In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
from thinkstats2 import Pmf, Cdf
import thinkstats2
import thinkplot
decorate = thinkplot.config
Let's load the GSS dataset.
In [2]:
%time gss = pd.read_hdf('../homeworks/gss.hdf5', 'gss')
gss.head()
Out[2]:
In [3]:
def counts(series):
return series.value_counts(sort=False).sort_index()
The GSS interviews a few thousand respondents each year.
In [4]:
counts(gss['year'])
Out[4]:
One of the questions they ask is "Do you think the use of marijuana should be made legal or not?"
The answer codes are:
1 Legal
2 Not legal
8 Don't know
9 No answer
0 Not applicable
Here is the distribution of responses for all years.
In [5]:
counts(gss['grass'])
Out[5]:
I'll replace "Don't know", "No answer", and "Not applicable" with NaN.
In [6]:
gss['grass'].replace([0,8,9], np.nan, inplace=True)
And replace 2
, which represents "No", with 1
. That way we can use mean
to compute the fraction in favor.
In [7]:
gss['grass'].replace(2, 0, inplace=True)
Here are the value counts after replacement.
In [8]:
counts(gss['grass'])
Out[8]:
And here's the mean.
In [9]:
gss['grass'].mean()
Out[9]:
In [10]:
grouped = gss.groupby('year')
grouped
Out[10]:
The result in a DataFrameGroupBy
object we can iterate through:
In [11]:
for name, group in grouped:
print(name, len(group))
And we can compute summary statistics for each group.
In [12]:
for name, group in grouped:
print(name, group['grass'].mean())
Using a for loop can be useful for debugging, but it is more concise, more idiomatic, and faster to apply operations directly to the DataFrameGroupBy
object.
For example, if you select a column from a DataFrameGroupBy
, the result is a SeriesGroupBy
that represents one Series
for each group.
In [13]:
grouped['grass']
Out[13]:
You can loop through the SeriesGroupBy
, but you normally don't.
In [14]:
for name, series in grouped['grass']:
print(name, series.mean())
Instead, you can apply a function to the SeriesGroupBy
; the result is a new Series
that maps from group names to the results from the function; in this case, it's the fraction of support for each interview year.
In [15]:
series = grouped['grass'].mean()
series
Out[15]:
Overall support for legalization has been increasing since 1990.
In [16]:
series.plot(color='C0')
decorate(xlabel='Year of interview',
ylabel='% in favor',
title='Should marijuana be made legal?')
In [17]:
counts(gss['cohort'])
Out[17]:
Pulling together the code from the previous section, we can plot support for legalization by year of birth.
In [18]:
grouped = gss.groupby('cohort')
series = grouped['grass'].mean()
series.plot(color='C1')
decorate(xlabel='Year of birth',
ylabel='% in favor',
title='Should marijuana be made legal?')
Later generations are more likely to support legalization than earlier generations.
In [19]:
grouped = gss.groupby('age')
series = grouped['grass'].mean()
series.plot(color='C2')
decorate(xlabel='Age at interview',
ylabel='% in favor',
title='Should marijuana be made legal?')
Younger people are more likely to support legalization than old people.
In general, it is not easy to separate period, cohort, and age effects, but there are ways. We'll come back to this example to see how.