In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
import utils
from utils import decorate
from thinkstats2 import Pmf, Cdf
In [2]:
def read_gss(dirname):
"""Reads GSS files from the given directory.
dirname: string
returns: DataFrame
"""
dct = utils.read_stata_dict(dirname + '/GSS.dct')
gss = dct.read_fixed_width(dirname + '/GSS.dat.gz',
compression='gzip')
return gss
Read the variables I selected from the GSS dataset. You can look up these variables at https://gssdataexplorer.norc.org/variables/vfilter
In [3]:
gss = read_gss('gss_eda')
print(gss.shape)
gss.head()
Most variables use special codes to indicate missing data. We have to be careful not to use these codes as numerical data; one way to manage that is to replace them with NaN
, which Pandas recognizes as a missing value.
In [4]:
def replace_invalid(df):
df.realinc.replace([0], np.nan, inplace=True)
df.educ.replace([98,99], np.nan, inplace=True)
# 89 means 89 or older
df.age.replace([98, 99], np.nan, inplace=True)
df.cohort.replace([9999], np.nan, inplace=True)
df.adults.replace([9], np.nan, inplace=True)
replace_invalid(gss)
Here are summary statistics for the variables I have validated and cleaned.
In [5]:
gss['year'].describe()
In [6]:
gss['sex'].describe()
In [7]:
gss['age'].describe()
In [8]:
gss['cohort'].describe()
In [9]:
gss['race'].describe()
In [10]:
gss['educ'].describe()
In [11]:
gss['realinc'].describe()
In [12]:
gss['wtssall'].describe()
Exercise
Look through the column headings to find a few variables that look interesting. Look them up on the GSS data explorer.
Use value_counts
to see what values appear in the dataset, and compare the results with the counts in the code book.
Identify special values that indicate missing data and replace them with NaN
.
Use describe
to compute summary statistics. What do you notice?
In [13]:
from thinkstats2 import Hist, Pmf, Cdf
import thinkplot
hist_educ = Hist(gss.educ)
thinkplot.hist(hist_educ)
decorate(xlabel='Years of education',
ylabel='Count')
Hist
as defined in thinkstats2
is different from hist
as defined in Matplotlib. The difference is that Hist
keeps all unique values and does not put them in bins. Also, hist
does not handle NaN
.
One of the hazards of using hist
is that the shape of the result depends on the bin size.
Exercise:
Run the following cell and compare the result to the Hist
above.
Add the keyword argument bins=11
to plt.hist
and see how it changes the results.
Experiment with other numbers of bins.
In [14]:
import matplotlib.pyplot as plt
plt.hist(gss.educ.dropna())
decorate(xlabel='Years of education',
ylabel='Count')
However, a drawback of Hist
and Pmf
is that they basically don't work when the number of unique values is large, as in this example:
In [15]:
hist_realinc = Hist(gss.realinc)
thinkplot.hist(hist_realinc)
decorate(xlabel='Real income (1986 USD)',
ylabel='Count')
Exercise:
Make and plot a Hist
of age
.
Make and plot a Pmf
of educ
.
What fraction of people have 12, 14, and 16 years of education?
In [16]:
# Solution goes here
In [17]:
# Solution goes here
In [18]:
# Solution goes here
In [19]:
# Solution goes here
In [20]:
# Solution goes here
Exercise:
Make and plot a Cdf
of educ
.
What fraction of people have more than 12 years of education?
In [21]:
# Solution goes here
In [22]:
# Solution goes here
In [23]:
# Solution goes here
Exercise:
Make and plot a Cdf
of age
.
What is the median age? What is the inter-quartile range (IQR)?
In [24]:
# Solution goes here
In [25]:
# Solution goes here
In [26]:
# Solution goes here
Exercise:
Find another numerical variable, plot a histogram, PMF, and CDF, and compute any statistics of interest.
In [27]:
# Solution goes here
In [28]:
# Solution goes here
In [29]:
# Solution goes here
In [30]:
# Solution goes here
Exercise:
Compute the CDF of realinc
for male and female respondents, and plot both CDFs on the same axes.
What is the difference in median income between the two groups?
In [31]:
# Solution goes here
In [32]:
# Solution goes here
In [33]:
# Solution goes here
In [34]:
# Solution goes here
Exercise:
Use a variable to break the dataset into groups and plot multiple CDFs to compare distribution of something within groups.
Note: Try to find something interesting, but be cautious about overinterpreting the results. Between any two groups, there are often many differences, with many possible causes.
In [35]:
# Solution goes here
In [36]:
# Solution goes here
In [37]:
# Solution goes here
In [38]:
# Solution goes here
In [39]:
np.random.seed(19)
sample = utils.resample_by_year(gss, 'wtssall')
Save the file.
In [40]:
!rm gss.hdf5
sample.to_hdf('gss.hdf5', 'gss')
Load it and see how fast it is!
In [41]:
%time gss = pd.read_hdf('gss.hdf5', 'gss')
gss.shape
In [ ]: