```
In [1]:
```%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
import utils
from utils import decorate
from thinkstats2 import Pmf, Cdf

```
In [2]:
```def read_gss(dirname):
"""Reads GSS files from the given directory.
dirname: string
returns: DataFrame
"""
dct = utils.read_stata_dict(dirname + '/GSS.dct')
gss = dct.read_fixed_width(dirname + '/GSS.dat.gz',
compression='gzip')
return gss

```
In [3]:
```gss = read_gss('gss_eda')
print(gss.shape)
gss.head()

`NaN`

, which Pandas recognizes as a missing value.

```
In [4]:
```def replace_invalid(df):
df.realinc.replace([0], np.nan, inplace=True)
df.educ.replace([98,99], np.nan, inplace=True)
# 89 means 89 or older
df.age.replace([98, 99], np.nan, inplace=True)
df.cohort.replace([9999], np.nan, inplace=True)
df.adults.replace([9], np.nan, inplace=True)
replace_invalid(gss)

Here are summary statistics for the variables I have validated and cleaned.

```
In [5]:
```gss['year'].describe()

```
In [6]:
```gss['sex'].describe()

```
In [7]:
```gss['age'].describe()

```
In [8]:
```gss['cohort'].describe()

```
In [9]:
```gss['race'].describe()

```
In [10]:
```gss['educ'].describe()

```
In [11]:
```gss['realinc'].describe()

```
In [12]:
```gss['wtssall'].describe()

**Exercise**

Look through the column headings to find a few variables that look interesting. Look them up on the GSS data explorer.

Use

`value_counts`

to see what values appear in the dataset, and compare the results with the counts in the code book.Identify special values that indicate missing data and replace them with

`NaN`

.Use

`describe`

to compute summary statistics. What do you notice?

```
In [13]:
```from thinkstats2 import Hist, Pmf, Cdf
import thinkplot
hist_educ = Hist(gss.educ)
thinkplot.hist(hist_educ)
decorate(xlabel='Years of education',
ylabel='Count')

`Hist`

as defined in `thinkstats2`

is different from `hist`

as defined in Matplotlib. The difference is that `Hist`

keeps all unique values and does not put them in bins. Also, `hist`

does not handle `NaN`

.

One of the hazards of using `hist`

is that the shape of the result depends on the bin size.

**Exercise:**

Run the following cell and compare the result to the

`Hist`

above.Add the keyword argument

`bins=11`

to`plt.hist`

and see how it changes the results.Experiment with other numbers of bins.

```
In [14]:
```import matplotlib.pyplot as plt
plt.hist(gss.educ.dropna())
decorate(xlabel='Years of education',
ylabel='Count')

`Hist`

and `Pmf`

is that they basically don't work when the number of unique values is large, as in this example:

```
In [15]:
```hist_realinc = Hist(gss.realinc)
thinkplot.hist(hist_realinc)
decorate(xlabel='Real income (1986 USD)',
ylabel='Count')

**Exercise:**

Make and plot a

`Hist`

of`age`

.Make and plot a

`Pmf`

of`educ`

.What fraction of people have 12, 14, and 16 years of education?

```
In [16]:
``````
# Solution goes here
```

```
In [17]:
``````
# Solution goes here
```

```
In [18]:
``````
# Solution goes here
```

```
In [19]:
``````
# Solution goes here
```

```
In [20]:
``````
# Solution goes here
```

**Exercise:**

Make and plot a

`Cdf`

of`educ`

.What fraction of people have more than 12 years of education?

```
In [21]:
``````
# Solution goes here
```

```
In [22]:
``````
# Solution goes here
```

```
In [23]:
``````
# Solution goes here
```

**Exercise:**

Make and plot a

`Cdf`

of`age`

.What is the median age? What is the inter-quartile range (IQR)?

```
In [24]:
``````
# Solution goes here
```

```
In [25]:
``````
# Solution goes here
```

```
In [26]:
``````
# Solution goes here
```

**Exercise:**

Find another numerical variable, plot a histogram, PMF, and CDF, and compute any statistics of interest.

```
In [27]:
``````
# Solution goes here
```

```
In [28]:
``````
# Solution goes here
```

```
In [29]:
``````
# Solution goes here
```

```
In [30]:
``````
# Solution goes here
```

**Exercise:**

Compute the CDF of

`realinc`

for male and female respondents, and plot both CDFs on the same axes.What is the difference in median income between the two groups?

```
In [31]:
``````
# Solution goes here
```

```
In [32]:
``````
# Solution goes here
```

```
In [33]:
``````
# Solution goes here
```

```
In [34]:
``````
# Solution goes here
```

**Exercise:**

Use a variable to break the dataset into groups and plot multiple CDFs to compare distribution of something within groups.

Note: Try to find something interesting, but be cautious about overinterpreting the results. Between any two groups, there are often many differences, with many possible causes.

```
In [35]:
``````
# Solution goes here
```

```
In [36]:
``````
# Solution goes here
```

```
In [37]:
``````
# Solution goes here
```

```
In [38]:
``````
# Solution goes here
```

```
In [39]:
```np.random.seed(19)
sample = utils.resample_by_year(gss, 'wtssall')

Save the file.

```
In [40]:
```!rm gss.hdf5
sample.to_hdf('gss.hdf5', 'gss')

Load it and see how fast it is!

```
In [41]:
```%time gss = pd.read_hdf('gss.hdf5', 'gss')
gss.shape

```
In [ ]:
```