In [ ]:
# Run this cell to set up the notebook.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab04.ok')
Today's lab has a few goals:
We'll be analyzing a dataset of names. The data were collected by the US Social Security Administration; it includes a large fraction of all people who were born in the United States in recent years. (Because it is based on official government records, the dataset includes some anachronisms, including binary sex classification and the use of the word "gender" rather than "sex." We've left it as we found it.)
Note: Most of the plotting in this lab can be done with <DataFrame>.plot.<plot_method_name>
. You can find the documentation for Pandas plotting here.
We've included several CSV and PDF files in names.zip
. These files are from kaggle.com's site for this dataset. Peruse them to see what the data look like. Note that Kaggle seems to have modified the data format, so the PDF descriptions are not exactly accurate!
In [ ]:
!unzip -o names.zip
In [ ]:
names = pd.read_csv("NationalNames.csv")
In [ ]:
names.head()
In [ ]:
names.groupby("Year")["Count"].sum().plot.line(); #SOLUTION
In [ ]:
names_2014 = names[names['Year'] == 2014]
least_popular_2014 = names_2014.sort_values("Count").iloc[:20]
most_popular_2014 = names_2014.sort_values("Count").iloc[-20:]
least_popular_2014.plot.bar("Name", "Count")
most_popular_2014.plot.bar("Name", "Count");
Using the plots you made in question 2, figure out one reason why the chart you made in question 1 probably understates the number of births per year in the US. Then find a source online listing US births by year and check whether - and by how much - the chart understates the number of births.
SOLUTION: The dataset apparently does not include names that occur fewer than 5 times, to avoid personally identifying anyone. So such people are not recorded in the dataset. Based on this dataset from HHS, the number of births might be understated by around 100,000 per year - not bad.
The visualization in question 2 didn't help us answer this question:
"Among names of people born in 2014, what is the distribution of frequencies?"
Create a histogram to attempt to answer that question. Your histogram should tell you, for example, how many names occurred fewer than 100 times.
In [ ]:
names_2014["Count"].plot.hist(bins=250); #SOLUTION
You should find that your histogram is not terribly useful, except to tell you that almost all names occurred fewer than, say, 500 times. A few names are much more frequent than the others. Your plot in question 2 should tell you what many of those names are.
Now consider the new question:
"Among people born in 2014, what is the distribution of name popularity?"
SOLUTION: The previous question was about names, while this is about people. It's possible that most names are rare, while most people have common names. From the histogram we made in question 4, we might guess that most people have names that only occur a few times.
The cell below creates a histogram helping us answer questions like this:
"How many people had names occurring between 5000 and 5500 times?"
In [ ]:
names_in_2014 = names[names['Year'] == 2014]
plt.hist(names_in_2014["Count"], weights=names_in_2014["Count"], bins=np.arange(0, names_in_2014["Count"].max()+500, 500));
Create a visualization to help us answer questions like this:
"How many people had names occurring fewer than 5000 times?"
Then use your visualization to estimate the number of people who had names occurring fewer than 2500 times, giving your estimate the name num_people_with_rare_names
.
(This question is a bit more natural and should lead to a visualization that's more readable than our histogram.)
In [ ]:
names_in_2014.groupby("Count")["Count"].sum()\
.sort_index(ascending=True)\
.cumsum()\
.plot.line();
num_people_with_rare_names = 2000000
In [ ]:
_ = ok.grade('q06')
_ = ok.backup()
You probably saw in question 2 that the least-common names occur 5 times in a year. That's because the data are "censored" to remove names that occur fewer than 5 times.
SOLUTION: They don't want to give out personally-identifiable information about individuals. If you were the only person with a certain name, then the dataset would tell us when you were born.
However, you might imagine that it's possible to extrapolate from the pattern of name frequencies in the our data to the pattern of name frequencies for names that appeared 4 or fewer times. Then we could try to answer interesting questions like, "How many distinct names were given to people born in 2014?" In later weeks, we'll have the tools to do that.
You can generate a random sample of the rows of a Pandas table with the sample
method, as in:
names.sample(n = 10000, replace = False)
But if we did that, we wouldn't get the kind of dataset we'd see if the Social Security Administration had sampled randomly from people born in each year and reported their name counts. Why not?
SOLUTION: We would be sampling names, not people. Some names would be missing entirely, and others would have all their data. We need to disaggregate the data and then sample individual people.
Generate a random sample of 100,000 people born in 2014. Do this by implementing the function sample_counts_table
according to its documentation. (Note that it should sample with replacement.) Call that function to compute a table called sample_names_2014
. It should have the same columns as names
, and sample_names_2014["Count"].sum()
should equal 100,000.
Hint: In sample_counts_table
, use groupby
to re-aggregate the table after sampling it. Before returning the result, use reset_index
to take the columns you pass to groupby
out of the index and make them ordinary columns again.
In [ ]:
def sample_counts_table(table, count_column_name, sample_size):
"""Given a table with a column of counts, produces a table that's a random sample
**with replacement** from that table.
The sampling is not done directly on the rows of the given table. Instead, the sample
is taken *as if* from a copy of that table, where each row is duplicated once per
count. The resulting table is aggregated, so it looks similar to the original table,
except that its counts column sums to sample_size.
Args:
table (pd.DataFrame): Any DataFrame
count_column_name (str): The name of a column in table. That column should
contain integer counts.
sample_size (int): The size of the sample
Returns:
pd.DataFrame: A copy of table, but with total count equal to sample_size,
sampled randomly as described above.
"""
sampled = table.sample(n = sample_size, replace=True, weights = count_column_name)
other_columns = list(set(table.columns) - set([count_column_name]))
regrouped = sampled.groupby(other_columns).count()
return regrouped.reset_index(level=other_columns)
sample_names_2014 = sample_counts_table(names_in_2014, "Count", 100000)
In [ ]:
_ = ok.grade('q09')
_ = ok.backup()
In [ ]:
n = 5
names_in_2014.sort_values("Count", ascending=False).head(n=n).plot.barh("Name", "Count") #SOLUTION
sample_names_2014.sort_values("Count", ascending=False).head(n=n).plot.barh("Name", "Count"); #SOLUTION
When you see a name for the first time in a familiar language (for example, in a fiction book), you can probably make a guess about the sex of the person with the name. For example, a person named Almon is probably male, and a person named Alma is probably female. What patterns has your brain learned to let you make such predictions?
One simple pattern you might guess is that names of females more often end in vowels, while names of males more often end in consonants. Let's find out.
For each letter in English, compute the proportion of males and females whose name ends with that letter. Do this by filling in the function pivot_by_ending
and calling it on sample_names_2014
. Use your sample table, sample_names_2014
, as in the filled-in code.
Your table should look like this (but with different values and with 26 rows):
Ending | F | M |
---|---|---|
a | .7 | .3 |
b | .2 | .8 |
In [ ]:
def pivot_by_ending(names):
"""Find the last letter of each name in the given table and pivot by
that value and by gender.
Args:
names (pd.DataFrame): A table like `names_in_2014`, having at least
the following columns:
* 'Name': A column of strings
* 'Gender': A column of strings 'F' or 'M'
* 'Count': A column of integers, the number
of individuals represented by each row
Returns:
pd.DataFrame: A table pivoted by last letter of name and by gender.
See above for an example."""
# This solution is slick, but it's fine to instead use
# apply or a list comprehension.
with_ending = names.assign(Ending = names["Name"].str[-1])
return with_ending.pivot_table(
index=['Ending'], # column(s) that will index the pivoted table
columns=['Gender'], # Each value in this column (or columns) will get its own column in the pivoted table.
values='Count', # the field to process in each group
aggfunc='sum', # the operation to apply to the collection of field values in each group
fill_value=0 # the default value for an empty cell in the resulting table
)
sample_names_2014_by_ending = pivot_by_ending(sample_names_2014)
In [ ]:
_ = ok.grade('q11')
_ = ok.backup()
Run the next cell to make a bar chart of your results. (The male proportion is 1 minus the female proportion.)
In [ ]:
def compute_proportions(pivoted_by_letter):
totals = pivoted_by_letter['M'] + pivoted_by_letter['F']
return pd.DataFrame({
"Ending": pivoted_by_letter.index.values,
"Male proportion": pivoted_by_letter['M'] / totals,
"Female proportion": pivoted_by_letter['F'] / totals})
proportions_in_2014_sample = compute_proportions(sample_names_2014_by_ending).sort_values("Female proportion", ascending=False)
proportions_in_2014_sample[["Female proportion"]].plot.bar(color="Pink");
In [ ]:
# Constants provided for your convenience.
num_resamples = 200
resample_size = 100000
# Compute a list of num_resamples tables, each one like `proportions_in_2014_sample`.
# Each should be based on a resample of sample_names_2014.
# This should take around a minute.
resamples = [compute_proportions(pivot_by_ending(sample_counts_table(sample_names_2014, "Count", resample_size))) for _ in range(num_resamples)] #SOLUTION
In [ ]:
_ = ok.grade('q13')
_ = ok.backup()
The cell below displays your bootstrapped proportions.
In [ ]:
resample_proportions = pd.concat(resamples)
sns.boxplot(x="Ending",
y="Female proportion",
data=resample_proportions,
order=proportions_in_2014_sample["Ending"],
color="Pink");
In [ ]:
compute_proportions(pivot_by_ending(names_in_2014))\
.loc[proportions_in_2014_sample["Ending"]]\
[["Female proportion"]]\
.plot.bar(color=("Pink"));
plt.title("")
SOLUTION: Very few names end in 'q', so in our sample of 100,000, there were very few such people. That means our sample size for estimating the proportion of females among people with names ending 'q' is very small.
In [ ]:
i_finished_the_lab = False
In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()
In [ ]:
_ = ok.submit()
Now, run this code in your terminal to make a
git commit
that saves a snapshot of your changes in git
. The last line of the cell
runs git push, which will send your work to your personal Github repo.
# Tell git to commit your changes to this notebook
git add sp17/lab/lab04/lab04.ipynb
# Tell git to make the commit
git commit -m "lab04 finished"
# Send your updates to your personal private repo
git push origin master