Lab 4: Baby Names


In [ ]:
# Run this cell to set up the notebook.
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab04.ok')

Today's lab has a few goals:

  1. Get more practice with Pandas and visualization code.
  2. Exercise some of the principles of visualization we've been learning about.
  3. Do some basic prediction and inference.

We'll be analyzing a dataset of names. The data were collected by the US Social Security Administration; it includes a large fraction of all people who were born in the United States in recent years. (Because it is based on official government records, the dataset includes some anachronisms, including binary sex classification and the use of the word "gender" rather than "sex." We've left it as we found it.)

Note: Most of the plotting in this lab can be done with <DataFrame>.plot.<plot_method_name>. You can find the documentation for Pandas plotting here.

What's in a name database?

We've included several CSV and PDF files in names.zip. These files are from kaggle.com's site for this dataset. Peruse them to see what the data look like. Note that Kaggle seems to have modified the data format, so the PDF descriptions are not exactly accurate!


In [ ]:
!unzip -o names.zip

In [ ]:
names = pd.read_csv("NationalNames.csv")

In [ ]:
names.head()

EDA

Let's do some basic exploration.

Question 1

In the national database, how many people were recorded each year? Create an appropriate visualization.


In [ ]:
names.groupby("Year")["Count"].sum().plot.line(); #SOLUTION

Question 2

In the national database, among people born in the most recent year, what are the frequencies of the most common and least common names? Create appropriate visualizations.


In [ ]:
names_2014 = names[names['Year'] == 2014]
least_popular_2014 = names_2014.sort_values("Count").iloc[:20]
most_popular_2014 = names_2014.sort_values("Count").iloc[-20:]
least_popular_2014.plot.bar("Name", "Count")
most_popular_2014.plot.bar("Name", "Count");

Question 3

Using the plots you made in question 2, figure out one reason why the chart you made in question 1 probably understates the number of births per year in the US. Then find a source online listing US births by year and check whether - and by how much - the chart understates the number of births.

SOLUTION: The dataset apparently does not include names that occur fewer than 5 times, to avoid personally identifying anyone. So such people are not recorded in the dataset. Based on this dataset from HHS, the number of births might be understated by around 100,000 per year - not bad.

Question 4

The visualization in question 2 didn't help us answer this question:

"Among names of people born in 2014, what is the distribution of frequencies?"

Create a histogram to attempt to answer that question. Your histogram should tell you, for example, how many names occurred fewer than 100 times.


In [ ]:
names_2014["Count"].plot.hist(bins=250); #SOLUTION

You should find that your histogram is not terribly useful, except to tell you that almost all names occurred fewer than, say, 500 times. A few names are much more frequent than the others. Your plot in question 2 should tell you what many of those names are.

Now consider the new question:

"Among people born in 2014, what is the distribution of name popularity?"

Question 5

How this is a different question than the one in question 4? How might the plot in question 4 mislead us about the answer to this new question?

SOLUTION: The previous question was about names, while this is about people. It's possible that most names are rare, while most people have common names. From the histogram we made in question 4, we might guess that most people have names that only occur a few times.

The cell below creates a histogram helping us answer questions like this:

"How many people had names occurring between 5000 and 5500 times?"


In [ ]:
names_in_2014 = names[names['Year'] == 2014]
plt.hist(names_in_2014["Count"], weights=names_in_2014["Count"], bins=np.arange(0, names_in_2014["Count"].max()+500, 500));

Question 6

Create a visualization to help us answer questions like this:

"How many people had names occurring fewer than 5000 times?"

Then use your visualization to estimate the number of people who had names occurring fewer than 2500 times, giving your estimate the name num_people_with_rare_names.

(This question is a bit more natural and should lead to a visualization that's more readable than our histogram.)


In [ ]:
names_in_2014.groupby("Count")["Count"].sum()\
          .sort_index(ascending=True)\
          .cumsum()\
          .plot.line();

num_people_with_rare_names = 2000000

In [ ]:
_ = ok.grade('q06')
_ = ok.backup()

You probably saw in question 2 that the least-common names occur 5 times in a year. That's because the data are "censored" to remove names that occur fewer than 5 times.

Question 7

Why do you think the Social Security Administration does that?

SOLUTION: They don't want to give out personally-identifiable information about individuals. If you were the only person with a certain name, then the dataset would tell us when you were born.

However, you might imagine that it's possible to extrapolate from the pattern of name frequencies in the our data to the pattern of name frequencies for names that appeared 4 or fewer times. Then we could try to answer interesting questions like, "How many distinct names were given to people born in 2014?" In later weeks, we'll have the tools to do that.

What if we'd had a sample?

We are pretty lucky to have a census of most US births. Let's generate a random sample to see how it would affect our inferences in the remainder of the lab.

Question 8

You can generate a random sample of the rows of a Pandas table with the sample method, as in:

names.sample(n = 10000, replace = False)

But if we did that, we wouldn't get the kind of dataset we'd see if the Social Security Administration had sampled randomly from people born in each year and reported their name counts. Why not?

SOLUTION: We would be sampling names, not people. Some names would be missing entirely, and others would have all their data. We need to disaggregate the data and then sample individual people.

Question 9

Generate a random sample of 100,000 people born in 2014. Do this by implementing the function sample_counts_table according to its documentation. (Note that it should sample with replacement.) Call that function to compute a table called sample_names_2014. It should have the same columns as names, and sample_names_2014["Count"].sum() should equal 100,000.

Hint: In sample_counts_table, use groupby to re-aggregate the table after sampling it. Before returning the result, use reset_index to take the columns you pass to groupby out of the index and make them ordinary columns again.


In [ ]:
def sample_counts_table(table, count_column_name, sample_size):
    """Given a table with a column of counts, produces a table that's a random sample
    **with replacement** from that table.
    
    The sampling is not done directly on the rows of the given table.  Instead, the sample
    is taken *as if* from a copy of that table, where each row is duplicated once per
    count.  The resulting table is aggregated, so it looks similar to the original table,
    except that its counts column sums to sample_size.
    
    Args:
        table (pd.DataFrame): Any DataFrame
        count_column_name (str): The name of a column in table.  That column should
                                 contain integer counts.
        sample_size (int): The size of the sample
    
    Returns:
        pd.DataFrame: A copy of table, but with total count equal to sample_size,
                      sampled randomly as described above.
    """
    sampled = table.sample(n = sample_size, replace=True, weights = count_column_name)
    other_columns = list(set(table.columns) - set([count_column_name]))
    regrouped = sampled.groupby(other_columns).count()
    return regrouped.reset_index(level=other_columns)

sample_names_2014 = sample_counts_table(names_in_2014, "Count", 100000)

In [ ]:
_ = ok.grade('q09')
_ = ok.backup()

Question 10

Verify that your sample looks reasonable. Do this by checking whether the following two bar charts look similar:

  1. The frequencies of the top 5 names in 2014.
  2. The frequencies of the top 5 names in 2014, using your sample.

In [ ]:
n = 5
names_in_2014.sort_values("Count", ascending=False).head(n=n).plot.barh("Name", "Count") #SOLUTION
sample_names_2014.sort_values("Count", ascending=False).head(n=n).plot.barh("Name", "Count"); #SOLUTION

Are names visually related to birth-assigned sex?

When you see a name for the first time in a familiar language (for example, in a fiction book), you can probably make a guess about the sex of the person with the name. For example, a person named Almon is probably male, and a person named Alma is probably female. What patterns has your brain learned to let you make such predictions?

One simple pattern you might guess is that names of females more often end in vowels, while names of males more often end in consonants. Let's find out.

Question 11

For each letter in English, compute the proportion of males and females whose name ends with that letter. Do this by filling in the function pivot_by_ending and calling it on sample_names_2014. Use your sample table, sample_names_2014, as in the filled-in code.

Your table should look like this (but with different values and with 26 rows):

Ending F M
a .7 .3
b .2 .8

In [ ]:
def pivot_by_ending(names):
    """Find the last letter of each name in the given table and pivot by
    that value and by gender.
    
    Args:
        names (pd.DataFrame): A table like `names_in_2014`, having at least
                              the following columns:
                               * 'Name': A column of strings
                               * 'Gender': A column of strings 'F' or 'M'
                               * 'Count': A column of integers, the number
                                 of individuals represented by each row
    Returns:
        pd.DataFrame: A table pivoted by last letter of name and by gender.
                      See above for an example."""
    # This solution is slick, but it's fine to instead use
    # apply or a list comprehension.
    with_ending = names.assign(Ending = names["Name"].str[-1])
    return with_ending.pivot_table(
        index=['Ending'], # column(s) that will index the pivoted table
        columns=['Gender'], # Each value in this column (or columns) will get its own column in the pivoted table.
        values='Count', # the field to process in each group
        aggfunc='sum', # the operation to apply to the collection of field values in each group
        fill_value=0 # the default value for an empty cell in the resulting table
    )

sample_names_2014_by_ending = pivot_by_ending(sample_names_2014)

In [ ]:
_ = ok.grade('q11')
_ = ok.backup()

Run the next cell to make a bar chart of your results. (The male proportion is 1 minus the female proportion.)


In [ ]:
def compute_proportions(pivoted_by_letter):
    totals = pivoted_by_letter['M'] + pivoted_by_letter['F']
    return pd.DataFrame({
            "Ending": pivoted_by_letter.index.values,
            "Male proportion": pivoted_by_letter['M'] / totals,
            "Female proportion": pivoted_by_letter['F'] / totals})

proportions_in_2014_sample = compute_proportions(sample_names_2014_by_ending).sort_values("Female proportion", ascending=False)
proportions_in_2014_sample[["Female proportion"]].plot.bar(color="Pink");

Question 12

Discuss what the chart tells us. Is there a strong association between sex and the last letter of a name in our sample?

Uncertainty

The bar chart was built from a sample. It's possible that the relationships we saw don't really exist in the full dataset.

To quantify our uncertainty, let's run the bootstrap. We can reuse most of our earlier code. (That's why we packaged everything into functions.)

Question 13

Compute 200 bootstrap resamples from sample_names_2014. For each resample, compute a table of proportions by calling your compute_proportions and pivot_by_ending functions on it.


In [ ]:
# Constants provided for your convenience.
num_resamples = 200
resample_size = 100000

# Compute a list of num_resamples tables, each one like `proportions_in_2014_sample`.
# Each should be based on a resample of sample_names_2014.
# This should take around a minute.
resamples = [compute_proportions(pivot_by_ending(sample_counts_table(sample_names_2014, "Count", resample_size))) for _ in range(num_resamples)] #SOLUTION

In [ ]:
_ = ok.grade('q13')
_ = ok.backup()

The cell below displays your bootstrapped proportions.


In [ ]:
resample_proportions = pd.concat(resamples)
sns.boxplot(x="Ending",
            y="Female proportion",
            data=resample_proportions,
            order=proportions_in_2014_sample["Ending"],
            color="Pink");

In [ ]:
compute_proportions(pivot_by_ending(names_in_2014))\
    .loc[proportions_in_2014_sample["Ending"]]\
    [["Female proportion"]]\
    .plot.bar(color=("Pink"));
plt.title("")

Question 14

Why is the box for names ending in 'q' so wide?

SOLUTION: Very few names end in 'q', so in our sample of 100,000, there were very few such people. That means our sample size for estimating the proportion of females among people with names ending 'q' is very small.

Submitting your assignment

If you made a good-faith effort to complete the lab, change i_finished_the_lab to True in the cell below. In any case, run the cells below to submit the lab.


In [ ]:
i_finished_the_lab = False

In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()

In [ ]:
_ = ok.submit()

Now, run this code in your terminal to make a git commit that saves a snapshot of your changes in git. The last line of the cell runs git push, which will send your work to your personal Github repo.

# Tell git to commit your changes to this notebook
git add sp17/lab/lab04/lab04.ipynb

# Tell git to make the commit
git commit -m "lab04 finished"

# Send your updates to your personal private repo
git push origin master