Lab 01: Introduction to exploratory data analysis

Introduction

This week's lab provides a practical introduction to exploratory data analysis. At the end of the lab, you should be able to use pandas to:

Load data from a CSV file.
Select data by row, column and cell.
Plot the data.
Compute summary statistics about the data.

Before we get started, there are a few things to cover.

Python

To complete the lab, you will need some previous experience of reading and writing Python code. If you don't have any previous experience of Python, don't worry! This course only requires that you know the basics and these are easy to pick up in a few hours.

The best place to start learning is Codecademy. This is a free, interactive, online course and is designed for beginners. You don't have to complete the whole thing - if you stop after Section 8, you will have covered just about everything you need to know for this class.

Jupyter Notebook

Jupyter Notebook is the browser-based environment you're reading this text in. Using the notebook is relatively painless, but if you do get stuck, there's plenty of help available in the official Jupyter Notebook support documentation. Be sure to have a quick glance at What is the Jupyter Notebook?, Notebook Basics and Running Code topics. Each week, labs will be distributed in Jupyter Notebook format, so it's best to get familiar with it as quickly as possible.

Before we start, let's take a quick tour of the notebook interface. To start the tour, click the Help menu above and select User Interface Tour. When you're finished, take a look at Jupyter's keyboard shortcuts. You can find these by clicking the Help menu above and selecting Keyboard Shortcuts.

Pandas

This week's lab involves the use of the Python module pandas. Before getting started, you should watch the brief 10-minute tour of pandas video. You don't need to understand everything in the video to do the lab, but it should give you a good idea of what pandas is and what it can do. To display the video in your browser, click in the cell below and press Shift-Enter.



In [ ]:

    
from IPython.display import VimeoVideo
VimeoVideo(59324550, width=800, height=600)

Pandas basics

In this lab, we're going to analyse registered birth names, as recorded in the United States between 1880 and 2010. Let's start by importing the packages that we'll need and configuring some charting options:



In [ ]:

    
# Import the pandas module
import pandas as pd

# Render charts in the browser, rather than in a pop-up
%matplotlib inline

Loading CSV data

Now that we've imported the pandas module, we can use it to load data from a CSV file, so that we can explore and manipulate the data using Python, right here in the browser.

To get started, we need to enter the location of our CSV file.

Note: If you have extracted the files for this lab in the same order they were archived in, then you shouldn't need to update the location of the CSV file.

If you have changed the location, don't worry - just update it below.

Enter the location of the CSV file in the next cell:



In [ ]:

    
path_to_csv_file = 'data/baby_names.csv'

Loading CSV files is easy in pandas - all we need to do is pass the name of the file to the read_csv method. Before we get started though, we need to think about how we're going to use the data in the file.

If you open the file manually (e.g. using speadsheet software), you'll see that the name of the first column is "year" and that the names of the other columns correspond to male and female names. The year column is important because it indexes the names, i.e. it provides a structure by which we can query the data. For example, we can tell how many people were named "John (M)" in a particular year by looking up the row corresponding to that year under the "John (M)" column.

If we import a CSV file using the read_csv method, we can specify the index column to use by passing the optional argument index_col.

Note: We don't have to pass the index_col argument if we don't want to, but having an index makes things much simpler later on.

Let's load the CSV data now by calling the read_csv method and setting the optional index_col argument to 'year', so that the year column is used to index the data:



In [ ]:

    
df = pd.read_csv(path_to_csv_file, index_col='year')

Data frames

In pandas, data is represented using the data frame object. A data frame is very similar to a spreadsheet, in the sense that it has a tabular layout, consisting of rows and columns. However, data frames are in many ways much more powerful tools than spreadsheets because they allow us to directly manipulate rows, columns and individual data cells using Python.

Note: When we loaded the CSV file above, we copied its contents into a pandas data frame, which we will refer to using the name df (this is by convention).

The amount of data in the CSV file is large! We can check the number of rows and columns using the shape attribute of the data frame, like this:

Note: shape is an attribute of the data frame not a method, so we don't need to append brackets (e.g. df.shape()) as we would normally.



In [ ]:

    
df.shape

You should have 131 rows and 98148 columns in your data frame - that's a lot to visualise in one go! Instead, why don't we just look a the first few rows? We can do this in pandas using the head method of the data frame, which works in a similar way to the Linux command of the same name. To see the first few rows of the data frame, run the cell below:



In [ ]:

    
df.head()

We can also check out the last few rows using the tail method, like this:



In [ ]:

    
df.tail()

We can get a list of the years represented in the data frame using the index attribute of the data frame (recall that we used the year column in the CSV file as our index), like this:



In [ ]:

    
df.index

Similarly, we can get a list of the column names using the columns attribute, like this:



In [ ]:

    
df.columns

We can access specific rows of the data using the loc indexer of the data frame. For instance, to select the data corresponding to the year 2001, we just need to index into the corresponding row, like this:



In [ ]:

    
df.loc[2001]

We can select a specific cell by specifying both a row and a column to the indexer. For instance, to select the number of female babies named Mary in 2001, we can write:



In [ ]:

    
df.loc[2001, 'Mary (F)']

Alternatively, to select the data for Mary for all years, we can pass the shorthand colon symbol ':', like in the cell below. The colon symbol ':' is just shorthand for all the rows.

Note: If we want to select a whole row, then the indexer requires just the row index (e.g. the year). However, If we want to select a column (e.g. 'Mary (F)'), then the indexer requires that we pass both the rows and the column.



In [ ]:

    
df.loc[:, 'Mary (F)']

We've now covered the basics of pandas:

How to load CSV files into data frames.
How to peak at the start and end rows of a data frame.
How to get the index row names.
How to get the column names.
How to look at a specific row, a specific column or a specific cell in a data frame.

Next, let's explore the data.

Exploratory data analysis

Exploring a specific name

Let's start by exploring a specific girl's name: Mary. In our data frame, the corresponding column is named 'Mary (F)'. To access the data in the column, all we need to do is pass the name of the column to the loc indexer of the data frame, just as we did earlier, like this:



In [ ]:

    
df.loc[:, 'Mary (F)']

We can examine how the popularity of the name Mary has changed over time by calling the plot method after we index into the data frame.

Note: Under the hood, the plot method of the data frame is actually calling the plot method in matplotlib. Conveniently, we don't need to import this module separately - it just works.

To generate a plot of the data above, all we need to do is append .plot() to the previous statement, like this:



In [ ]:

    
df.loc[:, 'Mary (F)'].plot()

Note: pandas automatically uses the index of the data frame as the x-axis and the column(s) we've specified as the series to plot.

The Era of Mary in the United States appears to have taken place between around 1910 and 1970. Let's select this data, so we can look at in more detail. To do this, we must specify an exact year range (as opposed to the shorthand all the rows colon operator ':') when we index into the data frame, like this:



In [ ]:

    
df.loc[1910:1970, 'Mary (F)']

To save having to write out the complicated indexing notation every time, let's create a new Python variable called mary and assign the data we want to examine to it:



In [ ]:

    
mary = df.loc[1910:1970, 'Mary (F)']

Now, we can plot the data by calling the plot method on the mary variable, like this:



In [ ]:

    
ax = mary.plot()
ax.set(
    title='The Era of Mary',              # Set the chart title
    xlabel='Year',                        # Set the X axis label
    ylabel='Number of registered births'  # Set the Y axis label
);

Next, let's compute some summary statistics about the Era of Mary. The pandas data frame object supports the direct computation of a number of statistical measures using instance methods such as mean, median, min, max and std. For instance, to compute the average number of girls named Mary in the data frame, we can write:



In [ ]:

    
mary.mean()

So, between 1910 and 1970, there were approximately 55,600 girls named Mary each year, on average.

At this point, you might be asking yourself whether you'll have to write out a line of code to compute each kind of summary statistic you want to compute, but fear not: pandas has a solution. To compute a complete statistical description of a data frame, all we have to do is call the describe method, like this:



In [ ]:

    
mary.describe()

Immediately, we have a lot of new information:

The number of data points we're looking at: count.
The arithmetic mean of the data: mean.
The standard deviation of the data: std.
The minimum value of the data (i.e. the 0$^{th}$ percentile): min.
The first, second (i.e. median) and third quartiles: 25%, 50%, 75%.
The maximum value of the data (i.e. the 100$^{th}$ percentile): max.

So, with just a few quick Python commands, we can say quite a few things about the name Mary:

On average, in the United States, there were approximately 55,600 female babies named Mary every year from 1910 until 1970.
At least 19,000 girls were called Mary each year.
The highest year saw almost 74,000 girls named Mary.

Summary statistics are powerful and pandas makes them simple! If you're done with this section, try selecting another name and investigate its popularity over time. This should help you to familiarise yourself with pandas' indexing notation.

Exploring a specific year

We know how to explore a specific name, how about a specific year? What was the most popular baby name in 1995, for instance? Again, using pandas, this is easy to do. Let's start by indexing into the data frame and assigning the data we want to explore to its own variable:



In [ ]:

    
babies_95 = df.loc[1995]

To find the most popular name, we can use the idxmax method of the data frame, which returns the column with the largest value:



In [ ]:

    
babies_95.idxmax()

Exploring in more detail

We've now covered how to explore data by row (e.g. by year) and by column (e.g. by name). What other information can we extract? How about we take a look at the popularity of the most popular names between 1880 and 2010? This sounds a lot harder than it is. When we looked at names in 1995, we first indexed into the data frame, so that we called idxmax on the data relating to 1995 only. This time, we'll run the same method on the whole data frame.

Note: By default, calls to idxmax are applied across the rows of the data frame. The axis argument controls whether the method is applied to the rows of the data frame (axis='rows') or the columns of the data frame (axis='columns').

As we want to find the most popular name for each year, we will need to select the columns with the largest value for each row (i.e. axis='columns'). Apart from this small change, the call is identical to the one above, where we found the most popular name in 1995.



In [ ]:

    
df.idxmax(axis='columns')

We've now figured out which names were most popular in each year. How about a condensed list? First let's assign the most popular names to a variable to save typing:



In [ ]:

    
most_popular_names = df.idxmax(axis='columns')

Next, we can use the unique method to condense the list of names per year into a single set:



In [ ]:

    
top_names = most_popular_names.unique()

top_names

Incredibly, there are just nine names in the top spot since 1880! Let's use pandas' built in plotting to take a look at how these vary over time:



In [ ]:

    
ax = df.loc[:, top_names].plot(figsize=(12, 7.5))  # Set the size of the chart
ax.set(
    title='Most popular baby names in the US, 1880-2010',
    xlabel='Year',
    ylabel='Number of registered births'
)
ax.legend(loc='upper left');  # Add a legend to the chart

There are some interesting features here, most notably perhaps the notion that names have a "lifespan" (i.e. they grow in popularity, peak, and then decline) and that modern names have increased in diversity (the most popular names now aren't anywhere near as widely used as those in the middle of the 1900s, despite the US population being much larger). As you become more familiar with pandas, finding insights like these becomes much easier.

Poisoned names

Names are often chosen based on the popularity, or otherwise, of celebrities. This can sometimes make for some amusing trends in the data, such as the decline in the popularity of "Hillary" with the rise to fame and election of Bill Clinton in the early 1990s:



In [ ]:

    
df.loc[:, 'Hillary (F)'].plot();

For more information, check out this blog post.