This week's lab provides a practical introduction to exploratory data analysis. At the end of the lab, you should be able to use pandas to:
Before we get started, there are a few things to cover.
To complete the lab, you will need some previous experience of reading and writing Python code. If you don't have any previous experience of Python, don't worry! This course only requires that you know the basics and these are easy to pick up in a few hours.
The best place to start learning is Codecademy. This is a free, interactive, online course and is designed for beginners. You don't have to complete the whole thing - if you stop after Section 8, you will have covered just about everything you need to know for this class.
Jupyter Notebook is the browser-based environment you're reading this text in. Using the notebook is relatively painless, but if you do get stuck, there's plenty of help available in the official Jupyter Notebook support documentation. Be sure to have a quick glance at What is the Jupyter Notebook?, Notebook Basics and Running Code topics. Each week, labs will be distributed in Jupyter Notebook format, so it's best to get familiar with it as quickly as possible.
Before we start, let's take a quick tour of the notebook interface. To start the tour, click the Help menu above and select User Interface Tour. When you're finished, take a look at Jupyter's keyboard shortcuts. You can find these by clicking the Help menu above and selecting Keyboard Shortcuts.
This week's lab involves the use of the Python module
pandas. Before getting started, you should watch the brief 10-minute tour of pandas video. You don't need to understand everything in the video to do the lab, but it should give you a good idea of what pandas is and what it can do. To display the video in your browser, click in the cell below and press
In [ ]:from IPython.display import VimeoVideo VimeoVideo(59324550, width=800, height=600)
In [ ]:# Import the pandas module import pandas as pd # Render charts in the browser, rather than in a pop-up %matplotlib inline
Now that we've imported the pandas module, we can use it to load data from a CSV file, so that we can explore and manipulate the data using Python, right here in the browser.
To get started, we need to enter the location of our CSV file.
Note: If you have extracted the files for this lab in the same order they were archived in, then you shouldn't need to update the location of the CSV file.
If you have changed the location, don't worry - just update it below.
Enter the location of the CSV file in the next cell:
In [ ]:path_to_csv_file = 'data/baby_names.csv'
Loading CSV files is easy in pandas - all we need to do is pass the name of the file to the
read_csv method. Before we get started though, we need to think about how we're going to use the data in the file.
If you open the file manually (e.g. using speadsheet software), you'll see that the name of the first column is "year" and that the names of the other columns correspond to male and female names. The year column is important because it indexes the names, i.e. it provides a structure by which we can query the data. For example, we can tell how many people were named "John (M)" in a particular year by looking up the row corresponding to that year under the "John (M)" column.
If we import a CSV file using the
read_csv method, we can specify the index column to use by passing the optional argument
Note: We don't have to pass the
index_colargument if we don't want to, but having an index makes things much simpler later on.
Let's load the CSV data now by calling the
read_csv method and setting the optional
index_col argument to
'year', so that the year column is used to index the data:
In [ ]:df = pd.read_csv(path_to_csv_file, index_col='year')
In pandas, data is represented using the data frame object. A data frame is very similar to a spreadsheet, in the sense that it has a tabular layout, consisting of rows and columns. However, data frames are in many ways much more powerful tools than spreadsheets because they allow us to directly manipulate rows, columns and individual data cells using Python.
Note: When we loaded the CSV file above, we copied its contents into a pandas data frame, which we will refer to using the name
df(this is by convention).
The amount of data in the CSV file is large! We can check the number of rows and columns using the
shape attribute of the data frame, like this:
shapeis an attribute of the data frame not a method, so we don't need to append brackets (e.g.
df.shape()) as we would normally.
In [ ]:df.shape
You should have 131 rows and 98148 columns in your data frame - that's a lot to visualise in one go! Instead, why don't we just look a the first few rows? We can do this in pandas using the
head method of the data frame, which works in a similar way to the Linux command of the same name. To see the first few rows of the data frame, run the cell below:
In [ ]:df.head()
In [ ]:df.tail()
In [ ]:df.index
In [ ]:df.columns
We can access specific rows of the data using the
loc indexer of the data frame. For instance, to select the data corresponding to the year 2001, we just need to index into the corresponding row, like this:
In [ ]:df.loc
We can select a specific cell by specifying both a row and a column to the indexer. For instance, to select the number of female babies named Mary in 2001, we can write:
In [ ]:df.loc[2001, 'Mary (F)']
Alternatively, to select the data for Mary for all years, we can pass the shorthand colon symbol '
:', like in the cell below. The colon symbol '
:' is just shorthand for all the rows.
Note: If we want to select a whole row, then the indexer requires just the row index (e.g. the year). However, If we want to select a column (e.g. 'Mary (F)'), then the indexer requires that we pass both the rows and the column.
In [ ]:df.loc[:, 'Mary (F)']
We've now covered the basics of pandas:
Next, let's explore the data.
Let's start by exploring a specific girl's name: Mary. In our data frame, the corresponding column is named
'Mary (F)'. To access the data in the column, all we need to do is pass the name of the column to the
loc indexer of the data frame, just as we did earlier, like this:
In [ ]:df.loc[:, 'Mary (F)']
We can examine how the popularity of the name Mary has changed over time by calling the
plot method after we index into the data frame.
Note: Under the hood, the
plotmethod of the data frame is actually calling the
plotmethod in matplotlib. Conveniently, we don't need to import this module separately - it just works.
To generate a plot of the data above, all we need to do is append
.plot() to the previous statement, like this:
In [ ]:df.loc[:, 'Mary (F)'].plot()
Note: pandas automatically uses the index of the data frame as the x-axis and the column(s) we've specified as the series to plot.
The Era of Mary in the United States appears to have taken place between around 1910 and 1970. Let's select this data, so we can look at in more detail. To do this, we must specify an exact year range (as opposed to the shorthand all the rows colon operator '
:') when we index into the data frame, like this:
In [ ]:df.loc[1910:1970, 'Mary (F)']
To save having to write out the complicated indexing notation every time, let's create a new Python variable called
mary and assign the data we want to examine to it:
In [ ]:mary = df.loc[1910:1970, 'Mary (F)']
Now, we can plot the data by calling the
plot method on the
mary variable, like this:
In [ ]:ax = mary.plot() ax.set( title='The Era of Mary', # Set the chart title xlabel='Year', # Set the X axis label ylabel='Number of registered births' # Set the Y axis label );
Next, let's compute some summary statistics about the Era of Mary. The pandas data frame object supports the direct computation of a number of statistical measures using instance methods such as
std. For instance, to compute the average number of girls named Mary in the data frame, we can write:
In [ ]:mary.mean()
So, between 1910 and 1970, there were approximately 55,600 girls named Mary each year, on average.
At this point, you might be asking yourself whether you'll have to write out a line of code to compute each kind of summary statistic you want to compute, but fear not: pandas has a solution. To compute a complete statistical description of a data frame, all we have to do is call the
describe method, like this:
In [ ]:mary.describe()
Immediately, we have a lot of new information:
So, with just a few quick Python commands, we can say quite a few things about the name Mary:
Summary statistics are powerful and pandas makes them simple! If you're done with this section, try selecting another name and investigate its popularity over time. This should help you to familiarise yourself with pandas' indexing notation.
We know how to explore a specific name, how about a specific year? What was the most popular baby name in 1995, for instance? Again, using pandas, this is easy to do. Let's start by indexing into the data frame and assigning the data we want to explore to its own variable:
In [ ]:babies_95 = df.loc
In [ ]:babies_95.idxmax()
We've now covered how to explore data by row (e.g. by year) and by column (e.g. by name). What other information can we extract? How about we take a look at the popularity of the most popular names between 1880 and 2010? This sounds a lot harder than it is. When we looked at names in 1995, we first indexed into the data frame, so that we called
idxmax on the data relating to 1995 only. This time, we'll run the same method on the whole data frame.
Note: By default, calls to
idxmaxare applied across the rows of the data frame. The
axisargument controls whether the method is applied to the rows of the data frame (
axis='rows') or the columns of the data frame (
As we want to find the most popular name for each year, we will need to select the columns with the largest value for each row (i.e.
axis='columns'). Apart from this small change, the call is identical to the one above, where we found the most popular name in 1995.
In [ ]:df.idxmax(axis='columns')
We've now figured out which names were most popular in each year. How about a condensed list? First let's assign the most popular names to a variable to save typing:
In [ ]:most_popular_names = df.idxmax(axis='columns')
In [ ]:top_names = most_popular_names.unique() top_names
Incredibly, there are just nine names in the top spot since 1880! Let's use pandas' built in plotting to take a look at how these vary over time:
In [ ]:ax = df.loc[:, top_names].plot(figsize=(12, 7.5)) # Set the size of the chart ax.set( title='Most popular baby names in the US, 1880-2010', xlabel='Year', ylabel='Number of registered births' ) ax.legend(loc='upper left'); # Add a legend to the chart
There are some interesting features here, most notably perhaps the notion that names have a "lifespan" (i.e. they grow in popularity, peak, and then decline) and that modern names have increased in diversity (the most popular names now aren't anywhere near as widely used as those in the middle of the 1900s, despite the US population being much larger). As you become more familiar with pandas, finding insights like these becomes much easier.
Names are often chosen based on the popularity, or otherwise, of celebrities. This can sometimes make for some amusing trends in the data, such as the decline in the popularity of "Hillary" with the rise to fame and election of Bill Clinton in the early 1990s:
In [ ]:df.loc[:, 'Hillary (F)'].plot();