In [ ]:
%matplotlib inline
Today we're going to use a tabular dataset to get hands-on experience with two other core Python data science libraries: pandas and matplotlib. We're going to use these libraries to explore, do statistics on, and visualize different parts of our dataset to get a handle on what's there.
The tabular format is fundamental to data science. We got a taste of tabular data last week when we loaded CSV and JSON data, but let's take a closer look now.
Most people are familiar with tabular data from working with spreadsheet software like Excel. In a table, "records" or "samples" are stored in rows, and "features" or "attributes" are stored in columns. For example, in the good_movies.csv
dataset that we took a look at last week, there were 7 columns representing the fields title
, year
, oscar_nominations
, short_summary
, star_1
, star_2
, and star_3
and 4 rows representing the movies La La Land, Moonlight, Argo, and Gone Girl.
In other words, the rows are individual movies, and the columns represent pieces of information that we know about each movie.
Pandas is the Swiss Army Knife of data analysis in Python. Built on top of NumPy, Pandas wraps arrays with additional functions and metadata to create data frames, a paradigm for storing tabular data borrowed from R.
Let's use Pandas to read and explore the good_movies.csv
dataset again:
In [ ]:
import pandas as pd
good_movies = pd.read_csv('data/good_movies.csv')
In [ ]:
good_movies.head()
Pandas automatically gives each row an integer index that guarantees the row can be uniquely identified, but otherwise, the data is exactly the same. The good_movies.head()
method prints out a few rows from the "head" (top) of the dataset. Since there were only 4 rows in this dataset, head
prints them all.
There are lots of other ways of reading in data as well (we won't cover these):
read_json
read_excel
read_sql
read_html
read_clipboard
read_pickle
How exactly does Pandas hold the data?
In [ ]:
type(good_movies)
One of the fundamental data structures in Pandas is the DataFrame, which stores 2-dimensional (i.e. tabular/matrix) data. The Pandas DataFrame is basically an ordered collection of heterogeneous Series of 1-dimensional data.
There are a few core methods for understanding DataFrames that will be important to understand. We already saw head()
for print the first several rows of a DataFrame. Some others are:
In [ ]:
good_movies.shape # dataset has 4 rows and 7 columns
In [ ]:
good_movies.info() # tell me some basic info about what's in each column
In [ ]:
good_movies.describe() # give some summary statistics for the numeric columns
Indexing Pandas DataFrames is a bit different than NumPy. In particular, you can index by the name of rows and columns with loc
or by their index with iloc
. For example, if we wanted to see the summary of Gone Girl, we could use:
In [ ]:
gone_girl_summary = good_movies.loc[3, 'short_summary']
print(gone_girl_summary)
In [ ]:
gone_girl_summary = good_movies.iloc[3, 3]
print(gone_girl_summary)
Question: Why did the first value in the index argument not change?
Or, instead, we can retrieve all of the summaries:
In [ ]:
summaries = good_movies.loc[:, 'short_summary']
print(summaries)
In [ ]:
summaries = good_movies.iloc[:, 3] # short_summaries is the third column
print(summaries)
Speaking of columns, how are these columns from our dataset being stored after we extract them from our original DataFrame?
In [ ]:
type(summaries)
The Series
object is another fundamental data type in Pandas. Series
objects store 1-dimensional (i.e. vector) data, like a single column of a DataFrame
.
For dealing with tabular data, Series
and DataFrames
are much more powerful than pure NumPy arrays. For example, we're not forced to index rows by integers; we can specify a column (as long as it contains unique elements) to use as an index:
In [ ]:
good_movies = pd.read_csv('data/good_movies.csv', index_col='title')
good_movies.head()
In [ ]:
good_movies.loc['Gone Girl', 'short_summary']
In [ ]:
summaries = good_movies.loc[:, 'short_summary']
summaries.loc['Gone Girl']
How is data stored in each of these series?
In [ ]:
type(summaries.values)
It's also easy to filter rows on certain conditions:
In [ ]:
good_movies[good_movies['oscar_nominations'] > 5]
It's also very convenient to do arithmetic and summary statistics on the data:
In [ ]:
good_movies['oscar_nominations'].count()
In [ ]:
good_movies['oscar_nominations'].sum()
In [ ]:
good_movies['oscar_nominations'].mean()
In [ ]:
good_movies['oscar_nominations'].median()
In [ ]:
good_movies['oscar_nominations'].std()
Your Turn
For the rest of the class, we're going to dive deep into the passengers of the RMS Titanic. The file data/titanic.csv
contains the following data on each of the passengers:
Questions to answer:
data/titanic.csv
into a Pandas DataFrame. What is the best index for rows?
In [ ]:
titanic = pd.read_csv('data/titanic.csv', index_col=None)
In [ ]:
When we get a new tabular dataset that we don't know anything about, there is a fairly standard process of exploration that we can do on that dataset to get a better handle on it:
To illustrate the process, we'll use a mystery dataset that we don't know anything about:
In [ ]:
mystery_data = pd.read_csv('data/mystery.csv', index_col=None)
In [ ]:
mystery_data.info()
And we can use head to look at some actual values in the columns:
In [ ]:
mystery_data.head()
For each column, we should then ask:
Again, Pandas provides a convenient way of looking at the usual univariate statistics:
In [ ]:
mystery_data.describe()
In [ ]:
for col in mystery_data.columns:
print('median of col {}: {}'.format(col, mystery_data.loc[:, col].median()))
What does it mean for a column to have a median and mean that are different?
Sometimes, you'll see two slightly different definitions for the "spread" of a column of data - the standard deviation:
$\sigma_X = \sqrt{ \frac{1}{n}\sum_i^n (x_i - \mu_X) }$
and the sample deviation:
$\sigma_X = \sqrt{ \frac{1}{n-1}\sum_i^n (x_i - \mu_X) }$
(where $\mu$ is the mean of the column $X$ and $x_i$ is the $i$th value of $X$). What's the difference between dividing by $n$ and by $n-1$?
The standard deviation is used when $X$ contains the entire population that you're estimating the width of. So, for example, if you want to know the spread of test scores in a class and you have the score of each and every student, use the standard deviation.
The sample deviation is used to correct for bias when you're trying to estimate the width of a population that you only have a sample of data points from. For example, if you are trying to extrapolate the spread of test scores on the SAT by looking at how 1,000 students performed, you should use the sample deviation.
In practice, however, standard deviation and sample deviation are very similar if the number of samples is large.
Summary statistics can only take us so far in understanding the data in a new dataset. Visualizations like histograms are another great way of getting a high-level overview for what the data looks like.
Plotting functionality is built into Pandas. In particular, the built-in plot functions are a thin wrapper around a very powerful data visualization library called matplotlib. We won't directly use matplotlib in this course; we'll do all of our plotting by calling the Pandas wrapper functions. The pure matplotlib functions are very similar, however.
In [ ]:
# {dataframe_name}.{column_name}.plot.{plot_type}(plotting options)
mystery_data.A.plot.hist(bins=30)
print('mean: {}'.format(mystery_data.A.mean()))
print('median: {}'.format(mystery_data.A.median()))
print('sample deviation: {}'.format(mystery_data.A.std()))
In [ ]:
mystery_data.B.plot.hist(bins=30)
print('mean: {}'.format(mystery_data.B.mean()))
print('median: {}'.format(mystery_data.B.median()))
print('sample deviation: {}'.format(mystery_data.B.std()))
In [ ]:
mystery_data.C.plot.hist(bins=30)
print('mean: {}'.format(mystery_data.C.mean()))
print('median: {}'.format(mystery_data.C.median()))
print('sample deviation: {}'.format(mystery_data.C.std()))
Your turn
Try following the exploratory data analysis steps on the Titanic dataset (i.e. get the 10,000-foot view, then get some column-by-column insights). What interesting things do you find?
In [ ]:
After understanding what is contained in each individual column, it's important to understand how each column is related to the others. The related ideas of correlation and covariance are common ways of understanding pairwise dependencies between columns.
The covariance measures, unsurprisingly, the extent to which two columns co-vary. In other words, how often do values with a large distance from the mean in one column correspond to a large distance from the mean in another column?
The covariance between columns $X$ and $Y$ is defined as:
$ cov(X, Y) = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_X)(y_i - \mu_Y) $
Think about what it means for covariance between 2 columns to be large vs small.
One problem with covariance is that comparing covariances between different columns can be tricky - columns that are naturally wider will tend to have larger covariances. The correlation between $X$ and $Y$ is a similar concept to covariance, but that corrects for the difference in widths:
$ corr(X, Y) = \frac{ cov(X, Y) }{ \sigma_X \sigma_Y} $
The correlation is always a number between -1 and 1, making it easy to interpret.
In [ ]:
mystery_data.corr()
Your turn
Calculate the correlation between pairs of columns in the Titanic dataset. Which columns correlate most strongly with the passengers who survived? What other relatively strong correlations do you find?
In [ ]:
In [ ]:
titanic.head()
Missing values are represented by NaN
:
In [ ]:
titanic.loc[0, 'body']
In [ ]:
type(titanic.loc[0, 'body'])
And "nothingness" carries forward if you do mathematical operations on missing values:
In [ ]:
titanic.loc[0, 'body'] + 5
...which matches intuition.
One way of dealing with missing values is to just ignore it!
In [ ]:
titanic_none_missing = titanic.dropna()
titanic_none_missing.info()
By default, dropna
drops a row from the dataset if any value is missing. This turned out to be a problem, since it looks like there weren't any rows that didn't have missing values!
Instead, maybe we just want to drop the rows where body
is missing:
In [ ]:
# the `subset` argument allows us to only consider certain columns
titanic_body_not_missing = titanic.dropna(subset=['body'])
titanic_body_not_missing.info()
By doing this, we discover that none of the bodies recovered had been assigned to boats.
Sometimes, though, just ignoring missing data is a bit too aggressive. Instead, sometimes we just want to fill in the missing data with other data:
In [ ]:
# `inplace` argument lets us change the data without creating a new object
titanic['body'].fillna('Person either survived or their body was not recovered.', inplace=True)
titanic.head()
Your turn
Using the Pandas notnull
method, make a histogram of the values in the age
column that aren't missing. Then, try filling the missing values in the age
column with the column's average and median. Plot a histogram of each filled column - how do the histograms differ?
In [ ]:
The bottom line is - no matter how you choose to deal with your missing data, there will be trade-offs. Proceed with caution.
https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb https://blog.socialcops.com/engineering/machine-learning-python/
The idea of method chaining is prominently featured in Pandas. Method chaining allows to perform several operations on a dataset in one fell swoop, making it much cleaner to code up multi-step analysis that we might want to perform. For example, instead of writing:
In [ ]:
titanic_temp = titanic.dropna(subset=['body'])
titanic_temp.describe()
...we can write simply:
In [ ]:
titanic.dropna(subset=['body']).describe()
The result is the same, but we avoided using ugly placeholder variables and extra lines of code.
Often, instead of looking at statistics of an entire dataset, we want to look at statistics of groups within the dataset. For example, we can easily calculate the survival rate of all passengers on the Titanic (recall survived
= 0 if the person did not survive and 1 if they survived),
In [ ]:
titanic['survived'].mean()
...but it might also be interesting to see the survival rate broken out by gender. If you've used SQL before, this type of command might look familiar:
In [ ]:
titanic.groupby('sex')['survived'].mean()
What happened here? The groupby
method groups all of the data in our dataset into groups based on which gender each person was. We then select only the survived
column and calculate the mean.
Your turn Use grouping, selecting, and aggregating to answer the following questions:
In [ ]: