In [ ]:
%matplotlib inline

Exploring and understanding data

Today we're going to use a tabular dataset to get hands-on experience with two other core Python data science libraries: pandas and matplotlib. We're going to use these libraries to explore, do statistics on, and visualize different parts of our dataset to get a handle on what's there.

What is tabular data?

The tabular format is fundamental to data science. We got a taste of tabular data last week when we loaded CSV and JSON data, but let's take a closer look now.

Most people are familiar with tabular data from working with spreadsheet software like Excel. In a table, "records" or "samples" are stored in rows, and "features" or "attributes" are stored in columns. For example, in the good_movies.csv dataset that we took a look at last week, there were 7 columns representing the fields title, year, oscar_nominations, short_summary, star_1, star_2, and star_3 and 4 rows representing the movies La La Land, Moonlight, Argo, and Gone Girl.

In other words, the rows are individual movies, and the columns represent pieces of information that we know about each movie.

Pandas

Pandas is the Swiss Army Knife of data analysis in Python. Built on top of NumPy, Pandas wraps arrays with additional functions and metadata to create data frames, a paradigm for storing tabular data borrowed from R.

The Series and DataFrame

Let's use Pandas to read and explore the good_movies.csv dataset again:


In [ ]:
import pandas as pd

good_movies = pd.read_csv('data/good_movies.csv')

In [ ]:
good_movies.head()

Pandas automatically gives each row an integer index that guarantees the row can be uniquely identified, but otherwise, the data is exactly the same. The good_movies.head() method prints out a few rows from the "head" (top) of the dataset. Since there were only 4 rows in this dataset, head prints them all.

There are lots of other ways of reading in data as well (we won't cover these):

  • read_json
  • read_excel
  • read_sql
  • read_html
  • read_clipboard
  • read_pickle

How exactly does Pandas hold the data?


In [ ]:
type(good_movies)

One of the fundamental data structures in Pandas is the DataFrame, which stores 2-dimensional (i.e. tabular/matrix) data. The Pandas DataFrame is basically an ordered collection of heterogeneous Series of 1-dimensional data.

There are a few core methods for understanding DataFrames that will be important to understand. We already saw head() for print the first several rows of a DataFrame. Some others are:


In [ ]:
good_movies.shape    # dataset has 4 rows and 7 columns

In [ ]:
good_movies.info()    # tell me some basic info about what's in each column

In [ ]:
good_movies.describe()    # give some summary statistics for the numeric columns

Indexing Pandas DataFrames is a bit different than NumPy. In particular, you can index by the name of rows and columns with loc or by their index with iloc. For example, if we wanted to see the summary of Gone Girl, we could use:


In [ ]:
gone_girl_summary = good_movies.loc[3, 'short_summary']
print(gone_girl_summary)

In [ ]:
gone_girl_summary = good_movies.iloc[3, 3]
print(gone_girl_summary)

Question: Why did the first value in the index argument not change?

Or, instead, we can retrieve all of the summaries:


In [ ]:
summaries = good_movies.loc[:, 'short_summary']
print(summaries)

In [ ]:
summaries = good_movies.iloc[:, 3]    # short_summaries is the third column
print(summaries)

Speaking of columns, how are these columns from our dataset being stored after we extract them from our original DataFrame?


In [ ]:
type(summaries)

The Series object is another fundamental data type in Pandas. Series objects store 1-dimensional (i.e. vector) data, like a single column of a DataFrame.

For dealing with tabular data, Series and DataFrames are much more powerful than pure NumPy arrays. For example, we're not forced to index rows by integers; we can specify a column (as long as it contains unique elements) to use as an index:


In [ ]:
good_movies = pd.read_csv('data/good_movies.csv', index_col='title')
good_movies.head()

In [ ]:
good_movies.loc['Gone Girl', 'short_summary']

In [ ]:
summaries = good_movies.loc[:, 'short_summary']
summaries.loc['Gone Girl']

How is data stored in each of these series?


In [ ]:
type(summaries.values)

It's also easy to filter rows on certain conditions:


In [ ]:
good_movies[good_movies['oscar_nominations'] > 5]

It's also very convenient to do arithmetic and summary statistics on the data:


In [ ]:
good_movies['oscar_nominations'].count()

In [ ]:
good_movies['oscar_nominations'].sum()

In [ ]:
good_movies['oscar_nominations'].mean()

In [ ]:
good_movies['oscar_nominations'].median()

In [ ]:
good_movies['oscar_nominations'].std()

Your Turn

For the rest of the class, we're going to dive deep into the passengers of the RMS Titanic. The file data/titanic.csv contains the following data on each of the passengers:

  • pclass: Passenger class (1 = first; 2 = second; 3 = third)
  • survived: Survival (0 = no; 1 = yes)
  • name: Name
  • sex: Sex
  • age: Age
  • sibsp: Number of siblings/spouses aboard
  • parch: Number of parents/children aboard
  • ticket: Ticket number
  • fare: Passenger fare
  • cabin: Cabin number
  • embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • boat: Lifeboat (if survived)
  • body: Body number (if did not survive and body was recovered)

Questions to answer:

  • Read data/titanic.csv into a Pandas DataFrame. What is the best index for rows?
  • What was the average and median ages of passengers?
  • What was the overall survival rate?
  • What was the price of the most expensive ticket? Whose tickets were they?
  • How many passengers had no family on board?

In [ ]:
titanic = pd.read_csv('data/titanic.csv', index_col=None)

In [ ]:

The process of exploratory data analysis

When we get a new tabular dataset that we don't know anything about, there is a fairly standard process of exploration that we can do on that dataset to get a better handle on it:

  • Get some really general information about the dataset as a whole
  • Dig into each column individually
  • Try to understand the relationships between columns

To illustrate the process, we'll use a mystery dataset that we don't know anything about:


In [ ]:
mystery_data = pd.read_csv('data/mystery.csv', index_col=None)

The 10,000-foot view

A few questions we should ask:

  • How many rows are there?
  • How many columns are there?
  • What are the datatypes of each column?
  • How many missing values (if any) are there?

The Pandas built-in .info() method gives a great, quick answer to these questions:


In [ ]:
mystery_data.info()

And we can use head to look at some actual values in the columns:


In [ ]:
mystery_data.head()

Univariate

For each column, we should then ask:

  • What are the central tendancies of the column?
    • "Central tendancies" = "typical" value(s)
    • Especially - Mean, median, mode
  • What is the spread of the data?
    • Often, variance/standard deviation
    • But can also be: quartiles (or other percentiles), min/max values, etc
  • What the distribution look like?

Again, Pandas provides a convenient way of looking at the usual univariate statistics:


In [ ]:
mystery_data.describe()

In [ ]:
for col in mystery_data.columns:
    print('median of col {}: {}'.format(col, mystery_data.loc[:, col].median()))

What does it mean for a column to have a median and mean that are different?

An aside: the difference between standard deviation and sample deviation

Sometimes, you'll see two slightly different definitions for the "spread" of a column of data - the standard deviation:

$\sigma_X = \sqrt{ \frac{1}{n}\sum_i^n (x_i - \mu_X) }$

and the sample deviation:

$\sigma_X = \sqrt{ \frac{1}{n-1}\sum_i^n (x_i - \mu_X) }$

(where $\mu$ is the mean of the column $X$ and $x_i$ is the $i$th value of $X$). What's the difference between dividing by $n$ and by $n-1$?

The standard deviation is used when $X$ contains the entire population that you're estimating the width of. So, for example, if you want to know the spread of test scores in a class and you have the score of each and every student, use the standard deviation.

The sample deviation is used to correct for bias when you're trying to estimate the width of a population that you only have a sample of data points from. For example, if you are trying to extrapolate the spread of test scores on the SAT by looking at how 1,000 students performed, you should use the sample deviation.

In practice, however, standard deviation and sample deviation are very similar if the number of samples is large.

Visualizing a DataFrame

Summary statistics can only take us so far in understanding the data in a new dataset. Visualizations like histograms are another great way of getting a high-level overview for what the data looks like.

Plotting functionality is built into Pandas. In particular, the built-in plot functions are a thin wrapper around a very powerful data visualization library called matplotlib. We won't directly use matplotlib in this course; we'll do all of our plotting by calling the Pandas wrapper functions. The pure matplotlib functions are very similar, however.


In [ ]:
# {dataframe_name}.{column_name}.plot.{plot_type}(plotting options)
mystery_data.A.plot.hist(bins=30)
print('mean: {}'.format(mystery_data.A.mean()))
print('median: {}'.format(mystery_data.A.median()))
print('sample deviation: {}'.format(mystery_data.A.std()))

In [ ]:
mystery_data.B.plot.hist(bins=30)
print('mean: {}'.format(mystery_data.B.mean()))
print('median: {}'.format(mystery_data.B.median()))
print('sample deviation: {}'.format(mystery_data.B.std()))

In [ ]:
mystery_data.C.plot.hist(bins=30)
print('mean: {}'.format(mystery_data.C.mean()))
print('median: {}'.format(mystery_data.C.median()))
print('sample deviation: {}'.format(mystery_data.C.std()))

Your turn

Try following the exploratory data analysis steps on the Titanic dataset (i.e. get the 10,000-foot view, then get some column-by-column insights). What interesting things do you find?


In [ ]:

Multivariate

After understanding what is contained in each individual column, it's important to understand how each column is related to the others. The related ideas of correlation and covariance are common ways of understanding pairwise dependencies between columns.

The covariance measures, unsurprisingly, the extent to which two columns co-vary. In other words, how often do values with a large distance from the mean in one column correspond to a large distance from the mean in another column?

The covariance between columns $X$ and $Y$ is defined as:

$ cov(X, Y) = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_X)(y_i - \mu_Y) $

Think about what it means for covariance between 2 columns to be large vs small.

One problem with covariance is that comparing covariances between different columns can be tricky - columns that are naturally wider will tend to have larger covariances. The correlation between $X$ and $Y$ is a similar concept to covariance, but that corrects for the difference in widths:

$ corr(X, Y) = \frac{ cov(X, Y) }{ \sigma_X \sigma_Y} $

The correlation is always a number between -1 and 1, making it easy to interpret.


In [ ]:
mystery_data.corr()

Your turn

Calculate the correlation between pairs of columns in the Titanic dataset. Which columns correlate most strongly with the passengers who survived? What other relatively strong correlations do you find?


In [ ]:

More advanced Pandas

Missing data

Unfortunately, missing data is a fact of life, so being able to easily deal with blank values in our datasets is crucial. Pandas treats missing values are true "first class citizens" in datasets:


In [ ]:
titanic.head()

Missing values are represented by NaN:


In [ ]:
titanic.loc[0, 'body']

In [ ]:
type(titanic.loc[0, 'body'])

And "nothingness" carries forward if you do mathematical operations on missing values:


In [ ]:
titanic.loc[0, 'body'] + 5

...which matches intuition.

One way of dealing with missing values is to just ignore it!


In [ ]:
titanic_none_missing = titanic.dropna()
titanic_none_missing.info()

By default, dropna drops a row from the dataset if any value is missing. This turned out to be a problem, since it looks like there weren't any rows that didn't have missing values!

Instead, maybe we just want to drop the rows where body is missing:


In [ ]:
# the `subset` argument allows us to only consider certain columns
titanic_body_not_missing = titanic.dropna(subset=['body'])
titanic_body_not_missing.info()

By doing this, we discover that none of the bodies recovered had been assigned to boats.

Sometimes, though, just ignoring missing data is a bit too aggressive. Instead, sometimes we just want to fill in the missing data with other data:


In [ ]:
# `inplace` argument lets us change the data without creating a new object
titanic['body'].fillna('Person either survived or their body was not recovered.', inplace=True)
titanic.head()

Your turn

Using the Pandas notnull method, make a histogram of the values in the age column that aren't missing. Then, try filling the missing values in the age column with the column's average and median. Plot a histogram of each filled column - how do the histograms differ?


In [ ]:

The bottom line is - no matter how you choose to deal with your missing data, there will be trade-offs. Proceed with caution.

Chaining and grouping

https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb https://blog.socialcops.com/engineering/machine-learning-python/

The idea of method chaining is prominently featured in Pandas. Method chaining allows to perform several operations on a dataset in one fell swoop, making it much cleaner to code up multi-step analysis that we might want to perform. For example, instead of writing:


In [ ]:
titanic_temp = titanic.dropna(subset=['body'])
titanic_temp.describe()

...we can write simply:


In [ ]:
titanic.dropna(subset=['body']).describe()

The result is the same, but we avoided using ugly placeholder variables and extra lines of code.

Often, instead of looking at statistics of an entire dataset, we want to look at statistics of groups within the dataset. For example, we can easily calculate the survival rate of all passengers on the Titanic (recall survived = 0 if the person did not survive and 1 if they survived),


In [ ]:
titanic['survived'].mean()

...but it might also be interesting to see the survival rate broken out by gender. If you've used SQL before, this type of command might look familiar:


In [ ]:
titanic.groupby('sex')['survived'].mean()

What happened here? The groupby method groups all of the data in our dataset into groups based on which gender each person was. We then select only the survived column and calculate the mean.

Your turn Use grouping, selecting, and aggregating to answer the following questions:

  • What was the survival rate of first class passengers? What about third class?
  • What percentage of the total fares paid was paid by passengers in first class?
  • What port of embarkation had the highest average ticket price?

In [ ]: