Pandas 5: Summarizing data

Another in a series of notebooks that describe Pandas' powerful data management tools. In this one we summarize our data in a variety of ways. Which is more interesting than it sounds.

Outline:

WEO government debt data. Something to work with. How does Argentina's government debt compare to the debt of other countries? How did it compare when it defaulted in 2001?
Describing numerical data. Descriptive statistics: numbers of non-missing values, mean, median, quantiles.
Describing catgorical data. The excellent value_counts method.
Grouping data. An incredibly useful collection of tools based on grouping data based on a variable: men and woman, grads and undergrads, and so on.

Note: requires internet access to run.

This Jupyter notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

Import packages, etc.



In [ ]:

    
import sys                             # system module 
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for Pandas 

%matplotlib inline                     

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())

WEO data on government debt

We use the IMF's data on government debt again, specifically its World Economic Outlook database, commonly referred to as the WEO. We focus on government debt expressed as a percentage of GDP, variable code GGXWDG_NGDP.

The central question here is how the debt of Argentina, which defaulted in 2001, compared to other countries. Was it a matter of too much debt or something else?

Load data

First step: load the data and extract a single variable: government debt (code GGXWDG_NGDP) expressed as a percentage of GDP.



In [ ]:

    
url1 = "http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/"
url2 = "WEOOct2016all.xls"
url = url1 + url2 
weo = pd.read_csv(url, sep='\t', 
                  usecols=[1,2] + list(range(19,46)), 
                  thousands=',', 
                  na_values=['n/a', '--']) 
print('Variable dtypes:\n', weo.dtypes.head(6), sep='')

Clean and shape

Second step: select the variable we want and generate the two dataframes.



In [ ]:

    
# select debt variable 
variables = ['GGXWDG_NGDP']
db = weo[weo['WEO Subject Code'].isin(variables)]

# drop variable code column (they're all the same) 
db = db.drop('WEO Subject Code', axis=1)

# set index to country code 
db = db.set_index('ISO')

# name columns 
db.columns.name = 'Year'

# transpose 
dbt = db.T

# see what we have 
dbt.head()

Example. Let's try a simple graph of the dataframe dbt. The goal is to put Argentina in perspective by plotting it along with many other countries.



In [ ]:

    
fig, ax = plt.subplots()
dbt.plot(ax=ax, 
         legend=False, color='blue', alpha=0.3, 
         ylim=(0,150)
        )
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left')
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)

Exercise.

What do you take away from this graph?
What would you change to make it look better?
To make it mnore informative?
To put Argentina's debt in context?



In [ ]:

Exercise. Do the same graph with Greece (GRC) as the country of interest. How does it differ? Why do you think that is?



In [ ]:

Describing numerical data

Let's step back a minute. What we're trying to do is compare Argentina to other countries. What's the best way to do that? This isn't a question with an obvious best answer, but we can try some things, see how they look. One thing we could do is compare Argentina to the mean or median. Or to some other feature of the distribution.

We work up to this by looking first at some features of the distribution of government debt numbers across countries. Some of this we've seen, some is new.

What's (not) there?

Let's check out the data first. How many non-missing values do we have at each date? We can do that with the count method. The argument axis=1 says to do this by date, counting across columns (axis number 1).



In [ ]:

    
dbt.shape



In [ ]:

    
# count non-missing values 
dbt.count(axis=1).plot()



In [ ]:

Describing series

Let's take the data for 2001 -- the year of Argentina's default -- and see what how Argentina compares. Was its debt high compare to other countries?

which leads to more questions. How would we compare? Compare Argentina to the mean or median? Something else?

Let's see how that works.



In [ ]:

    
# 2001 data 
db01 = db['2001']



In [ ]:

    
db01['ARG']



In [ ]:

    
db01.mean()



In [ ]:

    
db01.median()



In [ ]:

    
db01.describe()



In [ ]:

    
db01.quantile(q=[0.25, 0.5, 0.75])



In [ ]:

Comment. If we add enough quantiles, we might as well plot the whole distribution. The easiest way to do this is with a histogram.



In [ ]:

    
fig, ax = plt.subplots()
db01.hist(bins=15, ax=ax, alpha=0.35)
ax.set_xlabel('Government Debt (Percent of GDP)')
ax.set_ylabel('Number of Countries')

ymin, ymax = ax.get_ylim()
ax.vlines(db01['ARG'], ymin, ymax, color='blue', lw=2)

Comment Compared to the whole sample of countries in 2001, it doesn't seem that Argentina had particularly high debt.



In [ ]:



In [ ]:

Describing dataframes

We can compute the same statistics for dataframes. Here we hve a choice: we can compute (say) the mean down rows (axis=0) or across columns (axis=1). If we use the dataframe dbt, computing the mean across countries (columns) calls for axis=1.



In [ ]:

    
# here we compute the mean across countries at every date
dbt.mean(axis=1).head()



In [ ]:

    
# or we could do the median
dbt.median(axis=1).head()



In [ ]:

    
# or a bunch of stats at once 
# NB:  db not dbt (there's no axix argument here)
db.describe()



In [ ]:

    
# the other way 
dbt.describe()



In [ ]:



In [ ]:

Example. Let's add the mean to our graph. We make it a dashed line with linestyle='dashed'.



In [ ]:

    
fig, ax = plt.subplots()
dbt.plot(ax=ax, 
         legend=False, color='blue', alpha=0.2, 
         ylim=(0,200)
        )
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left')
dbt.mean(axis=1).plot(ax=ax, color='black', linewidth=2, linestyle='dashed')

Question. Do you think this looks better when the mean varies with time, or when we use a constant mean? Let's try it and see.



In [ ]:

    
dbar = dbt.mean().mean()
dbar



In [ ]:

    
fig, ax = plt.subplots()
dbt.plot(ax=ax, 
         legend=False, color='blue', alpha=0.3, 
         ylim=(0,150)
        )
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left') 
xmin, xmax = ax.get_xlim()
ax.hlines(dbar, xmin, xmax, linewidth=2, linestyle='dashed')

Exercise. Which do we like better?

Exercise. Replace the (constant) mean with the (constant) median? Which do you prefer?



In [ ]:

Describing categorical data

A categorical variable is one that takes on a small number of values. States take on one of fifty values. University students are either grad or undergrad. Students select majors and concentrations.

We're going to do two things with categorical data:

In this section, we count the number of observations in each category using the value_counts method. This is a series method, we apply it to one series/variable at a time.
In the next section, we go on to describe how other variables differ across catagories. How do students who major in finance differ from those who major in English? And so on.

We start with the combined MovieLens data we constructed in the previous notebook.



In [ ]:

    
url = 'http://pages.stern.nyu.edu/~dbackus/Data/mlcombined.csv'
ml  = pd.read_csv(url, index_col=0,encoding = "ISO-8859-1")
print('Dimensions:', ml.shape)

# fix up the dates
ml["timestamp"] = pd.to_datetime(ml["timestamp"], unit="s")
ml.head(10)



In [ ]:

    
# which movies have the most ratings? 
ml['title'].value_counts().head(10)



In [ ]:

    
ml['title'].value_counts().head(10).plot.barh(alpha=0.5)



In [ ]:

    
# which people have rated the most movies?
ml['userId'].value_counts().head(10)



In [ ]:

Grouping data

Next up: group data by some variable. As an example, how would we compute the average rating of each movie? If you think for a minute, you might think of these steps:

Group the data by movie: Put all the "Pulp Fiction" ratings in one bin, all the "Shawshank" ratings in another. We do that with the groupby method.
Compute a statistic (the mean, for example) for each group.

Pandas has tools that make that relatively easy.



In [ ]:

    
# group 
g = ml[['title', 'rating']].groupby('title')
type(g)



In [ ]:

Now that we have a groupby object, what can we do with it?



In [ ]:

    
# the number in each category
g.count().head(10)



In [ ]:

    
# what type of object have we created?
type(g.count())



In [ ]:

Comment. Note that the combination of groupby and count created a dataframe with

Its index is the variable we grouped by. If we group by more than one, we get a multi-index.
Its columns are the other variables.

Exercise. Take the code

counts = ml.groupby(['title', 'movieId'])

Without running it, what is the index of counts? What are its columns?



In [ ]:

    
counts = ml.groupby(['title', 'movieId']).count()



In [ ]:

    
gm = g.mean()
gm.head()



In [ ]:



In [ ]:

    
# we can put them together 
grouped = g.count()
grouped = grouped.rename(columns={'rating': 'Number'})
grouped['Mean'] = g.mean()
grouped.head(10)



In [ ]:



In [ ]:

    
grouped.plot.scatter(x='Number', y='Mean')



In [ ]:

Exercise. Compute the median and add it to the dataframe.



In [ ]:



In [ ]:

Resources

The Brandon Rhodes video covers most of this, too.



In [ ]: