Pandas 5: Summarizing data

Another in a series of notebooks that describe Pandas' powerful data management tools. In this one we summarize our data in a variety of ways. Which is more interesting than it sounds.

Outline:

  • WEO government debt data. Something to work with. How does Argentina's government debt compare to the debt of other countries? How did it compare when it defaulted in 2001?
  • Describing numerical data. Descriptive statistics: numbers of non-missing values, mean, median, quantiles.
  • Describing catgorical data. The excellent value_counts method.
  • Grouping data. An incredibly useful collection of tools based on grouping data based on a variable: men and woman, grads and undergrads, and so on.

Note: requires internet access to run.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

Import packages, etc.


In [2]:
import sys                             # system module 
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for Pandas 

%matplotlib inline                     

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())


Python version: 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.18.0
Today:  2016-04-20

WEO data on government debt

We use the IMF's data on government debt again, specifically its World Economic Outlook database, commonly referred to as the WEO. We focus on government debt expressed as a percentage of GDP, variable code GGXWDG_NGDP.

The central question here is how the debt of Argentina, which defaulted in 2001, compared to other countries. Was it a matter of too much debt or something else?

Load data

First step: load the data and extract a single variable: government debt (code GGXWDG_NGDP) expressed as a percentage of GDP.


In [4]:
url1 = 'http://www.imf.org/external/pubs/ft/weo/2015/02/weodata/'
url2 = 'WEOOct2015all.xls'
url = url1 + url2 
weo = pd.read_csv(url, sep='\t', 
                  usecols=[1,2] + list(range(19,45)), 
                  thousands=',', 
                  na_values=['n/a', '--']) 
print('Variable dtypes:\n', weo.dtypes.head(6), sep='')


Variable dtypes:
ISO                  object
WEO Subject Code     object
1990                float64
1991                float64
1992                float64
1993                float64
dtype: object

Clean and shape

Second step: select the variable we want and generate the two dataframes.


In [5]:
# select debt variable 
variables = ['GGXWDG_NGDP']
db = weo[weo['WEO Subject Code'].isin(variables)]

# drop variable code column (they're all the same) 
db = db.drop('WEO Subject Code', axis=1)

# set index to country code 
db = db.set_index('ISO')

# name columns 
db.columns.name = 'Year'

# transpose 
dbt = db.T

# see what we have 
dbt.head()


Out[5]:
ISO AFG ALB DZA AGO ATG ARG ARM AUS AUT AZE ... GBR USA URY UZB VUT VEN VNM YEM ZMB ZWE
Year
1990 NaN NaN NaN NaN 98.071 NaN NaN 16.442 56.057 NaN ... 31.029 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1991 NaN NaN 77.834 NaN 99.075 NaN NaN 21.627 56.260 NaN ... 31.026 NaN NaN NaN 15.908 NaN NaN NaN NaN NaN
1992 NaN NaN 62.926 NaN 95.479 23.309 NaN 27.650 56.153 NaN ... 36.129 NaN NaN NaN 21.036 NaN NaN NaN NaN NaN
1993 NaN NaN 74.028 NaN 90.147 25.113 NaN 30.650 60.765 NaN ... 41.292 NaN NaN NaN 20.274 NaN NaN NaN NaN NaN
1994 NaN NaN 98.376 NaN 89.655 26.536 NaN 31.763 63.863 NaN ... 44.309 NaN NaN NaN 21.599 NaN NaN NaN NaN NaN

5 rows × 189 columns

Example. Let's try a simple graph of the dataframe dbt. The goal is to put Argentina in perspective by plotting it along with many other countries.


In [9]:
fig, ax = plt.subplots()
dbt.plot(ax=ax, 
         legend=False, color='blue', alpha=0.3, 
         ylim=(0,150)
        )
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left')
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x97d4438>

Exercise.

  • What do you take away from this graph?
  • What would you change to make it look better?
  • To make it mnore informative?
  • To put Argentina's debt in context?

In [ ]:

Exercise. Do the same graph with Greece (GRC) as the country of interest. How does it differ? Why do you think that is?


In [ ]:

Describing numerical data

Let's step back a minute. What we're trying to do is compare Argentina to other countries. What's the best way to do that? This isn't a question with an obvious best answer, but we can try some things, see how they look. One thing we could do is compare Argentina to the mean or median. Or to some other feature of the distribution.

We work up to this by looking first at some features of the distribution of government debt numbers across countries. Some of this we've seen, so is new.

What's (not) there?

Let's check out the data first. How many non-missing values do we have at each date? We can do that with the count method. The argument axis=1 says to do this by date, counting across columns (axis number 1).


In [6]:
dbt.shape


Out[6]:
(26, 189)

In [20]:
# count non-missing values 
dbt.count(axis=1).plot()


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x920c518>

In [ ]:

Describing series

Let's take the data for 2001 -- the year of Argentina's default -- and see what how Argentina compares. Was its debt high compare to other countries?

which leads to more questions. How would we compare? Compare Argentina to the mean or median? Something else?

Let's see how that works.


In [10]:
# 2001 data 
db01 = db['2001']

In [11]:
db01['ARG']


Out[11]:
44.756999999999998

In [12]:
db01.mean()


Out[12]:
72.91579881656806

In [13]:
db01.median()


Out[13]:
56.523999999999994

In [14]:
db01.describe()


Out[14]:
count    169.000000
mean      72.915799
std       72.257092
min        0.000000
25%       36.968000
50%       56.524000
75%       93.769000
max      612.226000
Name: 2001, dtype: float64

In [15]:
db01.quantile(q=[0.25, 0.5, 0.75])


Out[15]:
0.25    36.968
0.50    56.524
0.75    93.769
Name: 2001, dtype: float64

In [ ]:

Comment. If we add enough quantiles, we might as well plot the whole distribution. The easiest way to do this is with a histogram.


In [16]:
fig, ax = plt.subplots()
db01.hist(bins=15, ax=ax, alpha=0.35)
ax.set_xlabel('Government Debt (Percent of GDP)')
ax.set_ylabel('Number of Countries')

darg = db01['ARG'] 
ymin, ymax = ax.get_ylim()
ax.vlines(darg, ymin, ymax, color='blue', lw=2)


Out[16]:
<matplotlib.collections.LineCollection at 0x93fbe80>

In [ ]:


In [ ]:


In [ ]:

Describing dataframes

We can compute the same statistics for dataframes. Here we hve a choice: we can compute (say) the mean down rows (axis=0) or across columns (axis=1). If we use the dataframe dbt, computing the mean across countries (columns) calls for axis=1.


In [21]:
# here we compute the mean across countries at every date
dbt.mean(axis=1).head()


Out[21]:
Year
1990    66.487102
1991    66.227729
1992    69.696406
1993    68.236130
1994    69.928247
dtype: float64

In [22]:
# or we could do the median
dbt.median(axis=1).head()


Out[22]:
Year
1990    56.482
1991    56.856
1992    55.624
1993    58.599
1994    59.735
dtype: float64

In [23]:
# or a bunch of stats at once 
# NB:  db not dbt (there's no axix argument here)
db.describe()


Out[23]:
Year 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
count 49.000000 59.000000 64.000000 69.000000 73.000000 88.000000 96.000000 112.000000 124.000000 132.000000 ... 182.000000 182.000000 182.000000 182.000000 182.000000 182.000000 182.000000 182.000000 182.000000 182.000000
mean 66.487102 66.227729 69.696406 68.236130 69.928247 62.102545 56.744188 56.986161 58.682137 59.659538 ... 53.035082 46.838967 45.018973 47.706577 46.481049 46.683604 47.686456 49.210324 51.306412 54.344247
std 55.489269 49.025392 62.713445 45.234081 53.652731 40.891238 37.642378 34.630448 38.194130 36.556886 ... 56.775113 46.953668 38.423596 33.993966 31.753314 33.059382 33.386893 33.999834 33.904244 33.923656
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.593000 0.684000 0.585000 0.671000 0.665000 0.000000 0.576000 0.546000 0.069000 0.062000
25% 28.353000 33.583000 35.971250 41.292000 43.052000 30.766750 28.317250 29.401250 30.370750 32.680750 ... 24.022750 21.209500 20.597750 25.888250 25.907000 24.616000 26.516250 28.047750 29.655000 32.763500
50% 56.482000 56.856000 55.624000 58.599000 59.735000 57.001000 53.500500 53.802500 53.690000 56.354000 ... 39.138000 35.747500 36.552500 39.440500 39.649000 39.139500 40.073000 40.070500 43.361000 47.370000
75% 80.353000 85.518500 86.676000 90.135000 87.985000 86.854000 73.508250 74.393000 75.460750 82.958500 ... 65.469000 62.184750 60.010250 63.158250 61.114500 61.253000 62.716500 64.636000 67.710000 70.120000
max 243.429000 203.364000 454.949000 262.657000 355.061000 219.593000 202.885000 156.196000 173.560000 164.268000 ... 593.515000 482.764000 309.660000 210.247000 215.821000 229.706000 236.649000 242.593000 246.173000 245.896000

8 rows × 26 columns


In [24]:
# the other way 
dbt.describe()


Out[24]:
ISO AFG ALB DZA AGO ATG ARG ARM AUS AUT AZE ... GBR USA URY UZB VUT VEN VNM YEM ZMB ZWE
count 0.0 19.000000 25.000000 16.000000 26.000000 24.000000 20.000000 26.000000 26.000000 21.000000 ... 26.000000 15.000000 15.000000 18.000000 25.000000 18.000000 16.000000 17.000000 16.000000 11.000000
mean NaN 62.007789 46.233680 47.985750 99.331308 50.230958 33.859000 21.928846 69.135846 15.938619 ... 52.135808 79.630533 72.594800 24.253167 24.902960 40.506000 43.055188 52.637941 24.360500 56.294091
std NaN 6.104841 34.091374 26.904972 13.553858 29.697460 10.162803 8.135018 9.050139 5.588846 ... 19.653787 20.306936 18.642755 16.621395 6.588914 11.230156 8.742659 13.964724 6.610021 10.191885
min NaN 53.427000 8.278000 16.426000 77.271000 23.309000 14.249000 9.677000 56.057000 7.297000 ... 31.026000 53.022000 54.890000 8.276000 15.908000 23.099000 31.432000 36.421000 16.718000 38.576000
25% NaN 57.582000 10.814000 30.740500 90.803500 31.498000 31.268000 15.407000 64.097750 11.599000 ... 39.336750 63.825000 59.798500 10.216000 20.568000 31.966250 37.220000 43.801000 20.207000 50.936500
50% NaN 59.667000 44.535000 40.515500 96.786500 39.686000 37.150500 22.107500 66.455000 14.291000 ... 43.923000 72.835000 64.067000 18.563000 22.162000 39.192000 40.158000 48.722000 21.912500 54.230000
75% NaN 66.391000 72.863000 59.234000 104.104500 52.408750 39.717000 28.952250 76.924000 20.603000 ... 62.298500 100.743000 79.808000 33.685750 28.046000 50.253000 48.215750 57.827000 26.263500 65.743000
max NaN 73.305000 116.195000 104.542000 128.165000 137.715000 46.503000 35.972000 86.712000 25.414000 ... 89.397000 104.853000 111.546000 59.375000 38.806000 61.443000 61.215000 96.050000 41.892000 69.291000

8 rows × 189 columns


In [ ]:


In [ ]:

Example. Let's add the mean to our graph. We make it a dashed line with linestyle='dashed'.


In [25]:
fig, ax = plt.subplots()
dbt.plot(ax=ax, 
         legend=False, color='blue', alpha=0.3, 
         ylim=(0,150)
        )
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left')
dbt.mean(axis=1).plot(ax=ax, color='black', linewidth=2, linestyle='dashed')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x9249128>

Question. Do you think this looks better when the mean varies with time, or when we use a constant mean? Let's try it and see.


In [26]:
dbar = dbt.mean().mean()
dbar


Out[26]:
57.044499497307044

In [27]:
fig, ax = plt.subplots()
dbt.plot(ax=ax, 
         legend=False, color='blue', alpha=0.3, 
         ylim=(0,150)
        )
dbt['ARG'].plot(ax=ax, color='black', linewidth=1.5)
ax.set_ylabel('Percent of GDP')
ax.set_xlabel('')
ax.set_title('Government debt', fontsize=14, loc='left') 
xmin, xmax = ax.get_xlim()
ax.hlines(dbar, xmin, xmax, linewidth=2, linestyle='dashed')


Out[27]:
<matplotlib.collections.LineCollection at 0xb1b59b0>

Exercise. Which do we like better?

Exercise. Replace the (constant) mean with the (constant) median? Which do you prefer?


In [ ]:

Describing categorical data

A categorical variable is one that takes on a small number of values. States take on one of fifty values. University students are either grad or undergrad. Students select majors and concentrations.

We're going to do two things with categorical data:

  • In this section, we count the number of observations in each category using the value_counts method. This is a series method, we apply it to one series/variable at a time.
  • In the next section, we go on to describe how other variables differ across catagories. How do students who major in finance differ from those who major in English? And so on.

We start with the combined MovieLens data we constructed in the previous notebook.


In [30]:
url = 'http://pages.stern.nyu.edu/~dbackus/Data/mlcombined.csv'
ml  = pd.read_csv(url)
print('Dimensions:', ml.shape)
ml.head(10)


Dimensions: (105339, 7)
Out[30]:
Unnamed: 0 userId movieId rating timestamp title genres
0 0 1 16 4.0 1217897793 Casino (1995) Crime|Drama
1 1 1 24 1.5 1217895807 Powder (1995) Drama|Sci-Fi
2 2 1 32 4.0 1217896246 Twelve Monkeys (a.k.a. 12 Monkeys) (1995) Mystery|Sci-Fi|Thriller
3 3 1 47 4.0 1217896556 Seven (a.k.a. Se7en) (1995) Mystery|Thriller
4 4 1 50 4.0 1217896523 Usual Suspects, The (1995) Crime|Mystery|Thriller
5 5 1 110 4.0 1217896150 Braveheart (1995) Action|Drama|War
6 6 1 150 3.0 1217895940 Apollo 13 (1995) Adventure|Drama|IMAX
7 7 1 161 4.0 1217897864 Crimson Tide (1995) Drama|Thriller|War
8 8 1 165 3.0 1217897135 Die Hard: With a Vengeance (1995) Action|Crime|Thriller
9 9 1 204 0.5 1217895786 Under Siege 2: Dark Territory (1995) Action

In [ ]:


In [32]:
# which movies have the most ratings? 
ml['title'].value_counts().head(10)


Out[32]:
Pulp Fiction (1994)                          325
Forrest Gump (1994)                          311
Shawshank Redemption, The (1994)             308
Jurassic Park (1993)                         294
Silence of the Lambs, The (1991)             290
Star Wars: Episode IV - A New Hope (1977)    273
Matrix, The (1999)                           261
Terminator 2: Judgment Day (1991)            253
Schindler's List (1993)                      248
Braveheart (1995)                            248
Name: title, dtype: int64

In [41]:
ml['title'].value_counts().head(10).plot.barh(alpha=0.5)


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0xcbe83c8>

In [ ]:


In [42]:
# which people have rated the most movies?
ml['userId'].value_counts().head(10)


Out[42]:
668    5678
575    2837
458    2086
232    1421
310    1287
475    1249
128    1231
224    1182
607    1176
63     1107
Name: userId, dtype: int64

In [ ]:


In [ ]:


In [ ]:

Grouping data

Next up: group data by some variable. As an example, how would we compute the average rating of each movie? If you think for a minute, you might think of these steps:

  • Group the data by movie: Put all the "Pulp Fiction" ratings in one bin, all the "Shawshank" ratings in another. We do that with the groupby method.
  • Compute a statistic (the mean, for example) for each group.

Pandas has tools that make that relatively easy.


In [43]:
# group 
g = ml[['title', 'rating']].groupby('title')
type(g)


Out[43]:
pandas.core.groupby.DataFrameGroupBy

In [ ]:

Now that we have a groupby object, what can we do with it?


In [50]:
# the number in each category
g.count().head(10)


Out[50]:
rating
title
'71 (2014) 1
'Hellboy': The Seeds of Creation (2004) 1
'Round Midnight (1986) 1
'Til There Was You (1997) 3
'burbs, The (1989) 20
'night Mother (1986) 1
(500) Days of Summer (2009) 37
*batteries not included (1987) 11
...And Justice for All (1979) 10
10 (1979) 3

In [48]:
# what type of object have we created?
type(g.count())


Out[48]:
pandas.core.frame.DataFrame

In [ ]:

Add this via Spencer.

Comment. Note that the combination of groupby and count created a dataframe with

  • Its index is the variable we grouped by. If we group by more than one, we get a multi-index.
  • Its columns are the other variables.

Exercise. Take the code

counts = ml.groupby(['title', 'movieId')

Without running it, what is the index of counts? What are its columns?


In [ ]:


In [ ]:


In [46]:
gm = g.mean()
gm.head(10)


Out[46]:
rating
title
'71 (2014) 3.500000
'Hellboy': The Seeds of Creation (2004) 3.000000
'Round Midnight (1986) 2.500000
'Til There Was You (1997) 4.000000
'burbs, The (1989) 3.125000
'night Mother (1986) 3.000000
(500) Days of Summer (2009) 3.932432
*batteries not included (1987) 3.318182
...And Justice for All (1979) 3.650000
10 (1979) 2.000000

In [ ]:


In [ ]:


In [ ]:


In [59]:
# we can put them together 
grouped = g.count()
grouped = grouped.rename(columns={'rating': 'Number'})
grouped['Mean'] = g.mean()
grouped.head(10)


Out[59]:
Number Mean
title
'71 (2014) 1 3.500000
'Hellboy': The Seeds of Creation (2004) 1 3.000000
'Round Midnight (1986) 1 2.500000
'Til There Was You (1997) 3 4.000000
'burbs, The (1989) 20 3.125000
'night Mother (1986) 1 3.000000
(500) Days of Summer (2009) 37 3.932432
*batteries not included (1987) 11 3.318182
...And Justice for All (1979) 10 3.650000
10 (1979) 3 2.000000

In [ ]:


In [63]:
grouped.plot.scatter(x='Number', y='Mean')


Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0xd0a8208>

In [ ]:

Exercise. Compute the median and add it to the dataframe.


In [ ]:


In [ ]:

Resources

The Brandon Rhodes video covers most of this, too.


In [ ]: