bit.ly/osb2015pandas

Introduction to data munging with Jupyter and pandas

Open Source Bridge 2015


In [1]:
from __future__ import division

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn
import rpy2

from IPython.display import display, Image, YouTubeVideo

%matplotlib inline

The case for open source data tools

IPython/Jupyter Notebook: some cool tips for beginners

1. holy moly tab completion


In [3]:
pd.read_csv('


Out[3]:
<function pandas.io.parsers.read_csv>

2. built-in documentation


In [12]:
?pd.read_csv

3. markup!

Markdown and HTML

</span>


LaTeX

$\bar{x} = \frac{\sum_{i=1}^{n}w_i\cdot x_i}{\sum_{i=1}^{n}w_i}$

Syntax highlighting for other languages

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))

In [14]:
%quickref

In [15]:
%load_ext rmagic


The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

In [17]:
%R x <- c(0:10, 50)
%R xm <- mean(x)
%R c(xm, mean(x, trim = 0.10))


Out[17]:
array([ 8.75,  5.5 ])

In [8]:
YouTubeVideo("L4Hbv4ugUWk")


Out[8]:

6. Notebooks are now viewable directly on github, and also exportable as PDF or HTML.

pandas

Input

Source: IRS.gov


In [22]:
?pd.read_csv()

In [18]:
# read in a CSV
# specify that zipcode should be treated as a string rather than an int!
AGI = pd.read_csv('12zpallagi.csv',dtype={'zipcode': str})

In [19]:
AGI.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 166904 entries, 0 to 166903
Data columns (total 77 columns):
STATEFIPS    166904 non-null int64
STATE        166904 non-null object
zipcode      166904 non-null object
AGI_STUB     166904 non-null int64
N1           166904 non-null float64
MARS1        166904 non-null float64
MARS2        166904 non-null float64
MARS4        166904 non-null float64
PREP         166904 non-null float64
N2           166904 non-null float64
NUMDEP       166904 non-null float64
A00100       166904 non-null float64
N00200       166904 non-null float64
A00200       166904 non-null float64
N00300       166904 non-null float64
A00300       166904 non-null float64
N00600       166904 non-null float64
A00600       166904 non-null float64
N00650       166904 non-null float64
A00650       166904 non-null float64
N00900       166904 non-null float64
A00900       166904 non-null float64
SCHF         166904 non-null float64
N01000       166904 non-null float64
A01000       166904 non-null float64
N01400       166904 non-null float64
A01400       166904 non-null float64
N01700       166904 non-null float64
A01700       166904 non-null float64
N02300       166904 non-null float64
A02300       166904 non-null float64
N02500       166904 non-null float64
A02500       166904 non-null float64
N03300       166904 non-null float64
A03300       166904 non-null float64
N00101       166904 non-null float64
A00101       166904 non-null float64
N04470       166904 non-null float64
A04470       166904 non-null float64
N18425       166904 non-null float64
A18425       166904 non-null float64
N18450       166904 non-null float64
A18450       166904 non-null float64
N18500       166904 non-null float64
A18500       166904 non-null float64
N18300       166904 non-null float64
A18300       166904 non-null float64
N19300       166904 non-null float64
A19300       166904 non-null float64
N19700       166904 non-null float64
A19700       166904 non-null float64
N04800       166904 non-null float64
A04800       166904 non-null float64
N07100       166904 non-null float64
A07100       166904 non-null float64
N07220       166904 non-null float64
A07220       166904 non-null float64
N07180       166904 non-null float64
A07180       166904 non-null float64
N07260       166904 non-null float64
A07260       166904 non-null float64
N59660       166904 non-null float64
A59660       166904 non-null float64
N59720       166904 non-null float64
A59720       166904 non-null float64
N11070       166904 non-null float64
A11070       166904 non-null float64
N09600       166904 non-null float64
A09600       166904 non-null float64
N06500       166904 non-null float64
A06500       166904 non-null float64
N10300       166904 non-null float64
A10300       166904 non-null float64
N11901       166904 non-null float64
A11901       166904 non-null float64
N11902       166904 non-null float64
A11902       166904 non-null float64
dtypes: float64(73), int64(2), object(2)
memory usage: 99.3+ MB

Transformation


In [20]:
# you can select columns by label or position!
AGI_column_subset = AGI[['STATE','AGI_STUB','zipcode','N1','A00100']]

In [21]:
AGI_column_subset.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 166904 entries, 0 to 166903
Data columns (total 5 columns):
STATE       166904 non-null object
AGI_STUB    166904 non-null int64
zipcode     166904 non-null object
N1          166904 non-null float64
A00100      166904 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 7.6+ MB

In [31]:
# note this is inclusive!
AGI_row_subset = AGI.ix[6:11]

In [32]:
AGI_row_subset


Out[32]:
STATEFIPS STATE zipcode AGI_STUB N1 MARS1 MARS2 MARS4 PREP N2 ... N09600 A09600 N06500 A06500 N10300 A10300 N11901 A11901 N11902 A11902
6 1 AL 35004 1 1600 990 270 300 840 2390 ... 0 0 570 397 760 667 130 107 1370 2855
7 1 AL 35004 2 1310 570 400 290 680 2620 ... 0 0 1030 2595 1060 2752 150 186 1140 2965
8 1 AL 35004 3 900 280 500 100 460 2090 ... 0 0 860 4382 870 4604 170 278 710 1982
9 1 AL 35004 4 590 70 490 30 270 1590 ... 0 0 580 4605 580 4774 120 300 460 1460
10 1 AL 35004 5 480 30 440 0 250 1340 ... 0 0 480 7397 480 7623 150 582 340 1330
11 1 AL 35004 6 50 0 50 0 50 140 ... 30 70 50 4956 50 5125 30 585 0 0

6 rows × 77 columns


In [34]:
AGI_column_subset.rename(columns={'N1':'population','A00100':'amount'},inplace=True)


/usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2417: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)

In [35]:
AGI_column_subset


Out[35]:
STATE AGI_STUB zipcode population amount
0 AL 1 00000 889920 11517112
1 AL 2 00000 491150 17617800
2 AL 3 00000 254280 15644666
3 AL 4 00000 160160 13885434
4 AL 5 00000 183320 24641055
5 AL 6 00000 44840 21926706
6 AL 1 35004 1600 20639
7 AL 2 35004 1310 48501
8 AL 3 35004 900 55790
9 AL 4 35004 590 50978
10 AL 5 35004 480 59932
11 AL 6 35004 50 21723
12 AL 1 35005 1400 19059
13 AL 2 35005 1060 38184
14 AL 3 35005 490 29808
15 AL 4 35005 250 21171
16 AL 5 35005 170 22122
17 AL 6 35005 0 0
18 AL 1 35006 500 6356
19 AL 2 35006 330 12039
20 AL 3 35006 180 11227
21 AL 4 35006 150 12855
22 AL 5 35006 140 17443
23 AL 6 35006 0 0
24 AL 1 35007 4240 50439
25 AL 2 35007 2640 96955
26 AL 3 35007 1860 115225
27 AL 4 35007 1360 117808
28 AL 5 35007 1510 195070
29 AL 6 35007 190 55060
... ... ... ... ... ...
166874 WY 1 83126 50 633
166875 WY 2 83126 20 809
166876 WY 3 83126 30 2048
166877 WY 4 83126 40 6122
166878 WY 5 83126 0 0
166879 WY 6 83126 0 0
166880 WY 1 83127 520 6466
166881 WY 2 83127 300 11343
166882 WY 3 83127 230 14287
166883 WY 4 83127 170 14615
166884 WY 5 83127 190 24920
166885 WY 6 83127 30 16535
166886 WY 1 83128 340 4881
166887 WY 2 83128 240 9295
166888 WY 3 83128 160 10042
166889 WY 4 83128 80 6885
166890 WY 5 83128 90 11905
166891 WY 6 83128 30 11589
166892 WY 1 83414 60 531
166893 WY 2 83414 50 1700
166894 WY 3 83414 40 2491
166895 WY 4 83414 0 0
166896 WY 5 83414 30 4645
166897 WY 6 83414 20 14668
166898 WY 1 99999 8870 113864
166899 WY 2 99999 6070 220135
166900 WY 3 99999 3720 229700
166901 WY 4 99999 2590 224438
166902 WY 5 99999 2960 391340
166903 WY 6 99999 1070 1317401

166904 rows × 5 columns


In [63]:
# group by zipcode and sum other values, resetting index
AGI_grouped = AGI_column_subset.groupby('zipcode').sum().reset_index()

In [64]:
AGI_grouped.head()


Out[64]:
zipcode AGI_STUB population amount
0 00000 1071 142098490 9123982917
1 01001 21 8780 458716
2 01002 21 9460 732849
3 01005 21 2230 122744
4 01007 21 7320 467891

A WILD ZIP CODE APPEARS!

Delete or render null? You decide!


In [65]:
AGI_grouped['population'].mean()


Out[65]:
10254.987012987012

In [66]:
#this can also be done using the na_values param upon being read in
null_zips = AGI_grouped['zipcode'] == '00000'
AGI_grouped.loc[null_zips, 'zipcode'] = np.nan

In [67]:
AGI_grouped.head()


Out[67]:
zipcode AGI_STUB population amount
0 NaN 1071 142098490 9123982917
1 01001 21 8780 458716
2 01002 21 9460 732849
3 01005 21 2230 122744
4 01007 21 7320 467891

In [68]:
AGI_notnull = AGI_grouped.dropna()

In [69]:
AGI_notnull['population'].mean()


Out[69]:
5128.9638875861319

In [70]:
AGI_grouped.dropna(inplace=True)

In [73]:
# make a new column with the real amount, not in thousands
AGI_grouped['actual_amount'] = AGI_grouped['amount'] * 1000

In [74]:
# make a mean!
AGI_grouped['weighted_mean_AGI'] = AGI_grouped['actual_amount']/AGI_grouped['population']

In [75]:
#use anonymous functions to change every value in a column!
AGI_grouped['weighted_mean_AGI']= AGI_grouped['weighted_mean_AGI'].apply(lambda x: round(x, 0))

In [77]:
AGI_grouped.head()


Out[77]:
zipcode AGI_STUB population amount actual_amount weighted_mean_AGI
1 01001 21 8780 458716 458716000 52246
2 01002 21 9460 732849 732849000 77468
3 01005 21 2230 122744 122744000 55042
4 01007 21 7320 467891 467891000 63920
5 01008 21 640 38487 38487000 60136

In [79]:
# drop columns you won't need
AGI_grouped.drop(['AGI_STUB','amount','actual_amount'],axis=1,inplace=True)

In [81]:
AGI_grouped.head()


Out[81]:
zipcode population weighted_mean_AGI
1 01001 8780 52246
2 01002 9460 77468
3 01005 2230 55042
4 01007 7320 63920
5 01008 640 60136

Merging! Better than in traffic!


In [37]:
# also look into pandas.Series.unique
AGI_subset_geo = AGI[['zipcode','STATE']].drop_duplicates()

In [82]:
AGI_subset_geo


Out[82]:
zipcode STATE
0 00000 AL
6 35004 AL
12 35005 AL
18 35006 AL
24 35007 AL
30 35010 AL
36 35014 AL
42 35016 AL
48 35019 AL
54 35020 AL
60 35022 AL
66 35023 AL
72 35031 AL
78 35033 AL
84 35034 AL
90 35035 AL
96 35040 AL
102 35042 AL
108 35043 AL
114 35044 AL
120 35045 AL
126 35046 AL
132 35049 AL
138 35051 AL
144 35053 AL
150 35054 AL
156 35055 AL
162 35057 AL
168 35058 AL
174 35061 AL
... ... ...
166724 82922 WY
166730 82923 WY
166736 82925 WY
166742 82930 WY
166748 82932 WY
166754 82933 WY
166760 82935 WY
166766 82937 WY
166772 82941 WY
166778 83001 WY
166784 83011 WY
166790 83012 WY
166796 83013 WY
166802 83014 WY
166808 83101 WY
166814 83110 WY
166820 83111 WY
166826 83112 WY
166832 83113 WY
166838 83114 WY
166844 83115 WY
166850 83118 WY
166856 83120 WY
166862 83122 WY
166868 83123 WY
166874 83126 WY
166880 83127 WY
166886 83128 WY
166892 83414 WY
166898 99999 WY

27820 rows × 2 columns


In [83]:
#merge rather than join if you want to use a common column other than the index
AGI_final = pd.merge(AGI_grouped, AGI_subset_geo, how='left', on='zipcode')

In [84]:
AGI_final.head()


Out[84]:
zipcode population weighted_mean_AGI STATE
0 01001 8780 52246 MA
1 01002 9460 77468 MA
2 01005 2230 55042 MA
3 01007 7320 63920 MA
4 01008 640 60136 MA

In [56]:
# this gives you the greated weighted_mean_AGI first
AGI_final.sort('weighted_mean_AGI',ascending=False).head()


Out[56]:
zipcode population weighted_mean_AGI STATE
9031 33109 250 2694776 FL
26104 94027 3220 1464534 CA
4926 19035 2040 1052019 PA
2194 10005 5580 983554 NY
9168 33480 5580 966673 FL

In [89]:
# chain methods!
AGI_final.groupby('STATE').mean().sort('weighted_mean_AGI',ascending=False)


Out[89]:
population weighted_mean_AGI
STATE
DC 139477.826087 101818.217391
CT 17525.992366 95944.801527
NJ 13011.868132 89198.278388
MA 12580.020747 86974.653527
CA 12839.494609 75530.843666
MD 14052.049383 73656.022222
NV 31566.615385 71164.100000
NY 7778.382924 68857.815653
ND 11232.543554 68087.386760
FL 12769.008715 65399.029412
RI 48424.285714 65349.700000
NH 15360.649351 64192.991342
WY 29240.462963 64139.157407
DE 59146.428571 63232.964286
CO 13168.686869 62932.449495
WA 12110.442656 61812.977867
TX 8598.743068 60810.509550
IL 7203.452769 60420.749186
VA 8309.132075 59568.041509
AK 58637.962963 58816.740741
MN 6894.545455 58649.895202
UT 21698.804348 57757.282609
AZ 18891.527778 56760.319444
NE 7706.639175 56243.342268
PA 6472.874909 55454.958303
HI 59265.254237 55327.932203
IA 5183.317191 55095.415254
SD 11328.965517 53825.644828
KS 6921.953642 53340.508278
WI 7870.336606 52714.803647
OK 8129.944954 52095.623853
LA 10651.504425 51541.876106
OR 13021.898017 51247.753541
OH 8307.505010 50755.814629
MI 8311.313131 50467.204265
VT 13467.805907 50113.248945
MT 14748.628319 49128.296460
IN 8683.679525 49115.709199
GA 10617.046477 48504.626687
NC 9685.696552 48415.451034
ID 16758.625592 47691.592417
ME 9470.948509 47603.149051
TN 9664.397284 46767.217317
SC 13080.294118 45617.171123
MO 6242.913386 45408.147357
AL 8463.923611 45086.321181
NM 17494.028436 44060.383886
WV 7027.126214 43301.866019
KY 7155.471125 42994.249240
AR 8252.953157 42244.458248
MS 11071.300813 40697.222222

In [88]:
AGI_final.sort('weighted_mean_AGI').head().plot(kind='bar')


Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x1173f2c90>

References

Python for Data Analysis written by Wes McKinney, creator of pandas

The inimitable Julia Evans' pandas cookbook

Cyrille Rossant's books on IPython

Interesting notebooks