In [3]:
from __future__ import division, print_function

In [4]:
from IPython.display import HTML, Image, display as disp

In [5]:
with open("../css/css.css", "r") as f:
    style = f.read()
HTML(style)


Out[5]:

Whirlwind Introduction to Pandas!


About me: Phillip Cloud

Past:

  • MA Psychology from CUNY CCNY (analysis of rat spike train data)
  • Started using pandas/numpy for loading data from proprietary formats
  • Decided I should contribute

Now:

  • Data infrastructure/research/development @ neuromatters (neurotechnology R & D company)
  • Core contributer to pandas for a little over 1 year

In [6]:
!./shortlog.sh ../../pandas


  2985	Wes McKinney
  1149	jreback
   765	y-p
   607	Chang She
   323	Phillip Cloud
   315	Adam Klein
   120	Jeffrey Tratner
   109	Vytautas Jancauskas
    89	Joris Van den Bossche
    84	Andy Hayden

Why pandas?

pandas fills the need for a Python library that is awesome at data munging. With pandas you don't need to go outside of Python to do data analysis. You can write a web app using django and have data dashboards that are backed by pandas.

What is pandas?

Strengths:

pandas is the library for tabular you've always wished you had. It supports operations such as

  • groupby
  • merge
  • aggregations (mean, sum, any, all, etc.)
  • advanced indexing capabilities
  • automatic data alignment during e.g., arithmetic operations
  • powerful vectorized text manipulation
  • missing data handling
  • comprehensive and extremely flexible IO (e.g., CSV, Excel, HDF5, HTML)

Limitations:

  • pandas was designed to work with "medium" data, i.e., data that fit into memory.
  • That said, I routinely work with $\approx$ 20-100M row datasets fairly easily with pandas.

My reaction after discovering pandas:


In [7]:
disp(Image("../img/m.jpg"))



In [8]:
disp(Image("../img/u.jpg"))


First rule of pandas talks: they aren't complete without a meme.

Deps for the talk

Optional

  • mpltools (for nice looking plots by default)

Enough talking, let's see some code.


In [9]:
# imports that I'll use throughout the talk
import numpy as np
import pandas as pd
from pandas import DataFrame, Series, Index
from numpy.random import randn, randint, rand, choice
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10

try:
    from mpltools import style
    style.use('ggplot')
except ImportError:
    pass


# because of our bg color
plt.rc('text', color='white')
plt.rc('axes', labelcolor='white')
plt.rc('xtick', color='white')
plt.rc('ytick', color='white')

In [10]:
%matplotlib inline

DataFrame is the flagship data structure in pandas


In [11]:
df = pd.DataFrame(randn(10, 2), columns=list('ab'))
df


Out[11]:
a b
0 1.525911 -1.536062
1 -0.773639 0.786538
2 0.673406 0.336026
3 -0.778567 0.557999
4 0.932137 -1.204426
5 0.139991 0.406078
6 -0.687879 -0.497632
7 0.271935 -0.287440
8 0.240098 1.125080
9 0.235167 -0.532541

In [ ]:
# if you have boto installed and have set up credentials
# df = pd.read_csv('s3://nypug/tips.csv')

In [12]:
df = pd.read_csv('https://s3.amazonaws.com/nyqpug/tips.csv')
df


Out[12]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns


In [13]:
df.dtypes


Out[13]:
total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

One of pandas' most useful and powerful features is its ability to slice and dice data in almost any way you can think of.


In [14]:
# column access by name
df['day']  # ⟵ that's a Series object


Out[14]:
0    Sun
1    Sun
2    Sun
...
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object

In [15]:
# by attribute
df.time


Out[15]:
0    Dinner
1    Dinner
2    Dinner
...
241    Dinner
242    Dinner
243    Dinner
Name: time, Length: 244, dtype: object

In [16]:
df.time.value_counts()


Out[16]:
Dinner    176
Lunch      68
dtype: int64

In [17]:
# multiple columns
df[['tip', 'sex']]


Out[17]:
tip sex
0 1.01 Female
1 1.66 Male
2 3.50 Male
3 3.31 Male
4 3.61 Female
... ... ...
239 5.92 Male
240 2.00 Female
241 2.00 Male
242 1.75 Male
243 3.00 Female

244 rows × 2 columns


In [18]:
t = df.set_index('day')
t.head()


Out[18]:
total_bill tip sex smoker time size
day
Sun 16.99 1.01 Female No Dinner 2
Sun 10.34 1.66 Male No Dinner 3
Sun 21.01 3.50 Male No Dinner 3
Sun 23.68 3.31 Male No Dinner 2
Sun 24.59 3.61 Female No Dinner 4

In [ ]:
t.loc['Sun']

In [19]:
df.loc[df.day == 'Sun']


Out[19]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
186 20.90 3.50 Female Yes Sun Dinner 3
187 30.46 2.00 Male Yes Sun Dinner 5
188 18.15 3.50 Female Yes Sun Dinner 3
189 23.10 4.00 Male Yes Sun Dinner 3
190 15.69 1.50 Male Yes Sun Dinner 2

76 rows × 7 columns


In [21]:
df.loc[:, 'smoker']


Out[21]:
0    No
1    No
2    No
...
241    Yes
242     No
243     No
Name: smoker, Length: 244, dtype: object

In [22]:
df.iloc[:, 3]  # same as df.loc[:, 'smoker']


Out[22]:
0    No
1    No
2    No
...
241    Yes
242     No
243     No
Name: smoker, Length: 244, dtype: object

Add and delete columns as you please


In [23]:
df['pct_tip'] = df.tip / df.total_bill
df['avg_price'] = df.total_bill / df.size

In [24]:
df.avg_price.hist(bins=20)


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x4ae7a50>

In [25]:
del df['avg_price']
del df['pct_tip']

In [26]:
# multiple columns, multiple rows
df.loc[[0, 2], ['sex', 'tip']]


Out[26]:
sex tip
0 Female 1.01
2 Male 3.50

In [27]:
df.loc[:10, ['total_bill', 'tip']]  # note this is inclusive


Out[27]:
total_bill tip
0 16.99 1.01
1 10.34 1.66
2 21.01 3.50
3 23.68 3.31
4 24.59 3.61
... ... ...
6 8.77 2.00
7 26.88 3.12
8 15.04 1.96
9 14.78 3.23
10 10.27 1.71

11 rows × 2 columns


In [28]:
# and with iloc
df.iloc[:5]  # exclusive endpoints


Out[28]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Applying Functions to Pandas Objects

Apply a function over the columns of a DataFrame


In [29]:
# np.ptp is peak-to-peak difference, i.e., range
df[['total_bill', 'tip', 'size']].apply(np.ptp)


Out[29]:
total_bill    47.74
tip            9.00
size           5.00
dtype: float64

In [31]:
days = ['Sun', 'Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat']

In [32]:
df.day.map(days.index)


Out[32]:
0    0
1    0
2    0
...
241    6
242    6
243    4
Name: day, Length: 244, dtype: int64

For example, the DataFrame.sum() method results in a Series object and by default sums across the rows. If you want to sum across the columns pass axis=1, similar to the numpy convention

In general, reduction operations result in a Series object.

Some example reduction functions:

  • mean, median, and mode
  • count
  • std, var

NOTE: These ignore NaNs


In [33]:
df.sum()  # whoa! + is defined for strings


Out[33]:
total_bill                                              4827.77
tip                                                      731.58
sex           FemaleMaleMaleMaleFemaleMaleMaleMaleMaleMaleMa...
smoker        NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo...
day           SunSunSunSunSunSunSunSunSunSunSunSunSunSunSunS...
time          DinnerDinnerDinnerDinnerDinnerDinnerDinnerDinn...
size                                                        627
dtype: object

In [34]:
df.sum(numeric_only=True)


Out[34]:
total_bill    4827.77
tip            731.58
size           627.00
dtype: float64

In [37]:
df.count() / df.shape[0]  # we don't have any nans


Out[37]:
total_bill    1
tip           1
sex           1
smoker        1
day           1
time          1
size          1
dtype: float64

In [35]:
df.var()


Out[35]:
total_bill    79.252939
tip            1.914455
size           0.904591
dtype: float64

In [36]:
df.mean()


Out[36]:
total_bill    19.785943
tip            2.998279
size           2.569672
dtype: float64

pandas' GroupBy functionality allows you to perform operations on subsets of a DataFrame and then combines the results for you at the end


In [38]:
df.head()


Out[38]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

In [39]:
gb = df.groupby('sex')
gb.mean()


Out[39]:
total_bill tip size
sex
Female 18.056897 2.833448 2.459770
Male 20.744076 3.089618 2.630573

In [40]:
gb = df.groupby(['sex', 'smoker'])
gb.std()


Out[40]:
total_bill tip size
sex smoker
Female No 7.286455 1.128425 1.073146
Yes 9.189751 1.219916 0.613917
Male No 8.726566 1.489559 0.989094
Yes 9.911845 1.500120 0.892530

In [41]:
# can pass multiple reducers to agg
gb.agg(['mean', 'std', 'median'])


Out[41]:
total_bill tip size
mean std median mean std median mean std median
sex smoker
Female No 18.105185 7.286455 16.69 2.773519 1.128425 2.68 2.592593 1.073146 2
Yes 17.977879 9.189751 16.27 2.931515 1.219916 2.88 2.242424 0.613917 2
Male No 19.791237 8.726566 18.24 3.113402 1.489559 2.74 2.711340 0.989094 2
Yes 22.284500 9.911845 20.39 3.051167 1.500120 3.00 2.500000 0.892530 2

apply

GroupBy.apply() take a callable and calls it on each group


In [42]:
def stdize(x):
    return (x - x.mean()) / x.std()
    
df['tb_std'] = gb.total_bill.apply(stdize)
df


Out[42]:
total_bill tip sex smoker day time size tb_std
0 16.99 1.01 Female No Sun Dinner 2 -0.153049
1 10.34 1.66 Male No Sun Dinner 3 -1.083042
2 21.01 3.50 Male No Sun Dinner 3 0.139661
3 23.68 3.31 Male No Sun Dinner 2 0.445623
4 24.59 3.61 Female No Sun Dinner 4 0.889982
... ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 1.058694
240 27.18 2.00 Female Yes Sat Dinner 2 1.001346
241 22.67 2.00 Male Yes Sat Dinner 2 0.038893
242 17.82 1.75 Male No Sat Dinner 2 -0.225889
243 18.78 3.00 Female No Thur Dinner 2 0.092612

244 rows × 8 columns


In [43]:
ax = df.tb_std.plot(kind='kde', lw=3)
df.tb_std.hist(ax=ax, normed=True)

ax.set_xlabel('$z$-score', fontsize=20)
ax.set_title(r'Total Bill Stdized Across Sex$\times$ Smoker')

ax.axis('tight')
plt.gcf().tight_layout()


Often times you have data sets of different shapes with a common column that you want to join on

pandas does this a couple of ways, but the main entry point is pandas.merge


In [45]:
# create some frames
n = 2000
n2 = n // 2
visits = DataFrame({'page_visits_per_day': np.random.poisson(10, size=n),
                 'user_id': randint(9, size=n)})
likes = DataFrame({'likes_per_day': np.random.poisson(30, size=n2),
                   'user_id': randint(6, size=n2)})
visits


Out[45]:
page_visits_per_day user_id
0 6 4
1 12 1
2 13 5
3 7 1
4 6 7
... ... ...
1995 7 1
1996 8 5
1997 9 3
1998 10 6
1999 8 4

2000 rows × 2 columns


In [46]:
likes


Out[46]:
likes_per_day user_id
0 27 0
1 33 4
2 23 1
3 31 5
4 26 0
... ... ...
995 28 5
996 30 3
997 33 4
998 29 0
999 27 5

1000 rows × 2 columns


In [47]:
merg = pd.merge(visits, likes)
merg.sort('user_id')


Out[47]:
page_visits_per_day user_id likes_per_day
152516 15 0 22
174447 11 0 21
174446 11 0 25
174445 11 0 25
174444 11 0 26
... ... ... ...
85623 17 5 24
85624 17 5 35
85625 17 5 34
85645 17 5 25
73182 10 5 34

227153 rows × 3 columns


In [48]:
s = df.day
s


Out[48]:
0    Sun
1    Sun
2    Sun
...
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object

In [49]:
is_weekend = s.str.startswith('S')
is_weekend


Out[49]:
0    True
1    True
2    True
...
241     True
242     True
243    False
Name: day, Length: 244, dtype: bool

In [50]:
correct = s[is_weekend].str.contains(r'^(?:Sat|Sun)$')
correct.all()


Out[50]:
True

In [51]:
s.str.len()


Out[51]:
0    3
1    3
2    3
...
241    3
242    3
243    4
Name: day, Length: 244, dtype: int64

In [52]:
s.str[:2]


Out[52]:
0    Su
1    Su
2    Su
...
241    Sa
242    Sa
243    Th
Name: day, Length: 244, dtype: object

Construction


In [53]:
n = 10000
idx = pd.date_range(start='today', periods=n, freq='D')
idx


Out[53]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-05-28, ..., 2041-10-12]
Length: 10000, Freq: D, Timezone: None

In [55]:
s = Series(np.random.poisson(10, size=n), index=idx, name='login_count')
s


Out[55]:
2014-05-28     6
2014-05-29    12
2014-05-30    11
...
2041-10-10     6
2041-10-11    14
2041-10-12     8
Freq: D, Name: login_count, Length: 10000

resample


In [56]:
s.resample('W', how='sum')


Out[56]:
2014-06-01    46
2014-06-08    56
2014-06-15    74
...
2041-09-29    66
2041-10-06    77
2041-10-13    59
Freq: W-SUN, Name: login_count, Length: 1429

In [57]:
# multiple functions
rs = s.resample('W', how=['mean', 'std', 'count'])
rs


Out[57]:
mean std count
2014-06-01 9.200000 2.387467 5
2014-06-08 8.000000 2.768875 7
2014-06-15 10.571429 4.755949 7
2014-06-22 10.857143 3.132016 7
2014-06-29 9.285714 2.927700 7
... ... ... ...
2041-09-15 8.285714 2.984085 7
2041-09-22 10.000000 2.000000 7
2041-09-29 9.428571 3.154739 7
2041-10-06 11.000000 2.828427 7
2041-10-13 9.833333 2.857738 6

1429 rows × 3 columns


In [58]:
mu = rs['mean']
fig, ax = plt.subplots(figsize=(12, 6))
ax.step(mu.index[:50], mu.iloc[:50], lw=3, where='post')
fig.tight_layout()
ax.set_xlabel('Time')
ax.set_ylabel('Average Count / Week')
ax.set_title('Count vs. Time')


Out[58]:
<matplotlib.text.Text at 0x58f0550>

Basic IO


In [59]:
%%writefile tmp.csv
a,b,c
1,d,3.0
4,e,6.28
2,f,4.4


Overwriting tmp.csv

In [60]:
df = pd.read_csv('tmp.csv')
df


Out[60]:
a b c
0 1 d 3.00
1 4 e 6.28
2 2 f 4.40

In [61]:
df.dtypes


Out[61]:
a      int64
b     object
c    float64
dtype: object

In [62]:
df.to_csv('tmp.csv')

In [63]:
%cat tmp.csv # saves the index created by default


,a,b,c
0,1,d,3.0
1,4,e,6.28
2,2,f,4.4

In [64]:
df = DataFrame(dict(a=randn(10), b=randint(10, size=10),
                    c=choice(list('abc'), size=10)))
df


Out[64]:
a b c
0 0.311531 2 a
1 0.466764 3 b
2 -0.134696 9 a
3 -1.227535 4 c
4 -0.572367 8 a
5 0.330043 7 a
6 -0.684184 4 c
7 -1.097351 9 c
8 -0.694465 1 b
9 -1.376861 5 b

In [65]:
df.to_hdf('tmp.h5', 'df', format='table')

In [66]:
pd.read_hdf('tmp.h5', 'df')


Out[66]:
a b c
0 0.311531 2 a
1 0.466764 3 b
2 -0.134696 9 a
3 -1.227535 4 c
4 -0.572367 8 a
5 0.330043 7 a
6 -0.684184 4 c
7 -1.097351 9 c
8 -0.694465 1 b
9 -1.376861 5 b

In [68]:
!ptdump tmp.h5


/ (RootGroup) ''
/df (Group) ''
/df/table (Table(10,)) ''

Next Up: Features!