In [3]:

    
from __future__ import division, print_function



In [4]:

    
from IPython.display import HTML, Image, display as disp



In [5]:

    
with open("../css/css.css", "r") as f:
    style = f.read()
HTML(style)









    Out[5]:

Whirlwind Introduction to Pandas!

About me: Phillip Cloud

Past:

MA Psychology from CUNY CCNY (analysis of rat spike train data)
Started using pandas/numpy for loading data from proprietary formats
Decided I should contribute

Now:

Data infrastructure/research/development @ neuromatters (neurotechnology R & D company)
Core contributer to pandas for a little over 1 year



In [6]:

    
!./shortlog.sh ../../pandas









    



  2985	Wes McKinney
  1149	jreback
   765	y-p
   607	Chang She
   323	Phillip Cloud
   315	Adam Klein
   120	Jeffrey Tratner
   109	Vytautas Jancauskas
    89	Joris Van den Bossche
    84	Andy Hayden

Why pandas?

pandas fills the need for a Python library that is awesome at data munging. With pandas you don't need to go outside of Python to do data analysis. You can write a web app using django and have data dashboards that are backed by pandas.

What is pandas?

Strengths:

pandas is the library for tabular you've always wished you had. It supports operations such as

groupby
merge
aggregations (mean, sum, any, all, etc.)
advanced indexing capabilities
automatic data alignment during e.g., arithmetic operations
powerful vectorized text manipulation
missing data handling
comprehensive and extremely flexible IO (e.g., CSV, Excel, HDF5, HTML)

Limitations:

pandas was designed to work with "medium" data, i.e., data that fit into memory.
That said, I routinely work with $\approx$ 20-100M row datasets fairly easily with pandas.

My reaction after discovering pandas:



In [7]:

    
disp(Image("../img/m.jpg"))



In [8]:

    
disp(Image("../img/u.jpg"))

First rule of pandas talks: they aren't complete without a meme.

Deps for the talk

These notebooks
pandas with all optional deps (Anaconda provides most if not all of these)
sh: pip install sh
jq
requests because urllib et al are less than desirable

Optional

mpltools (for nice looking plots by default)

Enough talking, let's see some code.



In [9]:

    
# imports that I'll use throughout the talk
import numpy as np
import pandas as pd
from pandas import DataFrame, Series, Index
from numpy.random import randn, randint, rand, choice
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10

try:
    from mpltools import style
    style.use('ggplot')
except ImportError:
    pass


# because of our bg color
plt.rc('text', color='white')
plt.rc('axes', labelcolor='white')
plt.rc('xtick', color='white')
plt.rc('ytick', color='white')



In [10]:

    
%matplotlib inline

`pandas.DataFrame`

`DataFrame` is the flagship data structure in `pandas`



In [11]:

    
df = pd.DataFrame(randn(10, 2), columns=list('ab'))
df



In [ ]:

    
# if you have boto installed and have set up credentials
# df = pd.read_csv('s3://nypug/tips.csv')



In [12]:

    
df = pd.read_csv('https://s3.amazonaws.com/nyqpug/tips.csv')
df









    Out[12]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0  
       16.99
       1.01
       Female
        No
        Sun
       Dinner
       2
    
    
      1  
       10.34
       1.66
         Male
        No
        Sun
       Dinner
       3
    
    
      2  
       21.01
       3.50
         Male
        No
        Sun
       Dinner
       3
    
    
      3  
       23.68
       3.31
         Male
        No
        Sun
       Dinner
       2
    
    
      4  
       24.59
       3.61
       Female
        No
        Sun
       Dinner
       4
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      239
       29.03
       5.92
         Male
        No
        Sat
       Dinner
       3
    
    
      240
       27.18
       2.00
       Female
       Yes
        Sat
       Dinner
       2
    
    
      241
       22.67
       2.00
         Male
       Yes
        Sat
       Dinner
       2
    
    
      242
       17.82
       1.75
         Male
        No
        Sat
       Dinner
       2
    
    
      243
       18.78
       3.00
       Female
        No
       Thur
       Dinner
       2
    
  

244 rows × 7 columns



In [13]:

    
df.dtypes









    Out[13]:





total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

One of `pandas`' most useful and powerful features is its ability to slice and dice data in almost any way you can think of.



In [14]:

    
# column access by name
df['day']  # ⟵ that's a Series object









    Out[14]:





0    Sun
1    Sun
2    Sun
...
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object



In [15]:

    
# by attribute
df.time









    Out[15]:





0    Dinner
1    Dinner
2    Dinner
...
241    Dinner
242    Dinner
243    Dinner
Name: time, Length: 244, dtype: object



In [16]:

    
df.time.value_counts()









    Out[16]:





Dinner    176
Lunch      68
dtype: int64



In [17]:

    
# multiple columns
df[['tip', 'sex']]









    Out[17]:






  
    
      
      tip
      sex
    
  
  
    
      0  
       1.01
       Female
    
    
      1  
       1.66
         Male
    
    
      2  
       3.50
         Male
    
    
      3  
       3.31
         Male
    
    
      4  
       3.61
       Female
    
    
      ...
      ...
      ...
    
    
      239
       5.92
         Male
    
    
      240
       2.00
       Female
    
    
      241
       2.00
         Male
    
    
      242
       1.75
         Male
    
    
      243
       3.00
       Female
    
  

244 rows × 2 columns



In [18]:

    
t = df.set_index('day')
t.head()









    Out[18]:






  
    
      
      total_bill
      tip
      sex
      smoker
      time
      size
    
    
      day
      
      
      
      
      
      
    
  
  
    
      Sun
       16.99
       1.01
       Female
       No
       Dinner
       2
    
    
      Sun
       10.34
       1.66
         Male
       No
       Dinner
       3
    
    
      Sun
       21.01
       3.50
         Male
       No
       Dinner
       3
    
    
      Sun
       23.68
       3.31
         Male
       No
       Dinner
       2
    
    
      Sun
       24.59
       3.61
       Female
       No
       Dinner
       4



In [ ]:

    
t.loc['Sun']



In [19]:

    
df.loc[df.day == 'Sun']









    Out[19]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0  
       16.99
       1.01
       Female
        No
       Sun
       Dinner
       2
    
    
      1  
       10.34
       1.66
         Male
        No
       Sun
       Dinner
       3
    
    
      2  
       21.01
       3.50
         Male
        No
       Sun
       Dinner
       3
    
    
      3  
       23.68
       3.31
         Male
        No
       Sun
       Dinner
       2
    
    
      4  
       24.59
       3.61
       Female
        No
       Sun
       Dinner
       4
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      186
       20.90
       3.50
       Female
       Yes
       Sun
       Dinner
       3
    
    
      187
       30.46
       2.00
         Male
       Yes
       Sun
       Dinner
       5
    
    
      188
       18.15
       3.50
       Female
       Yes
       Sun
       Dinner
       3
    
    
      189
       23.10
       4.00
         Male
       Yes
       Sun
       Dinner
       3
    
    
      190
       15.69
       1.50
         Male
       Yes
       Sun
       Dinner
       2
    
  

76 rows × 7 columns



In [21]:

    
df.loc[:, 'smoker']









    Out[21]:





0    No
1    No
2    No
...
241    Yes
242     No
243     No
Name: smoker, Length: 244, dtype: object



In [22]:

    
df.iloc[:, 3]  # same as df.loc[:, 'smoker']









    Out[22]:





0    No
1    No
2    No
...
241    Yes
242     No
243     No
Name: smoker, Length: 244, dtype: object

Add and delete columns as you please



In [23]:

    
df['pct_tip'] = df.tip / df.total_bill
df['avg_price'] = df.total_bill / df.size



In [24]:

    
df.avg_price.hist(bins=20)









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x4ae7a50>



In [25]:

    
del df['avg_price']
del df['pct_tip']



In [26]:

    
# multiple columns, multiple rows
df.loc[[0, 2], ['sex', 'tip']]



In [27]:

    
df.loc[:10, ['total_bill', 'tip']]  # note this is inclusive









    Out[27]:






  
    
      
      total_bill
      tip
    
  
  
    
      0 
       16.99
       1.01
    
    
      1 
       10.34
       1.66
    
    
      2 
       21.01
       3.50
    
    
      3 
       23.68
       3.31
    
    
      4 
       24.59
       3.61
    
    
      ...
      ...
      ...
    
    
      6 
        8.77
       2.00
    
    
      7 
       26.88
       3.12
    
    
      8 
       15.04
       1.96
    
    
      9 
       14.78
       3.23
    
    
      10
       10.27
       1.71
    
  

11 rows × 2 columns



In [28]:

    
# and with iloc
df.iloc[:5]  # exclusive endpoints









    Out[28]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
       16.99
       1.01
       Female
       No
       Sun
       Dinner
       2
    
    
      1
       10.34
       1.66
         Male
       No
       Sun
       Dinner
       3
    
    
      2
       21.01
       3.50
         Male
       No
       Sun
       Dinner
       3
    
    
      3
       23.68
       3.31
         Male
       No
       Sun
       Dinner
       2
    
    
      4
       24.59
       3.61
       Female
       No
       Sun
       Dinner
       4

Applying Functions to Pandas Objects

Apply a function over the columns of a DataFrame



In [29]:

    
# np.ptp is peak-to-peak difference, i.e., range
df[['total_bill', 'tip', 'size']].apply(np.ptp)









    Out[29]:





total_bill    47.74
tip            9.00
size           5.00
dtype: float64



In [31]:

    
days = ['Sun', 'Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat']



In [32]:

    
df.day.map(days.index)









    Out[32]:





0    0
1    0
2    0
...
241    6
242    6
243    4
Name: day, Length: 244, dtype: int64

Reductions

For example, the DataFrame.sum() method results in a Series object and by default sums across the rows. If you want to sum across the columns pass axis=1, similar to the numpy convention

In general, reduction operations result in a Series object.

Some example reduction functions:

mean, median, and mode
count
std, var

NOTE: These ignore NaNs



In [33]:

    
df.sum()  # whoa! + is defined for strings









    Out[33]:





total_bill                                              4827.77
tip                                                      731.58
sex           FemaleMaleMaleMaleFemaleMaleMaleMaleMaleMaleMa...
smoker        NoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNoNo...
day           SunSunSunSunSunSunSunSunSunSunSunSunSunSunSunS...
time          DinnerDinnerDinnerDinnerDinnerDinnerDinnerDinn...
size                                                        627
dtype: object



In [34]:

    
df.sum(numeric_only=True)









    Out[34]:





total_bill    4827.77
tip            731.58
size           627.00
dtype: float64



In [37]:

    
df.count() / df.shape[0]  # we don't have any nans









    Out[37]:





total_bill    1
tip           1
sex           1
smoker        1
day           1
time          1
size          1
dtype: float64



In [35]:

    
df.var()









    Out[35]:





total_bill    79.252939
tip            1.914455
size           0.904591
dtype: float64



In [36]:

    
df.mean()









    Out[36]:





total_bill    19.785943
tip            2.998279
size           2.569672
dtype: float64

`GroupBy`

pandas' GroupBy functionality allows you to perform operations on subsets of a DataFrame and then combines the results for you at the end



In [38]:

    
df.head()









    Out[38]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
       16.99
       1.01
       Female
       No
       Sun
       Dinner
       2
    
    
      1
       10.34
       1.66
         Male
       No
       Sun
       Dinner
       3
    
    
      2
       21.01
       3.50
         Male
       No
       Sun
       Dinner
       3
    
    
      3
       23.68
       3.31
         Male
       No
       Sun
       Dinner
       2
    
    
      4
       24.59
       3.61
       Female
       No
       Sun
       Dinner
       4



In [39]:

    
gb = df.groupby('sex')
gb.mean()



In [40]:

    
gb = df.groupby(['sex', 'smoker'])
gb.std()



In [41]:

    
# can pass multiple reducers to agg
gb.agg(['mean', 'std', 'median'])

`apply`

GroupBy.apply() take a callable and calls it on each group



In [42]:

    
def stdize(x):
    return (x - x.mean()) / x.std()
    
df['tb_std'] = gb.total_bill.apply(stdize)
df









    Out[42]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      tb_std
    
  
  
    
      0  
       16.99
       1.01
       Female
        No
        Sun
       Dinner
       2
      -0.153049
    
    
      1  
       10.34
       1.66
         Male
        No
        Sun
       Dinner
       3
      -1.083042
    
    
      2  
       21.01
       3.50
         Male
        No
        Sun
       Dinner
       3
       0.139661
    
    
      3  
       23.68
       3.31
         Male
        No
        Sun
       Dinner
       2
       0.445623
    
    
      4  
       24.59
       3.61
       Female
        No
        Sun
       Dinner
       4
       0.889982
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      239
       29.03
       5.92
         Male
        No
        Sat
       Dinner
       3
       1.058694
    
    
      240
       27.18
       2.00
       Female
       Yes
        Sat
       Dinner
       2
       1.001346
    
    
      241
       22.67
       2.00
         Male
       Yes
        Sat
       Dinner
       2
       0.038893
    
    
      242
       17.82
       1.75
         Male
        No
        Sat
       Dinner
       2
      -0.225889
    
    
      243
       18.78
       3.00
       Female
        No
       Thur
       Dinner
       2
       0.092612
    
  

244 rows × 8 columns



In [43]:

    
ax = df.tb_std.plot(kind='kde', lw=3)
df.tb_std.hist(ax=ax, normed=True)

ax.set_xlabel('$z$-score', fontsize=20)
ax.set_title(r'Total Bill Stdized Across Sex$\times$ Smoker')

ax.axis('tight')
plt.gcf().tight_layout()

`merge`

Often times you have data sets of different shapes with a common column that you want to join on

pandas does this a couple of ways, but the main entry point is pandas.merge



In [45]:

    
# create some frames
n = 2000
n2 = n // 2
visits = DataFrame({'page_visits_per_day': np.random.poisson(10, size=n),
                 'user_id': randint(9, size=n)})
likes = DataFrame({'likes_per_day': np.random.poisson(30, size=n2),
                   'user_id': randint(6, size=n2)})
visits









    Out[45]:






  
    
      
      page_visits_per_day
      user_id
    
  
  
    
      0   
        6
       4
    
    
      1   
       12
       1
    
    
      2   
       13
       5
    
    
      3   
        7
       1
    
    
      4   
        6
       7
    
    
      ...
      ...
      ...
    
    
      1995
        7
       1
    
    
      1996
        8
       5
    
    
      1997
        9
       3
    
    
      1998
       10
       6
    
    
      1999
        8
       4
    
  

2000 rows × 2 columns



In [46]:

    
likes









    Out[46]:






  
    
      
      likes_per_day
      user_id
    
  
  
    
      0  
       27
       0
    
    
      1  
       33
       4
    
    
      2  
       23
       1
    
    
      3  
       31
       5
    
    
      4  
       26
       0
    
    
      ...
      ...
      ...
    
    
      995
       28
       5
    
    
      996
       30
       3
    
    
      997
       33
       4
    
    
      998
       29
       0
    
    
      999
       27
       5
    
  

1000 rows × 2 columns



In [47]:

    
merg = pd.merge(visits, likes)
merg.sort('user_id')









    Out[47]:






  
    
      
      page_visits_per_day
      user_id
      likes_per_day
    
  
  
    
      152516
       15
       0
       22
    
    
      174447
       11
       0
       21
    
    
      174446
       11
       0
       25
    
    
      174445
       11
       0
       25
    
    
      174444
       11
       0
       26
    
    
      ...
      ...
      ...
      ...
    
    
      85623 
       17
       5
       24
    
    
      85624 
       17
       5
       35
    
    
      85625 
       17
       5
       34
    
    
      85645 
       17
       5
       25
    
    
      73182 
       10
       5
       34
    
  

227153 rows × 3 columns

String Manipulation



In [48]:

    
s = df.day
s









    Out[48]:





0    Sun
1    Sun
2    Sun
...
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object



In [49]:

    
is_weekend = s.str.startswith('S')
is_weekend









    Out[49]:





0    True
1    True
2    True
...
241     True
242     True
243    False
Name: day, Length: 244, dtype: bool



In [50]:

    
correct = s[is_weekend].str.contains(r'^(?:Sat|Sun)$')
correct.all()









    Out[50]:





True



In [51]:

    
s.str.len()









    Out[51]:





0    3
1    3
2    3
...
241    3
242    3
243    4
Name: day, Length: 244, dtype: int64



In [52]:

    
s.str[:2]









    Out[52]:





0    Su
1    Su
2    Su
...
241    Sa
242    Sa
243    Th
Name: day, Length: 244, dtype: object

Time Series Functionality

Construction



In [53]:

    
n = 10000
idx = pd.date_range(start='today', periods=n, freq='D')
idx









    Out[53]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2014-05-28, ..., 2041-10-12]
Length: 10000, Freq: D, Timezone: None



In [55]:

    
s = Series(np.random.poisson(10, size=n), index=idx, name='login_count')
s









    Out[55]:





2014-05-28     6
2014-05-29    12
2014-05-30    11
...
2041-10-10     6
2041-10-11    14
2041-10-12     8
Freq: D, Name: login_count, Length: 10000

`resample`



In [56]:

    
s.resample('W', how='sum')









    Out[56]:





2014-06-01    46
2014-06-08    56
2014-06-15    74
...
2041-09-29    66
2041-10-06    77
2041-10-13    59
Freq: W-SUN, Name: login_count, Length: 1429



In [57]:

    
# multiple functions
rs = s.resample('W', how=['mean', 'std', 'count'])
rs









    Out[57]:






  
    
      
      mean
      std
      count
    
  
  
    
      2014-06-01
        9.200000
       2.387467
       5
    
    
      2014-06-08
        8.000000
       2.768875
       7
    
    
      2014-06-15
       10.571429
       4.755949
       7
    
    
      2014-06-22
       10.857143
       3.132016
       7
    
    
      2014-06-29
        9.285714
       2.927700
       7
    
    
      ...
      ...
      ...
      ...
    
    
      2041-09-15
        8.285714
       2.984085
       7
    
    
      2041-09-22
       10.000000
       2.000000
       7
    
    
      2041-09-29
        9.428571
       3.154739
       7
    
    
      2041-10-06
       11.000000
       2.828427
       7
    
    
      2041-10-13
        9.833333
       2.857738
       6
    
  

1429 rows × 3 columns

Plotting



In [58]:

    
mu = rs['mean']
fig, ax = plt.subplots(figsize=(12, 6))
ax.step(mu.index[:50], mu.iloc[:50], lw=3, where='post')
fig.tight_layout()
ax.set_xlabel('Time')
ax.set_ylabel('Average Count / Week')
ax.set_title('Count vs. Time')









    Out[58]:





<matplotlib.text.Text at 0x58f0550>

Basic IO

CSV: `read_csv`/`to_csv`



In [59]:

    
%%writefile tmp.csv
a,b,c
1,d,3.0
4,e,6.28
2,f,4.4









    



Overwriting tmp.csv



In [60]:

    
df = pd.read_csv('tmp.csv')
df



In [61]:

    
df.dtypes









    Out[61]:





a      int64
b     object
c    float64
dtype: object



In [62]:

    
df.to_csv('tmp.csv')



In [63]:

    
%cat tmp.csv # saves the index created by default









    



,a,b,c
0,1,d,3.0
1,4,e,6.28
2,2,f,4.4

HDF5: `read_hdf`/`to_hdf`



In [64]:

    
df = DataFrame(dict(a=randn(10), b=randint(10, size=10),
                    c=choice(list('abc'), size=10)))
df



In [65]:

    
df.to_hdf('tmp.h5', 'df', format='table')



In [66]:

    
pd.read_hdf('tmp.h5', 'df')



In [68]:

    
!ptdump tmp.h5









    



/ (RootGroup) ''
/df (Group) ''
/df/table (Table(10,)) ''

	a	b
0	1.525911	-1.536062
1	-0.773639	0.786538
2	0.673406	0.336026
3	-0.778567	0.557999
4	0.932137	-1.204426
5	0.139991	0.406078
6	-0.687879	-0.497632
7	0.271935	-0.287440
8	0.240098	1.125080
9	0.235167	-0.532541

		total_bill			tip			size
		mean	std	median	mean	std	median	mean	std	median
sex	smoker
Female	No	18.105185	7.286455	16.69	2.773519	1.128425	2.68	2.592593	1.073146	2
Female	Yes	17.977879	9.189751	16.27	2.931515	1.219916	2.88	2.242424	0.613917	2
Male	No	19.791237	8.726566	18.24	3.113402	1.489559	2.74	2.711340	0.989094	2
Male	Yes	22.284500	9.911845	20.39	3.051167	1.500120	3.00	2.500000	0.892530	2

	a	b	c
0	0.311531	2	a
1	0.466764	3	b
2	-0.134696	9	a
3	-1.227535	4	c
4	-0.572367	8	a
5	0.330043	7	a
6	-0.684184	4	c
7	-1.097351	9	c
8	-0.694465	1	b
9	-1.376861	5	b

	a	b	c
0	0.311531	2	a
1	0.466764	3	b
2	-0.134696	9	a
3	-1.227535	4	c
4	-0.572367	8	a
5	0.330043	7	a
6	-0.684184	4	c
7	-1.097351	9	c
8	-0.694465	1	b
9	-1.376861	5	b

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

	total_bill	tip	size
sex
Female	18.056897	2.833448	2.459770
Male	20.744076	3.089618	2.630573

	page_visits_per_day	user_id	likes_per_day
152516	15	0	22
174447	11	0	21
174446	11	0	25
174445	11	0	25
174444	11	0	26
...	...	...	...
85623	17	5	24
85624	17	5	35
85625	17	5	34
85645	17	5	25
73182	10	5	34

	mean	std	count
2014-06-01	9.200000	2.387467	5
2014-06-08	8.000000	2.768875	7
2014-06-15	10.571429	4.755949	7
2014-06-22	10.857143	3.132016	7
2014-06-29	9.285714	2.927700	7
...	...	...	...
2041-09-15	8.285714	2.984085	7
2041-09-22	10.000000	2.000000	7
2041-09-29	9.428571	3.154739	7
2041-10-06	11.000000	2.828427	7
2041-10-13	9.833333	2.857738	6