In [1]:
# pandas is commonly imported as `pd`
import pandas as pd
import numpy as np

1-D

Series


In [2]:
s = pd.Series([4, 7, -5, 3])
s


Out[2]:
0    4
1    7
2   -5
3    3
dtype: int64

In [3]:
s.index


Out[3]:
RangeIndex(start=0, stop=4, step=1)

In [4]:
s.values


Out[4]:
array([ 4,  7, -5,  3])

In [5]:
s[1:3]


Out[5]:
1    7
2   -5
dtype: int64

In [6]:
s2 = s**2
s2


Out[6]:
0    16
1    49
2    25
3     9
dtype: int64

In [7]:
s2+s


Out[7]:
0    20
1    56
2    20
3    12
dtype: int64

In [8]:
print(np.sum(s))
print(np.mean(s))
print(np.std(s))


9
2.25
4.43705983732

In [9]:
s3 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
s3


Out[9]:
d    4
b    7
a   -5
c    3
dtype: int64

In [10]:
# !!!
s2+s3


/home/patricia/dev/tools/anaconda3/envs/playground/lib/python3.6/site-packages/pandas/core/indexes/base.py:2984: RuntimeWarning: '<' not supported between instances of 'int' and 'str', sort order is undefined for incomparable objects
  return this.join(other, how=how, return_indexers=return_indexers)
Out[10]:
0   NaN
1   NaN
2   NaN
3   NaN
d   NaN
b   NaN
a   NaN
c   NaN
dtype: float64

In [11]:
s4 = pd.Series({'d':10, 'b':12, 'a':3, 'c':9})
s4


Out[11]:
a     3
b    12
c     9
d    10
dtype: int64

In [12]:
s3+s4


Out[12]:
a    -2
b    19
c    12
d    14
dtype: int64

x-D

Dataframes

The most important feature pandas gives us access to is the DataFrame. Dataframes are two-dimensional stucutres that you can think of very much like a spreadsheet with named columns and rows. In fact, it supports reading in CSV and Excel files.

First, let's read a CSV containing population information of the States of the USA.


In [13]:
# Create a DataFrame, `df`, from the csv located in `data/population.csv`
df = pd.read_csv('../../data/population.csv')

To see what our data looks like, we can use df.head(n) to see the first n rows, with n=5 the default:


In [14]:
df.head()


Out[14]:
Id Id2 Geography 2010 2011 2012 2013 2014 2015
0 0400000US01 1 Alabama 4785161 4801108 4816089 4830533 4846411 4858979
1 0400000US02 2 Alaska 714021 722720 731228 737442 737046 738432
2 0400000US04 4 Arizona 6408208 6468732 6553262 6630799 6728783 6828065
3 0400000US05 5 Arkansas 2922394 2938538 2949499 2957957 2966835 2978204
4 0400000US06 6 California 37334079 37700034 38056055 38414128 38792291 39144818

We see that for each state we have two IDs and then 6 columns of years - in this case each column is the population of that state during the given year.

We can acess columns by referencing the column name just like a python dictionary:


In [15]:
df['2010'].head()


Out[15]:
0     4785161
1      714021
2     6408208
3     2922394
4    37334079
Name: 2010, dtype: int64

We can get multiple columns at once by passing a list of column names:


In [16]:
df[['2010', '2011']].head()


Out[16]:
2010 2011
0 4785161 4801108
1 714021 722720
2 6408208 6468732
3 2922394 2938538
4 37334079 37700034

And then we can access groups of rows using a range of row IDs:


In [17]:
df[5:10]


Out[17]:
Id Id2 Geography 2010 2011 2012 2013 2014 2015
5 0400000US08 8 Colorado 5048254 5119480 5191731 5271132 5355588 5456574
6 0400000US09 9 Connecticut 3579717 3589759 3593541 3597168 3594762 3590886
7 0400000US10 10 Delaware 899791 907916 917099 925353 935968 945934
8 0400000US11 11 District of Columbia 605126 620472 635342 649540 659836 672228
9 0400000US12 12 Florida 18849890 19105533 19352021 19594467 19905569 20271272

Accessing individual rows is different. You can't just use df[i]. Instead you need to use df.loc:


In [18]:
df.loc[20]


Out[18]:
Id           0400000US24
Id2                   24
Geography       Maryland
2010             5788409
2011             5844171
2012             5890740
2013             5936040
2014             5975346
2015             6006401
Name: 20, dtype: object

In [19]:
type(df.loc[20])


Out[19]:
pandas.core.series.Series

Pandas gives us an easy way to get summary statistics of our data:


In [20]:
df.describe()


Out[20]:
Id2 2010 2011 2012 2013 2014 2015
count 52.000000 5.200000e+01 5.200000e+01 5.200000e+01 5.200000e+01 5.200000e+01 5.200000e+01
mean 29.788462 6.020546e+06 6.065338e+06 6.110329e+06 6.154240e+06 6.200813e+06 6.247942e+06
std 16.774557 6.780109e+06 6.851091e+06 6.920737e+06 6.988942e+06 7.064993e+06 7.140930e+06
min 1.000000 5.645160e+05 5.677680e+05 5.770800e+05 5.831310e+05 5.843040e+05 5.861070e+05
25% 16.750000 1.765265e+06 1.777821e+06 1.791004e+06 1.792935e+06 1.795265e+06 1.796828e+06
50% 29.500000 4.092954e+06 4.118196e+06 4.141056e+06 4.163265e+06 4.191910e+06 4.227034e+06
75% 42.500000 6.609542e+06 6.664655e+06 6.717658e+06 6.774928e+06 6.832134e+06 6.913636e+06
max 72.000000 3.733408e+07 3.770003e+07 3.805606e+07 3.841413e+07 3.879229e+07 3.914482e+07

One thing you might notice is that describe only lists the numeric columns, but Id2 is included in that even though it would be better to treat it as a string. pandas tries to guess the datatype of each column and in this case, all of the values in Id2 are integers, so it gets treated as an integer.

We can see the datatype details:


In [21]:
df.dtypes


Out[21]:
Id           object
Id2           int64
Geography    object
2010          int64
2011          int64
2012          int64
2013          int64
2014          int64
2015          int64
dtype: object

We can cast Id2 to a string using astype and then override the original column:


In [22]:
df['Id2'] = df['Id2'].astype(str)

df.dtypes


Out[22]:
Id           object
Id2          object
Geography    object
2010          int64
2011          int64
2012          int64
2013          int64
2014          int64
2015          int64
dtype: object

Or we could have specified the data type when we originally read the CSV:


In [23]:
# Pass a dictionary to the dtype parameter with `'column': dtype`
df = pd.read_csv('../../data/population.csv', dtype={'Id2': str})

df.dtypes


Out[23]:
Id           object
Id2          object
Geography    object
2010          int64
2011          int64
2012          int64
2013          int64
2014          int64
2015          int64
dtype: object

Operations and Filtering

Each column of data behaves very much like a normal numpy array and thus can be used for mathematical operations. For example, to get the population change from 2014 to 2015 for each state:


In [24]:
df['2015'] - df['2014']


Out[24]:
0      12568
1       1386
2      99282
3      11369
4     352527
5     100986
6      -3876
7       9966
8      12392
9     365703
10    117728
11     11346
12     20124
13    -22194
14     21800
15     14418
16      9134
17     12475
18     21734
19      -928
20     31055
21     39298
22      6270
23     32469
24     -1110
25     19845
26      9697
27     13210
28     52564
29      2612
30     19169
31      -458
32     46933
33    102415
34     16887
35     16425
36     31728
37     57775
38      8736
39      1391
40     66986
41      5165
42     52520
43    490036
44     51421
45      -725
46     54895
47    107185
48     -4623
49     11905
50      1803
51    -60706
dtype: int64

Rather than continually computing that value, we can save it to a new column in the DataFrame. Let's make a new column called 'change':


In [25]:
df['pop_change'] = df['2015'] - df['2014']

df.head()


Out[25]:
Id Id2 Geography 2010 2011 2012 2013 2014 2015 pop_change
0 0400000US01 1 Alabama 4785161 4801108 4816089 4830533 4846411 4858979 12568
1 0400000US02 2 Alaska 714021 722720 731228 737442 737046 738432 1386
2 0400000US04 4 Arizona 6408208 6468732 6553262 6630799 6728783 6828065 99282
3 0400000US05 5 Arkansas 2922394 2938538 2949499 2957957 2966835 2978204 11369
4 0400000US06 6 California 37334079 37700034 38056055 38414128 38792291 39144818 352527

Just like numpy, we can also do element-wise comparisons. For example, to find out whether a state's population decreased over that year:


In [26]:
df['pop_change'] < 0


Out[26]:
0     False
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15    False
16    False
17    False
18    False
19     True
20    False
21    False
22    False
23    False
24     True
25    False
26    False
27    False
28    False
29    False
30    False
31     True
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45     True
46    False
47    False
48     True
49    False
50    False
51     True
Name: pop_change, dtype: bool

The True values are the states with negative population change (decrease). But that boolean array by itself isn't very useful. We can use that array as an index to filter our DataFrame:


In [27]:
df[df['pop_change'] < 0]


Out[27]:
Id Id2 Geography 2010 2011 2012 2013 2014 2015 pop_change
6 0400000US09 9 Connecticut 3579717 3589759 3593541 3597168 3594762 3590886 -3876
13 0400000US17 17 Illinois 12841249 12861882 12875167 12889580 12882189 12859995 -22194
19 0400000US23 23 Maine 1327695 1328257 1328888 1328778 1330256 1329328 -928
24 0400000US28 28 Mississippi 2970316 2977999 2985660 2990976 2993443 2992333 -1110
31 0400000US35 35 New Mexico 2064741 2078226 2084792 2086890 2085567 2085109 -458
45 0400000US50 50 Vermont 625984 626687 626398 627129 626767 626042 -725
48 0400000US54 54 West Virginia 1854225 1854948 1856283 1852985 1848751 1844128 -4623
51 0400000US72 72 Puerto Rico 3721526 3678736 3634487 3593079 3534888 3474182 -60706

Now we have a subset of the DataFrame with only the decreasing populations.

Statistical Operations


In [28]:
print(df.mean())
print(df.std())
print(df.max())
print(df.sum())


2010          6.020546e+06
2011          6.065338e+06
2012          6.110329e+06
2013          6.154240e+06
2014          6.200813e+06
2015          6.247942e+06
pop_change    4.712910e+04
dtype: float64
2010          6.780109e+06
2011          6.851091e+06
2012          6.920737e+06
2013          6.988942e+06
2014          7.064993e+06
2015          7.140930e+06
pop_change    9.629267e+04
dtype: float64
Id            0400000US72
Id2                     9
Geography         Wyoming
2010             37334079
2011             37700034
2012             38056055
2013             38414128
2014             38792291
2015             39144818
pop_change         490036
dtype: object
Id            0400000US010400000US020400000US040400000US0504...
Id2           1245689101112131516171819202122232425262728293...
Geography     AlabamaAlaskaArizonaArkansasCaliforniaColorado...
2010                                                  313068389
2011                                                  315397593
2012                                                  317737110
2013                                                  320020474
2014                                                  322442289
2015                                                  324893002
pop_change                                              2450713
dtype: object

In [29]:
# or over single columns
print(df[['2010','2011']].mean())


2010    6.020546e+06
2011    6.065338e+06
dtype: float64

Merging DataFrames

If you have data across multiple files, as long as there is a common column they can be joined. To start with, let's read in a CSV which contains the number of housing units in the state for each year from 2010-2015.


In [30]:
housing = pd.read_csv('../../data/housing.csv', dtype={'Id2': str})

housing.head()


Out[30]:
Id Id2 Geography 2010 2011 2012 2013 2014 2015
0 0400000US01 1 Alabama 2174266 2181371 2189738 2199413 2207849 2218287
1 0400000US02 2 Alaska 307059 307377 307664 308089 308578 309448
2 0400000US04 4 Arizona 2848293 2857849 2870144 2887506 2907820 2929030
3 0400000US05 5 Arkansas 1317756 1323130 1328315 1334978 1341017 1347528
4 0400000US06 6 California 13688185 13728585 13771494 13828412 13907648 13987625

Since the Id column is shared, it can easily be merged with our original DataFrame:


In [31]:
merged = df.merge(housing, on='Id')

merged.head()


Out[31]:
Id Id2_x Geography_x 2010_x 2011_x 2012_x 2013_x 2014_x 2015_x pop_change Id2_y Geography_y 2010_y 2011_y 2012_y 2013_y 2014_y 2015_y
0 0400000US01 1 Alabama 4785161 4801108 4816089 4830533 4846411 4858979 12568 1 Alabama 2174266 2181371 2189738 2199413 2207849 2218287
1 0400000US02 2 Alaska 714021 722720 731228 737442 737046 738432 1386 2 Alaska 307059 307377 307664 308089 308578 309448
2 0400000US04 4 Arizona 6408208 6468732 6553262 6630799 6728783 6828065 99282 4 Arizona 2848293 2857849 2870144 2887506 2907820 2929030
3 0400000US05 5 Arkansas 2922394 2938538 2949499 2957957 2966835 2978204 11369 5 Arkansas 1317756 1323130 1328315 1334978 1341017 1347528
4 0400000US06 6 California 37334079 37700034 38056055 38414128 38792291 39144818 352527 6 California 13688185 13728585 13771494 13828412 13907648 13987625

Since the column names are all shared, pandas appends '_x' and '_y' to columns from the left and right dataframes, respectively.

This isn't very user-friendly, so we can use the parameter suffixes to specify custom labels to append. Furthermore, we can also specify Id2 and Geography in on so we don't duplicate those columns.


In [32]:
merged = df.merge(housing, on=['Id', 'Id2', 'Geography'], suffixes=('_population', '_housing'))

merged.head()


Out[32]:
Id Id2 Geography 2010_population 2011_population 2012_population 2013_population 2014_population 2015_population pop_change 2010_housing 2011_housing 2012_housing 2013_housing 2014_housing 2015_housing
0 0400000US01 1 Alabama 4785161 4801108 4816089 4830533 4846411 4858979 12568 2174266 2181371 2189738 2199413 2207849 2218287
1 0400000US02 2 Alaska 714021 722720 731228 737442 737046 738432 1386 307059 307377 307664 308089 308578 309448
2 0400000US04 4 Arizona 6408208 6468732 6553262 6630799 6728783 6828065 99282 2848293 2857849 2870144 2887506 2907820 2929030
3 0400000US05 5 Arkansas 2922394 2938538 2949499 2957957 2966835 2978204 11369 1317756 1323130 1328315 1334978 1341017 1347528
4 0400000US06 6 California 37334079 37700034 38056055 38414128 38792291 39144818 352527 13688185 13728585 13771494 13828412 13907648 13987625

We can also notice that when we did the merge, we lost one row. That is because the housing dataset didn't contain data for Puerto Rico.


In [33]:
print('Population:', len(df), 'Merged:', len(merged))


Population: 52 Merged: 51

Now we can do something involving both datasets. For example, finding the ratio of people to houses:


In [34]:
merged['ratio'] = merged['2015_population']/merged['2015_housing']

merged['ratio'].head()


Out[34]:
0    2.190419
1    2.386288
2    2.331169
3    2.210124
4    2.798532
Name: ratio, dtype: float64

Now let's use sort_values to view the states with the lowest ratio of people to houses and view just the state name and ratio columns:


In [35]:
# Sort the data by ratio
merged_sorted = merged.sort_values(by=['ratio'])

# Just get the `Geography` and `ratio` columns
merged_subset = merged_sorted[['Geography', 'ratio']]

# View the first 5
merged_subset.head()


Out[35]:
Geography ratio
19 Maine 1.822463
45 Vermont 1.915122
48 West Virginia 2.082643
34 North Dakota 2.085428
26 Montana 2.090152

And now to view the top 5 use ascending=False:


In [36]:
merged.sort_values(by=['ratio'], ascending=False)[['Geography', 'ratio']].head()


Out[36]:
Geography ratio
44 Utah 2.886233
4 California 2.798532
11 Hawaii 2.688684
43 Texas 2.594424
30 New Jersey 2.492766

Grouping Rows by Value

Sometimes you'd like to aggregate groups of similar rows. For instance, let's compare the change in housing stock between the states with decreasing population to those with increasing population.

First let's make a column for the housing change and make a column with either True or False for whether the population is increasing.


In [37]:
merged['housing_change'] = merged['2015_housing'] - merged['2014_housing']
merged['pop_change_increasing'] = merged['pop_change'] > 0

Then use groupby to group our rows by whether they had an increasing or decreasing population change from 2014-2015:


In [38]:
grouped = merged.groupby('pop_change_increasing')

Then we can run aggeregate functions on our groups or describe to run the same summary statistics we did before:


In [39]:
grouped.describe()['housing_change']


Out[39]:
count mean std min 25% 50% 75% max
pop_change_increasing
False 7.0 3615.571429 3433.576069 886.0 1344.0 2406.0 4546.00 10237.0
True 44.0 17930.954545 27113.340721 -41.0 6383.0 10905.5 17219.25 162312.0

We can see that the average housing increase for states with decreasing population is lower. But the change in housing for all those states is still positive.

Statistical Operations


In [40]:
grouped.mean()


Out[40]:
2010_population 2011_population 2012_population 2013_population 2014_population 2015_population pop_change 2010_housing 2011_housing 2012_housing 2013_housing 2014_housing 2015_housing ratio housing_change
pop_change_increasing
False 3.609132e+06 3.616823e+06 3.621533e+06 3.624787e+06 3.623105e+06 3.618260e+06 -4844.857143 1.555811e+06 1.557427e+06 1.558741e+06 1.560897e+06 1.563665e+06 1.567280e+06 2.174036 3615.571429
True 6.456430e+06 6.509116e+06 6.562543e+06 6.614861e+06 6.671492e+06 6.729341e+06 57848.477273 2.747959e+06 2.756344e+06 2.765614e+06 2.779173e+06 2.796137e+06 2.814068e+06 2.328579 17930.954545

In [41]:
grouped.mean()['housing_change']


Out[41]:
pop_change_increasing
False     3615.571429
True     17930.954545
Name: housing_change, dtype: float64

Original Source

of this notebook:

Awesome Python Data Science Primer