In [2]:

    
## Import the packages we will use
import pandas as pd
import numpy as np
from IPython.display import display, HTML

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

Lesson 1 : Demonstration

This first example follows Chapter 4, section 3 of Richard McElreath's book Statistical Rethinking.

The task is understand height in a population, in this case using data about the !Kung San people. Anthropologist Nancy Howell conducted interviews with the !Kung San and collected the data used here.

The data are available in the github repository for the book: https://github.com/rmcelreath/rethinking/blob/master/data/Howell1.csv



In [5]:

    
# Read some data into a frame
# A frame is like an table in a spreadsheet. 
# It contains columns (which usually have names) and rows (which can be indexed by number, 
# but may also have names)
df = pd.read_csv('https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv', sep=";")
df.head()



In [6]:

    
# Graph the data -- let's look at height vs. age
df.plot.scatter(x='age', y='height')









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8c16efe8d0>



In [7]:

    
# Filter to adults, since height and age are correlated in children
adults_df = df[df['age'] >= 18]
# Look at height vs. age again
adults_df.plot.scatter(x='age', y='height')









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8c14e6e240>



In [8]:

    
# Print out how many rows are in each frame
len(df), len(adults_df)









    Out[8]:





(544, 352)



In [9]:

    
# Let's look at how the data are distributed
adults_df['height'].plot.hist()









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8c14dab0b8>



In [10]:

    
# Split data in to male and female
# -- first add in a sex column to make it less confusing
df['sex'] = df.apply(lambda row: 'Male' if row['male'] == 1 else 'Female', axis=1)
# -- re-apply the filter, since we modified the data
adults_df = df[df['age'] >= 18]
adults_df.head()



In [11]:

    
# Let's summarize the data
adults_df[['age', 'height', 'weight']].describe()



In [12]:

    
# Let's look at the data broken down by sex
adults_df[['age', 'height', 'weight', 'sex']].groupby('sex').describe()



In [13]:

    
# Let's focus on the means and std
summary_df = adults_df[['age', 'height', 'weight', 'sex']].groupby('sex').describe()
summary_df.loc[(slice(None),['mean', 'std']), :]



In [14]:

    
# Let's look at this visually -- plot height broken down by sex
g = sns.FacetGrid(adults_df, hue='sex', size=6)
g.map(sns.distplot, "height")
g.add_legend()









    Out[14]:





<seaborn.axisgrid.FacetGrid at 0x7f8c14dde358>



In [15]:

    
# Actually, let's look at everything
# -- first, get rid of the male column, it's redundant and confusing
del adults_df['male']
adults_df.head()



In [16]:

    
# -- now flatten the data -- very confusing, it will be explained later
flat_df = adults_df.set_index('sex', append=True)
flat_df = flat_df.stack().reset_index([1, 2])
flat_df.columns = ['sex', 'measurement', 'value']
flat_df.head()









    Out[16]:






  
    
      
      sex
      measurement
      value
    
  
  
    
      0
      Male
      height
      151.765000
    
    
      0
      Male
      weight
      47.825606
    
    
      0
      Male
      age
      63.000000
    
    
      1
      Female
      height
      139.700000
    
    
      1
      Female
      weight
      36.485807



In [17]:

    
# Plot!
g = sns.FacetGrid(flat_df, col='measurement', hue='sex', size=6, sharex=False)
g.map(sns.distplot, "value")
g.add_legend()









    Out[17]:





<seaborn.axisgrid.FacetGrid at 0x7f8c14c58cf8>

Lesson 2: Details

What just happened!? Let's take a deeper look at what was done above.

Key abstractions in Pandas

Series

Series are one-dimensional arrays, which can also be "labelled" - have labeled indices. You've probably encountered this already. Let's create a dummy dataframe to illustrate this point.



In [3]:

    
dummy_df = pd.DataFrame([['a', 'b'], ['c','d']])
dummy_df

A column of a dataframe is a series.



In [4]:

    
type(dummy_df[0])









    Out[4]:





pandas.core.series.Series

As is the row of a datafrmae.



In [5]:

    
type(dummy_df.loc[0])









    Out[5]:





pandas.core.series.Series

One can also construct a series from a list.



In [6]:

    
pd.Series([1,2,3,4])









    Out[6]:





0    1
1    2
2    3
3    4
dtype: int64

Unlike normal Python list, you can label the axis.



In [7]:

    
pd.Series([1,2,3,4], index=['one', 'two', 'three', 'four'])









    Out[7]:





one      1
two      2
three    3
four     4
dtype: int64

Dataframes

One central abstraction in pandas is the DataFrame. It's similar to a dataframe in R &mdash basically a spreadsheet. which is similar to a data frame in R — that is, basically a spreadsheet, with columns and rows.

Pandas is designed to be fast and efficient, so the table isn't necessarily stored the way you think it is internally. In particular, data is stored in columns, and each column is a pandas Series, which itself builds on numpy arrays, not native Python arrays.

Pandas can read data in many formats. CSV and JSON are common ones to use. You can control many aspects about how the data is read. Below, you see that the structure of the file is csv-like, but instead the ';' is used as the column separator. This is not a problem. Pandas can also handle different file encodings (UTF-8 is the default), etc.

Reading in data from a file



In [8]:

    
df = pd.read_csv('https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv', sep=";")
df.head()

Constructing a dataframe

If you were already manipulating data in python and had them as series - the pandas object type we discussed above, you could construct a dataframe from pandas series objects.



In [9]:

    
col1 = pd.Series([0,1,2,3])
col2 = pd.Series([4,5,6,7])

pd.DataFrame({"col1": col1, "col2": col2})

Basic frame manipulations — Accessing columns, rows, elements



In [10]:

    
df['height'].head()









    Out[10]:





0    151.765
1    139.700
2    136.525
3    156.845
4    145.415
Name: height, dtype: float64

In many cases, columns of a frame can be accessed like an array. The result is a pandas Series object, which, as you see, has a name, index, and type.

Aside — why all the calls to 'head()'?

Series and frames can be very large. The methods head() and tail() can be used get a few of the first and last rows, respectively. By default, first/last 5 rows are returned. It's used here to limit output to a small number of rows, since there is no need to see the whole table.

Accessing a dataframe using `loc`

The method loc allows you to access the dataframe using the "label" of the index.

If you look at the dataframe we read in at the beginning of the lesson, the row labels are actually numbers, so you can access a single row using what's functionally an index.



In [11]:

    
df.loc[0]









    Out[11]:





height    151.765000
weight     47.825606
age        63.000000
male        1.000000
Name: 0, dtype: float64

However, if you look at the summary_df below, you'll see that we can't actually access rows by their index using the method loc.



In [12]:

    
summary_df = df.describe()
summary_df



In [14]:

    
# summary_df.loc[0]



In [15]:

    
summary_df.loc['mean']









    Out[15]:





height    138.263596
weight     35.610618
age        29.344393
male        0.472426
Name: mean, dtype: float64

To access an individual cell, specify a row and column using loc.



In [16]:

    
summary_df.loc['mean', 'age']









    Out[16]:





29.344393382352944

Accessing elements of a dataframe using `iloc`

The method iloc takes the index as arguments, with the parameters [row-index, col-index].



In [8]:

    
# select row index 0, and all the columns in that row
df.iloc[0,:]









    Out[8]:





height    151.765000
weight     47.825606
age        63.000000
male        1.000000
Name: 0, dtype: float64



In [9]:

    
# select all the rows in column 0 by index
df.iloc[:,0].head(3)









    Out[9]:





0    151.765
1    139.700
2    136.525
Name: height, dtype: float64

Basic frame manipulations — data subsetting

What if you had more columns and rows than you cared for in a dataframe? How do you subset the data?



In [19]:

    
df[['age', 'height', 'weight']].head()

Specifiying an array of column names returns a frame containing just those columns.



In [21]:

    
df.iloc[0:5]

It's also possible to access a subset of the rows by index. More commonly, though, you will want to subset the data by some property.



In [22]:

    
df[df['age'] >= 18].head()

This is intiutive to understand, but may seem a little magical at first. It is worth understanding what is going on underneath the covers.

The expression

df['age'] >= 18

returns an series of bool indicating whether the expression is true or false for that row (identified by index).



In [23]:

    
(df['age'] >= 18).head()









    Out[23]:





0    True
1    True
2    True
3    True
4    True
Name: age, dtype: bool



In [24]:

    
(df['male'] == 0).head()









    Out[24]:





0    False
1     True
2     True
3    False
4     True
Name: male, dtype: bool

When such a series is the argument to the indexing operator, [], pandas returns a frame containing the rows where the value is True. These kinds of expressions can be combined as well, using the bitwise operators (not and/or).



In [29]:

    
((df['age'] >= 18) & (df['male'] == 0)).head()









    Out[29]:





0    False
1     True
2     True
3    False
4     True
dtype: bool



In [30]:

    
df[(df['age'] >= 18) & (df['male'] == 0)].head()

This way, code for subsetting is intuitive to understand. It is also possible to subset rows and columns simultaneously.



In [31]:

    
df.loc[(df['age'] >= 18) & (df['male'] == 0), ['height', 'weight', 'age']].head()

Basic frame manipulations — renaming columns

Renaming columns: just feed a list of new columns and pass it to df.columns



In [36]:

    
df.columns = ['new_height', 'new_weight', 'new_age', 'coded_gender']

Creating columns based on other columns

If I wanted to create a new column based on adding up the weight and age, I could do this:



In [39]:

    
df['new_id'] = df['new_weight'] + df['new_age']
df.head(2)









    Out[39]:






  
    
      
      new_height
      new_weight
      new_age
      coded_gender
      new_id
    
  
  
    
      0
      151.765
      47.825606
      63.0
      1
      110.825606
    
    
      1
      139.700
      36.485807
      63.0
      0
      99.485806

If I wanted to create a calculated column using a dictionary replacement, I could use the map function



In [40]:

    
gender_text = {1: 'Male', 0: 'Female'}
df['text_gender'] = df['coded_gender'].map(gender_text)
df.head(2)









    Out[40]:






  
    
      
      new_height
      new_weight
      new_age
      coded_gender
      new_id
      text_gender
    
  
  
    
      0
      151.765
      47.825606
      63.0
      1
      110.825606
      Male
    
    
      1
      139.700
      36.485807
      63.0
      0
      99.485806
      Female

What about using a lambda function to create a new column?



In [41]:

    
df['double_age'] = df['new_age'].apply(lambda x: x*2)



In [42]:

    
df.head(2)









    Out[42]:






  
    
      
      new_height
      new_weight
      new_age
      coded_gender
      new_id
      text_gender
      double_age
    
  
  
    
      0
      151.765
      47.825606
      63.0
      1
      110.825606
      Male
      126.0
    
    
      1
      139.700
      36.485807
      63.0
      0
      99.485806
      Female
      126.0

More advanced stuff

In lesson 1, you saw a groupby: notice that it multi-indexed things for you.

What happens when you group by? Well, for one, you get a Pandas groupby object.



In [47]:

    
df.groupby('male')









    Out[47]:





<pandas.core.groupby.DataFrameGroupBy object at 0x7fe27ed45f28>

If you choose to "get_group", you can return all of the values which have a particular value of the column you "grouped by."



In [42]:

    
df.groupby('male').get_group(0).head(5)

What I wanted to "sum by" gender?



In [48]:

    
df.groupby('male').sum()

That's great, but I actually want that returned as a normally-indexed DataFrame, so I'll "reset-index".



In [49]:

    
df.groupby('male').sum().reset_index()

How about a similar (but different) question. I want to know this: who is the tallest male, and who is the tallest female?



In [17]:

    
# insert nlargest/nsmallest stuff here
pd.DataFrame(df.groupby('male')['height'].nlargest(1))

If I wanted to know the shortest 2 males/females?



In [18]:

    
pd.DataFrame(df.groupby('male')['height'].nsmallest(2))

What about multiple dataframes that I want to work with together? In pandas, "concatenation" of dataframes adds them together. If you don't specify a default axis along which it should concatenate, it'll assume that you want to add rows together, and figure out how the columns match.



In [22]:

    
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
                    'B': ['B0', 'B1', 'B2', 'B3'],\
                    'C': ['C0', 'C1', 'C2', 'C3'],\
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['E0', 'E1', 'E2', 'E3'],\
                    'B': ['F0', 'F1', 'F2', 'F3'],\
                    'C': ['G0', 'G1', 'G2', 'G3'],\
                    'D': ['H0', 'H1', 'H2', 'H3']})
df1.head(2)



In [17]:

    
pd.concat([df1, df2])

Now if all you needed to do was append one data frame to another along axis "0" (add rows to each other), all you would do is this:



In [23]:

    
df1.append(df2)

If I wanted to maintain the same number of rows and add columns to each other, I would specify that was use axis "1".

Merging in pandas is similar to the JOIN keyword in SQL. You can specify which column(s) match so that you can "add" relevant columns together, and also specify what kind of join: left, right, outer, or inner.



In [31]:

    
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3', 'A4'],\
                    'B': ['B0', 'B1', 'B2', 'B3', 'B4'],\
                    'C': ['C0', 'C1', 'C2', 'C3', 'C4'],\
                    'D': ['D0', 'D1', 'D2', 'D3', 'D4']})

df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
                    'F': ['F0', 'F1', 'F2', 'F3'],\
                    'G': ['G0', 'G1', 'G2', 'G3'],\
                    'H': ['H0', 'H1', 'H2', 'H3']})
pd.merge(df1, df2, on='A', how='left')

PS - how do I deal with missing data?

Pandas provides two really useful functions: dropna, and fillna, to get rid of datapoints where there exist null values, or replace them. You can specify things like the axis along which to drop, as well as the columns to consier.



In [37]:

    
df1 = pd.DataFrame({'A': ['A0', None , 'A2', 'A3', 'A4'], 'B': ['B0', 'B1', 'B2', None , None]})
df1



In [38]:

    
df1.dropna()



In [39]:

    
df1.fillna('No longer missing!')









    Out[39]:






  
    
      
      A
      B
    
  
  
    
      0
      A0
      B0
    
    
      1
      No longer missing!
      B1
    
    
      2
      A2
      B2
    
    
      3
      A3
      No longer missing!
    
    
      4
      A4
      No longer missing!

Things to mention

Panels are 3D data structures - I haven't worked with them, but some of you might be curious to read more about them.
Pandas supports multi-indexing, and you've seen this a bit with the groupbys. Feel free to read more about that, including "melt," which I've found useful



In [ ]:

	age	height	weight
count	352.000000	352.000000	352.000000
mean	41.138494	154.597093	44.990486
std	15.967855	7.742332	6.456708
min	18.000000	136.525000	31.071052
25%	28.000000	148.590000	40.256290
50%	39.000000	154.305000	44.792210
75%	51.000000	160.655000	49.292693
max	88.000000	179.070000	62.992589

		age	height	weight
sex
Female	count	187.000000	187.000000	187.000000
	mean	40.712299	149.513519	41.814190
	std	16.219897	5.084577	5.387917
	min	18.000000	136.525000	31.071052
	25%	27.000000	146.050000	37.931631
	50%	38.000000	149.860000	41.645416
	75%	51.000000	153.035000	45.146579
	max	85.600000	162.560000	56.755699
Male	count	165.000000	165.000000	165.000000
	mean	41.621515	160.358476	48.590287
	std	15.712552	6.009177	5.635564
	min	18.000000	139.700000	31.524644
	25%	29.000000	157.480000	45.217453
	50%	41.000000	160.655000	48.420946
	75%	50.000000	163.830000	52.219779
	max	88.000000	179.070000	62.992589

	height	weight	age	male
count	544.000000	544.000000	544.000000	544.000000
mean	138.263596	35.610618	29.344393	0.472426
std	27.602448	14.719178	20.746888	0.499699
min	53.975000	4.252425	0.000000	0.000000
25%	125.095000	22.007717	12.000000	0.000000
50%	148.590000	40.057844	27.000000	0.000000
75%	157.480000	47.209005	43.000000	1.000000
max	179.070000	62.992589	88.000000	1.000000

	height	weight	age
1	139.700	36.485807	63.0
2	136.525	31.864838	65.0
4	145.415	41.276872	51.0
6	149.225	38.243476	32.0
8	147.955	34.869885	19.0

	height	weight	age
1	139.700	36.485807	63.0
2	136.525	31.864838	65.0
4	145.415	41.276872	51.0
6	149.225	38.243476	32.0
8	147.955	34.869885	19.0

	height	weight	age	male
0	151.765	47.825606	63.0	1
1	139.700	36.485807	63.0	0
2	136.525	31.864838	65.0	0
3	156.845	53.041915	41.0	1
4	145.415	41.276872	51.0	0

	sex	measurement	value
0	Male	height	151.765000
0	Male	weight	47.825606
0	Male	age	63.000000
1	Female	height	139.700000
1	Female	weight	36.485807

	height	weight	age
male
0	38638.8898	9599.43206	8388.70
1	36576.5066	9772.74391	7574.65

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
0	E0	F0	G0	H0
1	E1	F1	G1	H1
2	E2	F2	G2	H2
3	E3	F3	G3	H3

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
0	E0	F0	G0	H0
1	E1	F1	G1	H1
2	E2	F2	G2	H2
3	E3	F3	G3	H3

	A	B	C	D	F	G	H
0	A0	B0	C0	D0	F0	G0	H0
1	A1	B1	C1	D1	F1	G1	H1
2	A2	B2	C2	D2	F2	G2	H2
3	A3	B3	C3	D3	F3	G3	H3
4	A4	B4	C4	D4	NaN	NaN	NaN

	A	B
0	A0	B0
1	No longer missing!	B1
2	A2	B2
3	A3	No longer missing!
4	A4	No longer missing!

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
0	E0	F0	G0	H0
1	E1	F1	G1	H1
2	E2	F2	G2	H2
3	E3	F3	G3	H3

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
0	E0	F0	G0	H0
1	E1	F1	G1	H1
2	E2	F2	G2	H2
3	E3	F3	G3	H3

	A	B	C	D	F	G	H
0	A0	B0	C0	D0	F0	G0	H0
1	A1	B1	C1	D1	F1	G1	H1
2	A2	B2	C2	D2	F2	G2	H2
3	A3	B3	C3	D3	F3	G3	H3
4	A4	B4	C4	D4	NaN	NaN	NaN