In [2]:
## Import the packages we will use
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
This first example follows Chapter 4, section 3 of Richard McElreath's book Statistical Rethinking.
The task is understand height in a population, in this case using data about the !Kung San people. Anthropologist Nancy Howell conducted interviews with the !Kung San and collected the data used here.
The data are available in the github repository for the book: https://github.com/rmcelreath/rethinking/blob/master/data/Howell1.csv
In [5]:
# Read some data into a frame
# A frame is like an table in a spreadsheet.
# It contains columns (which usually have names) and rows (which can be indexed by number,
# but may also have names)
df = pd.read_csv('https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv', sep=";")
df.head()
Out[5]:
In [6]:
# Graph the data -- let's look at height vs. age
df.plot.scatter(x='age', y='height')
Out[6]:
In [7]:
# Filter to adults, since height and age are correlated in children
adults_df = df[df['age'] >= 18]
# Look at height vs. age again
adults_df.plot.scatter(x='age', y='height')
Out[7]:
In [8]:
# Print out how many rows are in each frame
len(df), len(adults_df)
Out[8]:
In [9]:
# Let's look at how the data are distributed
adults_df['height'].plot.hist()
Out[9]:
In [10]:
# Split data in to male and female
# -- first add in a sex column to make it less confusing
df['sex'] = df.apply(lambda row: 'Male' if row['male'] == 1 else 'Female', axis=1)
# -- re-apply the filter, since we modified the data
adults_df = df[df['age'] >= 18]
adults_df.head()
Out[10]:
In [11]:
# Let's summarize the data
adults_df[['age', 'height', 'weight']].describe()
Out[11]:
In [12]:
# Let's look at the data broken down by sex
adults_df[['age', 'height', 'weight', 'sex']].groupby('sex').describe()
Out[12]:
In [13]:
# Let's focus on the means and std
summary_df = adults_df[['age', 'height', 'weight', 'sex']].groupby('sex').describe()
summary_df.loc[(slice(None),['mean', 'std']), :]
Out[13]:
In [14]:
# Let's look at this visually -- plot height broken down by sex
g = sns.FacetGrid(adults_df, hue='sex', size=6)
g.map(sns.distplot, "height")
g.add_legend()
Out[14]:
In [15]:
# Actually, let's look at everything
# -- first, get rid of the male column, it's redundant and confusing
del adults_df['male']
adults_df.head()
Out[15]:
In [16]:
# -- now flatten the data -- very confusing, it will be explained later
flat_df = adults_df.set_index('sex', append=True)
flat_df = flat_df.stack().reset_index([1, 2])
flat_df.columns = ['sex', 'measurement', 'value']
flat_df.head()
Out[16]:
In [17]:
# Plot!
g = sns.FacetGrid(flat_df, col='measurement', hue='sex', size=6, sharex=False)
g.map(sns.distplot, "value")
g.add_legend()
Out[17]:
In [3]:
dummy_df = pd.DataFrame([['a', 'b'], ['c','d']])
dummy_df
Out[3]:
A column of a dataframe is a series.
In [4]:
type(dummy_df[0])
Out[4]:
As is the row of a datafrmae.
In [5]:
type(dummy_df.loc[0])
Out[5]:
One can also construct a series from a list.
In [6]:
pd.Series([1,2,3,4])
Out[6]:
Unlike normal Python list, you can label the axis.
In [7]:
pd.Series([1,2,3,4], index=['one', 'two', 'three', 'four'])
Out[7]:
One central abstraction in pandas is the DataFrame. It's similar to a dataframe in R &mdash basically a spreadsheet. which is similar to a data frame in R — that is, basically a spreadsheet, with columns and rows.
Pandas is designed to be fast and efficient, so the table isn't necessarily stored the way you think it is internally. In particular, data is stored in columns, and each column is a pandas Series, which itself builds on numpy arrays, not native Python arrays.
Pandas can read data in many formats. CSV and JSON are common ones to use. You can control many aspects about how the data is read. Below, you see that the structure of the file is csv-like, but instead the ';' is used as the column separator. This is not a problem. Pandas can also handle different file encodings (UTF-8 is the default), etc.
In [8]:
df = pd.read_csv('https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv', sep=";")
df.head()
Out[8]:
If you were already manipulating data in python and had them as series - the pandas object type we discussed above, you could construct a dataframe from pandas series objects.
In [9]:
col1 = pd.Series([0,1,2,3])
col2 = pd.Series([4,5,6,7])
pd.DataFrame({"col1": col1, "col2": col2})
Out[9]:
In [10]:
df['height'].head()
Out[10]:
In many cases, columns of a frame can be accessed like an array. The result is a pandas Series object, which, as you see, has a name, index, and type.
Series and frames can be very large. The methods head() and tail() can be used get a few of the first and last rows, respectively. By default, first/last 5 rows are returned. It's used here to limit output to a small number of rows, since there is no need to see the whole table.
In [11]:
df.loc[0]
Out[11]:
However, if you look at the summary_df below, you'll see that we can't actually access rows by their index using the method loc
.
In [12]:
summary_df = df.describe()
summary_df
Out[12]:
In [14]:
# summary_df.loc[0]
In [15]:
summary_df.loc['mean']
Out[15]:
To access an individual cell, specify a row and column using loc.
In [16]:
summary_df.loc['mean', 'age']
Out[16]:
In [8]:
# select row index 0, and all the columns in that row
df.iloc[0,:]
Out[8]:
In [9]:
# select all the rows in column 0 by index
df.iloc[:,0].head(3)
Out[9]:
What if you had more columns and rows than you cared for in a dataframe? How do you subset the data?
In [19]:
df[['age', 'height', 'weight']].head()
Out[19]:
Specifiying an array of column names returns a frame containing just those columns.
In [21]:
df.iloc[0:5]
Out[21]:
It's also possible to access a subset of the rows by index. More commonly, though, you will want to subset the data by some property.
In [22]:
df[df['age'] >= 18].head()
Out[22]:
This is intiutive to understand, but may seem a little magical at first. It is worth understanding what is going on underneath the covers.
The expression
df['age'] >= 18
returns an series of bool indicating whether the expression is true or false for that row (identified by index).
In [23]:
(df['age'] >= 18).head()
Out[23]:
In [24]:
(df['male'] == 0).head()
Out[24]:
When such a series is the argument to the indexing operator, []
, pandas returns a frame containing the rows where the value is True. These kinds of expressions can be combined as well, using the bitwise operators (not and
/or
).
In [29]:
((df['age'] >= 18) & (df['male'] == 0)).head()
Out[29]:
In [30]:
df[(df['age'] >= 18) & (df['male'] == 0)].head()
Out[30]:
This way, code for subsetting is intuitive to understand. It is also possible to subset rows and columns simultaneously.
In [31]:
df.loc[(df['age'] >= 18) & (df['male'] == 0), ['height', 'weight', 'age']].head()
Out[31]:
Renaming columns: just feed a list of new columns and pass it to df.columns
In [36]:
df.columns = ['new_height', 'new_weight', 'new_age', 'coded_gender']
If I wanted to create a new column based on adding up the weight and age, I could do this:
In [39]:
df['new_id'] = df['new_weight'] + df['new_age']
df.head(2)
Out[39]:
If I wanted to create a calculated column using a dictionary replacement, I could use the map
function
In [40]:
gender_text = {1: 'Male', 0: 'Female'}
df['text_gender'] = df['coded_gender'].map(gender_text)
df.head(2)
Out[40]:
What about using a lambda function to create a new column?
In [41]:
df['double_age'] = df['new_age'].apply(lambda x: x*2)
In [42]:
df.head(2)
Out[42]:
What happens when you group by? Well, for one, you get a Pandas groupby object.
In [47]:
df.groupby('male')
Out[47]:
If you choose to "get_group", you can return all of the values which have a particular value of the column you "grouped by."
In [42]:
df.groupby('male').get_group(0).head(5)
Out[42]:
What I wanted to "sum by" gender?
In [48]:
df.groupby('male').sum()
Out[48]:
That's great, but I actually want that returned as a normally-indexed DataFrame, so I'll "reset-index".
In [49]:
df.groupby('male').sum().reset_index()
Out[49]:
How about a similar (but different) question. I want to know this: who is the tallest male, and who is the tallest female?
In [17]:
# insert nlargest/nsmallest stuff here
pd.DataFrame(df.groupby('male')['height'].nlargest(1))
Out[17]:
If I wanted to know the shortest 2 males/females?
In [18]:
pd.DataFrame(df.groupby('male')['height'].nsmallest(2))
Out[18]:
What about multiple dataframes that I want to work with together? In pandas, "concatenation" of dataframes adds them together. If you don't specify a default axis along which it should concatenate, it'll assume that you want to add rows together, and figure out how the columns match.
In [22]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'B': ['B0', 'B1', 'B2', 'B3'],\
'C': ['C0', 'C1', 'C2', 'C3'],\
'D': ['D0', 'D1', 'D2', 'D3']})
df2 = pd.DataFrame({'A': ['E0', 'E1', 'E2', 'E3'],\
'B': ['F0', 'F1', 'F2', 'F3'],\
'C': ['G0', 'G1', 'G2', 'G3'],\
'D': ['H0', 'H1', 'H2', 'H3']})
df1.head(2)
Out[22]:
In [17]:
pd.concat([df1, df2])
Out[17]:
Now if all you needed to do was append one data frame to another along axis "0" (add rows to each other), all you would do is this:
In [23]:
df1.append(df2)
Out[23]:
If I wanted to maintain the same number of rows and add columns to each other, I would specify that was use axis "1".
Merging in pandas is similar to the JOIN keyword in SQL. You can specify which column(s) match so that you can "add" relevant columns together, and also specify what kind of join: left, right, outer, or inner.
In [31]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3', 'A4'],\
'B': ['B0', 'B1', 'B2', 'B3', 'B4'],\
'C': ['C0', 'C1', 'C2', 'C3', 'C4'],\
'D': ['D0', 'D1', 'D2', 'D3', 'D4']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'F': ['F0', 'F1', 'F2', 'F3'],\
'G': ['G0', 'G1', 'G2', 'G3'],\
'H': ['H0', 'H1', 'H2', 'H3']})
pd.merge(df1, df2, on='A', how='left')
Out[31]:
In [37]:
df1 = pd.DataFrame({'A': ['A0', None , 'A2', 'A3', 'A4'], 'B': ['B0', 'B1', 'B2', None , None]})
df1
Out[37]:
In [38]:
df1.dropna()
Out[38]:
In [39]:
df1.fillna('No longer missing!')
Out[39]:
In [ ]: