In [4]:
# Import the packages we will use
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
This first example follows Chapter 4, section 3 of Richard McElreath's book Statistical Rethinking.
The task is understand height in a population, in this case using data about the !Kung San people. Anthropologist Nancy Howell conducted interviews with the !Kung San and collected the data used here.
The data are available in the github repository for the book: https://github.com/rmcelreath/rethinking/blob/master/data/Howell1.csv
In [5]:
# Read some data into a frame
# A frame is like an table in a spreadsheet.
# It contains columns (which usually have names) and rows (which can be indexed by number,
# but may also have names)
df = pd.read_csv('https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv', sep=";")
df.head()
Out[5]:
In [6]:
# Graph the data -- let's look at height vs. age
df.plot.scatter(x='age', y='height')
Out[6]:
In [7]:
# Filter to adults, since height and age are correlated in children
adults_df = df[df['age'] >= 18]
# Look at height vs. age again
adults_df.plot.scatter(x='age', y='height')
Out[7]:
In [8]:
# Print out how many rows are in each frame
len(df), len(adults_df)
Out[8]:
In [9]:
# Let's look at how the data are distributed
adults_df['height'].plot.hist()
Out[9]:
In [10]:
# Split data in to male and female
# -- first add in a sex column to make it less confusing
df['sex'] = df.apply(lambda row: 'Male' if row['male'] == 1 else 'Female', axis=1)
# -- re-apply the filter, since we modified the data
adults_df = df[df['age'] >= 18]
adults_df.head()
Out[10]:
In [11]:
# Let's summarize the data
adults_df[['age', 'height', 'weight']].describe()
Out[11]:
In [12]:
# Let's look at the data broken down by sex
adults_df[['age', 'height', 'weight', 'sex']].groupby('sex').describe()
Out[12]:
In [13]:
# Let's focus on the means and std
summary_df = adults_df[['age', 'height', 'weight', 'sex']].groupby('sex').describe()
summary_df.loc[(slice(None),['mean', 'std']), :]
Out[13]:
In [14]:
# Let's look at this visually -- plot height broken down by sex
g = sns.FacetGrid(adults_df, hue='sex', size=6)
g.map(sns.distplot, "height")
g.add_legend()
Out[14]:
In [15]:
# Actually, let's look at everything
# -- first, get rid of the male column, it's redundant and confusing
del adults_df['male']
adults_df.head()
Out[15]:
In [16]:
# -- now flatten the data -- very confusing, it will be explained later
flat_df = adults_df.set_index('sex', append=True)
flat_df = flat_df.stack().reset_index([1, 2])
flat_df.columns = ['sex', 'measurement', 'value']
flat_df.head()
Out[16]:
In [17]:
# Plot!
g = sns.FacetGrid(flat_df, col='measurement', hue='sex', size=6, sharex=False)
g.map(sns.distplot, "value")
g.add_legend()
Out[17]:
In [18]:
df = pd.read_csv('https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv', sep=";")
df.head()
Out[18]:
One central abstraction in pandas is the DataFrame, which is similar to a data frame in R — that is, basically a spreadsheet. It is made up of columns, which are usually names, and rows, which may be named or just accessed by index.
Pandas is designed to be fast and efficient, so the table isn't necessarily stored the way you think it is internally. In particular, data is stored in columns, and each column is a pandas Series, which itself builds on numpy arrays, not native Python arrays.
Pandas can read data in many formats. CSV and JSON are common ones to use. You can control many aspects about how the data is read. Above, you see that the structure of the file is csv-like, but instead the ';' is used as the column separator. This is not a problem. Pandas can also handle different file encodings (UTF-8 is the default), etc.
In [19]:
df['height'].head()
Out[19]:
In many cases, columns of a frame can be accessed like an array. The result is a pandas Series object, which, as you see, has a name, index, and type.
Series and frames can be very large. The methods head() and tail() can be used get a few of the first and last rows, respectively. By default, first/last 5 rows are returned. It's used here to limit output to a small number of rows, since there is no need to see the whole table.
In [20]:
df.loc[0]
Out[20]:
Rows are accessed using the method loc or iloc. The method 'loc' takes the index, 'iloc' takes the row index. In the above case, these are the same, but that is not always true. For example...
In [21]:
summary_df = df.describe()
summary_df.loc['mean']
Out[21]:
To access an individual cell, specify a row and column using loc.
In [22]:
summary_df.loc['mean', 'age']
Out[22]:
In [27]:
# select row index 0, and all the columns in that row
df.iloc[0,:]
Out[27]:
In [30]:
# select all the rows in column 0 by index
df.iloc[:,0]
Out[30]:
In [19]:
df[['age', 'height', 'weight']].head()
Out[19]:
Specifiying an array of column names returns a frame containing just those columns.
In [21]:
df.iloc[0:5]
Out[21]:
It's also possible to access a subset of the rows by index. More commonly, though, you will want to subset the data by some property.
In [22]:
df[df['age'] >= 18].head()
Out[22]:
This is intiutive to understand, but may seem a little magical at first. It is worth understanding what is going on underneath the covers.
The expression
df['age'] >= 18
returns an series of bool indicating whether the expression is true or false for that row (identified by index).
In [23]:
(df['age'] >= 18).head()
Out[23]:
In [24]:
(df['male'] == 0).head()
Out[24]:
When such a series is the argument to the indexing operator, []
, pandas returns a frame containing the rows where the value is True. These kinds of expressions can be combined as well, using the bitwise operators (not and
/or
).
In [29]:
((df['age'] >= 18) & (df['male'] == 0)).head()
Out[29]:
In [30]:
df[(df['age'] >= 18) & (df['male'] == 0)].head()
Out[30]:
This way, code for subsetting is intuitive to understand. It is also possible to subset rows and columns simultaneously.
In [31]:
df.loc[(df['age'] >= 18) & (df['male'] == 0), ['height', 'weight', 'age']].head()
Out[31]:
Renaming columns: just feed a list of new columns and pass it to df.columns
In [36]:
df.columns = ['new_height', 'new_weight', 'new_age', 'coded_gender']
If I wanted to create a new column based on adding up the weight and age, I could do this:
In [39]:
df['new_id'] = df['new_weight'] + df['new_age']
df.head(2)
Out[39]:
If I wanted to create a calculated column using a dictionary replacement, I could use the map
function
In [40]:
gender_text = {1: 'Male', 0: 'Female'}
df['text_gender'] = df['coded_gender'].map(gender_text)
df.head(2)
Out[40]:
What about using a lambda function to create a new column?
In [41]:
df['double_age'] = df['new_age'].apply(lambda x: x*2)
In [42]:
df.head(2)
Out[42]:
In [ ]: