In [1]:
# pandas is commonly imported as `pd`
import pandas as pd
import numpy as np
In [2]:
s = pd.Series([4, 7, -5, 3])
s
Out[2]:
In [3]:
s.index
Out[3]:
In [4]:
s.values
Out[4]:
In [5]:
s[1:3]
Out[5]:
In [6]:
s2 = s**2
s2
Out[6]:
In [7]:
s2+s
Out[7]:
In [8]:
print(np.sum(s))
print(np.mean(s))
print(np.std(s))
In [9]:
s3 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
s3
Out[9]:
In [10]:
# !!!
s2+s3
Out[10]:
In [11]:
s4 = pd.Series({'d':10, 'b':12, 'a':3, 'c':9})
s4
Out[11]:
In [12]:
s3+s4
Out[12]:
The most important feature pandas gives us access to is the DataFrame
. Dataframes are two-dimensional stucutres that you can think of very much like a spreadsheet with named columns and rows. In fact, it supports reading in CSV and Excel files.
First, let's read a CSV containing population information of the States of the USA.
In [13]:
# Create a DataFrame, `df`, from the csv located in `data/population.csv`
df = pd.read_csv('../../data/population.csv')
To see what our data looks like, we can use df.head(n)
to see the first n
rows, with n=5 the default:
In [14]:
df.head()
Out[14]:
We see that for each state we have two IDs and then 6 columns of years - in this case each column is the population of that state during the given year.
We can acess columns by referencing the column name just like a python dictionary:
In [15]:
df['2010'].head()
Out[15]:
We can get multiple columns at once by passing a list of column names:
In [16]:
df[['2010', '2011']].head()
Out[16]:
And then we can access groups of rows using a range of row IDs:
In [17]:
df[5:10]
Out[17]:
Accessing individual rows is different. You can't just use df[i]
. Instead you need to use df.loc:
In [18]:
df.loc[20]
Out[18]:
In [19]:
type(df.loc[20])
Out[19]:
Pandas gives us an easy way to get summary statistics of our data:
In [20]:
df.describe()
Out[20]:
One thing you might notice is that describe only lists the numeric columns, but Id2
is included in that even though it would be better to treat it as a string. pandas tries to guess the datatype of each column and in this case, all of the values in Id2
are integers, so it gets treated as an integer.
We can see the datatype details:
In [21]:
df.dtypes
Out[21]:
We can cast Id2
to a string using astype
and then override the original column:
In [22]:
df['Id2'] = df['Id2'].astype(str)
df.dtypes
Out[22]:
Or we could have specified the data type when we originally read the CSV:
In [23]:
# Pass a dictionary to the dtype parameter with `'column': dtype`
df = pd.read_csv('../../data/population.csv', dtype={'Id2': str})
df.dtypes
Out[23]:
In [24]:
df['2015'] - df['2014']
Out[24]:
Rather than continually computing that value, we can save it to a new column in the DataFrame. Let's make a new column called 'change':
In [25]:
df['pop_change'] = df['2015'] - df['2014']
df.head()
Out[25]:
Just like numpy, we can also do element-wise comparisons. For example, to find out whether a state's population decreased over that year:
In [26]:
df['pop_change'] < 0
Out[26]:
The True
values are the states with negative population change (decrease). But that boolean array by itself isn't very useful. We can use that array as an index to filter our DataFrame:
In [27]:
df[df['pop_change'] < 0]
Out[27]:
Now we have a subset of the DataFrame with only the decreasing populations.
In [28]:
print(df.mean())
print(df.std())
print(df.max())
print(df.sum())
In [29]:
# or over single columns
print(df[['2010','2011']].mean())
In [30]:
housing = pd.read_csv('../../data/housing.csv', dtype={'Id2': str})
housing.head()
Out[30]:
Since the Id column is shared, it can easily be merged with our original DataFrame:
In [31]:
merged = df.merge(housing, on='Id')
merged.head()
Out[31]:
Since the column names are all shared, pandas appends '_x' and '_y' to columns from the left and right dataframes, respectively.
This isn't very user-friendly, so we can use the parameter suffixes
to specify custom labels to append. Furthermore, we can also specify Id2
and Geography
in on
so we don't duplicate those columns.
In [32]:
merged = df.merge(housing, on=['Id', 'Id2', 'Geography'], suffixes=('_population', '_housing'))
merged.head()
Out[32]:
We can also notice that when we did the merge, we lost one row. That is because the housing dataset didn't contain data for Puerto Rico.
In [33]:
print('Population:', len(df), 'Merged:', len(merged))
Now we can do something involving both datasets. For example, finding the ratio of people to houses:
In [34]:
merged['ratio'] = merged['2015_population']/merged['2015_housing']
merged['ratio'].head()
Out[34]:
Now let's use sort_values
to view the states with the lowest ratio of people to houses and view just the state name and ratio columns:
In [35]:
# Sort the data by ratio
merged_sorted = merged.sort_values(by=['ratio'])
# Just get the `Geography` and `ratio` columns
merged_subset = merged_sorted[['Geography', 'ratio']]
# View the first 5
merged_subset.head()
Out[35]:
And now to view the top 5 use ascending=False:
In [36]:
merged.sort_values(by=['ratio'], ascending=False)[['Geography', 'ratio']].head()
Out[36]:
Sometimes you'd like to aggregate groups of similar rows. For instance, let's compare the change in housing stock between the states with decreasing population to those with increasing population.
First let's make a column for the housing change and make a column with either True or False for whether the population is increasing.
In [37]:
merged['housing_change'] = merged['2015_housing'] - merged['2014_housing']
merged['pop_change_increasing'] = merged['pop_change'] > 0
Then use groupby
to group our rows by whether they had an increasing or decreasing population change from 2014-2015:
In [38]:
grouped = merged.groupby('pop_change_increasing')
Then we can run aggeregate functions on our groups or describe
to run the same summary statistics we did before:
In [39]:
grouped.describe()['housing_change']
Out[39]:
We can see that the average housing increase for states with decreasing population is lower. But the change in housing for all those states is still positive.
In [40]:
grouped.mean()
Out[40]:
In [41]:
grouped.mean()['housing_change']
Out[41]: