Torture numbers, and they'll confess to anything.
-Gregg Easterbrook
Pandas is a Python library that focuses on data manipulation and numerical analysis. While originally designed as an econometrics platform it has proven suitable for a variety of domains. It will be our primary interface going forward.
There are two impotant concepts we should discuss before we begin.
Pandas borrowed the concept of the pd.DataFrame from R.
In [ ]:
# Import pandas under the name pd
import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline
matplotlib.style.use('fivethirtyeight')
# Create a dataframe from a CSV file
df = pd.read_csv('data/cfpb_complaints_with_fictitious_data.csv')
# Any dataframe at end of cell gets HTML representation
# Head limits the number of output rows
df
In [ ]:
# For clarity, add color.
from IPython.display import HTML
with open('static/notebook_style.css', 'r', encoding='utf8') as f:
data = f.read()
HTML('<style>' + data + '</style>')
A dataframe can be thought of like a relationtional database table or an Excel sheet. It has rows and columns. The rows correspond with an individual item or entity. The columns correspond with various features of the entity. In the above, the columns and the row index are in navy. The actual data falls in the middle of the table.
DataFrames can be thought of a group of columns composed of Series objects. They can also be thought of a group of rows composed of Series. Unlike a dataframe, a Series only has an index and data.
In [ ]:
# You can select a column series of a dataframe with this notation:
# dataframe[column_name]
# OR if there are no spaces in the name
# df.Product
# Head limits it output to the first 5 elements
df['Product'].head()
Warning: If you are a Python person, some of the syntax you're about to see may frighten and confuse you. This is a perfectly normal reaction. Because Python doesn't have true operator overloading, the only way to get R-like syntax was to abuse the __getitem__() magic method. It's kludgy, but it works beautifully.
In [ ]:
# We can cut down the dataframe as needed
# You can also pass a list of columns to get back a subframe
# dataframe[list_of_column_names]
# http://pandas.pydata.org/pandas-docs/stable/indexing.html
tdf = df[['Product', 'Company', 'State']]
tdf.head()
In [ ]:
# We can also cut down columns by filtering by number
# Loc is for indexing
tdf.iloc[:5]
In [ ]:
# Or by content
# This works because of boolean indexing, which we will get to.
tdf[tdf['Company'] == 'Experian'].head(5)
In [ ]:
# We can filter and then analyze columns.
filtered_df = tdf[tdf['Company'] == 'Experian']
# And we can use Series methods if we want to examine columns
filtered_df['State'].value_counts()[:5]
In [ ]:
# Or get unique values
filtered_df['Product'].unique()[:5]
In [ ]:
# We can get column datatypes
filtered_df.dtypes
In [ ]:
# We can group the data and view it in aggregate
# http://pandas.pydata.org/pandas-docs/stable/groupby.html
gb = df.groupby(['Product', 'Sub-product'])
# Add count size gives basic numbers.
size = gb.mean().head(15)
size
In [ ]:
# Pandas also has simplified plotting
size['Consumer Claim'].sort_values().plot.barh()
In [ ]:
# We can reshape data as necessary.
size.unstack().fillna(' ')
In [ ]:
# Or get numerical data or transform based on common groups.
output = gb['Consumer Claim'].agg([np.mean, np.median, np.max, np.min]).head(10)
output
In [ ]:
# We can use apply to apply functions in a vectorized manner
def transmogrify(row):
id = row['Complaint ID']
product = row['Product']
state = row['State']
return 'Complaint {} is a {} complaint from the state of {}.'.format(id, product, state)
output = df.apply(transmogrify, axis=1)[:5]
output[0]
In [ ]:
# We can do database style merges, joins, and concatenations
# http://pandas.pydata.org/pandas-docs/stable/merging.html
df2 = pd.read_csv('data/simple.csv')
df2.head(5)
In [ ]:
# Here we non-sensically merge arbitrary numbers from simple to the CFPB dataset
tdf = df.merge(df2, how='inner', left_on='Date received', right_on='Date')
tdf[['Date received', 'Product', 'Count']].head(5)
In [ ]:
# We can write the data to disk in a single line
output.to_csv('data/custom_function_output.csv')
In [ ]:
# Pandas has built in support for datetime objects, too.
df['Date received'] = pd.to_datetime(df['Date received'])
gb = df.groupby([df['Date received'].dt.year,
df['Date received'].dt.month])
gb.size().plot()
In [ ]:
# And support for string methods
contains_lawyer = df['Consumer complaint narrative'].str.contains('lawyer')
# Get all items containing lawyer
data = df[contains_lawyer]['Consumer complaint narrative']
# Get text of first item
data
In [ ]:
# This includes regexes for text mining ... https://en.wikipedia.org/wiki/Regular_expression
regex_string = r'([Ll]awyer[\S\s]*?\.|[Aa]ttorney[\S\s]*?\.)'
# Look for each and every instance
lawyer_to_sentence_end = df['Consumer complaint narrative'].str.extract(regex_string,
expand=True)
lawyer_to_sentence_end.dropna().head(5)