In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize']=(8,5) # optional
plt.style.use('bmh') # optional
We're going to investigate the set of data on the passengers of the Titanic. The datasets I'm providing come from the website http://www.kaggle.com/c/titanic. You can download it there - you'll need to sign up and agree to the rules of the competition.
In [2]:
#change the paths as needed
train = pd.read_csv('../data/titanic_train.csv')
test = pd.read_csv('../data/titanic_test.csv')
In [3]:
print(train.info()) # overview of the training data
In [4]:
print(test.info()) # note no 'Survived' column
In [5]:
train.head() # first few rows
Out[5]:
In [6]:
print(train.shape) # 891 rows, 12 columns
print(test.shape) # 418 rows, 11 columns
In [7]:
train.shape[0]
Out[7]:
In [8]:
train.describe(include='all') # summary statistics
Out[8]:
Note that the above summary gives you information about missing values indirectly: for instance, we have 891 observations in the data, but the count for 'Age' is only 714, implying that we don't have the age for 177 passengers. A direct way to get this is by combining the command isnull() with sum():
In [9]:
print(train.isnull().sum()) # the sum is taken columnwise
Let's head towards developing a simple predictive model for who survives the shipwreck. For now we'll explore and clean the data; later on we'll implement the model. Let's look at the distributions of some of the data.
In [10]:
train.hist(column='Age');
In [11]:
train.hist(column=['Age', 'Fare'], figsize=(16, 6));
In [12]:
pd.crosstab(train['Sex'], train['Survived'], margins=True) # 'margins' gives total counts
Out[12]:
This will be easier to analyze via proportions. Below, 'axis=1' means to apply the sum horizontally ('axis = 0' would mean to apply the sum vertically)
In [13]:
# saving the table for plotting
table = pd.crosstab(train.Sex, train.Survived).apply(lambda x: x/x.sum(), axis=1) # no use for margins here
table
Out[13]:
In [14]:
train['Survived'].sum() / train.shape[0]
Out[14]:
In [15]:
table.plot(kind='bar', stacked=True); # another plot call directly from a Pandas object
In [16]:
table.plot(kind='bar'); # without stacking
At this point, we're going to start addressing the missing values in Age, Embarked, Cabin, etc. It is important that any changes we make to 'train' are also made to 'test,' otherwise any predictive model we build will be flawed. Let's make a combined data frame.
In [17]:
test['Survived'] = 0 # create a new 'Survived' column in test, set all values to 0
In [18]:
test.head()
Out[18]:
In [19]:
alldata = pd.concat([train,test], keys=['train', 'test']) # tag the training and test data with a multiindex
In [20]:
alldata.tail() #note the multiindex on the rows
Out[20]:
In [21]:
alldata.shape[0] == 891 + 418
Out[21]:
Below are some examples of how we could use the multi-index on the rows.
In [22]:
print(alldata.ix['train', 'Name'])
In [23]:
print(alldata.ix[1, 'Name'])
print(alldata.ix[892, 'Name']) # in test set
In [24]:
print(alldata.ix['test', 'Name'][:5]) # note the numerical indices start at 0
To use the .loc method of selecting data instead, you could first select the 'outer' indices, then the 'inner' indices.
In [25]:
print(alldata.loc['test', 'Name'][1])
In [26]:
sum(alldata.Age.isnull())
Out[26]:
In [27]:
alldata.Age.var() # the variance of the Age column (missing values excluded)
Out[27]:
In [28]:
alldata.Age.interpolate(inplace=True) # fills in missing values with mean column value
In [29]:
sum(alldata.Age.isnull())
Out[29]:
This isn't a flawless fix; the downside is that mean imputation will change the mean, may reduce the variance, and may distort the model. On the other hand, no imputation method is perfect and you may get better results than if you did not impute the missing values. Later we'll look at more advanced methods for imputing missing values.
In [30]:
alldata.Age.var()
Out[30]:
In [31]:
# your code here
print(alldata.Embarked.isnull().sum())
print(alldata.Embarked.value_counts())
There are a lot of missing Cabin values:
In [32]:
alldata.Cabin.isnull().sum()
Out[32]:
The simplest answer is to not include this column in the model for now. Three-quarters of the information in this column is missing, which makes it difficult to say much. However, it may be possible to infer some of the missing values based on implied relationships between passengers, perhaps by looking at last names, number of siblings on board, and so on. It's worth further investigation, but for now let's leave it.
In [33]:
alldata.drop('Cabin', inplace=True, axis=1)
In [34]:
alldata.Embarked.isnull().sum()
Out[34]:
In [35]:
alldata.Embarked.value_counts()
Out[35]:
We have only two missing values, and the majority of the passengers embarked in Southampton (UK). Absent any other information, I'd guess that the two passengers embarked in Southampton as well.
In [36]:
# use the 'fillna' method
alldata.Embarked.fillna('S', inplace=True)
In [37]:
alldata.Embarked.isnull().sum()
Out[37]:
In [38]:
cleaned_train = alldata.loc['train', :] #first 891 rows, use multi-index to select
cleaned_test = alldata.loc['test',:] #last 418 rows, same
print(cleaned_train.shape)
print(cleaned_test.shape)
Saving your work as csv:
In [39]:
cleaned_train.to_csv('../data/cleaned_train.csv', index=False)
cleaned_test.to_csv('../data/cleaned_test.csv', index=False)
or as a pickle:
In [40]:
cleaned_train.to_pickle('../data/cleaned_train.p') # index is not an option for a pickle
cleaned_test.to_pickle('../data/cleaned_test.p')
In [41]:
del cleaned_train
In [42]:
whos
In [43]:
# verify that the pickle looks like the same object
ct = pd.read_pickle('../data/cleaned_train.p')
ct.describe(include='all')
Out[43]:
In [ ]:
# other options - autocomplete below:
ct.to
Pandas plays well with most databases. Form a connection to the database, then write ordinary SQL queries to bring data into Python for analysis. Here is an example with a sqlite3 database (this database is too large to host on github as is, so just follow along for this example).
In [44]:
import sqlite3
In [45]:
con = sqlite3.connect('/Volumes/data/taxis_old/taxis.sqlite3')
In [46]:
pd.read_sql_query('select * from sqlite_master', con)
Out[46]:
In [47]:
# metadata
pd.read_sql_query("select name from sqlite_master where type = 'table'", con)
Out[47]:
In [48]:
names = pd.read_sql_query('select sql from sqlite_master where tbl_name="trip_data"', con)
names.ix[0,0]
Out[48]:
In [49]:
pd.read_sql_query("select count(*) from trip_data", con) # takes a while
Out[49]:
In [50]:
tips = pd.read_sql_query('select tip_amount from fare_data where tip_amount > 0', con) # takes a while
In [51]:
tips.tip_amount.max()
Out[51]:
In [52]:
tips.describe()
Out[52]:
In [53]:
tips.hist(bins=500)
plt.xlim(0, 20);
In [54]:
whos
In [56]:
tips.info()
In [57]:
tips[tips.tip_amount > 800].shape
Out[57]:
In [ ]:
names = pd.read_sql_query('select sql from sqlite_master where tbl_name="fare_data"', con)
names.ix[0,0]
In [ ]:
# how much time do you have?
# big_fares = pd.read_sql_query('select * from fare_data ' \
# 'inner join trip_data where trip_data.pickup_datetime = fare_data.pickup_datetime and ' \
# 'trip_data.medallion = fare_data.medallion and ' \
# 'trip_data.hack_license = fare_data.medallion and ' \
# 'fare_data.fare_amount > 100 limit 500', con)
In [58]:
con.close() # don't forget to close the connection!!!!!
In [59]:
pd.read_sql_query("select count(*) from trip_data", con)
In [60]:
whos
In [ ]: