Part I


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize']=(8,5) # optional
plt.style.use('bmh')  # optional

We're going to investigate the set of data on the passengers of the Titanic. The datasets I'm providing come from the website http://www.kaggle.com/c/titanic. You can download it there - you'll need to sign up and agree to the rules of the competition.


In [2]:
#change the paths as needed
train = pd.read_csv('../data/titanic_train.csv')
test = pd.read_csv('../data/titanic_test.csv')

In [3]:
print(train.info())  # overview of the training data


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

In [4]:
print(test.info())  # note no 'Survived' column


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

In [5]:
train.head()  # first few rows


Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [6]:
print(train.shape)  # 891 rows, 12 columns
print(test.shape)  # 418 rows, 11 columns


(891, 12)
(418, 11)

In [7]:
train.shape[0]


Out[7]:
891

In [8]:
train.describe(include='all')  # summary statistics


Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN NaN Dantcheff, Mr. Ristiu male NaN NaN NaN CA. 2343 NaN B96 B98 S
freq NaN NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 446.000000 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN

Note that the above summary gives you information about missing values indirectly: for instance, we have 891 observations in the data, but the count for 'Age' is only 714, implying that we don't have the age for 177 passengers. A direct way to get this is by combining the command isnull() with sum():


In [9]:
print(train.isnull().sum())  # the sum is taken columnwise


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Let's head towards developing a simple predictive model for who survives the shipwreck. For now we'll explore and clean the data; later on we'll implement the model. Let's look at the distributions of some of the data.


In [10]:
train.hist(column='Age');



In [11]:
train.hist(column=['Age', 'Fare'], figsize=(16, 6));



In [12]:
pd.crosstab(train['Sex'], train['Survived'], margins=True)  # 'margins' gives total counts


Out[12]:
Survived 0 1 All
Sex
female 81 233 314
male 468 109 577
All 549 342 891

This will be easier to analyze via proportions. Below, 'axis=1' means to apply the sum horizontally ('axis = 0' would mean to apply the sum vertically)


In [13]:
# saving the table for plotting
table = pd.crosstab(train.Sex, train.Survived).apply(lambda x: x/x.sum(), axis=1)  # no use for margins here
table


Out[13]:
Survived 0 1
Sex
female 0.257962 0.742038
male 0.811092 0.188908

In [14]:
train['Survived'].sum() / train.shape[0]


Out[14]:
0.3838383838383838

In [15]:
table.plot(kind='bar', stacked=True);  # another plot call directly from a Pandas object



In [16]:
table.plot(kind='bar');  # without stacking


At this point, we're going to start addressing the missing values in Age, Embarked, Cabin, etc. It is important that any changes we make to 'train' are also made to 'test,' otherwise any predictive model we build will be flawed. Let's make a combined data frame.


In [17]:
test['Survived'] = 0  # create a new 'Survived' column in test, set all values to 0

In [18]:
test.head()


Out[18]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 0
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 0

In [19]:
alldata = pd.concat([train,test], keys=['train', 'test'])  # tag the training and test data with a multiindex

In [20]:
alldata.tail() #note the multiindex on the rows


Out[20]:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
test 413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 0 A.5. 3236
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 0 PC 17758
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 0 SOTON/O.Q. 3101262
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 0 359309
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 0 2668

In [21]:
alldata.shape[0] == 891 + 418


Out[21]:
True

Below are some examples of how we could use the multi-index on the rows.


In [22]:
print(alldata.ix['train', 'Name'])


0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Williams, Mr. Charles Eugene
18     Vander Planke, Mrs. Julius (Emelia Maria Vande...
19                               Masselmani, Mrs. Fatima
20                                  Fynney, Mr. Joseph J
21                                 Beesley, Mr. Lawrence
22                           McGowan, Miss. Anna "Annie"
23                          Sloper, Mr. William Thompson
24                         Palsson, Miss. Torborg Danira
25     Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
26                               Emir, Mr. Farred Chehab
27                        Fortune, Mr. Charles Alexander
28                         O'Dwyer, Miss. Ellen "Nellie"
29                                   Todoroff, Mr. Lalio
                             ...                        
861                          Giles, Mr. Frederick Edward
862    Swift, Mrs. Frederick Joel (Margaret Welles Ba...
863                    Sage, Miss. Dorothy Edith "Dolly"
864                               Gill, Mr. John William
865                             Bystrom, Mrs. (Karolina)
866                         Duran y More, Miss. Asuncion
867                 Roebling, Mr. Washington Augustus II
868                          van Melkebeke, Mr. Philemon
869                      Johnson, Master. Harold Theodor
870                                    Balkic, Mr. Cerin
871     Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
872                             Carlsson, Mr. Frans Olof
873                          Vander Cruyssen, Mr. Victor
874                Abelson, Mrs. Samuel (Hannah Wizosky)
875                     Najib, Miss. Adele Kiamie "Jane"
876                        Gustafsson, Mr. Alfred Ossian
877                                 Petroff, Mr. Nedelio
878                                   Laleff, Mr. Kristo
879        Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
880         Shelley, Mrs. William (Imanita Parrish Hall)
881                                   Markun, Mr. Johann
882                         Dahlberg, Miss. Gerda Ulrika
883                        Banfield, Mr. Frederick James
884                               Sutehall, Mr. Henry Jr
885                 Rice, Mrs. William (Margaret Norton)
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, dtype: object

In [23]:
print(alldata.ix[1, 'Name'])
print(alldata.ix[892, 'Name'])  # in test set


Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Wilkes, Mrs. James (Ellen Needs)

In [24]:
print(alldata.ix['test', 'Name'][:5])  # note the numerical indices start at 0


0                                Kelly, Mr. James
1                Wilkes, Mrs. James (Ellen Needs)
2                       Myles, Mr. Thomas Francis
3                                Wirz, Mr. Albert
4    Hirvonen, Mrs. Alexander (Helga E Lindqvist)
Name: Name, dtype: object

To use the .loc method of selecting data instead, you could first select the 'outer' indices, then the 'inner' indices.


In [25]:
print(alldata.loc['test', 'Name'][1])


Wilkes, Mrs. James (Ellen Needs)

In [26]:
sum(alldata.Age.isnull())


Out[26]:
263

In [27]:
alldata.Age.var()  # the variance of the Age column (missing values excluded)


Out[27]:
207.74878655136482

In [28]:
alldata.Age.interpolate(inplace=True)  # fills in missing values with mean column value

In [29]:
sum(alldata.Age.isnull())


Out[29]:
0

This isn't a flawless fix; the downside is that mean imputation will change the mean, may reduce the variance, and may distort the model. On the other hand, no imputation method is perfect and you may get better results than if you did not impute the missing values. Later we'll look at more advanced methods for imputing missing values.


In [30]:
alldata.Age.var()


Out[30]:
189.658152309465

Exercise

How would you handle the missing values for Cabin, Embarked?


In [31]:
# your code here
print(alldata.Embarked.isnull().sum())
print(alldata.Embarked.value_counts())


2
S    914
C    270
Q    123
Name: Embarked, dtype: int64

Solution

Don't look until you've thought about it a bit.

There are a lot of missing Cabin values:


In [32]:
alldata.Cabin.isnull().sum()


Out[32]:
1014

The simplest answer is to not include this column in the model for now. Three-quarters of the information in this column is missing, which makes it difficult to say much. However, it may be possible to infer some of the missing values based on implied relationships between passengers, perhaps by looking at last names, number of siblings on board, and so on. It's worth further investigation, but for now let's leave it.


In [33]:
alldata.drop('Cabin', inplace=True, axis=1)

In [34]:
alldata.Embarked.isnull().sum()


Out[34]:
2

In [35]:
alldata.Embarked.value_counts()


Out[35]:
S    914
C    270
Q    123
Name: Embarked, dtype: int64

We have only two missing values, and the majority of the passengers embarked in Southampton (UK). Absent any other information, I'd guess that the two passengers embarked in Southampton as well.


In [36]:
# use the 'fillna' method
alldata.Embarked.fillna('S', inplace=True)

In [37]:
alldata.Embarked.isnull().sum()


Out[37]:
0

In [38]:
cleaned_train = alldata.loc['train', :] #first 891 rows, use multi-index to select
cleaned_test = alldata.loc['test',:] #last 418 rows, same
print(cleaned_train.shape)
print(cleaned_test.shape)


(891, 11)
(418, 11)

Saving your work as csv:


In [39]:
cleaned_train.to_csv('../data/cleaned_train.csv', index=False)
cleaned_test.to_csv('../data/cleaned_test.csv', index=False)

or as a pickle:


In [40]:
cleaned_train.to_pickle('../data/cleaned_train.p')  # index is not an option for a pickle
cleaned_test.to_pickle('../data/cleaned_test.p')

In [41]:
del cleaned_train

In [42]:
whos


Variable       Type         Data/Info
-------------------------------------
alldata        DataFrame                Age Embarked <...>n[1309 rows x 11 columns]
cleaned_test   DataFrame          Age Embarked      F<...>\n[418 rows x 11 columns]
np             module       <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd             module       <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt            module       <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'>
table          DataFrame    Survived         0       <...>e      0.811092  0.188908
test           DataFrame         PassengerId  Pclass <...>\n[418 rows x 12 columns]
train          DataFrame         PassengerId  Survive<...>\n[891 rows x 12 columns]

In [43]:
# verify that the pickle looks like the same object
ct = pd.read_pickle('../data/cleaned_train.p')
ct.describe(include='all')


Out[43]:
Age Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
count 891.000000 891 891.000000 891 891.000000 891.000000 891.000000 891 891.000000 891.000000 891
unique NaN 3 NaN 891 NaN NaN NaN 2 NaN NaN 681
top NaN S NaN Dantcheff, Mr. Ristiu NaN NaN NaN male NaN NaN CA. 2343
freq NaN 646 NaN 1 NaN NaN NaN 577 NaN NaN 7
mean 29.726061 NaN 32.204208 NaN 0.381594 446.000000 2.308642 NaN 0.523008 0.383838 NaN
std 13.902353 NaN 49.693429 NaN 0.806057 257.353842 0.836071 NaN 1.102743 0.486592 NaN
min 0.420000 NaN 0.000000 NaN 0.000000 1.000000 1.000000 NaN 0.000000 0.000000 NaN
25% 21.000000 NaN 7.910400 NaN 0.000000 223.500000 2.000000 NaN 0.000000 0.000000 NaN
50% 28.500000 NaN 14.454200 NaN 0.000000 446.000000 3.000000 NaN 0.000000 0.000000 NaN
75% 38.000000 NaN 31.000000 NaN 0.000000 668.500000 3.000000 NaN 1.000000 1.000000 NaN
max 80.000000 NaN 512.329200 NaN 6.000000 891.000000 3.000000 NaN 8.000000 1.000000 NaN

In [ ]:
# other options - autocomplete below:
ct.to

Part II

Pandas plays well with most databases. Form a connection to the database, then write ordinary SQL queries to bring data into Python for analysis. Here is an example with a sqlite3 database (this database is too large to host on github as is, so just follow along for this example).


In [44]:
import sqlite3

In [45]:
con = sqlite3.connect('/Volumes/data/taxis_old/taxis.sqlite3')

In [46]:
pd.read_sql_query('select * from sqlite_master', con)


Out[46]:
type name tbl_name rootpage sql
0 table trip_data trip_data 2 CREATE TABLE trip_data (\nmedallion varchar(40...
1 index pickup_datetime trip_data 33862931 CREATE INDEX pickup_datetime on trip_data(pick...
2 table fare_data fare_data 38828040 CREATE TABLE fare_data (\nmedallion varchar(40...
3 index tip_amount fare_data 63425735 CREATE INDEX tip_amount on fare_data(tip_amount)
4 index fame_amount fare_data 65633919 CREATE INDEX fame_amount on fare_data(fare_amo...
5 index hack_license fare_data 68113447 CREATE INDEX hack_license on fare_data(hack_li...

In [47]:
# metadata
pd.read_sql_query("select name from sqlite_master where type = 'table'", con)


Out[47]:
name
0 trip_data
1 fare_data

In [48]:
names = pd.read_sql_query('select sql from sqlite_master where tbl_name="trip_data"', con)
names.ix[0,0]


Out[48]:
'CREATE TABLE trip_data (\nmedallion varchar(40),\nhack_license varchar(40),\nvendor_id varchar(5),\nrate_code int,\nstore_and_fwd_flag varchar(5),\npickup_datetime timestamp,\ndropoff_datetime timestamp,\npassenger_count int,\ntrip_time_in_sec int,\ntrip_distance numeric,\npickup_longitude numeric,\npickup_latitude numeric,\ndropoff_longitude numeric,\ndropoff_latitude numeric)'

In [49]:
pd.read_sql_query("select count(*) from trip_data", con)  # takes a while


Out[49]:
count(*)
0 173179759

In [50]:
tips = pd.read_sql_query('select tip_amount from fare_data where tip_amount > 0', con)  # takes a while

In [51]:
tips.tip_amount.max()


Out[51]:
888.19000000000005

In [52]:
tips.describe()


Out[52]:
tip_amount
count 9.066210e+07
mean 2.607578e+00
std 2.439769e+00
min 1.000000e-02
25% 1.300000e+00
50% 2.000000e+00
75% 3.000000e+00
max 8.881900e+02

In [53]:
tips.hist(bins=500)
plt.xlim(0, 20);



In [54]:
whos


Variable       Type          Data/Info
--------------------------------------
alldata        DataFrame                 Age Embarked <...>n[1309 rows x 11 columns]
cleaned_test   DataFrame           Age Embarked      F<...>\n[418 rows x 11 columns]
con            Connection    <sqlite3.Connection object at 0x11766c9d0>
ct             DataFrame           Age Embarked      F<...>\n[891 rows x 11 columns]
names          DataFrame                              <...>time on trip_data(pick...
np             module        <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd             module        <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt            module        <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'>
sqlite3        module        <module 'sqlite3' from '/<...>3.5/sqlite3/__init__.py'>
table          DataFrame     Survived         0       <...>e      0.811092  0.188908
test           DataFrame          PassengerId  Pclass <...>\n[418 rows x 12 columns]
tips           DataFrame               tip_amount\n0  <...>0662100 rows x 1 columns]
train          DataFrame          PassengerId  Survive<...>\n[891 rows x 12 columns]

In [56]:
tips.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90662100 entries, 0 to 90662099
Data columns (total 1 columns):
tip_amount    float64
dtypes: float64(1)
memory usage: 691.7 MB

In [57]:
tips[tips.tip_amount > 800].shape


Out[57]:
(3, 1)

In [ ]:
names = pd.read_sql_query('select sql from sqlite_master where tbl_name="fare_data"', con)
names.ix[0,0]

In [ ]:
# how much time do you have?

# big_fares = pd.read_sql_query('select * from fare_data ' \
#                              'inner join trip_data where trip_data.pickup_datetime = fare_data.pickup_datetime and ' \
#                              'trip_data.medallion = fare_data.medallion and ' \
#                              'trip_data.hack_license = fare_data.medallion and ' \
#                              'fare_data.fare_amount > 100 limit 500', con)

In [58]:
con.close()  # don't forget to close the connection!!!!!

In [59]:
pd.read_sql_query("select count(*) from trip_data", con)


---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
<ipython-input-59-2575b3764c60> in <module>()
----> 1 pd.read_sql_query("select count(*) from trip_data", con)

/Users/jwj2/anaconda/envs/snowflakes/lib/python3.5/site-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize)
    429     return pandas_sql.read_query(
    430         sql, index_col=index_col, params=params, coerce_float=coerce_float,
--> 431         parse_dates=parse_dates, chunksize=chunksize)
    432 
    433 

/Users/jwj2/anaconda/envs/snowflakes/lib/python3.5/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1593 
   1594         args = _convert_params(sql, params)
-> 1595         cursor = self.execute(*args)
   1596         columns = [col_desc[0] for col_desc in cursor.description]
   1597 

/Users/jwj2/anaconda/envs/snowflakes/lib/python3.5/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1553             cur = self.con
   1554         else:
-> 1555             cur = self.con.cursor()
   1556         try:
   1557             if kwargs:

ProgrammingError: Cannot operate on a closed database.

In [60]:
whos


Variable       Type          Data/Info
--------------------------------------
alldata        DataFrame                 Age Embarked <...>n[1309 rows x 11 columns]
cleaned_test   DataFrame           Age Embarked      F<...>\n[418 rows x 11 columns]
con            Connection    <sqlite3.Connection object at 0x11766c9d0>
ct             DataFrame           Age Embarked      F<...>\n[891 rows x 11 columns]
names          DataFrame                              <...>time on trip_data(pick...
np             module        <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd             module        <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt            module        <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'>
sqlite3        module        <module 'sqlite3' from '/<...>3.5/sqlite3/__init__.py'>
table          DataFrame     Survived         0       <...>e      0.811092  0.188908
test           DataFrame          PassengerId  Pclass <...>\n[418 rows x 12 columns]
tips           DataFrame               tip_amount\n0  <...>0662100 rows x 1 columns]
train          DataFrame          PassengerId  Survive<...>\n[891 rows x 12 columns]

In [ ]: