Intro to Data Manipulation and Pandas

Dan Tamayo

Material draws from a blog working through the Titanic dataset by Dave Novelli, as well as the Pandas Cookbook by Julia Evans

Follow Along

In a terminal, navigate to the directory where you want to add the MachineLearning folder. Then

cd /path/to/MachineLearning
git pull
cd Day2
source activate ml
jupyter notebook TitanicPandas.ipynb

Loading a Dataset



In [70]:

    
import pandas as pd
df = pd.read_csv('data/train.csv')



In [2]:

    
df.head()









    Out[2]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      NaN
      S



In [3]:

    
df.tail()









    Out[3]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      886
      887
      0
      2
      Montvila, Rev. Juozas
      male
      27
      0
      0
      211536
      13.00
      NaN
      S
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19
      0
      0
      112053
      30.00
      B42
      S
    
    
      888
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      NaN
      1
      2
      W./C. 6607
      23.45
      NaN
      S
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26
      0
      0
      111369
      30.00
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32
      0
      0
      370376
      7.75
      NaN
      Q



In [4]:

    
df.shape









    Out[4]:





(891, 12)

Magic Pandas

Selecting Columns



In [5]:

    
df['Fare'].head()









    Out[5]:





0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
Name: Fare, dtype: float64



In [6]:

    
df[['Fare', 'Sex']].head()

Data at a glance



In [7]:

    
df['Sex'].value_counts()









    Out[7]:





male      577
female    314
Name: Sex, dtype: int64



In [8]:

    
df['Age'].median()









    Out[8]:





28.0

Plotting



In [71]:

    
%matplotlib inline
import seaborn
fig = df['Pclass'].hist()



In [72]:

    
fig = df.hist(figsize=(15,5))

Filtering



In [73]:

    
df.head()









    Out[73]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      NaN
      S

Boolean Masks



In [74]:

    
mask = df['Embarked'] == 'C'
mask.head()









    Out[74]:





0    False
1     True
2    False
3    False
4    False
Name: Embarked, dtype: bool

Filtering Dataframes



In [75]:

    
df_filter = df[mask] # df_filter = df[df['Embarked'] == 'C']
df_filter.head()









    Out[75]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14
      1
      0
      237736
      30.0708
      NaN
      C
    
    
      19
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      NaN
      0
      0
      2649
      7.2250
      NaN
      C
    
    
      26
      27
      0
      3
      Emir, Mr. Farred Chehab
      male
      NaN
      0
      0
      2631
      7.2250
      NaN
      C
    
    
      30
      31
      0
      1
      Uruchurtu, Don. Manuel E
      male
      40
      0
      0
      PC 17601
      27.7208
      NaN
      C

Memory Issues



In [81]:

    
df_filter = df.loc[df['Embarked'] == 'C']
df_filter.head()









    Out[81]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14
      1
      0
      237736
      30.0708
      NaN
      C
    
    
      19
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      NaN
      0
      0
      2649
      7.2250
      NaN
      C
    
    
      26
      27
      0
      3
      Emir, Mr. Farred Chehab
      male
      NaN
      0
      0
      2631
      7.2250
      NaN
      C
    
    
      30
      31
      0
      1
      Uruchurtu, Don. Manuel E
      male
      40
      0
      0
      PC 17601
      27.7208
      NaN
      C

Selecting specific columns

df.loc[criterion, columns]



In [82]:

    
test = df.loc[df['Age'] > 30., ['Age', 'Fare', 'Sex']]
test.head()

Combining criteria / columns



In [83]:

    
test = df.loc[(df['Age'] > 30.) & (df['Fare'] < 50.), 'Age':'Fare']
test.head()

Requirements for working with data in scikit-learn

Features should not have missing values
Features and response should be numeric
Features and response should be NumPy arrays
Features and response are separate objects
Features and response should have specific shapes

Requirements for working with data in scikit-learn

Features should not have missing values
Features and response should be numeric
Features and response should be NumPy arrays
Features and response are separate objects
Features and response should have specific shapes

How to Find Valid Values



In [84]:

    
df.head()









    Out[84]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      NaN
      S

Wrong way:



In [90]:

    
import numpy as np
df[df['Cabin']!=np.nan]









    Out[90]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      NaN
      S
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      NaN
      0
      0
      330877
      8.4583
      NaN
      Q
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54
      0
      0
      17463
      51.8625
      E46
      S
    
    
      7
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      8
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27
      0
      2
      347742
      11.1333
      NaN
      S
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14
      1
      0
      237736
      30.0708
      NaN
      C
    
    
      10
      11
      1
      3
      Sandstrom, Miss. Marguerite Rut
      female
      4
      1
      1
      PP 9549
      16.7000
      G6
      S
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58
      0
      0
      113783
      26.5500
      C103
      S
    
    
      12
      13
      0
      3
      Saundercock, Mr. William Henry
      male
      20
      0
      0
      A/5. 2151
      8.0500
      NaN
      S
    
    
      13
      14
      0
      3
      Andersson, Mr. Anders Johan
      male
      39
      1
      5
      347082
      31.2750
      NaN
      S
    
    
      14
      15
      0
      3
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14
      0
      0
      350406
      7.8542
      NaN
      S
    
    
      15
      16
      1
      2
      Hewlett, Mrs. (Mary D Kingcome)
      female
      55
      0
      0
      248706
      16.0000
      NaN
      S
    
    
      16
      17
      0
      3
      Rice, Master. Eugene
      male
      2
      4
      1
      382652
      29.1250
      NaN
      Q
    
    
      17
      18
      1
      2
      Williams, Mr. Charles Eugene
      male
      NaN
      0
      0
      244373
      13.0000
      NaN
      S
    
    
      18
      19
      0
      3
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      female
      31
      1
      0
      345763
      18.0000
      NaN
      S
    
    
      19
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      NaN
      0
      0
      2649
      7.2250
      NaN
      C
    
    
      20
      21
      0
      2
      Fynney, Mr. Joseph J
      male
      35
      0
      0
      239865
      26.0000
      NaN
      S
    
    
      21
      22
      1
      2
      Beesley, Mr. Lawrence
      male
      34
      0
      0
      248698
      13.0000
      D56
      S
    
    
      22
      23
      1
      3
      McGowan, Miss. Anna "Annie"
      female
      15
      0
      0
      330923
      8.0292
      NaN
      Q
    
    
      23
      24
      1
      1
      Sloper, Mr. William Thompson
      male
      28
      0
      0
      113788
      35.5000
      A6
      S
    
    
      24
      25
      0
      3
      Palsson, Miss. Torborg Danira
      female
      8
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      25
      26
      1
      3
      Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
      female
      38
      1
      5
      347077
      31.3875
      NaN
      S
    
    
      26
      27
      0
      3
      Emir, Mr. Farred Chehab
      male
      NaN
      0
      0
      2631
      7.2250
      NaN
      C
    
    
      27
      28
      0
      1
      Fortune, Mr. Charles Alexander
      male
      19
      3
      2
      19950
      263.0000
      C23 C25 C27
      S
    
    
      28
      29
      1
      3
      O'Dwyer, Miss. Ellen "Nellie"
      female
      NaN
      0
      0
      330959
      7.8792
      NaN
      Q
    
    
      29
      30
      0
      3
      Todoroff, Mr. Lalio
      male
      NaN
      0
      0
      349216
      7.8958
      NaN
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      861
      862
      0
      2
      Giles, Mr. Frederick Edward
      male
      21
      1
      0
      28134
      11.5000
      NaN
      S
    
    
      862
      863
      1
      1
      Swift, Mrs. Frederick Joel (Margaret Welles Ba...
      female
      48
      0
      0
      17466
      25.9292
      D17
      S
    
    
      863
      864
      0
      3
      Sage, Miss. Dorothy Edith "Dolly"
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      864
      865
      0
      2
      Gill, Mr. John William
      male
      24
      0
      0
      233866
      13.0000
      NaN
      S
    
    
      865
      866
      1
      2
      Bystrom, Mrs. (Karolina)
      female
      42
      0
      0
      236852
      13.0000
      NaN
      S
    
    
      866
      867
      1
      2
      Duran y More, Miss. Asuncion
      female
      27
      1
      0
      SC/PARIS 2149
      13.8583
      NaN
      C
    
    
      867
      868
      0
      1
      Roebling, Mr. Washington Augustus II
      male
      31
      0
      0
      PC 17590
      50.4958
      A24
      S
    
    
      868
      869
      0
      3
      van Melkebeke, Mr. Philemon
      male
      NaN
      0
      0
      345777
      9.5000
      NaN
      S
    
    
      869
      870
      1
      3
      Johnson, Master. Harold Theodor
      male
      4
      1
      1
      347742
      11.1333
      NaN
      S
    
    
      870
      871
      0
      3
      Balkic, Mr. Cerin
      male
      26
      0
      0
      349248
      7.8958
      NaN
      S
    
    
      871
      872
      1
      1
      Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
      female
      47
      1
      1
      11751
      52.5542
      D35
      S
    
    
      872
      873
      0
      1
      Carlsson, Mr. Frans Olof
      male
      33
      0
      0
      695
      5.0000
      B51 B53 B55
      S
    
    
      873
      874
      0
      3
      Vander Cruyssen, Mr. Victor
      male
      47
      0
      0
      345765
      9.0000
      NaN
      S
    
    
      874
      875
      1
      2
      Abelson, Mrs. Samuel (Hannah Wizosky)
      female
      28
      1
      0
      P/PP 3381
      24.0000
      NaN
      C
    
    
      875
      876
      1
      3
      Najib, Miss. Adele Kiamie "Jane"
      female
      15
      0
      0
      2667
      7.2250
      NaN
      C
    
    
      876
      877
      0
      3
      Gustafsson, Mr. Alfred Ossian
      male
      20
      0
      0
      7534
      9.8458
      NaN
      S
    
    
      877
      878
      0
      3
      Petroff, Mr. Nedelio
      male
      19
      0
      0
      349212
      7.8958
      NaN
      S
    
    
      878
      879
      0
      3
      Laleff, Mr. Kristo
      male
      NaN
      0
      0
      349217
      7.8958
      NaN
      S
    
    
      879
      880
      1
      1
      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
      female
      56
      0
      1
      11767
      83.1583
      C50
      C
    
    
      880
      881
      1
      2
      Shelley, Mrs. William (Imanita Parrish Hall)
      female
      25
      0
      1
      230433
      26.0000
      NaN
      S
    
    
      881
      882
      0
      3
      Markun, Mr. Johann
      male
      33
      0
      0
      349257
      7.8958
      NaN
      S
    
    
      882
      883
      0
      3
      Dahlberg, Miss. Gerda Ulrika
      female
      22
      0
      0
      7552
      10.5167
      NaN
      S
    
    
      883
      884
      0
      2
      Banfield, Mr. Frederick James
      male
      28
      0
      0
      C.A./SOTON 34068
      10.5000
      NaN
      S
    
    
      884
      885
      0
      3
      Sutehall, Mr. Henry Jr
      male
      25
      0
      0
      SOTON/OQ 392076
      7.0500
      NaN
      S
    
    
      885
      886
      0
      3
      Rice, Mrs. William (Margaret Norton)
      female
      39
      0
      5
      382652
      29.1250
      NaN
      Q
    
    
      886
      887
      0
      2
      Montvila, Rev. Juozas
      male
      27
      0
      0
      211536
      13.0000
      NaN
      S
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19
      0
      0
      112053
      30.0000
      B42
      S
    
    
      888
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 12 columns

Right Way:



In [91]:

    
df.loc[df['Cabin'].notnull()]









    Out[91]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54.0
      0
      0
      17463
      51.8625
      E46
      S
    
    
      10
      11
      1
      3
      Sandstrom, Miss. Marguerite Rut
      female
      4.0
      1
      1
      PP 9549
      16.7000
      G6
      S
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58.0
      0
      0
      113783
      26.5500
      C103
      S
    
    
      21
      22
      1
      2
      Beesley, Mr. Lawrence
      male
      34.0
      0
      0
      248698
      13.0000
      D56
      S
    
    
      23
      24
      1
      1
      Sloper, Mr. William Thompson
      male
      28.0
      0
      0
      113788
      35.5000
      A6
      S
    
    
      27
      28
      0
      1
      Fortune, Mr. Charles Alexander
      male
      19.0
      3
      2
      19950
      263.0000
      C23 C25 C27
      S
    
    
      31
      32
      1
      1
      Spencer, Mrs. William Augustus (Marie Eugenie)
      female
      NaN
      1
      0
      PC 17569
      146.5208
      B78
      C
    
    
      52
      53
      1
      1
      Harper, Mrs. Henry Sleeper (Myna Haxtun)
      female
      49.0
      1
      0
      PC 17572
      76.7292
      D33
      C
    
    
      54
      55
      0
      1
      Ostby, Mr. Engelhart Cornelius
      male
      65.0
      0
      1
      113509
      61.9792
      B30
      C
    
    
      55
      56
      1
      1
      Woolner, Mr. Hugh
      male
      NaN
      0
      0
      19947
      35.5000
      C52
      S
    
    
      61
      62
      1
      1
      Icard, Miss. Amelie
      female
      38.0
      0
      0
      113572
      80.0000
      B28
      NaN
    
    
      62
      63
      0
      1
      Harris, Mr. Henry Birkhardt
      male
      45.0
      1
      0
      36973
      83.4750
      C83
      S
    
    
      66
      67
      1
      2
      Nye, Mrs. (Elizabeth Ramell)
      female
      29.0
      0
      0
      C.A. 29395
      10.5000
      F33
      S
    
    
      75
      76
      0
      3
      Moen, Mr. Sigurd Hansen
      male
      25.0
      0
      0
      348123
      7.6500
      F G73
      S
    
    
      88
      89
      1
      1
      Fortune, Miss. Mabel Helen
      female
      23.0
      3
      2
      19950
      263.0000
      C23 C25 C27
      S
    
    
      92
      93
      0
      1
      Chaffee, Mr. Herbert Fuller
      male
      46.0
      1
      0
      W.E.P. 5734
      61.1750
      E31
      S
    
    
      96
      97
      0
      1
      Goldschmidt, Mr. George B
      male
      71.0
      0
      0
      PC 17754
      34.6542
      A5
      C
    
    
      97
      98
      1
      1
      Greenfield, Mr. William Bertram
      male
      23.0
      0
      1
      PC 17759
      63.3583
      D10 D12
      C
    
    
      102
      103
      0
      1
      White, Mr. Richard Frasar
      male
      21.0
      0
      1
      35281
      77.2875
      D26
      S
    
    
      110
      111
      0
      1
      Porter, Mr. Walter Chamberlain
      male
      47.0
      0
      0
      110465
      52.0000
      C110
      S
    
    
      118
      119
      0
      1
      Baxter, Mr. Quigg Edmond
      male
      24.0
      0
      1
      PC 17558
      247.5208
      B58 B60
      C
    
    
      123
      124
      1
      2
      Webber, Miss. Susan
      female
      32.5
      0
      0
      27267
      13.0000
      E101
      S
    
    
      124
      125
      0
      1
      White, Mr. Percival Wayland
      male
      54.0
      0
      1
      35281
      77.2875
      D26
      S
    
    
      128
      129
      1
      3
      Peter, Miss. Anna
      female
      NaN
      1
      1
      2668
      22.3583
      F E69
      C
    
    
      136
      137
      1
      1
      Newsom, Miss. Helen Monypeny
      female
      19.0
      0
      2
      11752
      26.2833
      D47
      S
    
    
      137
      138
      0
      1
      Futrelle, Mr. Jacques Heath
      male
      37.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      139
      140
      0
      1
      Giglio, Mr. Victor
      male
      24.0
      0
      0
      PC 17593
      79.2000
      B86
      C
    
    
      148
      149
      0
      2
      Navratil, Mr. Michel ("Louis M Hoffman")
      male
      36.5
      0
      2
      230080
      26.0000
      F2
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      751
      752
      1
      3
      Moor, Master. Meier
      male
      6.0
      0
      1
      392096
      12.4750
      E121
      S
    
    
      759
      760
      1
      1
      Rothes, the Countess. of (Lucy Noel Martha Dye...
      female
      33.0
      0
      0
      110152
      86.5000
      B77
      S
    
    
      763
      764
      1
      1
      Carter, Mrs. William Ernest (Lucile Polk)
      female
      36.0
      1
      2
      113760
      120.0000
      B96 B98
      S
    
    
      765
      766
      1
      1
      Hogeboom, Mrs. John C (Anna Andrews)
      female
      51.0
      1
      0
      13502
      77.9583
      D11
      S
    
    
      772
      773
      0
      2
      Mack, Mrs. (Mary)
      female
      57.0
      0
      0
      S.O./P.P. 3
      10.5000
      E77
      S
    
    
      776
      777
      0
      3
      Tobin, Mr. Roger
      male
      NaN
      0
      0
      383121
      7.7500
      F38
      Q
    
    
      779
      780
      1
      1
      Robert, Mrs. Edward Scott (Elisabeth Walton Mc...
      female
      43.0
      0
      1
      24160
      211.3375
      B3
      S
    
    
      781
      782
      1
      1
      Dick, Mrs. Albert Adrian (Vera Gillespie)
      female
      17.0
      1
      0
      17474
      57.0000
      B20
      S
    
    
      782
      783
      0
      1
      Long, Mr. Milton Clyde
      male
      29.0
      0
      0
      113501
      30.0000
      D6
      S
    
    
      789
      790
      0
      1
      Guggenheim, Mr. Benjamin
      male
      46.0
      0
      0
      PC 17593
      79.2000
      B82 B84
      C
    
    
      796
      797
      1
      1
      Leader, Dr. Alice (Farnham)
      female
      49.0
      0
      0
      17465
      25.9292
      D17
      S
    
    
      802
      803
      1
      1
      Carter, Master. William Thornton II
      male
      11.0
      1
      2
      113760
      120.0000
      B96 B98
      S
    
    
      806
      807
      0
      1
      Andrews, Mr. Thomas Jr
      male
      39.0
      0
      0
      112050
      0.0000
      A36
      S
    
    
      809
      810
      1
      1
      Chambers, Mrs. Norman Campbell (Bertha Griggs)
      female
      33.0
      1
      0
      113806
      53.1000
      E8
      S
    
    
      815
      816
      0
      1
      Fry, Mr. Richard
      male
      NaN
      0
      0
      112058
      0.0000
      B102
      S
    
    
      820
      821
      1
      1
      Hays, Mrs. Charles Melville (Clara Jennings Gr...
      female
      52.0
      1
      1
      12749
      93.5000
      B69
      S
    
    
      823
      824
      1
      3
      Moor, Mrs. (Beila)
      female
      27.0
      0
      1
      392096
      12.4750
      E121
      S
    
    
      829
      830
      1
      1
      Stone, Mrs. George Nelson (Martha Evelyn)
      female
      62.0
      0
      0
      113572
      80.0000
      B28
      NaN
    
    
      835
      836
      1
      1
      Compton, Miss. Sara Rebecca
      female
      39.0
      1
      1
      PC 17756
      83.1583
      E49
      C
    
    
      839
      840
      1
      1
      Marechal, Mr. Pierre
      male
      NaN
      0
      0
      11774
      29.7000
      C47
      C
    
    
      849
      850
      1
      1
      Goldenberg, Mrs. Samuel L (Edwiga Grabowska)
      female
      NaN
      1
      0
      17453
      89.1042
      C92
      C
    
    
      853
      854
      1
      1
      Lines, Miss. Mary Conover
      female
      16.0
      0
      1
      PC 17592
      39.4000
      D28
      S
    
    
      857
      858
      1
      1
      Daly, Mr. Peter Denis
      male
      51.0
      0
      0
      113055
      26.5500
      E17
      S
    
    
      862
      863
      1
      1
      Swift, Mrs. Frederick Joel (Margaret Welles Ba...
      female
      48.0
      0
      0
      17466
      25.9292
      D17
      S
    
    
      867
      868
      0
      1
      Roebling, Mr. Washington Augustus II
      male
      31.0
      0
      0
      PC 17590
      50.4958
      A24
      S
    
    
      871
      872
      1
      1
      Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
      female
      47.0
      1
      1
      11751
      52.5542
      D35
      S
    
    
      872
      873
      0
      1
      Carlsson, Mr. Frans Olof
      male
      33.0
      0
      0
      695
      5.0000
      B51 B53 B55
      S
    
    
      879
      880
      1
      1
      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
      female
      56.0
      0
      1
      11767
      83.1583
      C50
      C
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19.0
      0
      0
      112053
      30.0000
      B42
      S
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
  

204 rows × 12 columns

Approach 1: Throw out any rows with missing data values



In [92]:

    
df.notnull().head()









    Out[92]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
      False
      True
    
    
      1
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
    
    
      2
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
      False
      True
    
    
      3
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
    
    
      4
      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
      False
      True



In [93]:

    
df_filter = df.loc[df.notnull().all(axis=1)]

How much data is left?



In [94]:

    
df_filter.shape









    Out[94]:





(183, 12)

Approach 2: Assign missing identifier



In [105]:

    
df.loc[df['Cabin'].isnull(), 'Cabin'] = 'U0'
df.head()









    Out[105]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      U0
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      U0
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      U0
      S

Approach 3: Assign the average/median/mode value



In [106]:

    
df.loc[df['Fare'].isnull()].shape









    Out[106]:





(0, 12)



In [107]:

    
df['Fare'].median()









    Out[107]:





14.4542



In [108]:

    
df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].median()

Categorical Variables



In [109]:

    
df.loc[df['Embarked'].isnull()].shape









    Out[109]:





(0, 12)



In [111]:

    
df['Embarked'].mode()









    Out[111]:





0    S
dtype: object



In [112]:

    
df['Embarked'].mode()[0]









    Out[112]:





'S'



In [113]:

    
df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].dropna().mode()[0]

Approach 4: Fit a regression model to predict missing values



In [114]:

    
for col in df.columns:
    print("NaNs in column {0} = {1}".format(col, df.loc[df[col].isnull()].shape[0]))









    



NaNs in column PassengerId = 0
NaNs in column Survived = 0
NaNs in column Pclass = 0
NaNs in column Name = 0
NaNs in column Sex = 0
NaNs in column Age = 0
NaNs in column SibSp = 0
NaNs in column Parch = 0
NaNs in column Ticket = 0
NaNs in column Fare = 0
NaNs in column Cabin = 0
NaNs in column Embarked = 0



In [115]:

    
df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean()

Requirements for working with data in scikit-learn

Features should not have missing values
Features and response should be numeric
Features and response should be NumPy arrays
Features and response are separate objects
Features and response should have specific shapes



In [116]:

    
df.head()









    Out[116]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      U0
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      U0
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      U0
      S

Approach 1: Creating Dummy (Binary) Variables



In [117]:

    
print(df['Embarked'].unique())









    



['S' 'C' 'Q']



In [118]:

    
dummies_df = pd.get_dummies(df['Embarked'])
dummies_df.head()



In [119]:

    
def addEmbarked(name):
    return 'Embarked_' + name
dummies_df = dummies_df.rename(columns=addEmbarked)
dummies_df.head()



In [120]:

    
df = pd.concat([df, dummies_df], axis=1)
df.head()









    Out[120]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Embarked_C
      Embarked_Q
      Embarked_S
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      U0
      S
      0
      0
      1
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
      1
      0
      0
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      U0
      S
      0
      0
      1
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
      0
      0
      1
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      U0
      S
      0
      0
      1

Approach 2: Factorizing (Make single multi-class feature)



In [121]:

    
df['EmbarkedNum'] = pd.factorize(df['Embarked'])[0]
df.head(6)









    Out[121]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Embarked_C
      Embarked_Q
      Embarked_S
      EmbarkedNum
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.000000
      1
      0
      A/5 21171
      7.2500
      U0
      S
      0
      0
      1
      0
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.000000
      1
      0
      PC 17599
      71.2833
      C85
      C
      1
      0
      0
      1
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.000000
      0
      0
      STON/O2. 3101282
      7.9250
      U0
      S
      0
      0
      1
      0
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.000000
      1
      0
      113803
      53.1000
      C123
      S
      0
      0
      1
      0
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.000000
      0
      0
      373450
      8.0500
      U0
      S
      0
      0
      1
      0
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      29.699118
      0
      0
      330877
      8.4583
      U0
      Q
      0
      1
      0
      2

What are the classes?



In [122]:

    
pd.factorize(df['Sex'])









    Out[122]:





(array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
        0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
        0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,
        0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
        0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
        0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0,
        1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
        1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
        1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
        0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1,
        1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
        1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
        0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,
        0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,
        1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
        0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
        1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0]),
 Index([u'male', u'female'], dtype='object'))



In [123]:

    
df['Female'] = pd.factorize(df['Sex'])[0]
df.head()









    Out[123]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Embarked_C
      Embarked_Q
      Embarked_S
      EmbarkedNum
      Female
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      U0
      S
      0
      0
      1
      0
      0
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
      1
      0
      0
      1
      1
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      U0
      S
      0
      0
      1
      0
      1
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
      0
      0
      1
      0
      1
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      U0
      S
      0
      0
      1
      0
      0

Continuous Features: Feature Scaling



In [124]:

    
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(df[['Age', 'Fare']]) # .fit(df) to scale all numerical columns
print("Means = {0}".format(scaler.mean_))
print("Stdevs = {0}".format(scaler.scale_))
df[['Age', 'Fare']] = scaler.transform(df[['Age', 'Fare']])
df.head()









    



Means = [ 29.69911765  32.20420797]
Stdevs = [ 12.99471687  49.66553444]






    Out[124]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Embarked_C
      Embarked_Q
      Embarked_S
      EmbarkedNum
      Female
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      -0.592481
      1
      0
      A/5 21171
      -0.502445
      U0
      S
      0
      0
      1
      0
      0
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      0.638789
      1
      0
      PC 17599
      0.786845
      C85
      C
      1
      0
      0
      1
      1
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      -0.284663
      0
      0
      STON/O2. 3101282
      -0.488854
      U0
      S
      0
      0
      1
      0
      1
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      0.407926
      1
      0
      113803
      0.420730
      C123
      S
      0
      0
      1
      0
      1
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      0.407926
      0
      0
      373450
      -0.486337
      U0
      S
      0
      0
      1
      0
      0

Requirements for working with data in scikit-learn

Features should not have missing values
Features and response should be numeric
Features and response should be NumPy arrays
Features and response are separate objects
Features and response should have specific shapes

Underneath, Pandas series are numpy arrays



In [125]:

    
fares = df['Fare'].values
type(fares)









    Out[125]:





numpy.ndarray

Getting a pipeline ready for sklearn

Combining Data Frames



In [126]:

    
input_df = pd.read_csv('data/train.csv')
submit_df = pd.read_csv('data/test.csv')

print(input_df.shape)
print(submit_df.shape)









    



(891, 12)
(418, 11)



In [127]:

    
submit_df.head()









    Out[127]:






  
    
      
      PassengerId
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      892
      3
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
    
    
      1
      893
      3
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0
      1
      0
      363272
      7.0000
      NaN
      S
    
    
      2
      894
      2
      Myles, Mr. Thomas Francis
      male
      62.0
      0
      0
      240276
      9.6875
      NaN
      Q
    
    
      3
      895
      3
      Wirz, Mr. Albert
      male
      27.0
      0
      0
      315154
      8.6625
      NaN
      S
    
    
      4
      896
      3
      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
      female
      22.0
      1
      1
      3101298
      12.2875
      NaN
      S



In [128]:

    
input_df.head()









    Out[128]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      NaN
      S

Indexing



In [129]:

    
submit_df = pd.read_csv('data/test.csv', index_col=0)
submit_df.head()









    Out[129]:






  
    
      
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      892
      3
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
    
    
      893
      3
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0
      1
      0
      363272
      7.0000
      NaN
      S
    
    
      894
      2
      Myles, Mr. Thomas Francis
      male
      62.0
      0
      0
      240276
      9.6875
      NaN
      Q
    
    
      895
      3
      Wirz, Mr. Albert
      male
      27.0
      0
      0
      315154
      8.6625
      NaN
      S
    
    
      896
      3
      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
      female
      22.0
      1
      1
      3101298
      12.2875
      NaN
      S



In [130]:

    
input_df = pd.read_csv('data/train.csv', index_col=0)
submit_df = pd.read_csv('data/test.csv', index_col=0)
df = pd.concat([input_df, submit_df])
df.tail()









    Out[130]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1305
      NaN
      NaN
      S
      8.0500
      Spector, Mr. Woolf
      0
      3
      male
      0
      NaN
      A.5. 3236
    
    
      1306
      39.0
      C105
      C
      108.9000
      Oliva y Ocana, Dona. Fermina
      0
      1
      female
      0
      NaN
      PC 17758
    
    
      1307
      38.5
      NaN
      S
      7.2500
      Saether, Mr. Simon Sivertsen
      0
      3
      male
      0
      NaN
      SOTON/O.Q. 3101262
    
    
      1308
      NaN
      NaN
      S
      8.0500
      Ware, Mr. Frederick
      0
      3
      male
      0
      NaN
      359309
    
    
      1309
      NaN
      NaN
      C
      22.3583
      Peter, Master. Michael J
      1
      3
      male
      1
      NaN
      2668



In [131]:

    
print(df.shape[1], "columns")
print(df.shape[0], "rows")
print(df.columns.values)









    



(11, 'columns')
(1309, 'rows')
['Age' 'Cabin' 'Embarked' 'Fare' 'Name' 'Parch' 'Pclass' 'Sex' 'SibSp'
 'Survived' 'Ticket']

Putting it all together



In [49]:

    
def process_data(df):
    df['Female'] = pd.factorize(df['Sex'])[0]
    df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean()
    df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].median()
    df.loc[df['Cabin'].isnull(), 'Cabin'] = 'U0'
    df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].dropna().mode()[0]
    
    dummies_df = pd.get_dummies(df['Embarked'])
    def addEmbarked(name):
        return 'Embarked_' + name
    dummies_df = dummies_df.rename(columns=addEmbarked)
    df = pd.concat([df, dummies_df], axis=1)
    df['EmbarkedNum'] = pd.factorize(df['Embarked'])[0]
    
    return df



In [50]:

    
df = process_data(df)
df.tail()









    Out[50]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      Female
      Embarked_C
      Embarked_Q
      Embarked_S
      EmbarkedNum
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1305
      29.881138
      U0
      S
      8.0500
      Spector, Mr. Woolf
      0
      3
      male
      0
      NaN
      A.5. 3236
      0
      0
      0
      1
      0
    
    
      1306
      39.000000
      C105
      C
      108.9000
      Oliva y Ocana, Dona. Fermina
      0
      1
      female
      0
      NaN
      PC 17758
      1
      1
      0
      0
      1
    
    
      1307
      38.500000
      U0
      S
      7.2500
      Saether, Mr. Simon Sivertsen
      0
      3
      male
      0
      NaN
      SOTON/O.Q. 3101262
      0
      0
      0
      1
      0
    
    
      1308
      29.881138
      U0
      S
      8.0500
      Ware, Mr. Frederick
      0
      3
      male
      0
      NaN
      359309
      0
      0
      0
      1
      0
    
    
      1309
      29.881138
      U0
      C
      22.3583
      Peter, Master. Michael J
      1
      3
      male
      1
      NaN
      2668
      0
      1
      0
      0
      1



In [51]:

    
features = ['Age', 'Fare', 'Parch', 'Pclass', 'SibSp', 'Female', 'EmbarkedNum']
df_test = df.loc[df['Survived'].isnull(), features]
df_train = df.loc[df['Survived'].notnull(), features+['Survived']]
df_train.head()









    Out[51]:






  
    
      
      Age
      Fare
      Parch
      Pclass
      SibSp
      Female
      EmbarkedNum
      Survived
    
    
      PassengerId
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22
      7.2500
      0
      3
      1
      0
      0
      0
    
    
      2
      38
      71.2833
      0
      1
      1
      1
      1
      1
    
    
      3
      26
      7.9250
      0
      3
      0
      1
      0
      1
    
    
      4
      35
      53.1000
      0
      1
      1
      1
      0
      1
    
    
      5
      35
      8.0500
      0
      3
      0
      0
      0
      0

Requirements for working with data in scikit-learn

Features should not have missing values
Features and response should be numeric
Features and response should be NumPy arrays
Features and response are separate objects
Features and response should have specific shapes



In [52]:

    
X_train = df_train[features].values
y_train = df_train['Survived'].values
print(X_train[0:5])
print(y_train[0:5])
print("X has {0} rows".format(X_train.shape[0]))
print("y has {0} rows".format(y_train.shape[0]))









    



[[ 22.       7.25     0.       3.       1.       0.       0.    ]
 [ 38.      71.2833   0.       1.       1.       1.       1.    ]
 [ 26.       7.925    0.       3.       0.       1.       0.    ]
 [ 35.      53.1      0.       1.       1.       1.       0.    ]
 [ 35.       8.05     0.       3.       0.       0.       0.    ]]
[ 0.  1.  1.  1.  0.]
X has 891 rows
y has 891 rows

Feature Engineering

Parsing Alphanumeric Features



In [53]:

    
df['Cabin'].unique()









    Out[53]:





array(['U0', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35', 'C87',
       'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19', 'B49', 'D',
       'C22 C26', 'C106', 'C65', 'E36', 'C54', 'B57 B59 B63 B66', 'C7',
       'E34', 'C32', 'B18', 'C124', 'C91', 'E40', 'T', 'C128', 'D37',
       'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44', 'A34', 'C104', 'C111',
       'C92', 'E38', 'D21', 'E12', 'E63', 'A14', 'B37', 'C30', 'D20',
       'B79', 'E25', 'D46', 'B73', 'C95', 'B38', 'B39', 'B22', 'C86',
       'C70', 'A16', 'C101', 'C68', 'A10', 'E68', 'B41', 'A20', 'D19',
       'D50', 'D9', 'A23', 'B50', 'A26', 'D48', 'E58', 'C126', 'B71',
       'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63', 'C62 C64', 'E24', 'C90',
       'C45', 'E8', 'B101', 'D45', 'C46', 'D30', 'E121', 'D11', 'E77',
       'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36', 'B102', 'B69', 'E49',
       'C47', 'D28', 'E17', 'A24', 'C50', 'B42', 'C148', 'B45', 'B36',
       'A21', 'D34', 'A9', 'C31', 'B61', 'C53', 'D43', 'C130', 'C132',
       'C55 C57', 'C116', 'F', 'A29', 'C6', 'C28', 'C51', 'C97', 'D22',
       'B10', 'E45', 'E52', 'A11', 'B11', 'C80', 'C89', 'F E46', 'B26',
       'F E57', 'A18', 'E60', 'E39 E41', 'B52 B54 B56', 'C39', 'B24',
       'D40', 'D38', 'C105'], dtype=object)

Regular Expressions



In [54]:

    
import re
def getDeck(cabin):
    match = re.search("([A-Z])", cabin)
    return match.group(1) if match is not None else None
def getCabinNum(cabin):
    match = re.search("([0-9]+)", cabin)
    return match.group(1) if match is not None else None
print(getDeck('C237'))
print(getCabinNum('C237'))

Apply a function to all rows to generate a new feature



In [55]:

    
df['Deck'] = df['Cabin'].map(getDeck)
df['CabinNum'] = df['Cabin'].map(getCabinNum)
df.head()









    Out[55]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      Female
      Embarked_C
      Embarked_Q
      Embarked_S
      EmbarkedNum
      Deck
      CabinNum
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22
      U0
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0
      A/5 21171
      0
      0
      0
      1
      0
      U
      0
    
    
      2
      38
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1
      PC 17599
      1
      1
      0
      0
      1
      C
      85
    
    
      3
      26
      U0
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1
      STON/O2. 3101282
      1
      0
      0
      1
      0
      U
      0
    
    
      4
      35
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1
      113803
      1
      0
      0
      1
      0
      C
      123
    
    
      5
      35
      U0
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0
      373450
      0
      0
      0
      1
      0
      U
      0



In [56]:

    
df['CabinNum'].isnull().value_counts()









    Out[56]:





False    1303
True        6
Name: CabinNum, dtype: int64



In [57]:

    
df.loc[df['CabinNum'].isnull(), 'CabinNum'] = 0



In [58]:

    
df['Deck'].isnull().value_counts()









    Out[58]:





False    1309
Name: Deck, dtype: int64



In [59]:

    
df['DeckNum'] = pd.factorize(df['Deck'])[0]

What to do with the name?

Number of names



In [60]:

    
testname = df.loc[1, 'Name']
print(testname)









    



Braund, Mr. Owen Harris



In [61]:

    
re.split(' ', testname)









    Out[61]:





['Braund,', 'Mr.', 'Owen', 'Harris']



In [62]:

    
def numNames(name):
    return len(re.split(' ', name))

df['NumNames'] = df['Name'].map(numNames)
df.head()









    Out[62]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Ticket
      Female
      Embarked_C
      Embarked_Q
      Embarked_S
      EmbarkedNum
      Deck
      CabinNum
      DeckNum
      NumNames
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22
      U0
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0
      A/5 21171
      0
      0
      0
      1
      0
      U
      0
      0
      4
    
    
      2
      38
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1
      PC 17599
      1
      1
      0
      0
      1
      C
      85
      1
      7
    
    
      3
      26
      U0
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1
      STON/O2. 3101282
      1
      0
      0
      1
      0
      U
      0
      0
      3
    
    
      4
      35
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1
      113803
      1
      0
      0
      1
      0
      C
      123
      1
      7
    
    
      5
      35
      U0
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0
      373450
      0
      0
      0
      1
      0
      U
      0
      0
      4

Title



In [63]:

    
testname









    Out[63]:





'Braund, Mr. Owen Harris'



In [64]:

    
re.search(", (.+?)\.", testname).group(1)









    Out[64]:





'Mr'



In [65]:

    
def getTitle(name):
    match = re.search(", (.*?)\.", name)
    return match.group(1) if match is not None else None

df['Title'] = df['Name'].map(getTitle)
df.head()









    Out[65]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      ...
      Female
      Embarked_C
      Embarked_Q
      Embarked_S
      EmbarkedNum
      Deck
      CabinNum
      DeckNum
      NumNames
      Title
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22
      U0
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0
      ...
      0
      0
      0
      1
      0
      U
      0
      0
      4
      Mr
    
    
      2
      38
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1
      ...
      1
      1
      0
      0
      1
      C
      85
      1
      7
      Mrs
    
    
      3
      26
      U0
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1
      ...
      1
      0
      0
      1
      0
      U
      0
      0
      3
      Miss
    
    
      4
      35
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1
      ...
      1
      0
      0
      1
      0
      C
      123
      1
      7
      Mrs
    
    
      5
      35
      U0
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0
      ...
      0
      0
      0
      1
      0
      U
      0
      0
      4
      Mr
    
  

5 rows × 21 columns



In [66]:

    
df.loc[df['Title'] == 'Jonkheer', 'Title'] = 'Master'
df.loc[df['Title'].isin(['Ms', 'Mlle']), 'Title'] = 'Miss'
df.loc[df['Title'] == 'Mme', 'Title'] = 'Mrs'
df.loc[df['Title'].isin(['Capt', 'Don', 'Major', 'Col', 'Sir']), 'Title'] = 'Sir'
df.loc[df['Title'].isin(['Dona', 'Lady', 'the Countess']), 'Title'] = 'Lady'
df['Title'].value_counts()









    Out[66]:





Mr        757
Miss      264
Mrs       198
Master     62
Sir         9
Dr          8
Rev         8
Lady        3
Name: Title, dtype: int64



In [67]:

    
df = pd.concat([df, pd.get_dummies(df['Title']).rename(columns=lambda x: 'Title_' + str(x))], axis=1)
df.head()









    Out[67]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      ...
      NumNames
      Title
      Title_Dr
      Title_Lady
      Title_Master
      Title_Miss
      Title_Mr
      Title_Mrs
      Title_Rev
      Title_Sir
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22
      U0
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0
      ...
      4
      Mr
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      2
      38
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1
      ...
      7
      Mrs
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      3
      26
      U0
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1
      ...
      3
      Miss
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      4
      35
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1
      ...
      7
      Mrs
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      5
      35
      U0
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0
      ...
      4
      Mr
      0
      0
      0
      0
      1
      0
      0
      0
    
  

5 rows × 29 columns

Final Data Processing Pipeline



In [68]:

    
def process_data(df):
    df['Female'] = pd.factorize(df['Sex'])[0]
    df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean()
    df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].median()
    df.loc[df['Cabin'].isnull(), 'Cabin'] = 'U0'
    df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].dropna().mode()[0]
    
    dummies_df = pd.get_dummies(df['Embarked'])
    def addEmbarked(name):
        return 'Embarked_' + name
    dummies_df = dummies_df.rename(columns=addEmbarked)
    df = pd.concat([df, dummies_df], axis=1)
    df['EmbarkedNum'] = pd.factorize(df['Embarked'])[0]
    
    import re
    def getDeck(cabin):
        match = re.search("([A-Z])", cabin)
        return match.group(1) if match is not None else None
    def getCabinNum(cabin):
        match = re.search("([0-9]+)", cabin)
        return match.group(1) if match is not None else None
    
    df['Deck'] = df['Cabin'].map(getDeck)
    df['DeckNum'] = pd.factorize(df['Deck'])[0]
    df['CabinNum'] = df['Cabin'].map(getCabinNum)
    df.loc[df['CabinNum'].isnull(), 'CabinNum'] = 0
    
    def numNames(name):
        return len(re.split(' ', name))
    df['NumNames'] = df['Name'].map(numNames)
    
    def getTitle(name):
        match = re.search(", (.*?)\.", name)
        return match.group(1) if match is not None else None
    df['Title'] = df['Name'].map(getTitle)
    
    df.loc[df['Title'] == 'Jonkheer', 'Title'] = 'Master'
    df.loc[df['Title'].isin(['Ms', 'Mlle']), 'Title'] = 'Miss'
    df.loc[df['Title'] == 'Mme', 'Title'] = 'Mrs'
    df.loc[df['Title'].isin(['Capt', 'Don', 'Major', 'Col', 'Sir']), 'Title'] = 'Sir'
    df.loc[df['Title'].isin(['Dona', 'Lady', 'the Countess']), 'Title'] = 'Lady'
    df = pd.concat([df, pd.get_dummies(df['Title']).rename(columns=lambda x: 'Title_' + str(x))], axis=1)
    
    return df



In [69]:

    
input_df = pd.read_csv('data/train.csv', index_col=0)
submit_df = pd.read_csv('data/test.csv', index_col=0)
df = pd.concat([input_df, submit_df])
df = process_data(df)

features = ['Age', 'Fare', 'Parch', 'Pclass', 'SibSp', 'Female', 'EmbarkedNum', 'DeckNum', 'CabinNum', 'NumNames', 'Title_Dr','Title_Lady','Title_Master','Title_Miss','Title_Mr','Title_Mrs','Title_Rev','Title_Sir']
df_test = df.loc[df['Survived'].isnull(), features]
df_train = df.loc[df['Survived'].notnull(), features+['Survived']]

df_train.to_csv('data/train_processed.csv') # NOT FEATURE SCALED!
df_test.to_csv('data/test_processed.csv')



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32	0	0	370376	7.75	NaN	Q

	Age	SibSp	Parch	Ticket	Fare
4	35	0	0	373450	8.050
11	58	0	0	113783	26.550
13	39	1	5	347082	31.275
15	55	0	0	248706	16.000
18	31	1	0	345763	18.000

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
21	22	1	2	Beesley, Mr. Lawrence	male	34.0	0	0	248698	13.0000	D56	S
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	0	113788	35.5000	A6	S
27	28	0	1	Fortune, Mr. Charles Alexander	male	19.0	3	2	19950	263.0000	C23 C25 C27	S
31	32	1	1	Spencer, Mrs. William Augustus (Marie Eugenie)	female	NaN	1	0	PC 17569	146.5208	B78	C
52	53	1	1	Harper, Mrs. Henry Sleeper (Myna Haxtun)	female	49.0	1	0	PC 17572	76.7292	D33	C
54	55	0	1	Ostby, Mr. Engelhart Cornelius	male	65.0	0	1	113509	61.9792	B30	C
55	56	1	1	Woolner, Mr. Hugh	male	NaN	0	0	19947	35.5000	C52	S
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0000	B28	NaN
62	63	0	1	Harris, Mr. Henry Birkhardt	male	45.0	1	0	36973	83.4750	C83	S
66	67	1	2	Nye, Mrs. (Elizabeth Ramell)	female	29.0	0	0	C.A. 29395	10.5000	F33	S
75	76	0	3	Moen, Mr. Sigurd Hansen	male	25.0	0	0	348123	7.6500	F G73	S
88	89	1	1	Fortune, Miss. Mabel Helen	female	23.0	3	2	19950	263.0000	C23 C25 C27	S
92	93	0	1	Chaffee, Mr. Herbert Fuller	male	46.0	1	0	W.E.P. 5734	61.1750	E31	S
96	97	0	1	Goldschmidt, Mr. George B	male	71.0	0	0	PC 17754	34.6542	A5	C
97	98	1	1	Greenfield, Mr. William Bertram	male	23.0	0	1	PC 17759	63.3583	D10 D12	C
102	103	0	1	White, Mr. Richard Frasar	male	21.0	0	1	35281	77.2875	D26	S
110	111	0	1	Porter, Mr. Walter Chamberlain	male	47.0	0	0	110465	52.0000	C110	S
118	119	0	1	Baxter, Mr. Quigg Edmond	male	24.0	0	1	PC 17558	247.5208	B58 B60	C
123	124	1	2	Webber, Miss. Susan	female	32.5	0	0	27267	13.0000	E101	S
124	125	0	1	White, Mr. Percival Wayland	male	54.0	0	1	35281	77.2875	D26	S
128	129	1	3	Peter, Miss. Anna	female	NaN	1	1	2668	22.3583	F E69	C
136	137	1	1	Newsom, Miss. Helen Monypeny	female	19.0	0	2	11752	26.2833	D47	S
137	138	0	1	Futrelle, Mr. Jacques Heath	male	37.0	1	0	113803	53.1000	C123	S
139	140	0	1	Giglio, Mr. Victor	male	24.0	0	0	PC 17593	79.2000	B86	C
148	149	0	2	Navratil, Mr. Michel ("Louis M Hoffman")	male	36.5	0	2	230080	26.0000	F2	S
...	...	...	...	...	...	...	...	...	...	...	...	...
751	752	1	3	Moor, Master. Meier	male	6.0	0	1	392096	12.4750	E121	S
759	760	1	1	Rothes, the Countess. of (Lucy Noel Martha Dye...	female	33.0	0	0	110152	86.5000	B77	S
763	764	1	1	Carter, Mrs. William Ernest (Lucile Polk)	female	36.0	1	2	113760	120.0000	B96 B98	S
765	766	1	1	Hogeboom, Mrs. John C (Anna Andrews)	female	51.0	1	0	13502	77.9583	D11	S
772	773	0	2	Mack, Mrs. (Mary)	female	57.0	0	0	S.O./P.P. 3	10.5000	E77	S
776	777	0	3	Tobin, Mr. Roger	male	NaN	0	0	383121	7.7500	F38	Q
779	780	1	1	Robert, Mrs. Edward Scott (Elisabeth Walton Mc...	female	43.0	0	1	24160	211.3375	B3	S
781	782	1	1	Dick, Mrs. Albert Adrian (Vera Gillespie)	female	17.0	1	0	17474	57.0000	B20	S
782	783	0	1	Long, Mr. Milton Clyde	male	29.0	0	0	113501	30.0000	D6	S
789	790	0	1	Guggenheim, Mr. Benjamin	male	46.0	0	0	PC 17593	79.2000	B82 B84	C
796	797	1	1	Leader, Dr. Alice (Farnham)	female	49.0	0	0	17465	25.9292	D17	S
802	803	1	1	Carter, Master. William Thornton II	male	11.0	1	2	113760	120.0000	B96 B98	S
806	807	0	1	Andrews, Mr. Thomas Jr	male	39.0	0	0	112050	0.0000	A36	S
809	810	1	1	Chambers, Mrs. Norman Campbell (Bertha Griggs)	female	33.0	1	0	113806	53.1000	E8	S
815	816	0	1	Fry, Mr. Richard	male	NaN	0	0	112058	0.0000	B102	S
820	821	1	1	Hays, Mrs. Charles Melville (Clara Jennings Gr...	female	52.0	1	1	12749	93.5000	B69	S
823	824	1	3	Moor, Mrs. (Beila)	female	27.0	0	1	392096	12.4750	E121	S
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0000	B28	NaN
835	836	1	1	Compton, Miss. Sara Rebecca	female	39.0	1	1	PC 17756	83.1583	E49	C
839	840	1	1	Marechal, Mr. Pierre	male	NaN	0	0	11774	29.7000	C47	C
849	850	1	1	Goldenberg, Mrs. Samuel L (Edwiga Grabowska)	female	NaN	1	0	17453	89.1042	C92	C
853	854	1	1	Lines, Miss. Mary Conover	female	16.0	0	1	PC 17592	39.4000	D28	S
857	858	1	1	Daly, Mr. Peter Denis	male	51.0	0	0	113055	26.5500	E17	S
862	863	1	1	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	0	0	17466	25.9292	D17	S
867	868	0	1	Roebling, Mr. Washington Augustus II	male	31.0	0	0	PC 17590	50.4958	A24	S
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	B51 B53 B55	S
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	True	True	True	True	True	True	True	True	True	True	False	True
1	True	True	True	True	True	True	True	True	True	True	True	True
2	True	True	True	True	True	True	True	True	True	True	False	True
3	True	True	True	True	True	True	True	True	True	True	True	True
4	True	True	True	True	True	True	True	True	True	True	False	True

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Embarked_C	Embarked_Q	Embarked_S	EmbarkedNum
0	1	0	3	Braund, Mr. Owen Harris	male	22.000000	1	A/5 21171	7.2500	U0	S	0	0	1	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.000000	1	PC 17599	71.2833	C85	C	1	0	0	1
2	3	1	3	Heikkinen, Miss. Laina	female	26.000000	0	STON/O2. 3101282	7.9250	U0	S	0	0	1	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.000000	1	113803	53.1000	C123	S	0	0	1	0
4	5	0	3	Allen, Mr. William Henry	male	35.000000	0	373450	8.0500	U0	S	0	0	1	0
5	6	0	3	Moran, Mr. James	male	29.699118	0	330877	8.4583	U0	Q	0	1	0	2

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Embarked_C	Embarked_S	EmbarkedNum	Female
0	1	0	3	Braund, Mr. Owen Harris	male	-0.592481	1	A/5 21171	-0.502445	U0	S	0	1	0	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	0.638789	1	PC 17599	0.786845	C85	C	1	0	1	1
2	3	1	3	Heikkinen, Miss. Laina	female	-0.284663	0	STON/O2. 3101282	-0.488854	U0	S	0	1	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	0.407926	1	113803	0.420730	C123	S	0	1	0	1
4	5	0	3	Allen, Mr. William Henry	male	0.407926	0	373450	-0.486337	U0	S	0	1	0	0

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

	Age	Cabin	Embarked	Fare	Name	Parch	Pclass	Sex	SibSp	Survived	Ticket
PassengerId
1305	NaN	NaN	S	8.0500	Spector, Mr. Woolf	0	3	male	0	NaN	A.5. 3236
1306	39.0	C105	C	108.9000	Oliva y Ocana, Dona. Fermina	0	1	female	0	NaN	PC 17758
1307	38.5	NaN	S	7.2500	Saether, Mr. Simon Sivertsen	0	3	male	0	NaN	SOTON/O.Q. 3101262
1308	NaN	NaN	S	8.0500	Ware, Mr. Frederick	0	3	male	0	NaN	359309
1309	NaN	NaN	C	22.3583	Peter, Master. Michael J	1	3	male	1	NaN	2668

	Age	Cabin	Embarked	Fare	Name	Parch	Pclass	Sex	SibSp	Survived	Ticket	Female	Embarked_C	Embarked_Q	Embarked_S	EmbarkedNum
PassengerId
1305	29.881138	U0	S	8.0500	Spector, Mr. Woolf	0	3	male	0	NaN	A.5. 3236	0	0	0	1	0
1306	39.000000	C105	C	108.9000	Oliva y Ocana, Dona. Fermina	0	1	female	0	NaN	PC 17758	1	1	0	0	1
1307	38.500000	U0	S	7.2500	Saether, Mr. Simon Sivertsen	0	3	male	0	NaN	SOTON/O.Q. 3101262	0	0	0	1	0
1308	29.881138	U0	S	8.0500	Ware, Mr. Frederick	0	3	male	0	NaN	359309	0	0	0	1	0
1309	29.881138	U0	C	22.3583	Peter, Master. Michael J	1	3	male	1	NaN	2668	0	1	0	0	1