Intro to Data Manipulation and Pandas

Dan Tamayo

Material draws from a blog working through the Titanic dataset by Dave Novelli, as well as the Pandas Cookbook by Julia Evans

Follow Along

In a terminal, navigate to the directory where you want to add the MachineLearning folder. Then

cd /path/to/MachineLearning
git pull
cd Day2
source activate ml
jupyter notebook TitanicPandas.ipynb

Loading a Dataset


In [70]:
import pandas as pd
df = pd.read_csv('data/train.csv')

In [2]:
df.head()


Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

In [3]:
df.tail()


Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.75 NaN Q

In [4]:
df.shape


Out[4]:
(891, 12)

Magic Pandas

Selecting Columns


In [5]:
df['Fare'].head()


Out[5]:
0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
Name: Fare, dtype: float64

In [6]:
df[['Fare', 'Sex']].head()


Out[6]:
Fare Sex
0 7.2500 male
1 71.2833 female
2 7.9250 female
3 53.1000 female
4 8.0500 male

Data at a glance


In [7]:
df['Sex'].value_counts()


Out[7]:
male      577
female    314
Name: Sex, dtype: int64

In [8]:
df['Age'].median()


Out[8]:
28.0

Plotting


In [71]:
%matplotlib inline
import seaborn
fig = df['Pclass'].hist()



In [72]:
fig = df.hist(figsize=(15,5))


Filtering


In [73]:
df.head()


Out[73]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

Boolean Masks


In [74]:
mask = df['Embarked'] == 'C'
mask.head()


Out[74]:
0    False
1     True
2    False
3    False
4    False
Name: Embarked, dtype: bool

Filtering Dataframes


In [75]:
df_filter = df[mask] # df_filter = df[df['Embarked'] == 'C']
df_filter.head()


Out[75]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
30 31 0 1 Uruchurtu, Don. Manuel E male 40 0 0 PC 17601 27.7208 NaN C

Memory Issues


In [81]:
df_filter = df.loc[df['Embarked'] == 'C']
df_filter.head()


Out[81]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
30 31 0 1 Uruchurtu, Don. Manuel E male 40 0 0 PC 17601 27.7208 NaN C

Selecting specific columns

df.loc[criterion, columns]


In [82]:
test = df.loc[df['Age'] > 30., ['Age', 'Fare', 'Sex']]
test.head()


Out[82]:
Age Fare Sex
1 38 71.2833 female
3 35 53.1000 female
4 35 8.0500 male
6 54 51.8625 male
11 58 26.5500 female

Combining criteria / columns


In [83]:
test = df.loc[(df['Age'] > 30.) & (df['Fare'] < 50.), 'Age':'Fare']
test.head()


Out[83]:
Age SibSp Parch Ticket Fare
4 35 0 0 373450 8.050
11 58 0 0 113783 26.550
13 39 1 5 347082 31.275
15 55 0 0 248706 16.000
18 31 1 0 345763 18.000

Requirements for working with data in scikit-learn

  1. Features should not have missing values
  2. Features and response should be numeric
  3. Features and response should be NumPy arrays
  4. Features and response are separate objects
  5. Features and response should have specific shapes

Requirements for working with data in scikit-learn

  1. Features should not have missing values
  2. Features and response should be numeric
  3. Features and response should be NumPy arrays
  4. Features and response are separate objects
  5. Features and response should have specific shapes

How to Find Valid Values


In [84]:
df.head()


Out[84]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

Wrong way:


In [90]:
import numpy as np
df[df['Cabin']!=np.nan]


Out[90]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Right Way:


In [91]:
df.loc[df['Cabin'].notnull()]


Out[91]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
31 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN 1 0 PC 17569 146.5208 B78 C
52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C
54 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C
55 56 1 1 Woolner, Mr. Hugh male NaN 0 0 19947 35.5000 C52 S
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0000 B28 NaN
62 63 0 1 Harris, Mr. Henry Birkhardt male 45.0 1 0 36973 83.4750 C83 S
66 67 1 2 Nye, Mrs. (Elizabeth Ramell) female 29.0 0 0 C.A. 29395 10.5000 F33 S
75 76 0 3 Moen, Mr. Sigurd Hansen male 25.0 0 0 348123 7.6500 F G73 S
88 89 1 1 Fortune, Miss. Mabel Helen female 23.0 3 2 19950 263.0000 C23 C25 C27 S
92 93 0 1 Chaffee, Mr. Herbert Fuller male 46.0 1 0 W.E.P. 5734 61.1750 E31 S
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
97 98 1 1 Greenfield, Mr. William Bertram male 23.0 0 1 PC 17759 63.3583 D10 D12 C
102 103 0 1 White, Mr. Richard Frasar male 21.0 0 1 35281 77.2875 D26 S
110 111 0 1 Porter, Mr. Walter Chamberlain male 47.0 0 0 110465 52.0000 C110 S
118 119 0 1 Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208 B58 B60 C
123 124 1 2 Webber, Miss. Susan female 32.5 0 0 27267 13.0000 E101 S
124 125 0 1 White, Mr. Percival Wayland male 54.0 0 1 35281 77.2875 D26 S
128 129 1 3 Peter, Miss. Anna female NaN 1 1 2668 22.3583 F E69 C
136 137 1 1 Newsom, Miss. Helen Monypeny female 19.0 0 2 11752 26.2833 D47 S
137 138 0 1 Futrelle, Mr. Jacques Heath male 37.0 1 0 113803 53.1000 C123 S
139 140 0 1 Giglio, Mr. Victor male 24.0 0 0 PC 17593 79.2000 B86 C
148 149 0 2 Navratil, Mr. Michel ("Louis M Hoffman") male 36.5 0 2 230080 26.0000 F2 S
... ... ... ... ... ... ... ... ... ... ... ... ...
751 752 1 3 Moor, Master. Meier male 6.0 0 1 392096 12.4750 E121 S
759 760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.0 0 0 110152 86.5000 B77 S
763 764 1 1 Carter, Mrs. William Ernest (Lucile Polk) female 36.0 1 2 113760 120.0000 B96 B98 S
765 766 1 1 Hogeboom, Mrs. John C (Anna Andrews) female 51.0 1 0 13502 77.9583 D11 S
772 773 0 2 Mack, Mrs. (Mary) female 57.0 0 0 S.O./P.P. 3 10.5000 E77 S
776 777 0 3 Tobin, Mr. Roger male NaN 0 0 383121 7.7500 F38 Q
779 780 1 1 Robert, Mrs. Edward Scott (Elisabeth Walton Mc... female 43.0 0 1 24160 211.3375 B3 S
781 782 1 1 Dick, Mrs. Albert Adrian (Vera Gillespie) female 17.0 1 0 17474 57.0000 B20 S
782 783 0 1 Long, Mr. Milton Clyde male 29.0 0 0 113501 30.0000 D6 S
789 790 0 1 Guggenheim, Mr. Benjamin male 46.0 0 0 PC 17593 79.2000 B82 B84 C
796 797 1 1 Leader, Dr. Alice (Farnham) female 49.0 0 0 17465 25.9292 D17 S
802 803 1 1 Carter, Master. William Thornton II male 11.0 1 2 113760 120.0000 B96 B98 S
806 807 0 1 Andrews, Mr. Thomas Jr male 39.0 0 0 112050 0.0000 A36 S
809 810 1 1 Chambers, Mrs. Norman Campbell (Bertha Griggs) female 33.0 1 0 113806 53.1000 E8 S
815 816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0000 B102 S
820 821 1 1 Hays, Mrs. Charles Melville (Clara Jennings Gr... female 52.0 1 1 12749 93.5000 B69 S
823 824 1 3 Moor, Mrs. (Beila) female 27.0 0 1 392096 12.4750 E121 S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
835 836 1 1 Compton, Miss. Sara Rebecca female 39.0 1 1 PC 17756 83.1583 E49 C
839 840 1 1 Marechal, Mr. Pierre male NaN 0 0 11774 29.7000 C47 C
849 850 1 1 Goldenberg, Mrs. Samuel L (Edwiga Grabowska) female NaN 1 0 17453 89.1042 C92 C
853 854 1 1 Lines, Miss. Mary Conover female 16.0 0 1 PC 17592 39.4000 D28 S
857 858 1 1 Daly, Mr. Peter Denis male 51.0 0 0 113055 26.5500 E17 S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

204 rows × 12 columns

Approach 1: Throw out any rows with missing data values


In [92]:
df.notnull().head()


Out[92]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 True True True True True True True True True True False True
1 True True True True True True True True True True True True
2 True True True True True True True True True True False True
3 True True True True True True True True True True True True
4 True True True True True True True True True True False True

In [93]:
df_filter = df.loc[df.notnull().all(axis=1)]

How much data is left?


In [94]:
df_filter.shape


Out[94]:
(183, 12)

Approach 2: Assign missing identifier


In [105]:
df.loc[df['Cabin'].isnull(), 'Cabin'] = 'U0'
df.head()


Out[105]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 U0 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 U0 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 U0 S

Approach 3: Assign the average/median/mode value


In [106]:
df.loc[df['Fare'].isnull()].shape


Out[106]:
(0, 12)

In [107]:
df['Fare'].median()


Out[107]:
14.4542

In [108]:
df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].median()

Categorical Variables


In [109]:
df.loc[df['Embarked'].isnull()].shape


Out[109]:
(0, 12)

In [111]:
df['Embarked'].mode()


Out[111]:
0    S
dtype: object

In [112]:
df['Embarked'].mode()[0]


Out[112]:
'S'

In [113]:
df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].dropna().mode()[0]

Approach 4: Fit a regression model to predict missing values


In [114]:
for col in df.columns:
    print("NaNs in column {0} = {1}".format(col, df.loc[df[col].isnull()].shape[0]))


NaNs in column PassengerId = 0
NaNs in column Survived = 0
NaNs in column Pclass = 0
NaNs in column Name = 0
NaNs in column Sex = 0
NaNs in column Age = 0
NaNs in column SibSp = 0
NaNs in column Parch = 0
NaNs in column Ticket = 0
NaNs in column Fare = 0
NaNs in column Cabin = 0
NaNs in column Embarked = 0

In [115]:
df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean()

Requirements for working with data in scikit-learn

  1. Features should not have missing values
  2. Features and response should be numeric
  3. Features and response should be NumPy arrays
  4. Features and response are separate objects
  5. Features and response should have specific shapes

In [116]:
df.head()


Out[116]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 U0 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 U0 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 U0 S

Approach 1: Creating Dummy (Binary) Variables


In [117]:
print(df['Embarked'].unique())


['S' 'C' 'Q']

In [118]:
dummies_df = pd.get_dummies(df['Embarked'])
dummies_df.head()


Out[118]:
C Q S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1

In [119]:
def addEmbarked(name):
    return 'Embarked_' + name
dummies_df = dummies_df.rename(columns=addEmbarked)
dummies_df.head()


Out[119]:
Embarked_C Embarked_Q Embarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1

In [120]:
df = pd.concat([df, dummies_df], axis=1)
df.head()


Out[120]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Embarked_C Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 U0 S 0 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 1 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 U0 S 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S 0 0 1
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 U0 S 0 0 1

Approach 2: Factorizing (Make single multi-class feature)


In [121]:
df['EmbarkedNum'] = pd.factorize(df['Embarked'])[0]
df.head(6)


Out[121]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Embarked_C Embarked_Q Embarked_S EmbarkedNum
0 1 0 3 Braund, Mr. Owen Harris male 22.000000 1 0 A/5 21171 7.2500 U0 S 0 0 1 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.000000 1 0 PC 17599 71.2833 C85 C 1 0 0 1
2 3 1 3 Heikkinen, Miss. Laina female 26.000000 0 0 STON/O2. 3101282 7.9250 U0 S 0 0 1 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.000000 1 0 113803 53.1000 C123 S 0 0 1 0
4 5 0 3 Allen, Mr. William Henry male 35.000000 0 0 373450 8.0500 U0 S 0 0 1 0
5 6 0 3 Moran, Mr. James male 29.699118 0 0 330877 8.4583 U0 Q 0 1 0 2

What are the classes?


In [122]:
pd.factorize(df['Sex'])


Out[122]:
(array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
        0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
        0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,
        0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
        0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
        0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0,
        1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
        1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
        1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
        0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1,
        1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
        1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
        0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,
        0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,
        1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
        0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
        1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0]),
 Index([u'male', u'female'], dtype='object'))

In [123]:
df['Female'] = pd.factorize(df['Sex'])[0]
df.head()


Out[123]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Embarked_C Embarked_Q Embarked_S EmbarkedNum Female
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 U0 S 0 0 1 0 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 1 0 0 1 1
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 U0 S 0 0 1 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S 0 0 1 0 1
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 U0 S 0 0 1 0 0

Continuous Features: Feature Scaling


In [124]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(df[['Age', 'Fare']]) # .fit(df) to scale all numerical columns
print("Means = {0}".format(scaler.mean_))
print("Stdevs = {0}".format(scaler.scale_))
df[['Age', 'Fare']] = scaler.transform(df[['Age', 'Fare']])
df.head()


Means = [ 29.69911765  32.20420797]
Stdevs = [ 12.99471687  49.66553444]
Out[124]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Embarked_C Embarked_Q Embarked_S EmbarkedNum Female
0 1 0 3 Braund, Mr. Owen Harris male -0.592481 1 0 A/5 21171 -0.502445 U0 S 0 0 1 0 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 0.638789 1 0 PC 17599 0.786845 C85 C 1 0 0 1 1
2 3 1 3 Heikkinen, Miss. Laina female -0.284663 0 0 STON/O2. 3101282 -0.488854 U0 S 0 0 1 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 0.407926 1 0 113803 0.420730 C123 S 0 0 1 0 1
4 5 0 3 Allen, Mr. William Henry male 0.407926 0 0 373450 -0.486337 U0 S 0 0 1 0 0

Requirements for working with data in scikit-learn

  1. Features should not have missing values
  2. Features and response should be numeric
  3. Features and response should be NumPy arrays
  4. Features and response are separate objects
  5. Features and response should have specific shapes

Underneath, Pandas series are numpy arrays


In [125]:
fares = df['Fare'].values
type(fares)


Out[125]:
numpy.ndarray

Getting a pipeline ready for sklearn

Combining Data Frames


In [126]:
input_df = pd.read_csv('data/train.csv')
submit_df = pd.read_csv('data/test.csv')

print(input_df.shape)
print(submit_df.shape)


(891, 12)
(418, 11)

In [127]:
submit_df.head()


Out[127]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

In [128]:
input_df.head()


Out[128]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

Indexing


In [129]:
submit_df = pd.read_csv('data/test.csv', index_col=0)
submit_df.head()


Out[129]:
Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

In [130]:
input_df = pd.read_csv('data/train.csv', index_col=0)
submit_df = pd.read_csv('data/test.csv', index_col=0)
df = pd.concat([input_df, submit_df])
df.tail()


Out[130]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket
PassengerId
1305 NaN NaN S 8.0500 Spector, Mr. Woolf 0 3 male 0 NaN A.5. 3236
1306 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1 female 0 NaN PC 17758
1307 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 3 male 0 NaN SOTON/O.Q. 3101262
1308 NaN NaN S 8.0500 Ware, Mr. Frederick 0 3 male 0 NaN 359309
1309 NaN NaN C 22.3583 Peter, Master. Michael J 1 3 male 1 NaN 2668

In [131]:
print(df.shape[1], "columns")
print(df.shape[0], "rows")
print(df.columns.values)


(11, 'columns')
(1309, 'rows')
['Age' 'Cabin' 'Embarked' 'Fare' 'Name' 'Parch' 'Pclass' 'Sex' 'SibSp'
 'Survived' 'Ticket']

Putting it all together


In [49]:
def process_data(df):
    df['Female'] = pd.factorize(df['Sex'])[0]
    df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean()
    df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].median()
    df.loc[df['Cabin'].isnull(), 'Cabin'] = 'U0'
    df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].dropna().mode()[0]
    
    dummies_df = pd.get_dummies(df['Embarked'])
    def addEmbarked(name):
        return 'Embarked_' + name
    dummies_df = dummies_df.rename(columns=addEmbarked)
    df = pd.concat([df, dummies_df], axis=1)
    df['EmbarkedNum'] = pd.factorize(df['Embarked'])[0]
    
    return df

In [50]:
df = process_data(df)
df.tail()


Out[50]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket Female Embarked_C Embarked_Q Embarked_S EmbarkedNum
PassengerId
1305 29.881138 U0 S 8.0500 Spector, Mr. Woolf 0 3 male 0 NaN A.5. 3236 0 0 0 1 0
1306 39.000000 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1 female 0 NaN PC 17758 1 1 0 0 1
1307 38.500000 U0 S 7.2500 Saether, Mr. Simon Sivertsen 0 3 male 0 NaN SOTON/O.Q. 3101262 0 0 0 1 0
1308 29.881138 U0 S 8.0500 Ware, Mr. Frederick 0 3 male 0 NaN 359309 0 0 0 1 0
1309 29.881138 U0 C 22.3583 Peter, Master. Michael J 1 3 male 1 NaN 2668 0 1 0 0 1

In [51]:
features = ['Age', 'Fare', 'Parch', 'Pclass', 'SibSp', 'Female', 'EmbarkedNum']
df_test = df.loc[df['Survived'].isnull(), features]
df_train = df.loc[df['Survived'].notnull(), features+['Survived']]
df_train.head()


Out[51]:
Age Fare Parch Pclass SibSp Female EmbarkedNum Survived
PassengerId
1 22 7.2500 0 3 1 0 0 0
2 38 71.2833 0 1 1 1 1 1
3 26 7.9250 0 3 0 1 0 1
4 35 53.1000 0 1 1 1 0 1
5 35 8.0500 0 3 0 0 0 0

Requirements for working with data in scikit-learn

  1. Features should not have missing values
  2. Features and response should be numeric
  3. Features and response should be NumPy arrays
  4. Features and response are separate objects
  5. Features and response should have specific shapes

In [52]:
X_train = df_train[features].values
y_train = df_train['Survived'].values
print(X_train[0:5])
print(y_train[0:5])
print("X has {0} rows".format(X_train.shape[0]))
print("y has {0} rows".format(y_train.shape[0]))


[[ 22.       7.25     0.       3.       1.       0.       0.    ]
 [ 38.      71.2833   0.       1.       1.       1.       1.    ]
 [ 26.       7.925    0.       3.       0.       1.       0.    ]
 [ 35.      53.1      0.       1.       1.       1.       0.    ]
 [ 35.       8.05     0.       3.       0.       0.       0.    ]]
[ 0.  1.  1.  1.  0.]
X has 891 rows
y has 891 rows

Feature Engineering

Parsing Alphanumeric Features


In [53]:
df['Cabin'].unique()


Out[53]:
array(['U0', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35', 'C87',
       'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19', 'B49', 'D',
       'C22 C26', 'C106', 'C65', 'E36', 'C54', 'B57 B59 B63 B66', 'C7',
       'E34', 'C32', 'B18', 'C124', 'C91', 'E40', 'T', 'C128', 'D37',
       'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44', 'A34', 'C104', 'C111',
       'C92', 'E38', 'D21', 'E12', 'E63', 'A14', 'B37', 'C30', 'D20',
       'B79', 'E25', 'D46', 'B73', 'C95', 'B38', 'B39', 'B22', 'C86',
       'C70', 'A16', 'C101', 'C68', 'A10', 'E68', 'B41', 'A20', 'D19',
       'D50', 'D9', 'A23', 'B50', 'A26', 'D48', 'E58', 'C126', 'B71',
       'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63', 'C62 C64', 'E24', 'C90',
       'C45', 'E8', 'B101', 'D45', 'C46', 'D30', 'E121', 'D11', 'E77',
       'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36', 'B102', 'B69', 'E49',
       'C47', 'D28', 'E17', 'A24', 'C50', 'B42', 'C148', 'B45', 'B36',
       'A21', 'D34', 'A9', 'C31', 'B61', 'C53', 'D43', 'C130', 'C132',
       'C55 C57', 'C116', 'F', 'A29', 'C6', 'C28', 'C51', 'C97', 'D22',
       'B10', 'E45', 'E52', 'A11', 'B11', 'C80', 'C89', 'F E46', 'B26',
       'F E57', 'A18', 'E60', 'E39 E41', 'B52 B54 B56', 'C39', 'B24',
       'D40', 'D38', 'C105'], dtype=object)

Regular Expressions


In [54]:
import re
def getDeck(cabin):
    match = re.search("([A-Z])", cabin)
    return match.group(1) if match is not None else None
def getCabinNum(cabin):
    match = re.search("([0-9]+)", cabin)
    return match.group(1) if match is not None else None
print(getDeck('C237'))
print(getCabinNum('C237'))


C
237

Apply a function to all rows to generate a new feature


In [55]:
df['Deck'] = df['Cabin'].map(getDeck)
df['CabinNum'] = df['Cabin'].map(getCabinNum)
df.head()


Out[55]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket Female Embarked_C Embarked_Q Embarked_S EmbarkedNum Deck CabinNum
PassengerId
1 22 U0 S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0 A/5 21171 0 0 0 1 0 U 0
2 38 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 female 1 1 PC 17599 1 1 0 0 1 C 85
3 26 U0 S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1 STON/O2. 3101282 1 0 0 1 0 U 0
4 35 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1 113803 1 0 0 1 0 C 123
5 35 U0 S 8.0500 Allen, Mr. William Henry 0 3 male 0 0 373450 0 0 0 1 0 U 0

In [56]:
df['CabinNum'].isnull().value_counts()


Out[56]:
False    1303
True        6
Name: CabinNum, dtype: int64

In [57]:
df.loc[df['CabinNum'].isnull(), 'CabinNum'] = 0

In [58]:
df['Deck'].isnull().value_counts()


Out[58]:
False    1309
Name: Deck, dtype: int64

In [59]:
df['DeckNum'] = pd.factorize(df['Deck'])[0]

What to do with the name?

Number of names


In [60]:
testname = df.loc[1, 'Name']
print(testname)


Braund, Mr. Owen Harris

In [61]:
re.split(' ', testname)


Out[61]:
['Braund,', 'Mr.', 'Owen', 'Harris']

In [62]:
def numNames(name):
    return len(re.split(' ', name))

df['NumNames'] = df['Name'].map(numNames)
df.head()


Out[62]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket Female Embarked_C Embarked_Q Embarked_S EmbarkedNum Deck CabinNum DeckNum NumNames
PassengerId
1 22 U0 S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0 A/5 21171 0 0 0 1 0 U 0 0 4
2 38 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 female 1 1 PC 17599 1 1 0 0 1 C 85 1 7
3 26 U0 S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1 STON/O2. 3101282 1 0 0 1 0 U 0 0 3
4 35 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1 113803 1 0 0 1 0 C 123 1 7
5 35 U0 S 8.0500 Allen, Mr. William Henry 0 3 male 0 0 373450 0 0 0 1 0 U 0 0 4

Title


In [63]:
testname


Out[63]:
'Braund, Mr. Owen Harris'

In [64]:
re.search(", (.+?)\.", testname).group(1)


Out[64]:
'Mr'

In [65]:
def getTitle(name):
    match = re.search(", (.*?)\.", name)
    return match.group(1) if match is not None else None

df['Title'] = df['Name'].map(getTitle)
df.head()


Out[65]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived ... Female Embarked_C Embarked_Q Embarked_S EmbarkedNum Deck CabinNum DeckNum NumNames Title
PassengerId
1 22 U0 S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0 ... 0 0 0 1 0 U 0 0 4 Mr
2 38 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 female 1 1 ... 1 1 0 0 1 C 85 1 7 Mrs
3 26 U0 S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1 ... 1 0 0 1 0 U 0 0 3 Miss
4 35 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1 ... 1 0 0 1 0 C 123 1 7 Mrs
5 35 U0 S 8.0500 Allen, Mr. William Henry 0 3 male 0 0 ... 0 0 0 1 0 U 0 0 4 Mr

5 rows × 21 columns


In [66]:
df.loc[df['Title'] == 'Jonkheer', 'Title'] = 'Master'
df.loc[df['Title'].isin(['Ms', 'Mlle']), 'Title'] = 'Miss'
df.loc[df['Title'] == 'Mme', 'Title'] = 'Mrs'
df.loc[df['Title'].isin(['Capt', 'Don', 'Major', 'Col', 'Sir']), 'Title'] = 'Sir'
df.loc[df['Title'].isin(['Dona', 'Lady', 'the Countess']), 'Title'] = 'Lady'
df['Title'].value_counts()


Out[66]:
Mr        757
Miss      264
Mrs       198
Master     62
Sir         9
Dr          8
Rev         8
Lady        3
Name: Title, dtype: int64

In [67]:
df = pd.concat([df, pd.get_dummies(df['Title']).rename(columns=lambda x: 'Title_' + str(x))], axis=1)
df.head()


Out[67]:
Age Cabin Embarked Fare Name Parch Pclass Sex SibSp Survived ... NumNames Title Title_Dr Title_Lady Title_Master Title_Miss Title_Mr Title_Mrs Title_Rev Title_Sir
PassengerId
1 22 U0 S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0 ... 4 Mr 0 0 0 0 1 0 0 0
2 38 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 1 female 1 1 ... 7 Mrs 0 0 0 0 0 1 0 0
3 26 U0 S 7.9250 Heikkinen, Miss. Laina 0 3 female 0 1 ... 3 Miss 0 0 0 1 0 0 0 0
4 35 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 female 1 1 ... 7 Mrs 0 0 0 0 0 1 0 0
5 35 U0 S 8.0500 Allen, Mr. William Henry 0 3 male 0 0 ... 4 Mr 0 0 0 0 1 0 0 0

5 rows × 29 columns

Final Data Processing Pipeline


In [68]:
def process_data(df):
    df['Female'] = pd.factorize(df['Sex'])[0]
    df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean()
    df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].median()
    df.loc[df['Cabin'].isnull(), 'Cabin'] = 'U0'
    df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].dropna().mode()[0]
    
    dummies_df = pd.get_dummies(df['Embarked'])
    def addEmbarked(name):
        return 'Embarked_' + name
    dummies_df = dummies_df.rename(columns=addEmbarked)
    df = pd.concat([df, dummies_df], axis=1)
    df['EmbarkedNum'] = pd.factorize(df['Embarked'])[0]
    
    import re
    def getDeck(cabin):
        match = re.search("([A-Z])", cabin)
        return match.group(1) if match is not None else None
    def getCabinNum(cabin):
        match = re.search("([0-9]+)", cabin)
        return match.group(1) if match is not None else None
    
    df['Deck'] = df['Cabin'].map(getDeck)
    df['DeckNum'] = pd.factorize(df['Deck'])[0]
    df['CabinNum'] = df['Cabin'].map(getCabinNum)
    df.loc[df['CabinNum'].isnull(), 'CabinNum'] = 0
    
    def numNames(name):
        return len(re.split(' ', name))
    df['NumNames'] = df['Name'].map(numNames)
    
    def getTitle(name):
        match = re.search(", (.*?)\.", name)
        return match.group(1) if match is not None else None
    df['Title'] = df['Name'].map(getTitle)
    
    df.loc[df['Title'] == 'Jonkheer', 'Title'] = 'Master'
    df.loc[df['Title'].isin(['Ms', 'Mlle']), 'Title'] = 'Miss'
    df.loc[df['Title'] == 'Mme', 'Title'] = 'Mrs'
    df.loc[df['Title'].isin(['Capt', 'Don', 'Major', 'Col', 'Sir']), 'Title'] = 'Sir'
    df.loc[df['Title'].isin(['Dona', 'Lady', 'the Countess']), 'Title'] = 'Lady'
    df = pd.concat([df, pd.get_dummies(df['Title']).rename(columns=lambda x: 'Title_' + str(x))], axis=1)
    
    return df

In [69]:
input_df = pd.read_csv('data/train.csv', index_col=0)
submit_df = pd.read_csv('data/test.csv', index_col=0)
df = pd.concat([input_df, submit_df])
df = process_data(df)

features = ['Age', 'Fare', 'Parch', 'Pclass', 'SibSp', 'Female', 'EmbarkedNum', 'DeckNum', 'CabinNum', 'NumNames', 'Title_Dr','Title_Lady','Title_Master','Title_Miss','Title_Mr','Title_Mrs','Title_Rev','Title_Sir']
df_test = df.loc[df['Survived'].isnull(), features]
df_train = df.loc[df['Survived'].notnull(), features+['Survived']]

df_train.to_csv('data/train_processed.csv') # NOT FEATURE SCALED!
df_test.to_csv('data/test_processed.csv')

In [ ]: