Импортируем все необходимые библиотеки


In [78]:
# pandas
import pandas as pd
from pandas import DataFrame
import re

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import learning_curve, train_test_split, GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score

Загружаем наши данные и смотрим на их состояние


In [79]:
train_df = pd.read_csv("titanic/train.csv")
test_df    = pd.read_csv("titanic/test.csv")

test_df.head()


Out[79]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

In [80]:
train_df.info()
print("----------------------------")
test_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
----------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Легко заметить, что в тренировочном датасете у нас не хватает данных о возрасте, каюте и месте погружения пассажира на корабль. В тестовом датасете нам не хватает данных о возрасте, каюте и плате за пребывание на корабле.

Для начала разберемся с полем Embarked в тренировочном датасете, которое отвечает за место погружения. Проверим, в каких строках у нас отсутствуют данные.


In [81]:
# Embarked
train_df[train_df.Embarked.isnull()]


Out[81]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN

Посмотрим на общею зависимость шанса выживания от пункта погружения.


In [82]:
# plot
#sns.factorplot('Embarked','Survived', data=train_df,size=4,aspect=3)

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))

sns.countplot(x='Embarked', data=train_df, ax=axis1)
sns.countplot(x='Survived', hue="Embarked", data=train_df, order=[1,0], ax=axis2)

# group by embarked, and get the mean for survived passengers for each value in Embarked
embark_perc = train_df[["Embarked", "Survived"]].groupby(['Embarked'],as_index=False).mean()
sns.barplot(x='Embarked', y='Survived', data=embark_perc,order=['S','C','Q'],ax=axis3)


Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x20c49171160>

Смотрим на другие возможные зависимости, которые могли б нам указать на то, где пассажиры попали на корабль.


In [83]:
train_df.loc[train_df.Ticket == '113572']


Out[83]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN

In [84]:
print( 'C == ' + str( len(train_df.loc[train_df.Pclass == 1].loc[train_df.Fare > 75].loc[train_df.Fare < 85].loc[train_df.Embarked == 'C']) ) )
print( 'S == ' + str( len(train_df.loc[train_df.Pclass == 1].loc[train_df.Fare > 75].loc[train_df.Fare < 85].loc[train_df.Embarked == 'S']) ) )


C == 16
S == 13

In [85]:
train_df = train_df.set_value(train_df.Embarked.isnull(), 'Embarked', 'C')

In [86]:
train_df.loc[train_df.Embarked.isnull()]


Out[86]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

Теперь исправим пустое поле с платой за путешествение в тестовом датасете.


In [87]:
test_df[test_df.Fare.isnull()]


Out[87]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
152 1044 3 Storey, Mr. Thomas male 60.5 0 0 3701 NaN NaN S

Давайте посмотрим на всех пассажиров, с похожими другими признаками.


In [88]:
fig = plt.figure(figsize=(8, 5))
ax = fig.add_subplot(111)

test_df[(test_df.Pclass==3)&(test_df.Embarked=='S')].Fare.hist(bins=100, ax=ax)
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Histogram of Fare, Plcass 3 and Embarked S')


Out[88]:
<matplotlib.text.Text at 0x20c4941e4a8>

In [89]:
print ("The top 5 most common value of Fare")
test_df[(test_df.Pclass==3)&(test_df.Embarked=='S')].Fare.value_counts().head()


The top 5 most common value of Fare
Out[89]:
8.0500    17
7.8958    10
7.7750    10
7.8542     8
8.6625     8
Name: Fare, dtype: int64

Делаем вывод, что вероятнее всего плата была в таком размере.


In [90]:
test_df.set_value(test_df.Fare.isnull(), 'Fare', 8.05)
test_df.loc[test_df.Fare.isnull()]


Out[90]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

Теперь разберемся с полем Возраста в тренировочном датасете. Ему нужно уделить больше внимания, т.к. это очень важный признак, который сильно влияет на выживаемость пассажиров.


In [91]:
test_df.loc[test_df.Age.isnull()].head()


Out[91]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
10 902 3 Ilieff, Mr. Ylio male NaN 0 0 349220 7.8958 NaN S
22 914 1 Flegenheim, Mrs. Alfred (Antoinette) female NaN 0 0 PC 17598 31.6833 NaN S
29 921 3 Samaan, Mr. Elias male NaN 2 0 2662 21.6792 NaN C
33 925 3 Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)" female NaN 1 2 W./C. 6607 23.4500 NaN S
36 928 3 Roth, Miss. Sarah A female NaN 0 0 342712 8.0500 NaN S

In [92]:
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,4))
axis1.set_title('Original Age values')
axis2.set_title('New Age values')

# среднее, дисперсия и пустые значение в тестовом датасете
average_age_train   = train_df["Age"].mean()
std_age_train       = train_df["Age"].std()
count_nan_age_train = train_df["Age"].isnull().sum()

# среднее, дисперсия и пустые значение в тестовом датасете
average_age_test   = test_df["Age"].mean()
std_age_test       = test_df["Age"].std()
count_nan_age_test = test_df["Age"].isnull().sum()

# генерируем случайные значения (mean - std) & (mean + std)
rand_1 = np.random.randint(average_age_train - std_age_train, average_age_train + std_age_train, size = count_nan_age_train)
rand_2 = np.random.randint(average_age_test - std_age_test, average_age_test + std_age_test, size = count_nan_age_test)

# строим гистограму возраста до изменений (пустые конвертим в инты)
train_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)
test_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)

# заполняем случайными значениями пустые поля с возрастом
train_df["Age"][np.isnan(train_df["Age"])] = rand_1
test_df["Age"][np.isnan(test_df["Age"])] = rand_2

# конвертим флоаты в инты
train_df['Age'] = train_df['Age'].astype(int)
test_df['Age']    = test_df['Age'].astype(int)
        
# гистограма нового возраста
train_df['Age'].hist(bins=70, ax=axis2)
test_df['Age'].hist(bins=70, ax=axis2)


C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel\__main__.py:24: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel\__main__.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[92]:
<matplotlib.axes._subplots.AxesSubplot at 0x20c4af494a8>

In [93]:
# Еще немного графиков

# пик выживаемости в зависимости от возраста
facet = sns.FacetGrid(train_df, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train_df['Age'].max()))
facet.add_legend()

# средняя выживаемость по возрасту
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_age = train_df[["Age", "Survived"]].groupby(['Age'],as_index=False).mean()
sns.barplot(x='Age', y='Survived', data=average_age)


C:\Users\Admin\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[93]:
<matplotlib.axes._subplots.AxesSubplot at 0x20c49c88e80>

In [94]:
train_df.info()
test_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null int32
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(1), int32(1), int64(5), object(5)
memory usage: 80.1+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null int32
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(1), int32(1), int64(4), object(5)
memory usage: 34.4+ KB

В именах есть приставки, с ними тоже можно кое-что сделать, т.к. социальный статус может быть важным признаком выживаемости.


In [95]:
Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Nobel",
                    "Don":        "Nobel",
                    "Sir" :       "Nobel",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Nobel",
                    "Dona":       "Nobel",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Nobel"
                    } 

train_df['Title'] = train_df['Name'].apply(lambda x: Title_Dictionary[x.split(',')[1].split('.')[0].strip()])
test_df['Title'] = test_df['Name'].apply(lambda x: Title_Dictionary[x.split(',')[1].split('.')[0].strip()])

train_df.head(100)


Out[95]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S Mrs
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S Mr
5 6 0 3 Moran, Mr. James male 26 0 0 330877 8.4583 NaN Q Mr
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S Mr
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NaN S Master
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NaN S Mrs
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C Mrs
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7000 G6 S Miss
11 12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.5500 C103 S Miss
12 13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.0500 NaN S Mr
13 14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.2750 NaN S Mr
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 0 350406 7.8542 NaN S Miss
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55 0 0 248706 16.0000 NaN S Mrs
16 17 0 3 Rice, Master. Eugene male 2 4 1 382652 29.1250 NaN Q Master
17 18 1 2 Williams, Mr. Charles Eugene male 43 0 0 244373 13.0000 NaN S Mr
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 0 345763 18.0000 NaN S Mrs
19 20 1 3 Masselmani, Mrs. Fatima female 41 0 0 2649 7.2250 NaN C Mrs
20 21 0 2 Fynney, Mr. Joseph J male 35 0 0 239865 26.0000 NaN S Mr
21 22 1 2 Beesley, Mr. Lawrence male 34 0 0 248698 13.0000 D56 S Mr
22 23 1 3 McGowan, Miss. Anna "Annie" female 15 0 0 330923 8.0292 NaN Q Miss
23 24 1 1 Sloper, Mr. William Thompson male 28 0 0 113788 35.5000 A6 S Mr
24 25 0 3 Palsson, Miss. Torborg Danira female 8 3 1 349909 21.0750 NaN S Miss
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 5 347077 31.3875 NaN S Mrs
26 27 0 3 Emir, Mr. Farred Chehab male 16 0 0 2631 7.2250 NaN C Mr
27 28 0 1 Fortune, Mr. Charles Alexander male 19 3 2 19950 263.0000 C23 C25 C27 S Mr
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female 24 0 0 330959 7.8792 NaN Q Miss
29 30 0 3 Todoroff, Mr. Lalio male 27 0 0 349216 7.8958 NaN S Mr
... ... ... ... ... ... ... ... ... ... ... ... ... ...
70 71 0 2 Jenkin, Mr. Stephen Curnow male 32 0 0 C.A. 33111 10.5000 NaN S Mr
71 72 0 3 Goodwin, Miss. Lillian Amy female 16 5 2 CA 2144 46.9000 NaN S Miss
72 73 0 2 Hood, Mr. Ambrose Jr male 21 0 0 S.O.C. 14879 73.5000 NaN S Mr
73 74 0 3 Chronopoulos, Mr. Apostolos male 26 1 0 2680 14.4542 NaN C Mr
74 75 1 3 Bing, Mr. Lee male 32 0 0 1601 56.4958 NaN S Mr
75 76 0 3 Moen, Mr. Sigurd Hansen male 25 0 0 348123 7.6500 F G73 S Mr
76 77 0 3 Staneff, Mr. Ivan male 31 0 0 349208 7.8958 NaN S Mr
77 78 0 3 Moutal, Mr. Rahamin Haim male 15 0 0 374746 8.0500 NaN S Mr
78 79 1 2 Caldwell, Master. Alden Gates male 0 0 2 248738 29.0000 NaN S Master
79 80 1 3 Dowdell, Miss. Elizabeth female 30 0 0 364516 12.4750 NaN S Miss
80 81 0 3 Waelens, Mr. Achille male 22 0 0 345767 9.0000 NaN S Mr
81 82 1 3 Sheerlinck, Mr. Jan Baptist male 29 0 0 345779 9.5000 NaN S Mr
82 83 1 3 McDermott, Miss. Brigdet Delia female 28 0 0 330932 7.7875 NaN Q Miss
83 84 0 1 Carrau, Mr. Francisco M male 28 0 0 113059 47.1000 NaN S Mr
84 85 1 2 Ilett, Miss. Bertha female 17 0 0 SO/C 14885 10.5000 NaN S Miss
85 86 1 3 Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... female 33 3 0 3101278 15.8500 NaN S Mrs
86 87 0 3 Ford, Mr. William Neal male 16 1 3 W./C. 6608 34.3750 NaN S Mr
87 88 0 3 Slocovski, Mr. Selman Francis male 40 0 0 SOTON/OQ 392086 8.0500 NaN S Mr
88 89 1 1 Fortune, Miss. Mabel Helen female 23 3 2 19950 263.0000 C23 C25 C27 S Miss
89 90 0 3 Celotti, Mr. Francesco male 24 0 0 343275 8.0500 NaN S Mr
90 91 0 3 Christmann, Mr. Emil male 29 0 0 343276 8.0500 NaN S Mr
91 92 0 3 Andreasson, Mr. Paul Edvin male 20 0 0 347466 7.8542 NaN S Mr
92 93 0 1 Chaffee, Mr. Herbert Fuller male 46 1 0 W.E.P. 5734 61.1750 E31 S Mr
93 94 0 3 Dean, Mr. Bertram Frank male 26 1 2 C.A. 2315 20.5750 NaN S Mr
94 95 0 3 Coxon, Mr. Daniel male 59 0 0 364500 7.2500 NaN S Mr
95 96 0 3 Shorney, Mr. Charles Joseph male 29 0 0 374910 8.0500 NaN S Mr
96 97 0 1 Goldschmidt, Mr. George B male 71 0 0 PC 17754 34.6542 A5 C Mr
97 98 1 1 Greenfield, Mr. William Bertram male 23 0 1 PC 17759 63.3583 D10 D12 C Mr
98 99 1 2 Doling, Mrs. John T (Ada Julia Bone) female 34 0 1 231919 23.0000 NaN S Mrs
99 100 0 2 Kantor, Mr. Sinai male 34 1 0 244367 26.0000 NaN S Mr

100 rows × 13 columns

Вместо двух полей указывающий на наличие партнера (Parch) или родственника (SibSp), сделаем одно поле FamilySize


In [96]:
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch']
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch']

train_df.head()


Out[96]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title FamilySize
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S Mr 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C Mrs 1
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S Miss 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S Mrs 1
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S Mr 0

Пол тоже очень важный признак, но если вы смотрели фильм титаник, то наверное помните "Сначала женщин и детей." Поэтому предлагаю сооздать новый признак, который будет учитывать как пол, так и возраст


In [97]:
def get_person(passenger):
    age,sex = passenger
    return 'child' if age < 16 else sex
    
train_df['Person'] = train_df[['Age','Sex']].apply(get_person,axis=1)
test_df['Person']    = test_df[['Age','Sex']].apply(get_person,axis=1)

train_df.head()


Out[97]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title FamilySize Person
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S Mr 1 male
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C Mrs 1 female
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S Miss 0 female
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S Mrs 1 female
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S Mr 0 male

In [98]:
train_df.info()
print("----------------------------")
train_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null int32
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
Title          891 non-null object
FamilySize     891 non-null int64
Person         891 non-null object
dtypes: float64(1), int32(1), int64(6), object(7)
memory usage: 101.0+ KB
----------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null int32
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
Title          891 non-null object
FamilySize     891 non-null int64
Person         891 non-null object
dtypes: float64(1), int32(1), int64(6), object(7)
memory usage: 101.0+ KB

Убедились, что теперь наши данные в порядке и переходим к откидыванию лишнего.


In [99]:
train_df.drop(labels=['PassengerId', 'Name', 'Cabin', 'Ticket', 'SibSp', 'Parch', 'Sex'], axis=1, inplace=True)
test_df.drop(labels=['Name', 'Cabin', 'Ticket', 'SibSp', 'Parch', 'Sex'], axis=1, inplace=True)

In [100]:
train_df.head()


Out[100]:
Survived Pclass Age Fare Embarked Title FamilySize Person
0 0 3 22 7.2500 S Mr 1 male
1 1 1 38 71.2833 C Mrs 1 female
2 1 3 26 7.9250 S Miss 0 female
3 1 1 35 53.1000 S Mrs 1 female
4 0 3 35 8.0500 S Mr 0 male

У нас есть дискретные переменные и нам стоило б их закодировать. Для этого в пандас уже существует функция get_dummies


In [101]:
dummies_person_train = pd.get_dummies(train_df['Person'],prefix='Person')
dummies_embarked_train = pd.get_dummies(train_df['Embarked'], prefix= 'Embarked') 
dummies_title_train = pd.get_dummies(train_df['Title'], prefix= 'Title')
dummies_pclass_train = pd.get_dummies(train_df['Pclass'], prefix= 'Pclass')

train_df = pd.concat([train_df, dummies_person_train, dummies_embarked_train, dummies_title_train, dummies_pclass_train], axis=1)
train_df = train_df.drop(['Person','Embarked','Title', 'Pclass'], axis=1)

train_df.head()


Out[101]:
Survived Age Fare FamilySize Person_child Person_female Person_male Embarked_C Embarked_Q Embarked_S Title_Master Title_Miss Title_Mr Title_Mrs Title_Nobel Title_Officer Pclass_1 Pclass_2 Pclass_3
0 0 22 7.2500 1 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
1 1 38 71.2833 1 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2 1 26 7.9250 0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 1 35 53.1000 1 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
4 0 35 8.0500 0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0

In [102]:
dummies_person_test = pd.get_dummies(test_df['Person'],prefix='Person')
dummies_embarked_test = pd.get_dummies(test_df['Embarked'], prefix= 'Embarked') 
dummies_title_test = pd.get_dummies(test_df['Title'], prefix= 'Title')
dummies_pclass_test = pd.get_dummies(test_df['Pclass'], prefix= 'Pclass')

test_df = pd.concat([test_df, dummies_person_test, dummies_embarked_test, dummies_title_test, dummies_pclass_test], axis=1)
test_df = test_df.drop(['Person','Embarked','Title', 'Pclass'], axis=1)

test_df.head()


Out[102]:
PassengerId Age Fare FamilySize Person_child Person_female Person_male Embarked_C Embarked_Q Embarked_S Title_Master Title_Miss Title_Mr Title_Mrs Title_Nobel Title_Officer Pclass_1 Pclass_2 Pclass_3
0 892 34 7.8292 0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
1 893 47 7.0000 1 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
2 894 62 9.6875 0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 895 27 8.6625 0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
4 896 22 12.2875 2 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0

Создадим функцию, которая будет строить зависимость обучаемости от кол-ва тестовых семплов.


In [103]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5), scoring='accuracy'):
    plt.figure(figsize=(10,6))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel(scoring)
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, scoring=scoring,
                                                            n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

Разбиваем наш тренировочный датасет на 2, что б прежде чем сабмитить нашу модель, мы убедились что она не переобучается на наших данных (т.н. кросс-валидация)


In [104]:
X = train_df.drop(['Survived'], axis=1)
y = train_df.Survived
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.3)

Посмотрим модель рандом фореста. Параметры укажем обычные, потом благодаря GridSearchCV подберем оптимальные. Ну и в конце взглянем на то, что у нас вышло


In [105]:
# Choose the type of classifier. 
clf = RandomForestClassifier()

# Choose some parameter combinations to try
parameters = {'n_estimators': [4, 6, 9], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1,5,8]
             }


# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
clf = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
clf.fit(X_train, y_train)


Out[105]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=5, max_features='log2', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=5, min_weight_fraction_leaf=0.0,
            n_estimators=9, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [106]:
predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))
plot_learning_curve(clf, 'Random Forest', X, y, cv=4);


0.813432835821

In [107]:
from sklearn.cross_validation import KFold

def run_kfold(clf):
    kf = KFold(891, n_folds=10)
    outcomes = []
    fold = 0
    for train_index, test_index in kf:
        fold += 1
        X_train, X_test = X.values[train_index], X.values[test_index]
        y_train, y_test = y.values[train_index], y.values[test_index]
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        outcomes.append(accuracy)
        print("Fold {0} accuracy: {1}".format(fold, accuracy))     
    mean_outcome = np.mean(outcomes)
    print("Mean Accuracy: {0}".format(mean_outcome)) 

run_kfold(clf)


Fold 1 accuracy: 0.8222222222222222
Fold 2 accuracy: 0.8651685393258427
Fold 3 accuracy: 0.7640449438202247
Fold 4 accuracy: 0.8764044943820225
Fold 5 accuracy: 0.8314606741573034
Fold 6 accuracy: 0.8426966292134831
Fold 7 accuracy: 0.7865168539325843
Fold 8 accuracy: 0.7865168539325843
Fold 9 accuracy: 0.8651685393258427
Fold 10 accuracy: 0.8539325842696629
Mean Accuracy: 0.8294132334581773

Повторим все выше описанные процедуры, которые мы делали для рандом фореста, теперь для логистической регрессии.


In [108]:
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(random_state=42, penalty='l1')
parameters = {'C':[0.5]}


# Type of scoring used to compare parameter combinations
acc_scorer_lg = make_scorer(accuracy_score)

# Run the grid search
grid_obj_lg = GridSearchCV(lg, parameters, scoring=acc_scorer_lg)
grid_obj_lg = grid_obj_lg.fit(X_train, y_train)

# Set the clf to the best combination of parameters
lg = grid_obj_lg.best_estimator_

# Fit the best algorithm to the data. 
lg.fit(X_train, y_train)


Out[108]:
LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [109]:
predictions_lg = lg.predict(X_test)
print(accuracy_score(y_test, predictions_lg))
plot_learning_curve(lg, 'Logistic Regression', X, y, cv=4);


0.809701492537

Выбираем ту модель, которая нам больше понравилась и сабмитим ее на кагл.


In [110]:
ids = test_df['PassengerId']
predictions = clf.predict(test_df.drop('PassengerId', axis=1))

output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('titanic-predictions.csv', index = False)
output.head()


Out[110]:
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1

In [ ]: