scikit-learn实践

场景

就是那个大家都熟悉的『Jack and Rose』的故事，豪华游艇倒了，大家都惊恐逃生，可是救生艇的数量有限，无法人人都有，副船长发话了『lady and kid first！』，所以是否获救其实并非随机，而是基于一些背景有rank先后的。

数据

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

Data Dictionary

Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.



In [9]:

    
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
import seaborn as sns



In [11]:

    
data_train = pd.read_csv("../kaggle_titanic/data//train.csv",index_col='PassengerId')
data_test = pd.read_csv("../kaggle_titanic/data/test.csv",index_col='PassengerId')
data_train.head(5)









    Out[11]:






  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

数据理解



In [8]:

    
data_train.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

上面的数据说啥了？它告诉我们，训练数据中总共有891名乘客，但是很不幸，我们有些属性的数据不全，比如说：

Age（年龄）属性只有714名乘客有记录
Cabin（客舱）更是只有204名乘客是已知的

似乎信息略少啊，想再瞄一眼具体数据数值情况呢？恩，我们用下列的方法，得到数值型数据的一些分布(因为有些属性，比如姓名，是文本型；而另外一些属性，比如登船港口，是类目型。这些我们用下面的函数是看不到的)：



In [9]:

    
data_train.describe()

我们从上面看到更进一步的什么信息呢？ mean字段告诉我们，大概0.383838的人最后获救了，2/3等舱的人数比1等舱要多，平均乘客年龄大概是29.7岁(计算这个时候会略掉无记录的)等等…

这个时候我们可能会有一些想法了：

不同舱位/乘客等级可能和财富/地位有关系，最后获救概率可能会不一样
年龄对获救概率也一定是有影响的，毕竟前面说了，副船长还说『小孩和女士先走』呢
和登船港口是不是有关系呢？也许登船港口不同，人的出身地位不同？

口说无凭，空想无益。老老实实再来统计统计，看看这些属性值的统计分布吧。



In [68]:

    
plt.figure(figsize=(12,8))
plt.subplot(2,3,1)
sns.countplot(x='Pclass',hue='Survived',data=data_train)
plt.title('Pclass vs Survived')

plt.subplot(2,3,2)
sns.countplot(x='Sex',hue='Survived',data=data_train)
plt.title('Sex vs Survived')

plt.subplot(2,3,3)
sns.countplot(x='Embarked',hue='Survived',data=data_train)
plt.title('Embarked vs Survived')

plt.subplot(2,2,3)
sns.countplot(x='SibSp',hue='Survived',data=data_train)
plt.title('SibSp vs Survived')

plt.subplot(2,2,4)
sns.countplot(x='Parch',hue='Survived',data=data_train)
plt.title('Parch vs Survived')









    Out[68]:





<matplotlib.text.Text at 0x7f2014566d50>

结论：

钱和地位对舱位有影响，进而对获救的可能性也有影响
很尊重lady，lady first践行得不错。性别无疑也要作为重要特征加入最后的模型之中
登船港口居然是有影响的
单身一个人，获救的可能性还是比较低的



In [45]:

    
fig = plt.figure()
fig.set(alpha=0.2)
data_train.Age[data_train.Survived == 0].plot(kind='kde')   
data_train.Age[data_train.Survived == 1].plot(kind='kde')









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f20177fd550>



In [46]:

    
#ticket是船票编号，应该是unique的，和最后的结果没有太大的关系，先不纳入考虑的特征范畴把
#cabin只有204个乘客有值，我们先看看它的一个分布
data_train.Cabin.value_counts()









    Out[46]:





C23 C25 C27        4
G6                 4
B96 B98            4
D                  3
C22 C26            3
E101               3
F2                 3
F33                3
B57 B59 B63 B66    2
C68                2
B58 B60            2
E121               2
D20                2
E8                 2
E44                2
B77                2
C65                2
D26                2
E24                2
E25                2
B20                2
C93                2
D33                2
E67                2
D35                2
D36                2
C52                2
F4                 2
C125               2
C124               2
                  ..
F G63              1
A6                 1
D45                1
D6                 1
D56                1
C101               1
C54                1
D28                1
D37                1
B102               1
D30                1
E17                1
E58                1
F E69              1
D10 D12            1
E50                1
A14                1
C91                1
A16                1
B38                1
B39                1
C95                1
B78                1
B79                1
C99                1
B37                1
A19                1
E12                1
A7                 1
D15                1
Name: Cabin, dtype: int64

Cabin这鬼属性，应该算作类目型的，本来缺失值就多，还如此不集中，注定是个棘手货…第一感觉，这玩意儿如果直接按照类目特征处理的话，太散了，估计每个因子化后的特征都拿不到什么权重。-----> 找技术人员去了解什么含义。。。为方便，我们以Un表示没有的情况

数据预处理与转化



In [12]:

    
data_train = pd.read_csv("../kaggle_titanic/data/train.csv",index_col='PassengerId')
data_test = pd.read_csv("../kaggle_titanic/data/test.csv",index_col='PassengerId')
data = pd.concat([data_train, data_test],axis=0)



In [13]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 11 columns):
Age         1046 non-null float64
Cabin       295 non-null object
Embarked    1307 non-null object
Fare        1308 non-null float64
Name        1309 non-null object
Parch       1309 non-null int64
Pclass      1309 non-null int64
Sex         1309 non-null object
SibSp       1309 non-null int64
Survived    891 non-null float64
Ticket      1309 non-null object
dtypes: float64(3), int64(3), object(5)
memory usage: 122.7+ KB

缺失值处理



In [14]:

    
data.Embarked[data.Embarked.isnull()] = data.Embarked.dropna().mode().values









    



/usr/local/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':



In [15]:

    
data.groupby(data.Pclass).mean()['Fare']









    Out[15]:





Pclass
1    87.508992
2    21.179196
3    13.302889
Name: Fare, dtype: float64



In [16]:

    
data.Fare[data.Fare.isnull()] = 13.302889









    



/usr/local/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':



In [18]:

    
from sklearn.ensemble import RandomForestRegressor
age_df = data[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
# 乘客分成已知年龄和未知年龄两部分
known_age = age_df[age_df.Age.notnull()].as_matrix()
unknown_age = age_df[age_df.Age.isnull()].as_matrix()
rf_age = RandomForestRegressor(n_estimators=1000,n_jobs=-1)
rf_age.fit(known_age[:,1:], known_age[:,0])
rf_predict = rf_age.predict(unknown_age[:,1:])
data.loc[data.Age.isnull(),"Age"] = rf_predict



In [19]:

    
data.loc[data.Cabin.notnull(),"Cabin"] = "Know"
data.loc[data.Cabin.isnull(),"Cabin"] = "Unknow"



In [20]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 11 columns):
Age         1309 non-null float64
Cabin       1309 non-null object
Embarked    1309 non-null object
Fare        1309 non-null float64
Name        1309 non-null object
Parch       1309 non-null int64
Pclass      1309 non-null int64
Sex         1309 non-null object
SibSp       1309 non-null int64
Survived    891 non-null float64
Ticket      1309 non-null object
dtypes: float64(3), int64(3), object(5)
memory usage: 122.7+ KB



In [21]:

    
### 特征转化
dummies_Cabin = pd.get_dummies(data['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix= 'Pclass')

df = pd.concat([data, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df.head(5)









    Out[21]:






  
    
      
      Age
      Fare
      Parch
      SibSp
      Survived
      Cabin_Know
      Cabin_Unknow
      Embarked_C
      Embarked_Q
      Embarked_S
      Sex_female
      Sex_male
      Pclass_1
      Pclass_2
      Pclass_3
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22.0
      7.2500
      0
      1
      0.0
      0
      1
      0
      0
      1
      0
      1
      0
      0
      1
    
    
      2
      38.0
      71.2833
      0
      1
      1.0
      1
      0
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      3
      26.0
      7.9250
      0
      0
      1.0
      0
      1
      0
      0
      1
      1
      0
      0
      0
      1
    
    
      4
      35.0
      53.1000
      0
      1
      1.0
      1
      0
      0
      0
      1
      1
      0
      1
      0
      0
    
    
      5
      35.0
      8.0500
      0
      0
      0.0
      0
      1
      0
      0
      1
      0
      1
      0
      0
      1



In [22]:

    
### 去纲量化
from sklearn.preprocessing import StandardScaler
age_scaler = StandardScaler().fit(df.Age)
df['Age_scaled'] = age_scaler.transform(df.Age.values.reshape(-1,1))

fare_scaler = StandardScaler().fit(df.Fare)
df['Fare_scaled'] = age_scaler.transform(df.Fare.values.reshape(-1,1))

df.drop(['Age','Fare'],axis=1,inplace=True)









    



/usr/local/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/usr/local/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

建模



In [37]:

    
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import sklearn.externals.joblib as joblib



In [24]:

    
X = df.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
y = df.loc[:,'Survived']

X_train = X.loc[y.notnull(),:]
y_train = y.loc[y.notnull()]
X_test = X.loc[y.isnull(),:]



In [25]:

    
# 保存训练数据，后面会用到
from cPickle import dump
data = (np.array(X_train),np.array(y_train))
with file("data/train_data","wb")  as f:
    dump(data, f)



In [27]:

    
lr = LogisticRegression()
lr.fit(X_train,y_train)
print lr.score(X_train,y_train)

# 
y_test = lr.predict(X_test)
ret = pd.DataFrame(y_test, index=X_test.index,columns=['survival'],dtype=np.int32)
ret.to_csv("../kaggle_titanic/res/lr_0.csv")









    



0.814814814815

提交模型，在测试集上的准确率为76.6%, 排名2700++



In [38]:

    
# 保存模型
joblib.dump(lr, '../kaggle_titanic/model/lr.ckpt')

# lr_read = joblib.load('../kaggle_titanic/model/lr.ckpt')









    Out[38]:





['../kaggle_titanic/model/lr.ckpt']

改进



In [30]:

    
import  cPickle
with open("../kaggle_titanic/data/train_data","rb") as f:
    X_train, y_train = cPickle.load(f)



In [31]:

    
# 因为没有测试集，我们使用KFold来评估模型的有效性
lr = LogisticRegression()
score = cross_val_score(lr,X_train,y_train,cv=5)
score.mean()









    Out[31]:





0.80470539086817561



In [32]:

    
# 我们使用sklearn选择特征
from sklearn.feature_selection import SelectKBest
skb = SelectKBest(k=12)
X30 = skb.fit_transform(X_train,y_train)

lr = LogisticRegression()
score = cross_val_score(lr,X30,y_train,cv=5)
score.mean()









    Out[32]:





0.79574801217255065

特征选择无益于效果的提升，可能是因为特征本身比较少，相关性都比较强。
如果在文件分类中，特征数据多达百万，这时进行特征选择，是很必要的。



In [33]:

    
# 流水线 + 优化
clf = Pipeline([('skb',SelectKBest(k=100)),('lr',LogisticRegression(C=1))])

grid_param =  {'skb__k':[8,9,10,11,12,13,14],
               'lr__C':[0.01, 0.1, 1, 10, 100],
              'lr__penalty':['l1','l2']}

grid = GridSearchCV(clf, grid_param, n_jobs=-1, cv=5)
grid.fit(X_train,y_train)
print grid.best_params_
print grid.best_score_









    



{'lr__penalty': 'l2', 'lr__C': 0.1, 'skb__k': 14}
0.805836139169



In [35]:

    
#使用调优后的数据
clf = Pipeline([('skb',SelectKBest(k=14)),('lr',LogisticRegression(C=0.1,penalty='l2'))])
clf.fit(X_train, y_train)

print clf.score(X_train,y_train)

# 
y_test =clfpp.predict(X_test)
ret = pd.DataFrame(y_test, index=X_test.index,columns=['survival'],dtype=np.int32)
ret.to_csv("../kaggle_titanic/res/lr_1.csv")









    



0.817059483726

提交结果，得分为0.78,排名2000+。



In [288]:

    
# 学习曲线
from sklearn.model_selection import learning_curve

pp = Pipeline([('skb',SelectKBest(k=14)),('lr',LogisticRegression(C=0.1))])

train_sizes, train_scores, test_scores = learning_curve(pp, X_train, y_train, cv=5, n_jobs=-1, train_sizes=np.linspace(0.2, 1.0, 20))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure()
plt.plot(train_sizes, train_scores_mean, 'bo-')
plt.fill_between(train_sizes, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std,alpha=0.1, color='b')

plt.plot(train_sizes, test_scores_mean, 'ro-')
plt.fill_between(train_sizes, test_scores_mean-test_scores_std, test_scores_mean+test_scores_std,alpha=0.1, color='r')









    Out[288]:





<matplotlib.collections.PolyCollection at 0x7f20140f4050>

改进2



In [300]:

    
from sklearn.model_selection import train_test_split



In [307]:

    
X_tra, X_val, y_tra, y_val = train_test_split(X_train,y_train, test_size=0.3, random_state=0)



In [309]:

    
lrl1 = LogisticRegression(C=1.0, penalty="l1", tol=1e-6)
lrl1.fit(X_tra, y_tra)
y_val_pred = lrl1.predict(X_val)



In [325]:

    
val_pid_set = X_val.iloc[y_val.tolist() != y_val_pred,:].index
data_train.iloc[data_train.index.isin(val_pid_set),:].head(10)









    Out[325]:






  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      15
      0
      3
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14.0
      0
      0
      350406
      7.8542
      NaN
      S
    
    
      50
      0
      3
      Arnold-Franchi, Mrs. Josef (Josefine Franchi)
      female
      18.0
      1
      0
      349237
      17.8000
      NaN
      S
    
    
      56
      1
      1
      Woolner, Mr. Hugh
      male
      NaN
      0
      0
      19947
      35.5000
      C52
      S
    
    
      66
      1
      3
      Moubarek, Master. Gerios
      male
      NaN
      1
      1
      2661
      15.2458
      NaN
      C
    
    
      69
      1
      3
      Andersson, Miss. Erna Alexandra
      female
      17.0
      4
      2
      3101281
      7.9250
      NaN
      S
    
    
      86
      1
      3
      Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu...
      female
      33.0
      3
      0
      3101278
      15.8500
      NaN
      S
    
    
      114
      0
      3
      Jussila, Miss. Katriina
      female
      20.0
      1
      0
      4136
      9.8250
      NaN
      S
    
    
      141
      0
      3
      Boulos, Mrs. Joseph (Sultana)
      female
      NaN
      0
      2
      2678
      15.2458
      NaN
      C
    
    
      205
      1
      3
      Cohen, Mr. Gurshon "Gus"
      male
      18.0
      0
      0
      A/5 3540
      8.0500
      NaN
      S
    
    
      241
      0
      3
      Zabour, Miss. Thamine
      female
      NaN
      1
      0
      2665
      14.4542
      NaN
      C

我们随便列一些可能可以做的优化操作：

Age属性不使用现在的拟合方式，而是根据名称中的『Mr』『Mrs』『Miss』等的平均值进行填充。
Age不做成一个连续值属性，而是使用一个步长进行离散化，变成离散的类目feature。
Cabin再细化一些，对于有记录的Cabin属性，我们将其分为前面的字母部分(我猜是位置和船层之类的信息) 和后面的数字部分(应该是房间号，有意思的事情是，如果你仔细看看原始数据，你会发现，这个值大的情况下，似乎获救的可能性高一些)。
Pclass和Sex俩太重要了，我们试着用它们去组出一个组合属性来试试，这也是另外一种程度的细化。
单加一个Child字段，Age<=12的，设为1，其余为0(你去看看数据，确实小盆友优先程度很高啊)
如果名字里面有『Mrs』，而Parch>1的，我们猜测她可能是一个母亲，应该获救的概率也会提高，因此可以多加一个Mother字段，此种情况下设为1，其余情况下设为0
登船港口可以考虑先去掉试试(Q和C本来就没权重，S有点诡异)
把堂兄弟/兄妹和 Parch 还有自己个数加在一起组一个Family_size字段(考虑到大家族可能对最后的结果有影响)
Name是一个我们一直没有触碰的属性，我们可以做一些简单的处理，比如说男性中带某些字眼的(‘Capt’, ‘Don’, ‘Major’, ‘Sir’)可以统一到一个Title，女性也一样。

进一步和优化特征，把以把得分做到0.804,基本可以排到前700+了。

模型整合



In [ ]:

    
from sklearn.ensemble import BaggingClassifier



In [326]:

    
clf = LogisticRegression(C=1.0, penalty="l1", tol=1e-6)
bag_clf = BaggingClassifier(clf, n_estimators=20, max_samples=0.75, max_features=1.0, bootstrap=True, bootstrap_features=False, n_jobs=-1)
bag_clf.fit(X_train, y_train)
bag_clf.score(X_train, y_train)









    Out[326]:





0.81593714927048255



In [328]:

    
scores = cross_val_score(bag_clf, X_train, y_train, cv=5)
scores.mean()









    Out[328]:





0.80808252538223635

模型整合的确能提升模型的效果。
我们还可以再构建其它不同的分类器，再把所有的输出结果做为特征，进行二次学习。。。。

回顾

回顾整个分析，我们绝大部分时间是在做数据预处理，特征选择，特征构造等相关的工作，而真正用在建模的时间不是很多。特征工程很重要



In [ ]:

	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S
50	0	3	Arnold-Franchi, Mrs. Josef (Josefine Franchi)	female	18.0	1	0	349237	17.8000	NaN	S
56	1	1	Woolner, Mr. Hugh	male	NaN	0	0	19947	35.5000	C52	S
66	1	3	Moubarek, Master. Gerios	male	NaN	1	1	2661	15.2458	NaN	C
69	1	3	Andersson, Miss. Erna Alexandra	female	17.0	4	2	3101281	7.9250	NaN	S
86	1	3	Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu...	female	33.0	3	0	3101278	15.8500	NaN	S
114	0	3	Jussila, Miss. Katriina	female	20.0	1	0	4136	9.8250	NaN	S
141	0	3	Boulos, Mrs. Joseph (Sultana)	female	NaN	0	2	2678	15.2458	NaN	C
205	1	3	Cohen, Mr. Gurshon "Gus"	male	18.0	0	0	A/5 3540	8.0500	NaN	S
241	0	3	Zabour, Miss. Thamine	female	NaN	1	0	2665	14.4542	NaN	C