scikit-learn实践

场景

就是那个大家都熟悉的『Jack and Rose』的故事,豪华游艇倒了,大家都惊恐逃生,可是救生艇的数量有限,无法人人都有,副船长发话了『lady and kid first!』,所以是否获救其实并非随机,而是基于一些背景有rank先后的。

数据

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

Data Dictionary

Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.


In [9]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
import seaborn as sns

In [11]:
data_train = pd.read_csv("../kaggle_titanic/data//train.csv",index_col='PassengerId')
data_test = pd.read_csv("../kaggle_titanic/data/test.csv",index_col='PassengerId')
data_train.head(5)


Out[11]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

数据理解


In [8]:
data_train.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

上面的数据说啥了?它告诉我们,训练数据中总共有891名乘客,但是很不幸,我们有些属性的数据不全,比如说:

  • Age(年龄)属性只有714名乘客有记录
  • Cabin(客舱)更是只有204名乘客是已知的

似乎信息略少啊,想再瞄一眼具体数据数值情况呢?恩,我们用下列的方法,得到数值型数据的一些分布(因为有些属性,比如姓名,是文本型;而另外一些属性,比如登船港口,是类目型。这些我们用下面的函数是看不到的):


In [9]:
data_train.describe()


Out[9]:
Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

我们从上面看到更进一步的什么信息呢? mean字段告诉我们,大概0.383838的人最后获救了,2/3等舱的人数比1等舱要多,平均乘客年龄大概是29.7岁(计算这个时候会略掉无记录的)等等…

这个时候我们可能会有一些想法了:

  • 不同舱位/乘客等级可能和财富/地位有关系,最后获救概率可能会不一样
  • 年龄对获救概率也一定是有影响的,毕竟前面说了,副船长还说『小孩和女士先走』呢
  • 和登船港口是不是有关系呢?也许登船港口不同,人的出身地位不同?

口说无凭,空想无益。老老实实再来统计统计,看看这些属性值的统计分布吧。


In [68]:
plt.figure(figsize=(12,8))
plt.subplot(2,3,1)
sns.countplot(x='Pclass',hue='Survived',data=data_train)
plt.title('Pclass vs Survived')

plt.subplot(2,3,2)
sns.countplot(x='Sex',hue='Survived',data=data_train)
plt.title('Sex vs Survived')

plt.subplot(2,3,3)
sns.countplot(x='Embarked',hue='Survived',data=data_train)
plt.title('Embarked vs Survived')

plt.subplot(2,2,3)
sns.countplot(x='SibSp',hue='Survived',data=data_train)
plt.title('SibSp vs Survived')

plt.subplot(2,2,4)
sns.countplot(x='Parch',hue='Survived',data=data_train)
plt.title('Parch vs Survived')


Out[68]:
<matplotlib.text.Text at 0x7f2014566d50>

结论:

  • 钱和地位对舱位有影响,进而对获救的可能性也有影响
  • 很尊重lady,lady first践行得不错。性别无疑也要作为重要特征加入最后的模型之中
  • 登船港口居然是有影响的
  • 单身一个人,获救的可能性还是比较低的

In [45]:
fig = plt.figure()
fig.set(alpha=0.2)
data_train.Age[data_train.Survived == 0].plot(kind='kde')   
data_train.Age[data_train.Survived == 1].plot(kind='kde')


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f20177fd550>

In [46]:
#ticket是船票编号,应该是unique的,和最后的结果没有太大的关系,先不纳入考虑的特征范畴把
#cabin只有204个乘客有值,我们先看看它的一个分布
data_train.Cabin.value_counts()


Out[46]:
C23 C25 C27        4
G6                 4
B96 B98            4
D                  3
C22 C26            3
E101               3
F2                 3
F33                3
B57 B59 B63 B66    2
C68                2
B58 B60            2
E121               2
D20                2
E8                 2
E44                2
B77                2
C65                2
D26                2
E24                2
E25                2
B20                2
C93                2
D33                2
E67                2
D35                2
D36                2
C52                2
F4                 2
C125               2
C124               2
                  ..
F G63              1
A6                 1
D45                1
D6                 1
D56                1
C101               1
C54                1
D28                1
D37                1
B102               1
D30                1
E17                1
E58                1
F E69              1
D10 D12            1
E50                1
A14                1
C91                1
A16                1
B38                1
B39                1
C95                1
B78                1
B79                1
C99                1
B37                1
A19                1
E12                1
A7                 1
D15                1
Name: Cabin, dtype: int64

Cabin这鬼属性,应该算作类目型的,本来缺失值就多,还如此不集中,注定是个棘手货…第一感觉,这玩意儿如果直接按照类目特征处理的话,太散了,估计每个因子化后的特征都拿不到什么权重。-----> 找技术人员去了解什么含义。。。 为方便,我们以Un表示没有的情况

数据预处理与转化


In [12]:
data_train = pd.read_csv("../kaggle_titanic/data/train.csv",index_col='PassengerId')
data_test = pd.read_csv("../kaggle_titanic/data/test.csv",index_col='PassengerId')
data = pd.concat([data_train, data_test],axis=0)

In [13]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 11 columns):
Age         1046 non-null float64
Cabin       295 non-null object
Embarked    1307 non-null object
Fare        1308 non-null float64
Name        1309 non-null object
Parch       1309 non-null int64
Pclass      1309 non-null int64
Sex         1309 non-null object
SibSp       1309 non-null int64
Survived    891 non-null float64
Ticket      1309 non-null object
dtypes: float64(3), int64(3), object(5)
memory usage: 122.7+ KB

缺失值处理


In [14]:
data.Embarked[data.Embarked.isnull()] = data.Embarked.dropna().mode().values


/usr/local/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

In [15]:
data.groupby(data.Pclass).mean()['Fare']


Out[15]:
Pclass
1    87.508992
2    21.179196
3    13.302889
Name: Fare, dtype: float64

In [16]:
data.Fare[data.Fare.isnull()] = 13.302889


/usr/local/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

In [18]:
from sklearn.ensemble import RandomForestRegressor
age_df = data[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
# 乘客分成已知年龄和未知年龄两部分
known_age = age_df[age_df.Age.notnull()].as_matrix()
unknown_age = age_df[age_df.Age.isnull()].as_matrix()
rf_age = RandomForestRegressor(n_estimators=1000,n_jobs=-1)
rf_age.fit(known_age[:,1:], known_age[:,0])
rf_predict = rf_age.predict(unknown_age[:,1:])
data.loc[data.Age.isnull(),"Age"] = rf_predict

In [19]:
data.loc[data.Cabin.notnull(),"Cabin"] = "Know"
data.loc[data.Cabin.isnull(),"Cabin"] = "Unknow"

In [20]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 1 to 1309
Data columns (total 11 columns):
Age         1309 non-null float64
Cabin       1309 non-null object
Embarked    1309 non-null object
Fare        1309 non-null float64
Name        1309 non-null object
Parch       1309 non-null int64
Pclass      1309 non-null int64
Sex         1309 non-null object
SibSp       1309 non-null int64
Survived    891 non-null float64
Ticket      1309 non-null object
dtypes: float64(3), int64(3), object(5)
memory usage: 122.7+ KB

In [21]:
### 特征转化
dummies_Cabin = pd.get_dummies(data['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix= 'Pclass')

df = pd.concat([data, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df.head(5)


Out[21]:
Age Fare Parch SibSp Survived Cabin_Know Cabin_Unknow Embarked_C Embarked_Q Embarked_S Sex_female Sex_male Pclass_1 Pclass_2 Pclass_3
PassengerId
1 22.0 7.2500 0 1 0.0 0 1 0 0 1 0 1 0 0 1
2 38.0 71.2833 0 1 1.0 1 0 1 0 0 1 0 1 0 0
3 26.0 7.9250 0 0 1.0 0 1 0 0 1 1 0 0 0 1
4 35.0 53.1000 0 1 1.0 1 0 0 0 1 1 0 1 0 0
5 35.0 8.0500 0 0 0.0 0 1 0 0 1 0 1 0 0 1

In [22]:
### 去纲量化
from sklearn.preprocessing import StandardScaler
age_scaler = StandardScaler().fit(df.Age)
df['Age_scaled'] = age_scaler.transform(df.Age.values.reshape(-1,1))

fare_scaler = StandardScaler().fit(df.Fare)
df['Fare_scaled'] = age_scaler.transform(df.Fare.values.reshape(-1,1))

df.drop(['Age','Fare'],axis=1,inplace=True)


/usr/local/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/usr/local/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

建模


In [37]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import sklearn.externals.joblib as joblib

In [24]:
X = df.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
y = df.loc[:,'Survived']

X_train = X.loc[y.notnull(),:]
y_train = y.loc[y.notnull()]
X_test = X.loc[y.isnull(),:]

In [25]:
# 保存训练数据,后面会用到
from cPickle import dump
data = (np.array(X_train),np.array(y_train))
with file("data/train_data","wb")  as f:
    dump(data, f)

In [27]:
lr = LogisticRegression()
lr.fit(X_train,y_train)
print lr.score(X_train,y_train)

# 
y_test = lr.predict(X_test)
ret = pd.DataFrame(y_test, index=X_test.index,columns=['survival'],dtype=np.int32)
ret.to_csv("../kaggle_titanic/res/lr_0.csv")


0.814814814815

提交模型,在测试集上的准确率为76.6%, 排名2700++


In [38]:
# 保存模型
joblib.dump(lr, '../kaggle_titanic/model/lr.ckpt')

# lr_read = joblib.load('../kaggle_titanic/model/lr.ckpt')


Out[38]:
['../kaggle_titanic/model/lr.ckpt']

改进


In [30]:
import  cPickle
with open("../kaggle_titanic/data/train_data","rb") as f:
    X_train, y_train = cPickle.load(f)

In [31]:
# 因为没有测试集,我们使用KFold来评估模型的有效性
lr = LogisticRegression()
score = cross_val_score(lr,X_train,y_train,cv=5)
score.mean()


Out[31]:
0.80470539086817561

In [32]:
# 我们使用sklearn选择特征
from sklearn.feature_selection import SelectKBest
skb = SelectKBest(k=12)
X30 = skb.fit_transform(X_train,y_train)

lr = LogisticRegression()
score = cross_val_score(lr,X30,y_train,cv=5)
score.mean()


Out[32]:
0.79574801217255065

特征选择无益于效果的提升,可能是因为特征本身比较少,相关性都比较强。
如果在文件分类中,特征数据多达百万,这时进行特征选择,是很必要的。


In [33]:
# 流水线 + 优化
clf = Pipeline([('skb',SelectKBest(k=100)),('lr',LogisticRegression(C=1))])

grid_param =  {'skb__k':[8,9,10,11,12,13,14],
               'lr__C':[0.01, 0.1, 1, 10, 100],
              'lr__penalty':['l1','l2']}

grid = GridSearchCV(clf, grid_param, n_jobs=-1, cv=5)
grid.fit(X_train,y_train)
print grid.best_params_
print grid.best_score_


{'lr__penalty': 'l2', 'lr__C': 0.1, 'skb__k': 14}
0.805836139169

In [35]:
#使用调优后的数据
clf = Pipeline([('skb',SelectKBest(k=14)),('lr',LogisticRegression(C=0.1,penalty='l2'))])
clf.fit(X_train, y_train)

print clf.score(X_train,y_train)

# 
y_test =clfpp.predict(X_test)
ret = pd.DataFrame(y_test, index=X_test.index,columns=['survival'],dtype=np.int32)
ret.to_csv("../kaggle_titanic/res/lr_1.csv")


0.817059483726

提交结果,得分为0.78,排名2000+。


In [288]:
# 学习曲线
from sklearn.model_selection import learning_curve

pp = Pipeline([('skb',SelectKBest(k=14)),('lr',LogisticRegression(C=0.1))])

train_sizes, train_scores, test_scores = learning_curve(pp, X_train, y_train, cv=5, n_jobs=-1, train_sizes=np.linspace(0.2, 1.0, 20))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure()
plt.plot(train_sizes, train_scores_mean, 'bo-')
plt.fill_between(train_sizes, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std,alpha=0.1, color='b')

plt.plot(train_sizes, test_scores_mean, 'ro-')
plt.fill_between(train_sizes, test_scores_mean-test_scores_std, test_scores_mean+test_scores_std,alpha=0.1, color='r')


Out[288]:
<matplotlib.collections.PolyCollection at 0x7f20140f4050>

改进2


In [300]:
from sklearn.model_selection import train_test_split

In [307]:
X_tra, X_val, y_tra, y_val = train_test_split(X_train,y_train, test_size=0.3, random_state=0)

In [309]:
lrl1 = LogisticRegression(C=1.0, penalty="l1", tol=1e-6)
lrl1.fit(X_tra, y_tra)
y_val_pred = lrl1.predict(X_val)

In [325]:
val_pid_set = X_val.iloc[y_val.tolist() != y_val_pred,:].index
data_train.iloc[data_train.index.isin(val_pid_set),:].head(10)


Out[325]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
50 0 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.0 1 0 349237 17.8000 NaN S
56 1 1 Woolner, Mr. Hugh male NaN 0 0 19947 35.5000 C52 S
66 1 3 Moubarek, Master. Gerios male NaN 1 1 2661 15.2458 NaN C
69 1 3 Andersson, Miss. Erna Alexandra female 17.0 4 2 3101281 7.9250 NaN S
86 1 3 Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... female 33.0 3 0 3101278 15.8500 NaN S
114 0 3 Jussila, Miss. Katriina female 20.0 1 0 4136 9.8250 NaN S
141 0 3 Boulos, Mrs. Joseph (Sultana) female NaN 0 2 2678 15.2458 NaN C
205 1 3 Cohen, Mr. Gurshon "Gus" male 18.0 0 0 A/5 3540 8.0500 NaN S
241 0 3 Zabour, Miss. Thamine female NaN 1 0 2665 14.4542 NaN C

我们随便列一些可能可以做的优化操作:

  • Age属性不使用现在的拟合方式,而是根据名称中的『Mr』『Mrs』『Miss』等的平均值进行填充。
  • Age不做成一个连续值属性,而是使用一个步长进行离散化,变成离散的类目feature。
  • Cabin再细化一些,对于有记录的Cabin属性,我们将其分为前面的字母部分(我猜是位置和船层之类的信息) 和 后面的数字部分(应该是房间号,有意思的事情是,如果你仔细看看原始数据,你会发现,这个值大的情况下,似乎获救的可能性高一些)。
  • Pclass和Sex俩太重要了,我们试着用它们去组出一个组合属性来试试,这也是另外一种程度的细化。
  • 单加一个Child字段,Age<=12的,设为1,其余为0(你去看看数据,确实小盆友优先程度很高啊)
  • 如果名字里面有『Mrs』,而Parch>1的,我们猜测她可能是一个母亲,应该获救的概率也会提高,因此可以多加一个Mother字段,此种情况下设为1,其余情况下设为0
  • 登船港口可以考虑先去掉试试(Q和C本来就没权重,S有点诡异)
  • 把堂兄弟/兄妹 和 Parch 还有自己 个数加在一起组一个Family_size字段(考虑到大家族可能对最后的结果有影响)
  • Name是一个我们一直没有触碰的属性,我们可以做一些简单的处理,比如说男性中带某些字眼的(‘Capt’, ‘Don’, ‘Major’, ‘Sir’)可以统一到一个Title,女性也一样。

进一步和优化特征,把以把得分做到0.804,基本可以排到前700+了。

模型整合


In [ ]:
from sklearn.ensemble import BaggingClassifier

In [326]:
clf = LogisticRegression(C=1.0, penalty="l1", tol=1e-6)
bag_clf = BaggingClassifier(clf, n_estimators=20, max_samples=0.75, max_features=1.0, bootstrap=True, bootstrap_features=False, n_jobs=-1)
bag_clf.fit(X_train, y_train)
bag_clf.score(X_train, y_train)


Out[326]:
0.81593714927048255

In [328]:
scores = cross_val_score(bag_clf, X_train, y_train, cv=5)
scores.mean()


Out[328]:
0.80808252538223635

模型整合的确能提升模型的效果。
我们还可以再构建其它不同的分类器,再把所有的输出结果做为特征,进行二次学习。。。。

回顾

回顾整个分析,我们绝大部分时间是在做数据预处理,特征选择,特征构造等相关的工作,而真正用在建模的时间不是很多。特征工程很重要


In [ ]: