The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
Data Dictionary
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
In [9]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
import seaborn as sns
In [11]:
data_train = pd.read_csv("../kaggle_titanic/data//train.csv",index_col='PassengerId')
data_test = pd.read_csv("../kaggle_titanic/data/test.csv",index_col='PassengerId')
data_train.head(5)
Out[11]:
In [8]:
data_train.info()
上面的数据说啥了?它告诉我们,训练数据中总共有891名乘客,但是很不幸,我们有些属性的数据不全,比如说:
似乎信息略少啊,想再瞄一眼具体数据数值情况呢?恩,我们用下列的方法,得到数值型数据的一些分布(因为有些属性,比如姓名,是文本型;而另外一些属性,比如登船港口,是类目型。这些我们用下面的函数是看不到的):
In [9]:
data_train.describe()
Out[9]:
我们从上面看到更进一步的什么信息呢? mean字段告诉我们,大概0.383838的人最后获救了,2/3等舱的人数比1等舱要多,平均乘客年龄大概是29.7岁(计算这个时候会略掉无记录的)等等…
这个时候我们可能会有一些想法了:
口说无凭,空想无益。老老实实再来统计统计,看看这些属性值的统计分布吧。
In [68]:
plt.figure(figsize=(12,8))
plt.subplot(2,3,1)
sns.countplot(x='Pclass',hue='Survived',data=data_train)
plt.title('Pclass vs Survived')
plt.subplot(2,3,2)
sns.countplot(x='Sex',hue='Survived',data=data_train)
plt.title('Sex vs Survived')
plt.subplot(2,3,3)
sns.countplot(x='Embarked',hue='Survived',data=data_train)
plt.title('Embarked vs Survived')
plt.subplot(2,2,3)
sns.countplot(x='SibSp',hue='Survived',data=data_train)
plt.title('SibSp vs Survived')
plt.subplot(2,2,4)
sns.countplot(x='Parch',hue='Survived',data=data_train)
plt.title('Parch vs Survived')
Out[68]:
结论:
In [45]:
fig = plt.figure()
fig.set(alpha=0.2)
data_train.Age[data_train.Survived == 0].plot(kind='kde')
data_train.Age[data_train.Survived == 1].plot(kind='kde')
Out[45]:
In [46]:
#ticket是船票编号,应该是unique的,和最后的结果没有太大的关系,先不纳入考虑的特征范畴把
#cabin只有204个乘客有值,我们先看看它的一个分布
data_train.Cabin.value_counts()
Out[46]:
Cabin这鬼属性,应该算作类目型的,本来缺失值就多,还如此不集中,注定是个棘手货…第一感觉,这玩意儿如果直接按照类目特征处理的话,太散了,估计每个因子化后的特征都拿不到什么权重。-----> 找技术人员去了解什么含义。。。 为方便,我们以Un表示没有的情况
In [12]:
data_train = pd.read_csv("../kaggle_titanic/data/train.csv",index_col='PassengerId')
data_test = pd.read_csv("../kaggle_titanic/data/test.csv",index_col='PassengerId')
data = pd.concat([data_train, data_test],axis=0)
In [13]:
data.info()
In [14]:
data.Embarked[data.Embarked.isnull()] = data.Embarked.dropna().mode().values
In [15]:
data.groupby(data.Pclass).mean()['Fare']
Out[15]:
In [16]:
data.Fare[data.Fare.isnull()] = 13.302889
In [18]:
from sklearn.ensemble import RandomForestRegressor
age_df = data[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
# 乘客分成已知年龄和未知年龄两部分
known_age = age_df[age_df.Age.notnull()].as_matrix()
unknown_age = age_df[age_df.Age.isnull()].as_matrix()
rf_age = RandomForestRegressor(n_estimators=1000,n_jobs=-1)
rf_age.fit(known_age[:,1:], known_age[:,0])
rf_predict = rf_age.predict(unknown_age[:,1:])
data.loc[data.Age.isnull(),"Age"] = rf_predict
In [19]:
data.loc[data.Cabin.notnull(),"Cabin"] = "Know"
data.loc[data.Cabin.isnull(),"Cabin"] = "Unknow"
In [20]:
data.info()
In [21]:
### 特征转化
dummies_Cabin = pd.get_dummies(data['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix= 'Pclass')
df = pd.concat([data, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df.head(5)
Out[21]:
In [22]:
### 去纲量化
from sklearn.preprocessing import StandardScaler
age_scaler = StandardScaler().fit(df.Age)
df['Age_scaled'] = age_scaler.transform(df.Age.values.reshape(-1,1))
fare_scaler = StandardScaler().fit(df.Fare)
df['Fare_scaled'] = age_scaler.transform(df.Fare.values.reshape(-1,1))
df.drop(['Age','Fare'],axis=1,inplace=True)
In [37]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import sklearn.externals.joblib as joblib
In [24]:
X = df.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
y = df.loc[:,'Survived']
X_train = X.loc[y.notnull(),:]
y_train = y.loc[y.notnull()]
X_test = X.loc[y.isnull(),:]
In [25]:
# 保存训练数据,后面会用到
from cPickle import dump
data = (np.array(X_train),np.array(y_train))
with file("data/train_data","wb") as f:
dump(data, f)
In [27]:
lr = LogisticRegression()
lr.fit(X_train,y_train)
print lr.score(X_train,y_train)
#
y_test = lr.predict(X_test)
ret = pd.DataFrame(y_test, index=X_test.index,columns=['survival'],dtype=np.int32)
ret.to_csv("../kaggle_titanic/res/lr_0.csv")
提交模型,在测试集上的准确率为76.6%, 排名2700++
In [38]:
# 保存模型
joblib.dump(lr, '../kaggle_titanic/model/lr.ckpt')
# lr_read = joblib.load('../kaggle_titanic/model/lr.ckpt')
Out[38]:
In [30]:
import cPickle
with open("../kaggle_titanic/data/train_data","rb") as f:
X_train, y_train = cPickle.load(f)
In [31]:
# 因为没有测试集,我们使用KFold来评估模型的有效性
lr = LogisticRegression()
score = cross_val_score(lr,X_train,y_train,cv=5)
score.mean()
Out[31]:
In [32]:
# 我们使用sklearn选择特征
from sklearn.feature_selection import SelectKBest
skb = SelectKBest(k=12)
X30 = skb.fit_transform(X_train,y_train)
lr = LogisticRegression()
score = cross_val_score(lr,X30,y_train,cv=5)
score.mean()
Out[32]:
特征选择无益于效果的提升,可能是因为特征本身比较少,相关性都比较强。
如果在文件分类中,特征数据多达百万,这时进行特征选择,是很必要的。
In [33]:
# 流水线 + 优化
clf = Pipeline([('skb',SelectKBest(k=100)),('lr',LogisticRegression(C=1))])
grid_param = {'skb__k':[8,9,10,11,12,13,14],
'lr__C':[0.01, 0.1, 1, 10, 100],
'lr__penalty':['l1','l2']}
grid = GridSearchCV(clf, grid_param, n_jobs=-1, cv=5)
grid.fit(X_train,y_train)
print grid.best_params_
print grid.best_score_
In [35]:
#使用调优后的数据
clf = Pipeline([('skb',SelectKBest(k=14)),('lr',LogisticRegression(C=0.1,penalty='l2'))])
clf.fit(X_train, y_train)
print clf.score(X_train,y_train)
#
y_test =clfpp.predict(X_test)
ret = pd.DataFrame(y_test, index=X_test.index,columns=['survival'],dtype=np.int32)
ret.to_csv("../kaggle_titanic/res/lr_1.csv")
提交结果,得分为0.78,排名2000+。
In [288]:
# 学习曲线
from sklearn.model_selection import learning_curve
pp = Pipeline([('skb',SelectKBest(k=14)),('lr',LogisticRegression(C=0.1))])
train_sizes, train_scores, test_scores = learning_curve(pp, X_train, y_train, cv=5, n_jobs=-1, train_sizes=np.linspace(0.2, 1.0, 20))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure()
plt.plot(train_sizes, train_scores_mean, 'bo-')
plt.fill_between(train_sizes, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std,alpha=0.1, color='b')
plt.plot(train_sizes, test_scores_mean, 'ro-')
plt.fill_between(train_sizes, test_scores_mean-test_scores_std, test_scores_mean+test_scores_std,alpha=0.1, color='r')
Out[288]:
In [300]:
from sklearn.model_selection import train_test_split
In [307]:
X_tra, X_val, y_tra, y_val = train_test_split(X_train,y_train, test_size=0.3, random_state=0)
In [309]:
lrl1 = LogisticRegression(C=1.0, penalty="l1", tol=1e-6)
lrl1.fit(X_tra, y_tra)
y_val_pred = lrl1.predict(X_val)
In [325]:
val_pid_set = X_val.iloc[y_val.tolist() != y_val_pred,:].index
data_train.iloc[data_train.index.isin(val_pid_set),:].head(10)
Out[325]:
我们随便列一些可能可以做的优化操作:
进一步和优化特征,把以把得分做到0.804,基本可以排到前700+了。
In [ ]:
from sklearn.ensemble import BaggingClassifier
In [326]:
clf = LogisticRegression(C=1.0, penalty="l1", tol=1e-6)
bag_clf = BaggingClassifier(clf, n_estimators=20, max_samples=0.75, max_features=1.0, bootstrap=True, bootstrap_features=False, n_jobs=-1)
bag_clf.fit(X_train, y_train)
bag_clf.score(X_train, y_train)
Out[326]:
In [328]:
scores = cross_val_score(bag_clf, X_train, y_train, cv=5)
scores.mean()
Out[328]:
模型整合的确能提升模型的效果。
我们还可以再构建其它不同的分类器,再把所有的输出结果做为特征,进行二次学习。。。。
回顾整个分析,我们绝大部分时间是在做数据预处理,特征选择,特征构造等相关的工作,而真正用在建模的时间不是很多。特征工程很重要
In [ ]: