Import libraries.
In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Load training data and test data.
In [27]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')
Well, I don't really have any idea how to handle these data. So let's just take a look at them. Let's start from the trainning data.
In [28]:
df_train.head()
Out[28]:
In [29]:
df_train.describe()
Out[29]:
In [30]:
df_train.info()
In [31]:
df_train.isnull().sum()
Out[31]:
Hmm... There are some data missing. Age could be an important feature. Cabin seems like a useless feature and I am going to discard it. Well, my 1st question, how do you decide which feature to be used and which not?
After i read other people's analysis, they show me this:
In [32]:
df_train.describe(include=['O'])
Out[32]:
Hmm... Seems some people share one cabin. Is it the case that people in one cabin help each other and increase the survive chance? But the cabin has too less data. Also, the ticket number is shared by upto 7 people, which means they are a group? And they will more likely help each other and increase the survive chance?
Among 891 row, 577 are Male and 314 Female.
Now, do the same thing to the test data.
In [33]:
df_test.head()
Out[33]:
In [34]:
df_test.describe()
Out[34]:
In [35]:
df_test.describe(include=['O'])
Out[35]:
In [36]:
df_test.info()
In [37]:
sns.countplot(x='Survived', data=df_train)
plt.show()
In [38]:
df_train['Percentage'] = 1 # this is a helper colume
df_train[['Percentage','Survived']].groupby('Survived').count().apply(lambda x: (100 * x)/x.sum())
Out[38]:
In [39]:
df_train[['Pclass','Survived']].groupby('Pclass').mean()
Out[39]:
In [40]:
df_train['Count'] = 1 # this is a helper colume
df_train[['Pclass','Survived','Count']].groupby(['Pclass','Survived']).count()
Out[40]:
In [41]:
df_train[['Sex','Survived']].groupby('Sex').mean()
Out[41]:
In [42]:
df_train[['Sex','Survived','Count']].groupby(['Sex','Survived']).count()
Out[42]:
In [43]:
df_train[['Pclass','Sex','Survived','Count']].groupby(['Pclass','Sex','Survived']).count()
Out[43]:
In [44]:
df_train[['Pclass','Sex','Survived']].groupby(['Pclass','Sex']).mean()
Out[44]:
The female survive rate in Pclass 1 and 2 are similar, but Pclass 3 is way lower. Well, the story is the gate from Pclass 3 to the deck was locked at the very beginning. That's sad...
The male survive rate in Pclass 2 and 3 are similar, but Pclass 1 is way higher.
In [45]:
sns.boxplot(x='Survived', y='Age', hue='Sex',data=df_train, palette="coolwarm")
plt.show()
In [56]:
def SimplyAge(colage):
colage = colage.fillna(-1)
bins = (-2,0,5,10,20,35,60,100)
colage = pd.cut(colage,bins)
return colage
colage = SimplyAge(df_train['Age'])
# for test
df_train['Age'] = colage
dfage = df_train
#dfage = pd.DataFrame()
#dfage['Age'] = colage
#dfage['Survived'] = df_train['Survived']
In [74]:
df_train[['Age','Survived','Count']].groupby(['Age','Survived']).count()
Out[74]:
In [75]:
df_train[['Age','Survived']].groupby('Age').mean()
Out[75]:
Well, the babys look have highest survive rate.
In [72]:
df_train[['Age','Survived','Sex','Count']].groupby(['Age','Sex','Survived']).count()
Out[72]:
In [71]:
df_train[['Age','Sex','Survived']].groupby(['Age','Sex']).mean()
Out[71]:
In [64]:
#sns.countplot(x='Age',hue='Survived',data=dfage)
sns.countplot(x='Age',data=dfage,color='Red')
sns.barplot(x='Age',y='Survived',data=dfage,estimator=np.sum,color='Blue')
Out[64]:
In [76]:
sns.barplot(x='Age',y='Count',hue='Survived',data=dfage,estimator=np.sum)
Out[76]:
In [49]:
sns.barplot(x='Pclass', y='Survived', hue='Sex',data=df_train,estimator=np.sum)
plt.show()
In [50]:
sns.countplot(x='Fare',data=df_train)
plt.show()