Here we got the data dictionary. This useful info will help you to understand the csv.
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
Please check https://www.kaggle.com/c/titanic/data
Let's waggle the data!!!
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('ggplot')
# Anaconda on Windows will get warning
In [3]:
df=pd.read_csv('train.csv')
In [4]:
df.head()
Out[4]:
In [5]:
df.hist( figsize=(16, 10))
Out[5]:
In [6]:
df.describe()
Out[6]:
In [7]:
df.keys()
Out[7]:
In [8]:
df2=df.drop(['Name','PassengerId','Cabin','Ticket'],1)
In [9]:
df2.head()
Out[9]:
In [10]:
code={'female':0,'male':1}
for k,i in df2.iterrows():
df2.loc[k,"sexc"]=code[i['Sex']]
In [11]:
df2.head()
Out[11]:
In [12]:
df2["Embarked"].value_counts()
Out[12]:
In [13]:
code={'C':0,'Q':1, 'S':2}
for k,i in df2.iterrows():
if i['Embarked']==i['Embarked']:
df2.loc[k,"embarkedc"]=code[i['Embarked']]
In [14]:
df2.head(10)
Out[14]:
In [15]:
#remove the original column
df3=df2.drop(['Sex','Embarked'],1)
In [16]:
df4=df3.dropna(axis=0)
df4.head(10)
Out[16]:
In [17]:
df5=df3[ df3['Age'].notnull() ]
df5.head()
Out[17]:
In [18]:
X=df4.as_matrix([df4.columns[1:]])
In [19]:
y=df4.as_matrix(["Survived"])
In [20]:
y.shape
Out[20]:
In [21]:
y=y.reshape(-1)
In [22]:
y.shape
Out[22]:
In [23]:
X.shape
Out[23]:
In [24]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cross_validation import cross_val_score
In [25]:
forest = ExtraTreesClassifier(n_estimators=250,random_state=0)
In [26]:
cross_val_score(forest, X, y)
Out[26]:
In [27]:
forest.fit(X,y)
Out[27]:
In [28]:
forest.predict([3,22.0,1,0,7.2500,1.0,2.0])
Out[28]:
In [29]:
forest.predict([1,38.0,1,0,71.2833,0.0,0.0])
Out[29]:
In [30]:
#Pclass=3 (lower), Age=25 (guessing), SibSp=0, Parch=0, Fare=0.0 (free trip), sexc=1 (male), embarkedc=1 (Queenstown, guessing)
forest.predict([3,25.0,0,0,0.0,1,1])
Out[30]:
--The output of prediction is 0 that means sorry for Jack, he's not survive.
In [31]:
forest.predict([1,23.0,0,1,50.0,0,1])
Out[31]:
--But Rose has to continue.
In [32]:
%%HTML
<img src="https://upload.wikimedia.org/wikipedia/en/b/bb/Titanic_breaks_in_half.jpg">
In [ ]: