Here we got the data dictionary. This useful info will help you to understand the csv.
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
Please check https://www.kaggle.com/c/titanic/data
Let's waggle the data!!!
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('ggplot')
# Anaconda on Windows will get warning
In [2]:
df=pd.read_csv('train.csv')
In [3]:
df.head()
Out[3]:
In [4]:
df.hist( figsize=(16, 10))
Out[4]:
In [5]:
df.describe()
Out[5]:
In [6]:
df.keys()
Out[6]:
In [7]:
df2=df.drop(['Name','PassengerId','Cabin','Ticket'],1)
In [8]:
df2.head()
Out[8]:
In [9]:
code={'female':0,'male':1}
for k,i in df2.iterrows():
df2.loc[k,"sexc"]=code[i['Sex']]
In [10]:
df2.head()
Out[10]:
In [11]:
df2["Embarked"].value_counts()
Out[11]:
In [12]:
code={'C':0,'Q':1, 'S':2}
for k,i in df2.iterrows():
if i['Embarked']==i['Embarked']:
df2.loc[k,"embarkedc"]=code[i['Embarked']]
In [13]:
df2.head(10)
Out[13]:
In [14]:
#remove the original column
df3=df2.drop(['Sex','Embarked'],1)
In [15]:
df4=df3.dropna(axis=0)
df4.head(10)
Out[15]:
In [70]:
import numpy as np
A=df4.as_matrix([df4.columns[:]])
X=df4.as_matrix([df4.columns[[2,5]]])
In [71]:
#X[:,1]=np.log(X[:,1]+1)
In [72]:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=1.5, min_samples=20).fit(X)
In [73]:
db.labels_.max()
Out[73]:
In [74]:
ax=plt.figure(figsize=(12,9))
#plt.rcParams['figure.figsize'] = (12,9)
plt.xlabel("Age")
plt.ylabel("log(Fare+1)")
plt.title("Titanic")
plt.yscale('log')
color_list=['r','g','b','c','m','y','k']
for i in xrange( -1,db.labels_.max()+1 ):
is_survive=(A[:,0]==1) & (db.labels_==i)
plt.scatter(X[is_survive,0], X[is_survive,1], marker='o',color=color_list[i])
is_not_survive=(A[:,0]==0) & (db.labels_==i)
plt.scatter(X[is_not_survive,0], X[is_not_survive,1], marker='x',color=color_list[i])
In [ ]:
In [ ]: