Tinanic Survival

*Note: Dataset has been obtained from kaggle.com

Loading Data

To begin with, the train and test datasets are loaded using the tools provided by pandas and both of these datasets are concatenated in only one. That will ease the way we can work with the information.



In [1]:

    
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

#load the files
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
data = pd.concat([train, test]).reset_index(drop=True)

#size of training dataset
train_samples = train.shape[0]

#print some of them
data.head()









    Out[1]:







  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      PassengerId
      Pclass
      Sex
      SibSp
      Survived
      Ticket
    
  
  
    
      0
      22.0
      NaN
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      1
      3
      male
      1
      0.0
      A/5 21171
    
    
      1
      38.0
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      2
      1
      female
      1
      1.0
      PC 17599
    
    
      2
      26.0
      NaN
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      3
      female
      0
      1.0
      STON/O2. 3101282
    
    
      3
      35.0
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      4
      1
      female
      1
      1.0
      113803
    
    
      4
      35.0
      NaN
      S
      8.0500
      Allen, Mr. William Henry
      0
      5
      3
      male
      0
      0.0
      373450



In [2]:

    
#show the data types
data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

Data Analysis



In [3]:

    
#heatmap
#fig = plt.figure(figsize=(9, 7))
sns.heatmap(data.corr(), annot=True);

At first sight we see the most correlated features with 'Survived' are 'Pclass' and 'Fare'.



In [4]:

    
data.groupby(data.Survived).Survived.count()









    Out[4]:





Survived
0.0    549
1.0    342
Name: Survived, dtype: int64

There are more non-survivors than survivors so we will have to deal with it.

Data Engineering

Dropping

There are a few number of features that can be removed. Apparently things like the ticket, the passenger ID or their names should not be relevant. At least in the way a passenger can survive or not.



In [5]:

    
#Dropping useless features
data = data.drop(['Name', 'PassengerId', 'Ticket'], axis=1)
data.head(10)

Cabin

The cabin is formed by a letter with numbers and some cases there are more than one. So we can extract the letter to know the importance and localization of the passenger inside the ship. Here you have the real deck distribution.

Empty values will be set to zero.



In [6]:

    
data.Cabin.head(5)









    Out[6]:





0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object



In [7]:

    
# Cabin
data.Cabin.fillna('0', inplace=True)
data.loc[data.Cabin.str[0] == 'A', 'Cabin'] = 1
data.loc[data.Cabin.str[0] == 'B', 'Cabin'] = 2
data.loc[data.Cabin.str[0] == 'C', 'Cabin'] = 3
data.loc[data.Cabin.str[0] == 'D', 'Cabin'] = 4
data.loc[data.Cabin.str[0] == 'E', 'Cabin'] = 5
data.loc[data.Cabin.str[0] == 'F', 'Cabin'] = 6
data.loc[data.Cabin.str[0] == 'G', 'Cabin'] = 7
data.loc[data.Cabin.str[0] == 'T', 'Cabin'] = 8



In [8]:

    
# Change the type from object to int
data.Cabin = pd.to_numeric(data.Cabin)

Now we can show how this feature is related to the survival in the train dataset.



In [9]:

    
data[data.Survived==1].groupby(data.Cabin).Cabin.hist();



In [10]:

    
data[data.Survived==0].groupby(data.Cabin).Cabin.hist();



In [11]:

    
data[data.Cabin==0].Cabin.count()/data.Cabin.count()









    Out[11]:





0.77463712757830405

Another problem, the 77% of the cabins were unknown.. But seeing the previous graphs we can assume that is likely the passenger dies if the cabin is unknown.

So we can transform the Cabin to a binary classification.



In [12]:

    
data.loc[data.Cabin != 0, 'Cabin'] = 1



In [13]:

    
sns.heatmap(data.corr(), annot=True);

Embarked



In [14]:

    
g  = sns.factorplot(x="Embarked",y="Survived",kind="bar",data=data)









    



/Users/samuel/anaconda/envs/py3/lib/python3.5/site-packages/seaborn/categorical.py:1460: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  stat_data = remove_na(group_data)



In [15]:

    
# Embarked
data.loc[data.Embarked == 'C', 'Embarked'] = 1
data.loc[data.Embarked == 'Q', 'Embarked'] = 2
data.loc[data.Embarked == 'S', 'Embarked'] = 3



In [16]:

    
data[data.Embarked.isna()]









    Out[16]:







  
    
      
      Age
      Cabin
      Embarked
      Fare
      Parch
      Pclass
      Sex
      SibSp
      Survived
    
  
  
    
      61
      38.0
      1
      NaN
      80.0
      0
      1
      female
      0
      1.0
    
    
      829
      62.0
      1
      NaN
      80.0
      0
      1
      female
      0
      1.0



In [17]:

    
# They are females of the first class and survivors -> assumption: they embarked in C
data.Embarked.fillna(1, inplace=True)

Family Size

Obviously the family size is quite important because it can affect directly if a person gets survived or not. The dataset has two features related to this meaning, 'SibSp' and 'Parch'. The first one is 'siblings and spouse' and the second one means 'parents and child'. Both features are defined as numbers so they can be transformed to a new one which represents the size of the family. It's only necessary to sum both of them and asign the correspondent size. It's going to be taken the following sizes: 'alone', 'medium' (1-4) and 'large' (more than 4).

After that, the original features will be deleted.



In [18]:

    
#New column to know if the passenger has family on board
def family(size):
    if size == 0:
        return "alone"
    elif size < 5:
        return "medium"
    else:
        return "large"

data['FamilySize'] = (data['SibSp']+data['Parch']).apply(family)
data = data.drop(['SibSp', 'Parch'], axis=1)

data.head()

Filling 'Age' NaN with Mean

Train and test datasets have many rows with empty values. In order to avoid that, the mean of every feature will be taken and use it to fill in the blanks.



In [19]:

    
cond = (data['Sex']=='female') & (data['Pclass']==3)
data.groupby(['Survived','Sex','Pclass'])['Age'].mean()
data[cond].groupby(['Sex','Pclass'])['Age'].mean()









    Out[19]:





Sex     Pclass
female  3         22.185329
Name: Age, dtype: float64

In the case of age is not that easy. There could be a little correlation between the class where the passenger is traveling, the gender of it and if the passenger has survived or not. So it's a big deal to take into account these three features to calculate the most likely age.



In [20]:

    
def getAge(row):
    surv = row.Survived
    sex = row.Sex
    pclass = row.Pclass
    
    if surv==0 or surv==1:
        condition = (data['Survived']==surv) & (data['Sex']==sex) & (data['Pclass']==pclass)
        df_mean = data[condition].groupby(['Survived','Sex','Pclass'])['Age'].mean()
    else:
        condition = (data['Sex']==sex) & (data['Pclass']==pclass)
        df_mean = data[condition].groupby(['Sex','Pclass'])['Age'].mean()
    
    #print("surv: {}, sex: {}, class: {} -> age (mean): {}".format(surv, sex, pclass, df_mean.mean()))
    return df_mean.mean()
    
data['Age'] = data['Age'].fillna(data.apply(getAge, axis=1))



In [21]:

    
len(data['Age']) - data['Age'].count()









    Out[21]:





0

'Fare' Distribution

Showing the 'Fare' distribution we observe the feature is skewed (to the right). So we can manage it applying the log to all the values. This is useful when the range of values has the minimum in 0, in other cases the log for negative numbers is a problem.



In [22]:

    
data.Fare.hist(bins=50)









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d44e518>



In [23]:

    
d1 = data["Fare"].map(lambda i: np.log(i) if i > 0 else 0)



In [24]:

    
d1.hist(bins=50)









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d0bef98>

Filling 'Fare' NaN with Mean



In [25]:

    
data['Fare'] = data['Fare'].fillna(data['Fare'].mean())

Age Range: Grouping

Similar to the 'familty size' (engineered feature), the age can be represented and categorized. The possible classes are going to be: child, young, adult and old.



In [26]:

    
#define age by ranges
def getAgeRange(age):
    if age < 5:
        return "child"
    elif age < 20:
        return "young"
    elif age < 50:
        return "adult"
    else:
        return "old"

data['Age'] = data['Age'].apply(getAgeRange)



In [27]:

    
# Age distribution
data.groupby(['Age'])['Survived'].describe()

Getting Feature Dummies

The features that are categorical have to be converted to dummies. On this way we'll have new "numerical" features.



In [28]:

    
#Transform categorical to dummies
data = pd.get_dummies(data)

Function to Remove Anomalies



In [29]:

    
from sklearn.covariance import EllipticEnvelope

def idxAnomalies(X):
    ee = EllipticEnvelope(contamination=0.05,
                          assume_centered=True,
                          random_state=13)
    ee.fit(X)
    pred = ee.predict(X)

    return [index[0] for index, x in np.ndenumerate(pred) if x != 1]

Checking NaN

Only the test dataset should have NaN values for the 'Survived' column.



In [30]:

    
#finding NaN
data.columns[data.isnull().any()].tolist()









    Out[30]:





['Survived']

Normalizing

Probably the range of numerical features are not the same and this could produce problems in our results. A proper way to get good calculations is normalizing all the features.

In this case we're going to use the minmax scaler because it transforms the range of the features to values between 0 to 1.



In [31]:

    
data.describe().T









    Out[31]:







  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      Cabin
      1309.0
      0.225363
      0.417981
      0.0
      0.0000
      0.0000
      0.000
      1.0000
    
    
      Embarked
      1309.0
      2.490451
      0.816089
      1.0
      2.0000
      3.0000
      3.000
      3.0000
    
    
      Fare
      1309.0
      33.295479
      51.738879
      0.0
      7.8958
      14.4542
      31.275
      512.3292
    
    
      Pclass
      1309.0
      2.294882
      0.837836
      1.0
      2.0000
      3.0000
      3.000
      3.0000
    
    
      Survived
      891.0
      0.383838
      0.486592
      0.0
      0.0000
      0.0000
      1.000
      1.0000
    
    
      Age_adult
      1309.0
      0.723453
      0.447461
      0.0
      0.0000
      1.0000
      1.000
      1.0000
    
    
      Age_child
      1309.0
      0.038961
      0.193576
      0.0
      0.0000
      0.0000
      0.000
      1.0000
    
    
      Age_old
      1309.0
      0.084034
      0.277544
      0.0
      0.0000
      0.0000
      0.000
      1.0000
    
    
      Age_young
      1309.0
      0.153552
      0.360657
      0.0
      0.0000
      0.0000
      0.000
      1.0000
    
    
      Sex_female
      1309.0
      0.355997
      0.478997
      0.0
      0.0000
      0.0000
      1.000
      1.0000
    
    
      Sex_male
      1309.0
      0.644003
      0.478997
      0.0
      0.0000
      1.0000
      1.000
      1.0000
    
    
      FamilySize_alone
      1309.0
      0.603514
      0.489354
      0.0
      0.0000
      1.0000
      1.000
      1.0000
    
    
      FamilySize_large
      1309.0
      0.045837
      0.209210
      0.0
      0.0000
      0.0000
      0.000
      1.0000
    
    
      FamilySize_medium
      1309.0
      0.350649
      0.477356
      0.0
      0.0000
      0.0000
      1.000
      1.0000

Avoiding the dummy features, we see 'pclass' is not moving between a range of (0,1).



In [32]:

    
import matplotlib.pyplot as plt

data.head(15).plot()
plt.show()

After the scaler processing we'll have all the features in the same range. This speeds up the calculations because all are small numbers which are easy to use (better performance).

The picture below shows the features ranges.



In [33]:

    
#Squeeze the data to [0,1]
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
data[['Pclass']] = scaler.fit_transform(data[['Pclass']])
data[['Fare']] = scaler.fit_transform(data[['Fare']])
data[['Cabin']] = scaler.fit_transform(data[['Cabin']])
data[['Embarked']] = scaler.fit_transform(data[['Embarked']])
print("Train shape: {}".format(data.shape))

data.head(15).plot()
plt.show()
data.describe().T









    



Train shape: (1309, 14)






    












    Out[33]:







  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      Cabin
      1309.0
      0.225363
      0.417981
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      Embarked
      1309.0
      0.745225
      0.408045
      0.0
      0.500000
      1.000000
      1.000000
      1.0
    
    
      Fare
      1309.0
      0.064988
      0.100988
      0.0
      0.015412
      0.028213
      0.061045
      1.0
    
    
      Pclass
      1309.0
      0.647441
      0.418918
      0.0
      0.500000
      1.000000
      1.000000
      1.0
    
    
      Survived
      891.0
      0.383838
      0.486592
      0.0
      0.000000
      0.000000
      1.000000
      1.0
    
    
      Age_adult
      1309.0
      0.723453
      0.447461
      0.0
      0.000000
      1.000000
      1.000000
      1.0
    
    
      Age_child
      1309.0
      0.038961
      0.193576
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      Age_old
      1309.0
      0.084034
      0.277544
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      Age_young
      1309.0
      0.153552
      0.360657
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      Sex_female
      1309.0
      0.355997
      0.478997
      0.0
      0.000000
      0.000000
      1.000000
      1.0
    
    
      Sex_male
      1309.0
      0.644003
      0.478997
      0.0
      0.000000
      1.000000
      1.000000
      1.0
    
    
      FamilySize_alone
      1309.0
      0.603514
      0.489354
      0.0
      0.000000
      1.000000
      1.000000
      1.0
    
    
      FamilySize_large
      1309.0
      0.045837
      0.209210
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      FamilySize_medium
      1309.0
      0.350649
      0.477356
      0.0
      0.000000
      0.000000
      1.000000
      1.0

Splitting the data to train and test

As a good practive, we're going to split the data into two different datasets, training and testing. Taking the number of training samples (saved in the beginning) we are able to split it.

Besides, we'll use the k-fold method to get different batches of the data (it's configured with 3 splits).

StratifiedKFold training set



In [34]:

    
from sklearn.model_selection import StratifiedKFold

y = np.array(data['Survived'])
X = np.array(data.drop('Survived', axis=1))

#split by idx
idx = train_samples
X_train, X_test = X[:idx], X[idx:]
y_train, y_test = y[:idx], y[idx:]

# Remove anomalies only in train dataset
idx_anomalies = idxAnomalies(X_train)
X_train = np.delete(X_train, idx_anomalies, axis=0)
y_train = np.delete(y_train, idx_anomalies, axis=0)

print("Shape train: {}".format(X_train.shape))
print("Shape test: {}".format(X_test.shape))
#print(y_train[0:1])
#print(X_train[0:1].tolist())

kf = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)
print(kf)









    



/Users/samuel/anaconda/envs/py3/lib/python3.5/site-packages/sklearn/covariance/robust_covariance.py:622: UserWarning: The covariance matrix associated to your dataset is not full rank
  warnings.warn("The covariance matrix associated to your dataset "






    



Shape train: (846, 13)
Shape test: (418, 13)
StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

Grid Search



In [35]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)

rf = RandomForestClassifier(random_state=42)
rf_param_grid = {"max_depth": [3, 5, 8],
              "min_samples_split": [3, 10, 20],
              "min_samples_leaf": [3, 10, 20],
              "n_estimators" :[10, 20, 30, 40, 50]}

gsrf = GridSearchCV(rf, param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs=4, verbose=1)

gsrf.fit(X_train, y_train)

# Best score
gsrf.best_score_









    



Fitting 10 folds for each of 135 candidates, totalling 1350 fits






    



[Parallel(n_jobs=4)]: Done 212 tasks      | elapsed:    5.1s
[Parallel(n_jobs=4)]: Done 1112 tasks      | elapsed:   26.8s
[Parallel(n_jobs=4)]: Done 1350 out of 1350 | elapsed:   32.6s finished






    Out[35]:





0.84042553191489366



In [36]:

    
brf = gsrf.best_estimator_
gsrf.best_params_









    Out[36]:





{'max_depth': 8,
 'min_samples_leaf': 3,
 'min_samples_split': 20,
 'n_estimators': 10}



In [63]:

    
from sklearn.svm import SVC

svc = SVC(random_state=42, probability=True)

Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
gammas = [0.0001, 0.001, 0.01, 0.1, 1]
    
svc_param_grid = [{'kernel': ['rbf'], 'gamma': gammas, 'C': Cs}]

gssvc = GridSearchCV(svc, param_grid = svc_param_grid, cv=kfold, scoring="accuracy", n_jobs=4, verbose=1)

gssvc.fit(X_train, y_train)

# Best score
gssvc.best_score_









    



Fitting 10 folds for each of 35 candidates, totalling 350 fits






    



[Parallel(n_jobs=4)]: Done  76 tasks      | elapsed:    3.7s
[Parallel(n_jobs=4)]: Done 350 out of 350 | elapsed:   19.5s finished






    Out[63]:





0.82624113475177308



In [64]:

    
bsvc = gssvc.best_estimator_
gssvc.best_params_









    Out[64]:





{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}

Voting Ensemble

Although it had been configured an ensemble method with a few classifiers (all have the same weight), it was seen the accuracy was bigger when the random forest classifier was used alone.

For this reason, part of the previous code is commented and 'clf2' is used.



In [65]:

    
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

eclf = VotingClassifier(estimators=[('rf', brf), ('svc', bsvc)], voting='soft')

epoch = 1
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]
    
    epoch += 1
    eclf.fit(X_t, y_t)
    scores = cross_val_score(eclf, X_v, y_v)
    print("Scores({}): {}".format(epoch-1, scores))









    



Scores(1): [ 0.8         0.78947368  0.80645161]
Scores(2): [ 0.78947368  0.88297872  0.78494624]
Scores(3): [ 0.76842105  0.8172043   0.76344086]

Learning Curve



In [66]:

    
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):

    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

#
plot_learning_curve(eclf, "Voting Ensemble", X_train, y_train, cv=kfold, n_jobs=4)
plot_learning_curve(brf, "Random Forest", X_train, y_train, cv=kfold, n_jobs=4)
plot_learning_curve(bsvc, "SVC", X_train, y_train, cv=kfold, n_jobs=4)









    Out[66]:





<module 'matplotlib.pyplot' from '/Users/samuel/anaconda/envs/py3/lib/python3.5/site-packages/matplotlib/pyplot.py'>

Post-Analysis

Finally, we only need to print the classification report and the confusion matrix and see the outcome.



In [71]:

    
from sklearn.metrics import classification_report,confusion_matrix

predictions = eclf.predict(X_train)
cm = confusion_matrix(y_train, predictions)
print(cm)
plt.matshow(cm)
plt.colorbar()
ax = plt.gca()
ax.set_xlabel('Predicted')
ax.set_ylabel('True')

plt.show()



In [72]:

    
print(classification_report(y_train, predictions))









    



             precision    recall  f1-score   support

        0.0       0.82      0.95      0.88       535
        1.0       0.89      0.65      0.75       311

avg / total       0.85      0.84      0.83       846

Seeing the ROC curve and the AUC we can evaluate if the model is a good classifier.



In [73]:

    
from sklearn.metrics import roc_curve, auc

# calculate the fpr and tpr for all thresholds of the classification
def plot_roc_curve(y_test, preds):
    fpr, tpr, threshold = roc_curve(y_test, preds)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'best')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([-0.01, 1.01])
    plt.ylim([-0.01, 1.01])
    plt.title('Receiver Operating Characteristic')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()



In [74]:

    
plot_roc_curve(y_train, predictions)

Get Predictions

*Note: Following code is not related to the calculations but how to compose the csv required by the Kaggle competition.



In [43]:

    
import os

predictions = gsrf.best_estimator_.predict(X_test)

passengerId = 892
file = "PassengerId,Survived" + os.linesep

for i in range(len(X_test)):
    file += "{},{}".format(passengerId, (int)(predictions[i]))  + os.linesep
    passengerId += 1



In [44]:

    
# Save to file
with open('attempt.txt', 'w') as f:
    f.write(file)

	Age	Cabin	Embarked	Fare	Name	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
0	22.0	NaN	S	7.2500	Braund, Mr. Owen Harris	1	3	male	1	0.0	A/5 21171
1	38.0	C85	C	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	2	1	female	1	1.0	PC 17599
2	26.0	NaN	S	7.9250	Heikkinen, Miss. Laina	3	3	female	0	1.0	STON/O2. 3101282
3	35.0	C123	S	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	female	1	1.0	113803
4	35.0	NaN	S	8.0500	Allen, Mr. William Henry	5	3	male	0	0.0	373450

	count	mean	std	min	25%	50%	75%	max
Age
adult	626.0	0.333866	0.471970	0.0	0.0	0.0	1.0	1.0
child	40.0	0.675000	0.474342	0.0	0.0	1.0	1.0	1.0
old	74.0	0.364865	0.484678	0.0	0.0	0.0	1.0	1.0
young	151.0	0.523179	0.501125	0.0	0.0	1.0	1.0	1.0

	count	mean	std	min	25%	50%	75%	max
Cabin	1309.0	0.225363	0.417981	0.0	0.0000	0.0000	0.000	1.0000
Embarked	1309.0	2.490451	0.816089	1.0	2.0000	3.0000	3.000	3.0000
Fare	1309.0	33.295479	51.738879	0.0	7.8958	14.4542	31.275	512.3292
Pclass	1309.0	2.294882	0.837836	1.0	2.0000	3.0000	3.000	3.0000
Survived	891.0	0.383838	0.486592	0.0	0.0000	0.0000	1.000	1.0000
Age_adult	1309.0	0.723453	0.447461	0.0	0.0000	1.0000	1.000	1.0000
Age_child	1309.0	0.038961	0.193576	0.0	0.0000	0.0000	0.000	1.0000
Age_old	1309.0	0.084034	0.277544	0.0	0.0000	0.0000	0.000	1.0000
Age_young	1309.0	0.153552	0.360657	0.0	0.0000	0.0000	0.000	1.0000
Sex_female	1309.0	0.355997	0.478997	0.0	0.0000	0.0000	1.000	1.0000
Sex_male	1309.0	0.644003	0.478997	0.0	0.0000	1.0000	1.000	1.0000
FamilySize_alone	1309.0	0.603514	0.489354	0.0	0.0000	1.0000	1.000	1.0000
FamilySize_large	1309.0	0.045837	0.209210	0.0	0.0000	0.0000	0.000	1.0000
FamilySize_medium	1309.0	0.350649	0.477356	0.0	0.0000	0.0000	1.000	1.0000