Titanic Dataset

Titanic & Python

Throughout this practice we will analyze the Titanic data set. In the [Kaggle website] (https://www.kaggle.com/c/titanic) we can find more information about the different variables contained in the dataset.

The objective of the practice will be to answer the question: What factors influenced the survival of passengers?

The practice will consist of the following phases: 1. Getting Started with the dataset 2. Exploring the dataset and generating statistics 3. Presentation of results And if we have time for more, we will build a model based on our findings to predict survival. We will compare our model with a decision tree that we will generate from the dataset. IMPORTANT : This document is written using markdown notation. Here you have a good [manual] (https://daringfireball.net/projects/markdown/syntax)

Getting Started with the dataset:

The purpose of this section is to familiarize ourselves with Pandas and be able to manipulate the dataset to our interest.

The first thing we will have to load is our data, in this case in `csv``` format in a dataset. [Pandas manages multiple inputs / outputs] (http://pandas.pydata.org/pandas-docs/stable/io.html) with which we will have to use the specific one for this case: [read_csv] (http: // pandas .pydata.org / pandas-docs / stable / generated / pandas.read_csv.html # pandas.read_csv) where we will define the separator and the path of the file (all other parameters will not need to be touched in this case):



In [1]:

    
import pandas

input_file = 'static/titanic_data.csv'
separador = ","
dataset = pandas.read_csv(filepath_or_buffer=input_file, sep=separador)

We already have our dataset variable with all the dataset loaded into memory. To start exploring our dataset we will use some Pandas processes. For example, answer the following questions: How many rows and columns does the dataset contain? What is the name of the columns that the dataset includes? * What information contains the first row of the dataset? To answer these questions we will use the .shape,.columns, .head () and .tail ()



In [2]:

    
print "El numero de filas y columnas que incluye el dataset es: ",dataset.shape
print "\nLos nombres de las columnas son: \n",dataset.columns
print "\nLa primera fila del dataset es: \n",dataset.head(1)
print "\nLa última fila del dataset es: \n",dataset.tail(1)









    



El numero de filas y columnas que incluye el dataset es:  (1045, 12)

Los nombres de las columnas son: 
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object')

La primera fila del dataset es: 
   PassengerId  Survived  Pclass                     Name   Sex   Age  SibSp  \
0            1         0       3  Braund, Mr. Owen Harris  male  22.0      1   

   Parch     Ticket  Fare Cabin Embarked  
0      0  A/5 21171  7.25   NaN        S  

La última fila del dataset es: 
      PassengerId  Survived  Pclass                          Name   Sex   Age  \
1044         1307         0       3  Saether, Mr. Simon Sivertsen  male  38.5   

      SibSp  Parch              Ticket  Fare Cabin Embarked  
1044      0      0  SOTON/O.Q. 3101262  7.25   NaN        S

Below is a description of the different columns obtained from the Kaggle website



In [3]:

    
from IPython.display import Image
Image(filename='static/DescripcionVariablesKaggle.png')









    Out[3]:

Now we will learn how to handle our dataset, for this we will use different techniques that are described perfectly in documentation. To learn how to handle these techniques we will use another example dataframe:



In [4]:

    
import numpy as np

df1 = pandas.DataFrame(np.random.randn(6,4),index=list(range(0,12,2)),columns=list(range(10,18,2)))

As you can see, I have generated a data matrix with random values: np.random.randn (6.4), I have defined an index: list (range (0,12,2)) which is a list of even values between 0 and 10 and columns that are another range of pairs between 10 and 16:



In [5]:

    
print df1









    



          10        12        14        16
0   1.394064  0.826406  0.429552  1.213356
2  -0.942928  1.140223  0.295492 -0.876468
4   0.313370  0.645356  0.417849  0.740132
6  -1.477904  0.406932  0.072277  0.617387
8  -0.076441  0.819743 -0.453792  0.632975
10  0.909475 -0.803407  1.129866  2.113239

In this first part we will use the .loc and .iloc functions to select subsets of our dataset. Both .loc and .iloc operate on the index, but while .loc refers to positions,.iloc Refers to tags :



In [6]:

    
# Seleciona la fila en la posición 2 (recuerda que en python las series empiezan en cero!)
print df1.iloc[2]









    



10    0.313370
12    0.645356
14    0.417849
16    0.740132
Name: 4, dtype: float64

As we can see, the third row (row 1 corresponds to 0) is returned from the dataframe



In [7]:

    
# Seleciona la fila en cuyo índice es 2
print df1.loc[2]









    



10   -0.942928
12    1.140223
14    0.295492
16   -0.876468
Name: 2, dtype: float64

In this case, the row whose index is equal to 2 is returned.

We can make more complex selections including a range of indexes and certain columns, for example:



In [8]:

    
# Seleciona la filas cuyos indices valen 2 y 8 y las columnas 12 y 16
print df1.loc[[2,8],[12,16]]









    



         12        16
2  1.140223 -0.876468
8  0.819743  0.632975



In [9]:

    
# Seleciona todas las filas hasta la que vale 4 y las columnas hasta la 12
print df1.loc[:4,:12]









    



         10        12
0  1.394064  0.826406
2 -0.942928  1.140223
4  0.313370  0.645356



In [10]:

    
# Seleciona las 4 primeras filas y las tres primeras columnas
print df1.iloc[:4,:3]









    



         10        12        14
0  1.394064  0.826406  0.429552
2 -0.942928  1.140223  0.295492
4  0.313370  0.645356  0.417849
6 -1.477904  0.406932  0.072277

Now that we have clear the use of .loc e .iloc (hopefully!), We will operate with our dataset. For example we will select the columns: Name, Sex, Age and rows from 3 to 5 (remember that in our dataset the first row has associated index 1 and so on:



In [11]:

    
print "Imprimir las filas 3 a 5 y columnas Name, Sex, Age\n", dataset.loc[[3,4,5],['Name','Sex','Age']]









    



Imprimir las filas 3 a 5 y columnas Name, Sex, Age
                                           Name     Sex   Age
3  Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0
4                      Allen, Mr. William Henry    male  35.0
5                       McCarthy, Mr. Timothy J    male  54.0

However, I now want to select the first 4 rows of the dataset and the third and fourth columns:



In [12]:

    
print "Imprimir las 4 primeras filas y columnas tercera y cuarta\n", dataset.iloc[:4,[2,3]]









    



Imprimir las 4 primeras filas y columnas tercera y cuarta
   Pclass                                               Name
0       3                            Braund, Mr. Owen Harris
1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...
2       3                             Heikkinen, Miss. Laina
3       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)

Perfect! We already know how to select rows and columns! Now the goal is to make modifications. For example, let's change Allen, Mr. William Henry's age (currently 54 years old) by 260. For this we will have to select the row in which the column 'Name' is equal to 'Allen, Mr. William Henry', of the following way:



In [13]:

    
print "Edad de Allen, Mr. William Henry:\n ", dataset.loc[dataset['Name'].isin(['Allen, Mr. William Henry']),'Age']









    



Edad de Allen, Mr. William Henry:
  4    35.0
Name: Age, dtype: float64

To modify this value, we will have to assign the one we want (260):



In [14]:

    
# Asignar nueva edad al pobre Allen:
dataset.loc[dataset['Name']=='Allen, Mr. William Henry','Age'] = 260

What is Allen's age now?



In [15]:

    
print "La nueva edad de Allen, Mr. William Henry:\n ", dataset.loc[dataset['Name']=='Allen, Mr. William Henry','Age']









    



La nueva edad de Allen, Mr. William Henry:
  4    260.0
Name: Age, dtype: float64

As we see, we have already edited its age !. Although we will return to the 54 years, by now we know how to select subsets of data and edit them if we need.



In [16]:

    
# Devolvamos a Allen su edad:
dataset.loc[dataset['Name']=='Allen, Mr. William Henry','Age'] = 54

Exploring the dataset and generating statistics

What do we know about our data for now? Practically nothing !. We will use a number of functions that are included in Pandas to easily obtain valuable information from them.

The first information to be obtained when you first work with a dataset is the amount of NAN included in it, for this we will use the function .info ()of Pandas :



In [17]:

    
# Obtener información sobre el número de valores NAN que se incluyen en el dataset:
print dataset.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1045 entries, 0 to 1044
Data columns (total 12 columns):
PassengerId    1045 non-null int64
Survived       1045 non-null int64
Pclass         1045 non-null int64
Name           1045 non-null object
Sex            1045 non-null object
Age            1045 non-null float64
SibSp          1045 non-null int64
Parch          1045 non-null int64
Ticket         1045 non-null object
Fare           1045 non-null float64
Cabin          272 non-null object
Embarked       1043 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 98.0+ KB
None

Which columns include null values ?: Cabin includes 272 non-NAN values and Embarked 2 NANs

As we can see the dataset is quite complete except for these 2 columns. For your reference, Pandas includes valuable documentation to handle NAN. Here we will not enter to handle these values, but take this information into account !.

Well, we already know how the holes in the dataset are distributed, but we return to the previous question: What do we know about the information contained in the dataset? We remain the same, practically nothing!

OK OK, let's start looking at the dataset so that we can answer our main question: What factors influenced the survival of the Titanic?

Well, Pandas includes the function .describe () which allows to obtain basic statistics about the different columans of the dataset:



In [18]:

    
print "Imprimir estadísticas básicas del dataset: \n", dataset.describe()









    



Imprimir estadísticas básicas del dataset: 
       PassengerId     Survived       Pclass          Age        SibSp  \
count  1045.000000  1045.000000  1045.000000  1045.000000  1045.000000   
mean    654.990431     0.399043     2.206699    29.870019     0.503349   
std     377.650551     0.489936     0.841542    14.407698     0.912471   
min       1.000000     0.000000     1.000000     0.170000     0.000000   
25%     326.000000     0.000000     1.000000    21.000000     0.000000   
50%     662.000000     0.000000     2.000000    28.000000     0.000000   
75%     973.000000     1.000000     3.000000    39.000000     1.000000   
max    1307.000000     1.000000     3.000000    80.000000     8.000000   

             Parch         Fare  
count  1045.000000  1045.000000  
mean      0.421053    36.686080  
std       0.840052    55.732533  
min       0.000000     0.000000  
25%       0.000000     8.050000  
50%       0.000000    15.750000  
75%       1.000000    35.500000  
max       6.000000   512.329200

What information includes `.describe ()` for each column? Number of values (count) Mean (mean) Standard deviation (std) Minimum value (min) Maximum value (max) First percentile (25%) Median (50%) Third percentile (75%) Here we already have useful information !!!, we can answer the following questions: What percentage of survivors were there? 39.9% What was the average age? 29.8 What was the most expensive ticket (Fare) paid, and the least? 512.3 and 0.0 How many classes (Pclass) does the boat include? Do not miss a column? 'Sex' is not included in this analysis, why? If you look at the information in .info () here only the numeric columns (float or int64) have been included. Let's take a look at 'Sex' to understand the format it has:



In [19]:

    
print dataset.Sex.head()









    



0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object

As we see "male" and "female" are the included values, check that these are the only values that are included for the entire dataset



In [20]:

    
# Imprimimos todos los valores de Sex en el dataset:
print dataset.Sex.unique()









    



['male' 'female']

As this format prevents us from working with the `.describe ()` function, we are going to transform this column into other values that are most useful to us. For this we will define a function in python that assigns 1 to 'female' and 0 to 'male', then we will use the function Apply of Pandas to modify these values in the dataset:



In [21]:

    
# definir functión que haga el cambio:
def gender_number(gender):
    if gender == 'male': return 0.
    return 1.

# Aplicar el cambio a todas las filas con la función Apply:
dataset.Sex = dataset.Sex.apply(lambda x:gender_number(x))

What does 'Sex' look like now?



In [22]:

    
# Imprimimos todos los valores de Sex en el dataset:
print dataset.Sex.unique()









    



[ 0.  1.]

Well, let's run .describe() again on the 'Sex' column.



In [23]:

    
# Obtengamos estadísticas de 'Sex'
print dataset.Sex.describe()









    



count    1045.000000
mean        0.371292
std         0.483382
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: Sex, dtype: float64

Wow, it does not seem to help to understand how the sexes are distributed? I do not think so. Let's try something new: Pandas Aggregations!. As an example I will make an aggregation to check how many men and women were on the boat:



In [23]:

    
# Obtener el recuento de hombres y mujeres en el barco:
print dataset.groupby('Sex').count().iloc[:,1]









    



Sex
0.0    657
1.0    388
Name: Survived, dtype: int64

How many men and women were there? 388 women and 657 men

Let's do one thing now, in order to understand how the passengers are distributed by age we will create a new column called 'Age_ranges' that includes the age range to which each passenger corresponds. The objective will be to obtain how many passengers there are in the range [0-10], [10,20] ....



In [24]:

    
# Function para generar los rangos de edad:
def age_ranges(x):
    return '[{0}-{1})'.format(10*int(x/10), 10 +10*int(x/10))

# Crear nueva columna:
dataset['Age_ranges'] = np.nan

# Asignar los valores rangos de edad correspondientes:
dataset['Age_ranges'] = dataset.Age.apply(lambda x: age_ranges(x))

# Comprobar que las cosas funcionan como se espera:
print dataset[['Age_ranges','Age']].head(10)









    



  Age_ranges   Age
0    [20-30)  22.0
1    [30-40)  38.0
2    [20-30)  26.0
3    [30-40)  35.0
4    [50-60)  54.0
5    [50-60)  54.0
6     [0-10)   2.0
7    [20-30)  27.0
8    [10-20)  14.0
9     [0-10)   4.0

How are the passengers distributed by age ranges on the boat?



In [25]:

    
# Imprimir distribución de rangos de edad en el barco:
print dataset.groupby('Age_ranges').count().iloc[:,1]









    



Age_ranges
[0-10)      82
[10-20)    143
[20-30)    344
[30-40)    231
[40-50)    135
[50-60)     71
[60-70)     31
[70-80)      7
[80-90)      1
Name: Survived, dtype: int64

How many children under 10 are included? 82

And now ... Who knows how the passengers are distributed by age and sex?



In [26]:

    
# Imprimir distribución de rangos de edad en el barco por sexo:
print dataset.groupby(['Sex','Age_ranges']).count().iloc[:,1]









    



Sex  Age_ranges
0.0  [0-10)         43
     [10-20)        79
     [20-30)       229
     [30-40)       145
     [40-50)        89
     [50-60)        44
     [60-70)        21
     [70-80)         6
     [80-90)         1
1.0  [0-10)         39
     [10-20)        64
     [20-30)       115
     [30-40)        86
     [40-50)        46
     [50-60)        27
     [60-70)        10
     [70-80)         1
Name: Survived, dtype: int64

I think that by this point, we should no longer have problems in removing the factors that influenced the survival of passengers on the Titanic. For example, how are survivors distributed by sex, and by age, class, price of the ticket, etc.?



In [27]:

    
# Distribución de supervivientes por edades:
print dataset.groupby(['Survived','Age_ranges']).count().iloc[:,1]
# Distribución de supervivientes por sexo:
print dataset.groupby(['Survived','Sex']).count().iloc[:,1]
# Distribución de supervivientes por clase:
print dataset.groupby(['Survived','Pclass']).count().iloc[:,1]









    



Survived  Age_ranges
0         [0-10)         35
          [10-20)        83
          [20-30)       224
          [30-40)       132
          [40-50)        87
          [50-60)        42
          [60-70)        19
          [70-80)         6
1         [0-10)         47
          [10-20)        60
          [20-30)       120
          [30-40)        99
          [40-50)        48
          [50-60)        29
          [60-70)        12
          [70-80)         1
          [80-90)         1
Name: Pclass, dtype: int64
Survived  Sex
0         0.0    564
          1.0     64
1         0.0     93
          1.0    324
Name: Pclass, dtype: int64
Survived  Pclass
0         1         114
          2         149
          3         365
1         1         170
          2         112
          3         135
Name: Name, dtype: int64

Well, we already have the number of passengers that survived by age, sex and class, but the ideal (and expected) is to have the percentage of survival by age, sex, class, etc.

For example, the percentage of survival by age can be calculated as follows:

NOTE : A surviving passenger has a value of 1, so if 5 passengers of 10 survive, the average will be equal to the survival rate.



In [28]:

    
# Porcentaje de supervivientes por edad:
print dataset.groupby('Age_ranges').mean().loc[:,'Survived']









    



Age_ranges
[0-10)     0.573171
[10-20)    0.419580
[20-30)    0.348837
[30-40)    0.428571
[40-50)    0.355556
[50-60)    0.408451
[60-70)    0.387097
[70-80)    0.142857
[80-90)    1.000000
Name: Survived, dtype: float64

OK, what are the survival rates for ages, ages and sex, and any other combinations you can think of?



In [29]:

    
# Porcentaje de supervivientes por Sexo:
print dataset.groupby('Sex').mean().loc[:,'Survived']
# Porcentaje de supervivientes por Pclass:
print dataset.groupby('Pclass').mean().loc[:,'Survived']
# Porcentaje de supervivientes por Sexo y edad:
print dataset.groupby(['Sex','Age_ranges']).mean().loc[:,'Survived']
# Porcentaje de supervivientes por Sexo y Pclass:
print dataset.groupby(['Sex','Pclass']).mean().loc[:,'Survived']
# Porcentaje de supervivientes por Sexo y Age_ranges:
print dataset.groupby(['Pclass','Age_ranges']).mean().loc[:,'Survived']









    



Sex
0.0    0.141553
1.0    0.835052
Name: Survived, dtype: float64
Pclass
1    0.598592
2    0.429119
3    0.270000
Name: Survived, dtype: float64
Sex  Age_ranges
0.0  [0-10)        0.441860
     [10-20)       0.088608
     [20-30)       0.109170
     [30-40)       0.158621
     [40-50)       0.134831
     [50-60)       0.090909
     [60-70)       0.095238
     [70-80)       0.000000
     [80-90)       1.000000
1.0  [0-10)        0.717949
     [10-20)       0.828125
     [20-30)       0.826087
     [30-40)       0.883721
     [40-50)       0.782609
     [50-60)       0.925926
     [60-70)       1.000000
     [70-80)       1.000000
Name: Survived, dtype: float64
Sex  Pclass
0.0  1         0.264901
     2         0.094937
     3         0.109195
1.0  1         0.977444
     2         0.941748
     3         0.638158
Name: Survived, dtype: float64
Pclass  Age_ranges
1       [0-10)        0.500000
        [10-20)       0.772727
        [20-30)       0.673077
        [30-40)       0.694444
        [40-50)       0.500000
        [50-60)       0.521739
        [60-70)       0.428571
        [70-80)       0.250000
        [80-90)       1.000000
2       [0-10)        0.909091
        [10-20)       0.482759
        [20-30)       0.388889
        [30-40)       0.375000
        [40-50)       0.387097
        [50-60)       0.294118
        [60-70)       0.285714
        [70-80)       0.000000
3       [0-10)        0.446429
        [10-20)       0.315217
        [20-30)       0.247525
        [30-40)       0.263158
        [40-50)       0.119048
        [50-60)       0.000000
        [60-70)       0.333333
        [70-80)       0.000000
Name: Survived, dtype: float64

Now yes !, What factors influenced mainly the survival of the passengers?

Gender
Class
Age

Results presentation

When working in data science, it is very important not only to analyze the data with which you work and to obtain conclusions, but also to communicate effectively and clearly the results to the other participants so that they can understand the conclusions and the work done. For this reason, we are going to generate a series of graphs that serve to represent in a clear and forceful way the results of our analysis.

The first type of graphs that we are going to generate will be to describe the variables that the dataset includes, for example to represent how the passengers are distributed by classes, by sex, by age ....

Pandas and Matplotlib allow you to generate graphs from dataframes with which we have been working.

For example, below is a representation with boxplots where we can see a Information similar to what the .describe() function offers:



In [30]:

    
# definir functión que haga el cambio:
def numbertogender(gender):
    if gender == 0.: return 'male'
    return 'female'

# Aplicar el cambio a todas las filas con la función Apply:
dataset.Sex = dataset.Sex.apply(lambda x:numbertogender(x))



In [31]:

    
import matplotlib.pyplot as plt
%matplotlib inline

# Seleccionamos el estilo de las gráficas:
plt.style.use('ggplot')

# Generamos el Boxplot:
dataset.boxplot(return_type='axes')

# Representamos la gráfica:
plt.show()

It does not look like much and besides the PassengerId column decompresses the y-axis, let's improve the graphics style:



In [32]:

    
# Generamos el Boxplot, quitando la columna 'PassengerId','Survived','Pclass','SibSp','Parch':
dataset.loc[:,~dataset.columns.isin(['PassengerId','Survived','Pclass','SibSp','Parch'])].boxplot(return_type='axes')

# Añadimos título y etiquetas:
plt.title('Variables del dataset Titanic')
plt.xlabel('Variables')
plt.ylim([0,300])

# Representamos la gráfica:
plt.show()

What conclusions can we draw from this representation?

Median (red)
First and third quartiles
Outliers

How do you generate a histogram of any variable? Here's how the histogram of the Age column is generated:



In [33]:

    
# Generamos el histograma de Age:
dataset.loc[:,'Age'].hist(bins=15)

# Añadimos título y etiquetas:
plt.title('Distribucion de Edades en el Titanic')
plt.xlabel('Edades')
plt.ylabel('Frecuencias')

# Representamos la gráfica:
plt.show()

Could you generate a histogram for the Fare Column?



In [34]:

    
# Generamos el histograma de Fare:
dataset.loc[:,'Fare'].hist(bins=15)

# Añadimos título y etiquetas:
plt.title('Distribucion de Precios de ticket en el Titanic')
plt.xlabel('Precios')
plt.ylabel('Frecuencias')

# Representamos la gráfica:
plt.show()

Could we see how the passengers are distributed by age and by another variable (gender, survival, class ....)? Check out the documentation!



In [35]:

    
# Generamos el histograma de Age por Sex:
dataset.loc[:,['Age','Sex']].hist(bins=15,by='Sex',sharey=True)

# Representamos la gráfica:
plt.show()

Another option is:



In [36]:

    
# Generamos el histograma de Age por Sex:
dataset.loc[dataset.Sex=='male','Age'].hist(bins=15,color='blue',alpha=0.5,label='male')
dataset.loc[dataset.Sex=='female','Age'].hist(bins=15,color='red',alpha=0.5,label='female')

# Añadimos título y etiquetas:
plt.title('Distribucion de Edades por Sexo en el Titanic')
plt.xlabel('Edades')
plt.ylabel('Frecuencias')
plt.legend()

# Representamos la gráfica:
plt.show()

How are passengers distributed by age and by class? (Selections [colors] (http://html-color-codes.info/) for each of the classes)



In [37]:

    
# Generamos el histograma de Age por clase:
dataset.loc[dataset.Pclass==3,'Age'].hist(bins=15,color='#40FF00',alpha=0.5,label='Tercera')
dataset.loc[dataset.Pclass==2,'Age'].hist(bins=15,color='#00FFFF',alpha=0.5,label='Segunda')
dataset.loc[dataset.Pclass==1,'Age'].hist(bins=15,color='#610B5E',alpha=0.5,label='Primera')

# Añadimos título y etiquetas:
plt.title('Distribucion de Edades por Clases en el Titanic')
plt.xlabel('Edades')
plt.ylabel('Frecuencias')
plt.legend()

# Representamos la gráfica:
plt.show()

Now let's start generating graphs to show how survivors are distributed by Sex, Class and Age. Before we saw that this information was obtained by making clusters about the dataset. In this case we will repeat the process and represent it, for example the following process shows how passengers are distributed by sex and the percentages of survival associated with them:



In [38]:

    
# Vamos a crear una figura con dos subplots:
f, axarr = plt.subplots(1,2,figsize=(10,7))
f.suptitle('Distribucion y probabilidad de Supervivencia\n de pasajeros por Sexo', fontsize=14, fontweight='bold')

# En la gráfica de la izquierda vamos a representar la cantidad de pasajeros por Sexo
dataset.groupby('Sex').count().iloc[:,1].plot.pie(ax=axarr[0],colors=['red','blue'],autopct='%1.1f%%')
axarr[0].set_ylabel('Distribucion por Sexos de Pasajeros')

# En la gráfica de la derecha vamos a representar la probabilidad de superviviencia por Sexo
dataset.groupby('Sex').mean().iloc[:,1].plot.bar(ax=axarr[1],color=['red','blue'],alpha=0.5)
axarr[1].set_ylabel('Probabilidad de Supervivencia')
axarr[1].set_ylim([0,1])

plt.show()

Could we do the same with 'Age_ranges' or 'Pclass'?



In [39]:

    
# Vamos a crear una figura con dos subplots:
f, axarr = plt.subplots(1,2,figsize=(10,7))
f.suptitle('Distribucion y probabilidad de Supervivencia\n de pasajeros por Clase', fontsize=14, fontweight='bold')

# En la gráfica de la izquierda vamos a representar la cantidad de pasajeros por Clase
dataset.groupby('Pclass').count().iloc[:,1].plot.pie(ax=axarr[0],labels=['Primera','Segunda','Tercera']\
                                                     ,autopct='%1.1f%%',colors=['#610B5E','#00FFFF','#40FF00'])
axarr[0].set_ylabel('Distribucion por Clase de Pasajeros')

# En la gráfica de la derecha vamos a representar la probabilidad de superviviencia por Clase
dataset.groupby('Pclass').mean().iloc[:,1].plot.bar(ax=axarr[1],color=['#610B5E','#00FFFF','#40FF00']\
                                                    ,alpha=0.5)
axarr[1].set_ylabel('Probabilidad de Supervivencia')
axarr[1].set_ylim([0,1])

plt.show()



In [40]:

    
# Vamos a crear una figura con dos subplots:
f, axarr = plt.subplots(1,2,figsize=(10,7))
f.suptitle('Distribucion y probabilidad de Supervivencia\n de pasajeros por Rango de Edad', fontsize=14, fontweight='bold')

# En la gráfica de la izquierda vamos a representar la cantidad de pasajeros por Sexo
dataset.groupby('Age_ranges').count().iloc[:,1].plot.pie(ax=axarr[0],autopct='%1.1f%%',colors=['#610B5E','#00FFFF','#40FF00'])
axarr[0].set_ylabel('Distribucion por Rangos de Edad de Pasajeros')

# En la gráfica de la derecha vamos a representar la probabilidad de superviviencia por Sexo
dataset.groupby('Age_ranges').mean().iloc[:,1].plot.bar(ax=axarr[1],alpha=0.5)
axarr[1].set_ylabel('Probabilidad de Supervivencia')
axarr[1].set_ylim([0,1])

f.tight_layout()
plt.show()

Finally, we will check how the probability of survival is distributed by Sex and by Class, for this, we will use the function pivot.



In [41]:

    
# Este es el dataframe que vamos a representar:
print dataset.groupby(['Sex','Pclass']).mean().reset_index().pivot(index='Pclass',columns='Sex',values='Survived')









    



Sex       female      male
Pclass                    
1       0.977444  0.264901
2       0.941748  0.094937
3       0.638158  0.109195



In [42]:

    
dataset.groupby(['Sex','Pclass']).mean().reset_index().pivot(
    index='Pclass',columns='Sex',values='Survived').plot.bar(color=['red','blue'],alpha=0.5)

# Añadimos título y etiquetas:
plt.title('Probabilidad de supervivencia\n por Sexo y Clases en el Titanic')
plt.xlabel('Clase')
plt.ylabel('Prob. de Supervivencia')
plt.xticks(range(3), ['Primera', 'Segunda', 'Tercera'], color='black')
plt.legend()

plt.show()

Congratulations !, arrived at this point and we have visually represented the conclusions that we obtained at first. We have shown that the probability of survival is mostly determined by Sex, Class and to a lesser extent by Age (and is probably also a factor of other variables that we have not analyzed in this exercise.) We could create a model to know if a Does the passenger survive or not?

Modeling

In this section we are going to work on the creation of a model that serves to determine if a passenger is going to survive or not. The first thing we will do is create our own model based on the conclusions we have drawn, and then compare it with a decision tree that we will train with the dataset.

The first thing we are going to define is the score accuracy to measure the quality of our results:



In [43]:

    
def accuracy_score(truth, pred):
    """ Devuelve accuracy score comparando valores predichos (pred) contra reales (truth). """
    
    # Ensure that the number of predictions matches number of outcomes
    if len(truth) == len(pred): 
        
        # Calculate and return the accuracy as a percent
        return "Predicciones tienen un accuracy de {:.2f}%.".format((truth == pred).mean()*100)
    
    else:
        return "El número de predicciones no es igual al numero de valores reales!"

The next thing we are going to do is delete the 'Survived' column from the dataset and mark it as an outcome



In [44]:

    
# 'Survived' será nuestra etiqueta y el valor que queremos predecir:
outcomes = dataset['Survived']
data = dataset.drop(['Survived','Name','Ticket','Cabin','Age_ranges','Embarked','PassengerId'], axis = 1)
data['Sex'] = data['Sex'].apply(lambda x: 1. if x == 'female' else 0.)
# Representar las primeras 5 filas para comprobar los cambios
print data.head()









    



   Pclass  Sex   Age  SibSp  Parch     Fare
0       3  0.0  22.0      1      0   7.2500
1       1  1.0  38.0      1      0  71.2833
2       3  1.0  26.0      0      0   7.9250
3       1  1.0  35.0      1      0  53.1000
4       3  0.0  54.0      0      0   8.0500

Could we improve the following model ?:



In [45]:

    
def modelo(data):
    """ Modelo para estimar la probabilidad de supervivencia. """

    predictions = []
    for _, passenger in data.iterrows():
        # Predecir la supervivencia del pasajero (modificar estas lineas como corresponda)
        if passenger['Sex'] == 1:
            predictions.append(1) # 1 significa que el pasajero sobrevive
        else:
            if passenger['Age'] >= 80:
                predictions.append(1) # 1 significa que el pasajero sobrevive
            else:
                predictions.append(0) # 0 significa que el pasajero no sobrevive
    
    # Devolver las predicciones
    return pandas.Series(predictions)

# Crear predicciones
predictions = modelo(data)

# Obtener resultado:
accuracy_score(outcomes, predictions)









    Out[45]:





'Predicciones tienen un accuracy de 85.07%.'

On the other hand, we are going to train a decision tree and compare its results with the Our



In [46]:

    
# Crear un conjunto de datos que sirvan entrenar el árbol de decision (train) y un conjunto de datos de validacion (test)
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(outcomes, 1, test_size=0.3, random_state=450)
for train_index, test_index in sss:
    X_train = data.iloc[train_index]
    y_train = outcomes.iloc[train_index]
    X_test = data.iloc[test_index]
    y_test = outcomes.iloc[test_index]
    
# Definir el arbol de decision
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_features=3,max_depth=2)

# Entrenarlo con los datos de train
clf = clf.fit(X_train, y_train)

# Crear predicciones
predictions = clf.predict(X_test)

# Obtener resultado:
accuracy_score(y_test, predictions)









    



/home/rafaelcastillo/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)






    Out[46]:





'Predicciones tienen un accuracy de 84.71%.'

Now we are going to represent the decision tree:



In [55]:

    
from sklearn.externals.six import StringIO
import pydot 
print data.columns
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data,  
                         feature_names=data.columns,
                         class_names=['Perished','Survived'], 
                         filled=True, rounded=True,
                         proportion = True,
                         special_characters=True)  
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())









    



Index([u'Pclass', u'Sex', u'Age', u'SibSp', u'Parch', u'Fare'], dtype='object')






    Out[55]: