[Data, the Humanist's New Best Friend](index.ipynb)
Assignment 1
Titanic: Women and Children First

*Yep, no room for more people, sorry, mate*

From Wikipedia, "the passengers of the RMS Titanic were among the estimated 2,223 people who sailed on the maiden voyage of the second of the White Star Line's Olympic class ocean liners, from Southampton to New York City. Halfway through the voyage, the ship struck an iceberg and sank in the early morning of 15 April 1912, resulting in the deaths of over 1,500 people, including approximately 703 of the passengers."



In [1]:

    
from IPython.display import YouTubeVideo; YouTubeVideo("9xoqXVjBEF8")









    Out[1]:

The goal will be to analyze the passengers list and look for some patterns and, ultimately, find whether woman and children really went first.

Assignment

Your mission will be to complete the Python code in the cells below and execute it until the output looks similar or identical to the output shown. I would recommend to use a temporary notebook to work with the dataset, and when the code is ready and producing the expected output, copypaste it to this notebook. Once is done and validated, just copy the file elsewhere, as the notebook will be the only file to be sent for evaluation. Of course, everything starts by downloading this notebook.

*No worries, there is no test in this class, just... assignments!*

Deadline

February $24^{th}$.

Data

The Titanic's passengers were divided into three separate classes, determined not only by the price of their ticket but by wealth and social class: those travelling in first class, the wealthiest passengers on board, were prominent members of the upper class and included businessmen, politicians, high-ranking military personnel, industrialists, bankers and professional athletes. Second class passengers were middle class travellers and included professors, authors, clergymen and tourists. Third class or steerage passengers were primarily immigrants moving to the United States and Canada.

In the file titanic.xls you will find part of the original list of passengers. The variables or columns are described below:

survival: Survival (0 = No; 1 = Yes)
pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name: Name
sex: Sex
age: Age
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare in pound sterling (£)
cabin: Cabin
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat: Boat number used for survival
home.dest: Home / Final destination

Consider that pclass is a proxy for socio-economic status:

1st ~ Upper
2nd ~ Middle
3rd ~ Lower

And that age is given in years, with a couple of exceptions

If age less than 1, is given as a fraction
If the age is an estimation, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

When loaded into a DataFrame, the dataset looks like this:



In [2]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set some Pandas options
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)



In [3]:

    
titanic = pd.read_excel("data/titanic.xls")
titanic.head()









    Out[3]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      home.dest
    
  
  
    
      0
       1
       1
                         Allen, Miss. Elisabeth Walton
       female
       29.0000
       0
       0
        24160
       211.3375
            B5
       S
         2
                          St Louis, MO
    
    
      1
       1
       1
                        Allison, Master. Hudson Trevor
         male
        0.9167
       1
       2
       113781
       151.5500
       C22 C26
       S
        11
       Montreal, PQ / Chesterville, ON
    
    
      2
       1
       0
                          Allison, Miss. Helen Loraine
       female
        2.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
    
    
      3
       1
       0
                  Allison, Mr. Hudson Joshua Creighton
         male
       30.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
    
    
      4
       1
       0
       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
       female
       25.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON

Preparation

Before we can actually start playing with the data, we need to add some stuff. First, create a new column, cabin_type, with the letter of the cabin if known. For example, if the cabin is 'C22 C26', cabin_type would be C; for something like 'A90 B11' (which never happens), only the first code is used, being A the cabin_type.



In [4]:

    
titanic["cabin_type"] = titanic["cabin"].
titanic.head()









    Out[4]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      home.dest
      cabin_type
    
  
  
    
      0
       1
       1
                         Allen, Miss. Elisabeth Walton
       female
       29.0000
       0
       0
        24160
       211.3375
            B5
       S
         2
                          St Louis, MO
       B
    
    
      1
       1
       1
                        Allison, Master. Hudson Trevor
         male
        0.9167
       1
       2
       113781
       151.5500
       C22 C26
       S
        11
       Montreal, PQ / Chesterville, ON
       C
    
    
      2
       1
       0
                          Allison, Miss. Helen Loraine
       female
        2.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C
    
    
      3
       1
       0
                  Allison, Mr. Hudson Joshua Creighton
         male
       30.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C
    
    
      4
       1
       0
       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
       female
       25.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C

We also need a column, name_title, with the title included in the name but in lower case. For example, if the name is Allison, Mrs. Hudson J C (Bessie Waldo Daniels), name_title would be mrs. Note the dot is excluded. If the title is none of Master, Miss, Ms or Mrs, it will be classified as other.



In [5]:

    
def assign_name(name):
    pass

titanic["name_title"] = titanic["name"].apply(assign_name)
titanic.head()









    Out[5]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      home.dest
      cabin_type
      name_title
    
  
  
    
      0
       1
       1
                         Allen, Miss. Elisabeth Walton
       female
       29.0000
       0
       0
        24160
       211.3375
            B5
       S
         2
                          St Louis, MO
       B
         miss
    
    
      1
       1
       1
                        Allison, Master. Hudson Trevor
         male
        0.9167
       1
       2
       113781
       151.5500
       C22 C26
       S
        11
       Montreal, PQ / Chesterville, ON
       C
       master
    
    
      2
       1
       0
                          Allison, Miss. Helen Loraine
       female
        2.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C
         miss
    
    
      3
       1
       0
                  Allison, Mr. Hudson Joshua Creighton
         male
       30.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C
           mr
    
    
      4
       1
       0
       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
       female
       25.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C
          mrs

Finally, a third new columnd age_cat that maps the age to one of the next values: children (under 14 years), adolescents (14-20), adult (21-64), and senior (65+).



In [6]:

    
def assign_age(age):
    pass

titanic["age_cat"] = titanic["age"].apply(assign_age)
titanic.head()









    Out[6]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      home.dest
      cabin_type
      name_title
      age_cat
    
  
  
    
      0
       1
       1
                         Allen, Miss. Elisabeth Walton
       female
       29.0000
       0
       0
        24160
       211.3375
            B5
       S
         2
                          St Louis, MO
       B
         miss
          adult
    
    
      1
       1
       1
                        Allison, Master. Hudson Trevor
         male
        0.9167
       1
       2
       113781
       151.5500
       C22 C26
       S
        11
       Montreal, PQ / Chesterville, ON
       C
       master
       children
    
    
      2
       1
       0
                          Allison, Miss. Helen Loraine
       female
        2.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C
         miss
       children
    
    
      3
       1
       0
                  Allison, Mr. Hudson Joshua Creighton
         male
       30.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C
           mr
          adult
    
    
      4
       1
       0
       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
       female
       25.0000
       1
       2
       113781
       151.5500
       C22 C26
       S
       NaN
       Montreal, PQ / Chesterville, ON
       C
          mrs
          adult

The last cosmetic change will be to replace the letters in the variable embarked for the actual names of the cities:

C: Cherbourg
Q: Queenstown
S: Southampton

Note that in this case we won't create a new column but replace the original one.



In [7]:

    
titanic["embarked"] = titanic["embarked"].
titanic.head()









    Out[7]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      home.dest
      cabin_type
      name_title
      age_cat
    
  
  
    
      0
       1
       1
                         Allen, Miss. Elisabeth Walton
       female
       29.0000
       0
       0
        24160
       211.3375
            B5
       Southampton
         2
                          St Louis, MO
       B
         miss
          adult
    
    
      1
       1
       1
                        Allison, Master. Hudson Trevor
         male
        0.9167
       1
       2
       113781
       151.5500
       C22 C26
       Southampton
        11
       Montreal, PQ / Chesterville, ON
       C
       master
       children
    
    
      2
       1
       0
                          Allison, Miss. Helen Loraine
       female
        2.0000
       1
       2
       113781
       151.5500
       C22 C26
       Southampton
       NaN
       Montreal, PQ / Chesterville, ON
       C
         miss
       children
    
    
      3
       1
       0
                  Allison, Mr. Hudson Joshua Creighton
         male
       30.0000
       1
       2
       113781
       151.5500
       C22 C26
       Southampton
       NaN
       Montreal, PQ / Chesterville, ON
       C
           mr
          adult
    
    
      4
       1
       0
       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
       female
       25.0000
       1
       2
       113781
       151.5500
       C22 C26
       Southampton
       NaN
       Montreal, PQ / Chesterville, ON
       C
          mrs
          adult

Analysis

*Rose: "I'l never let go Jack..."*

Where did the richer people embark from? And what about the destination? Where did the 10 poorest people want to go/come from? To solve this, take a look on the fare they paid, and return the average per city.



In [8]:

    
titanic.groupby()[["fare"]].aggregate().sort("fare")









    Out[8]:






  
    
      
      fare
    
    
      embarked
      
    
  
  
    
      Queenstown
       12.409012
    
    
      Southampton
       27.418824
    
    
      Cherbourg
       62.336267



In [9]:

    
titanic.groupby()[[]].aggregate().sort()[:10]









    Out[9]:






  
    
      
      fare
    
    
      home.dest
      
    
  
  
    
      Liverpool, England / Belfast
       0.000000
    
    
      Belfast, NI
       0.000000
    
    
      Belfast
       0.000000
    
    
      Rotterdam, Netherlands
       0.000000
    
    
      Syria
       6.155567
    
    
      Liverpool
       6.500000
    
    
      Co Cork, Ireland Charlestown, MA
       6.750000
    
    
      Effington Rut, SD
       6.975000
    
    
      Portugal
       7.050000
    
    
      Argentina
       7.050000

Let's consider now proportions (ratio) of survival. Calculate the proportion of passengers that survived by sex. And the same proportion by age category.



In [10]:

    
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio



In [11]:

    
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio









    Out[11]:






  
    
      
      survived
    
    
      age_cat
      
    
  
  
    
      adolescents
       38.255034
    
    
      adult
       39.617834
    
    
      children
       35.911602
    
    
      senior
       15.384615

Calculate the same proportion, but by sex and class.



In [12]:

    
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio

Calculate survival proportions by age category, class and sex.



In [13]:

    
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio.unstack()









    Out[13]:






  
    
      
      
      survived
    
    
      
      sex
      female
      male
    
    
      age_cat
      pclass
      
      
    
  
  
    
      adolescents
      1
       100.000000
       20.000000
    
    
      2
        92.307692
       11.764706
    
    
      3
        54.285714
       12.500000
    
    
      adult
      1
        96.551724
       34.328358
    
    
      2
        86.842105
        7.812500
    
    
      3
        44.186047
       15.918367
    
    
      children
      1
        91.666667
       39.393939
    
    
      2
        94.117647
       54.166667
    
    
      3
        51.578947
       15.469613
    
    
      senior
      1
       100.000000
       14.285714
    
    
      2
              NaN
        0.000000
    
    
      3
              NaN
        0.000000

Calculate survival proportions by age category and sex.



In [14]:

    
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio.unstack()









    Out[14]:






  
    
      
      survived
    
    
      sex
      female
      male
    
    
      age_cat
      
      
    
  
  
    
      adolescents
        73.015873
       12.790698
    
    
      adult
        77.697842
       18.737673
    
    
      children
        61.290323
       22.689076
    
    
      senior
       100.000000
        8.333333

So, women and children first?

*[Mrs. Charlotte Collyer and her daughter Marjorie](http://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic)*

Let's now see price. In order to see distributions of price, the first thing we need to do is to calculate the average fare per class, as well as the average age, and the number of people in each class.



In [15]:

    
pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={"fare": , "age": , "name": },
               margins=True)

And even split the latter per cabin type and adding an aggregate counting the number of survivors.



In [16]:

    
pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={"fare": , "age": , "name": , "survived": },
               margins=)

Well, it starts to look like the fare maybe, and only maybe, had something to do with the probability of survival, isn't?

Visualizations

Before getting deeper into that relationship, let's just explore some basic plots. For example, let's plot the proportion of survived by name title.



In [17]:

    
ax = titanic.boxplot(column=, by=, grid=False)
ax.set_title("")
ax.set_xlabel("")
ax.set_ylabel("")
ax.get_figure().suptitle("")  # Do not change this statement









    Out[17]:





<matplotlib.text.Text at 0x7fb758706cf8>

And also, if we define a new column alone that is True if the passenger was travelling withouh family, and False if with family, let's plot that into a box plot by age.



In [18]:

    
titanic["alone"] = 
ax = titanic.boxplot(column="", by="", grid=False)
ax.set_title("")
ax.set_xlabel("")
ax.set_ylabel("")
ax.get_figure().suptitle("")  # Do not change this statement









    Out[18]:





<matplotlib.text.Text at 0x7fb7587d34e0>

On the other hand, to better see the relationship between proportion of survived and fare, we are going to create a table that has, in rows, two indices, sex and age category, and as the values, the average age, the average fare, and the proportion of passengers that survived. Define and use the function ratio as an aggreate to calculate the proportion of people that survived.



In [19]:

    
def ratio(values):
    pass



In [20]:

    
pt_titanic = pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={
                    "fare": ,
                    "age": ,
                    "survived": 
                },
               margins=)
pt_titanic









    Out[20]:






  
    
      
      
      age
      fare
      survived
    
    
      sex
      age_cat
      
      
      
    
  
  
    
      female
      adolescents
       17.428571
       34.826592
        73.015873
    
    
      adult
       34.953237
       57.661422
        77.697842
    
    
      children
        5.208335
       26.012198
        61.290323
    
    
      senior
       76.000000
       78.850000
       100.000000
    
    
      male
      adolescents
       17.970930
       21.331057
        12.790698
    
    
      adult
       34.433925
       28.636848
        18.737673
    
    
      children
        5.416666
       21.671076
        22.689076
    
    
      senior
       69.541667
       44.978483
         8.333333
    
    
      All
      
       29.881135
       33.295479
        38.197097

Now let's create two plots, one for the female and other for male, comparing the series of average fare and survival proportion, and sorted by fare.



In [21]:

    
female = pt_titanic.query("sex == 'female'").sort()[[]]
male = pt_titanic.query().sort()[[]]

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot
ax2 = fig.add_subplot

female.plot(ax=)
ax1.set_ylabel()

male.plot(ax=)
ax2.set_xlabel()
ax2.set_ylabel()

ax2.set_xticklabels(["children", "", "adolescent", "", "adult", "", "senior"])









    Out[21]:





[<matplotlib.text.Text at 0x7fb75852add8>,
 <matplotlib.text.Text at 0x7fb75852e3c8>,
 <matplotlib.text.Text at 0x7fb7584fcdd8>,
 <matplotlib.text.Text at 0x7fb75850ba58>,
 <matplotlib.text.Text at 0x7fb75850e4e0>,
 <matplotlib.text.Text at 0x7fb75850ef28>,
 <matplotlib.text.Text at 0x7fb7585119b0>]

At least for woman, it looks like the richer the woman, the greater her chances to survive. Men, on the other hand, does not look like that; if any, it would be the opposite.

But let's take a closer look to the distribution of ages (ignoring the people with unknown age), by creating a histogram of the age distribution.



In [22]:

    
ax = titanic[].dropna()...(bins=30, alpha=0.8, grid=False, figsize=(12, 4))
ax.set_title()









    Out[22]:





<matplotlib.text.Text at 0x7fb7584adb70>

To further support this idea, let's create another pivot table, but instead of using the age categories, let's just calculate the corresponding values for fare and propotion of survival for each individual age.



In [23]:

    
pt_titanic2 = pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={
               },
               margins=True)
pt_titanic2[[]]









    Out[23]:






  
    
      
      
      fare
      survived
    
    
      sex
      age
      
      
    
  
  
    
      female
      0.1667
        20.575000
       100.000000
    
    
      0.75
        19.258300
       100.000000
    
    
      0.9167
        27.750000
       100.000000
    
    
      1.0
        19.467500
        80.000000
    
    
      2.0
        39.955357
        28.571429
    
    
      3.0
        25.476400
        33.333333
    
    
      4.0
        22.828340
       100.000000
    
    
      5.0
        22.717700
       100.000000
    
    
      6.0
        32.137500
        50.000000
    
    
      7.0
        26.250000
       100.000000
    
    
      8.0
        24.441667
        66.666667
    
    
      9.0
        24.808320
        20.000000
    
    
      ...
      ...
      ...
      ...
    
    
      male
      62.0
        18.321875
        25.000000
    
    
      63.0
        26.000000
         0.000000
    
    
      64.0
       121.416667
         0.000000
    
    
      65.0
        32.093067
         0.000000
    
    
      66.0
        10.500000
         0.000000
    
    
      67.0
       221.779200
         0.000000
    
    
      70.0
        40.750000
         0.000000
    
    
      70.5
         7.750000
         0.000000
    
    
      71.0
        42.079200
         0.000000
    
    
      74.0
         7.775000
         0.000000
    
    
      80.0
        30.000000
       100.000000
    
    
      All
      
        33.295479
        38.197097
    
  

167 rows × 2 columns

And now plot that information into a scatter plot, being the X-axis the average fare in pound sterling, and the Y-axis the average ratio of survival.



In [24]:

    
female2 = pt_titanic2.query
male2 = pt_titanic2.query
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(, , color='m')
ax.scatter(, , color='b')
ax.set_title()
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.legend(["female", "male"])









    Out[24]:





<matplotlib.legend.Legend at 0x7fb758420ba8>

The pattern, if any, is hard to see using this visualization. However, we can use panda.cut() to create bigger age groups and try again.



In [25]:

    
# This cell is complete
labels = ["{}-{}".format(i, i+9) for i in range(0, 71, 10)]
titanic["age_group"] = pd.cut(titanic["age"].dropna(), range(0, 81, 10), right=False, labels=labels)
titanic.head()









    Out[25]:






  
    
      
      pclass
      survived
      name
      sex
      age
      sibsp
      parch
      ticket
      fare
      cabin
      embarked
      boat
      home.dest
      cabin_type
      name_title
      age_cat
      alone
      age_group
    
  
  
    
      0
       1
       1
                         Allen, Miss. Elisabeth Walton
       female
       29.0000
       0
       0
        24160
       211.3375
            B5
       Southampton
         2
                          St Louis, MO
       B
         miss
          adult
        True
       20-29
    
    
      1
       1
       1
                        Allison, Master. Hudson Trevor
         male
        0.9167
       1
       2
       113781
       151.5500
       C22 C26
       Southampton
        11
       Montreal, PQ / Chesterville, ON
       C
       master
       children
       False
         0-9
    
    
      2
       1
       0
                          Allison, Miss. Helen Loraine
       female
        2.0000
       1
       2
       113781
       151.5500
       C22 C26
       Southampton
       NaN
       Montreal, PQ / Chesterville, ON
       C
         miss
       children
       False
         0-9
    
    
      3
       1
       0
                  Allison, Mr. Hudson Joshua Creighton
         male
       30.0000
       1
       2
       113781
       151.5500
       C22 C26
       Southampton
       NaN
       Montreal, PQ / Chesterville, ON
       C
           mr
          adult
       False
       30-39
    
    
      4
       1
       0
       Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
       female
       25.0000
       1
       2
       113781
       151.5500
       C22 C26
       Southampton
       NaN
       Montreal, PQ / Chesterville, ON
       C
          mrs
          adult
       False
       20-29



In [26]:

    
pt_titanic3 = pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={
                },
               margins=True)
female3 = pt_titanic3.query
male3 = pt_titanic3.query
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot
ax.scatter(, , color='m')
ax.scatter(, , color='b')
ax.set_title("")
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.legend(["female", "male"])









    Out[26]:





<matplotlib.legend.Legend at 0x7fb75839eda0>

Effectively, once grouped in groups of 10 years, the more expensive the fare, the more likely you were to survive. But only if you were a woman, and actually the trend gets stronger for older woman rather than for children. And for the males, it doesn't matter; if any, the trend seems to be the poorer the better your chances for survival.

And now for a video.



In [27]:

    
YouTubeVideo("FHG2oizTlpY")









    Out[27]:

Statistcs

By applying the simple ordinary least squares we can see if these trends are actually significant.



In [28]:

    
import statsmodels.api as sm

First, we calculate the ordinary least squares for both, male and female, using the fare as our independent variable (X).



In [29]:

    
female_model3 = sm.OLS(, )
female_results3 = female_model3.fit()
print(female_results3.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.609
Model:                            OLS   Adj. R-squared:                  0.544
Method:                 Least Squares   F-statistic:                     9.335
Date:                Tue, 10 Feb 2015   Prob (F-statistic):             0.0224
Time:                        16:45:22   Log-Likelihood:                -26.637
No. Observations:                   8   AIC:                             57.27
Df Residuals:                       6   BIC:                             57.43
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         57.7050      7.754      7.442      0.000        38.731    76.679
fare           0.3644      0.119      3.055      0.022         0.073     0.656
==============================================================================
Omnibus:                        1.994   Durbin-Watson:                   2.682
Prob(Omnibus):                  0.369   Jarque-Bera (JB):                0.165
Skew:                           0.325   Prob(JB):                        0.921
Kurtosis:                       3.269   Cond. No.                         183.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.






    



/home/versae/.venvs/dh2304/lib/python3.4/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
  int(n))



In [30]:

    
male_model3 = sm.OLS(, )
male_results3 = male_model3.fit()
print(male_results3.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.034
Model:                            OLS   Adj. R-squared:                 -0.126
Method:                 Least Squares   F-statistic:                    0.2142
Date:                Tue, 10 Feb 2015   Prob (F-statistic):              0.660
Time:                        16:45:23   Log-Likelihood:                -33.415
No. Observations:                   8   AIC:                             70.83
Df Residuals:                       6   BIC:                             70.99
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         27.6788     19.551      1.416      0.207       -20.160    75.517
fare          -0.2435      0.526     -0.463      0.660        -1.531     1.044
==============================================================================
Omnibus:                       11.693   Durbin-Watson:                   1.310
Prob(Omnibus):                  0.003   Jarque-Bera (JB):                3.872
Skew:                           1.484   Prob(JB):                        0.144
Kurtosis:                       4.676   Cond. No.                         113.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now, we plot everything together.



In [31]:

    
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(, , 'mo', label="Female")
ax.plot(female3["fare"], female_results3.fittedvalues, 'r--.', label='$R^2$: {:.2}'.format(female_results3.rsquared))
ax.plot(, , 'bo', label="Male")
ax.plot(male3["fare"], male_results3.fittedvalues, 'g--.', label='$R^2$: {:.2}'.format(male_results3.rsquared))
ax.legend(loc='best')
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.set_title("")









    Out[31]:





<matplotlib.text.Text at 0x7fb7501ae198>

And repeat the whole process for age group and proportion of survived.



In [32]:

    
pt_titanic4 = pd.pivot_table(titanic,
               index=,
               values=,
               aggfunc=,
               margins=True)
female4 = pt_titanic4.query
male4 = pt_titanic4.query

female_model4 = sm.OLS(, )
female_results4 = female_model4.
print(female_results4.)

male_model4 = sm.OLS(, )
male_results4 = male_model4.
print(male_results4.)

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(, , 'mo', label="Female")
ax.plot(, , 'r--.', label='$R^2$: {:.2}'.format(female_results4.))
ax.plot(, , 'bo', label="Male")
ax.plot(, , 'g--.', label='$R^2$: {:.2}'.format(male_results4.))
ax.legend(loc='best')
ax.set_xlabel()
ax.set_ylabel()
ax.set_title()









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.801
Model:                            OLS   Adj. R-squared:                  0.767
Method:                 Least Squares   F-statistic:                     24.09
Date:                Tue, 10 Feb 2015   Prob (F-statistic):            0.00269
Time:                        16:45:23   Log-Likelihood:                -23.941
No. Observations:                   8   AIC:                             51.88
Df Residuals:                       6   BIC:                             52.04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         63.0402      3.950     15.959      0.000        53.375    72.706
age            0.4267      0.087      4.908      0.003         0.214     0.639
==============================================================================
Omnibus:                        0.953   Durbin-Watson:                   3.483
Prob(Omnibus):                  0.621   Jarque-Bera (JB):                0.488
Skew:                          -0.542   Prob(JB):                        0.783
Kurtosis:                       2.461   Cond. No.                         91.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.569
Model:                            OLS   Adj. R-squared:                  0.498
Method:                 Least Squares   F-statistic:                     7.937
Date:                Tue, 10 Feb 2015   Prob (F-statistic):             0.0305
Time:                        16:45:23   Log-Likelihood:                -30.185
No. Observations:                   8   AIC:                             64.37
Df Residuals:                       6   BIC:                             64.53
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         40.7309      8.789      4.635      0.004        19.226    62.236
age           -0.5572      0.198     -2.817      0.030        -1.041    -0.073
==============================================================================
Omnibus:                        1.502   Durbin-Watson:                   2.045
Prob(Omnibus):                  0.472   Jarque-Bera (JB):                0.019
Skew:                          -0.073   Prob(JB):                        0.991
Kurtosis:                       3.186   Cond. No.                         90.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.






    Out[32]:





<matplotlib.text.Text at 0x7fb7501515f8>

Conclusions

Who survived at the end? What about the correlation between fare and ratio of survival for women And men? Is OLS the best approach for men?

What about class? And cabins?

Had any age category better chances at sruvival? Which one(s)?

Why? Did the children survive more than the older people?

*My Heart Will Go On*

		survived
sex	pclass
female	1	96.527778
	2	88.679245
	3	49.074074
male	1	34.078212
	2	14.619883
	3	15.212982

		age	fare	name	survived
pclass	cabin_type
1	A	44.157895	41.244314	22	11
	B	36.476190	122.383078	65	47
	C	38.382752	107.926598	94	57
	D	41.040541	58.919065	40	28
	E	39.593750	63.464706	34	24
	T	45.000000	35.500000	1	0
2	D	29.800000	13.595833	6	4
	E	38.833333	11.587500	4	3
	F	19.076923	23.423077	13	10
3	E	21.666667	11.000000	3	3
	F	27.200000	9.395838	8	3
	G	12.000000	14.205000	5	3
All		29.881135	33.295479	1309	500

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	Montreal, PQ / Chesterville, ON

	fare
embarked
Queenstown	12.409012
Southampton	27.418824
Cherbourg	62.336267

	fare
home.dest
Liverpool, England / Belfast	0.000000
Belfast, NI	0.000000
Belfast	0.000000
Rotterdam, Netherlands	0.000000
Syria	6.155567
Liverpool	6.500000
Co Cork, Ireland Charlestown, MA	6.750000
Effington Rut, SD	6.975000
Portugal	7.050000
Argentina	7.050000

	survived
age_cat
adolescents	38.255034
adult	39.617834
children	35.911602
senior	15.384615

	age	fare	name
pclass
1	39.159918	87.508992	323
2	29.506705	21.179196	277
3	24.816367	13.302889	709
All	29.881135	33.295479	1309

		fare	survived
sex	age
female	0.1667	20.575000	100.000000
	0.75	19.258300	100.000000
	0.9167	27.750000	100.000000
	1.0	19.467500	80.000000
	2.0	39.955357	28.571429
	3.0	25.476400	33.333333
	4.0	22.828340	100.000000
	5.0	22.717700	100.000000
	6.0	32.137500	50.000000
	7.0	26.250000	100.000000
	8.0	24.441667	66.666667
	9.0	24.808320	20.000000
...	...	...	...
male	62.0	18.321875	25.000000
	63.0	26.000000	0.000000
	64.0	121.416667	0.000000
	65.0	32.093067	0.000000
	66.0	10.500000	0.000000
	67.0	221.779200	0.000000
	70.0	40.750000	0.000000
	70.5	7.750000	0.000000
	71.0	42.079200	0.000000
	74.0	7.775000	0.000000
	80.0	30.000000	100.000000
All		33.295479	38.197097

[Data, the Humanist's New Best Friend](index.ipynb)*Assignment 1*Titanic: Women and Children First