[Data, the Humanist's New Best Friend](index.ipynb)
*Assignment 1*
Titanic: Women and Children First

*Yep, no room for more people, sorry, mate*

From Wikipedia, "the passengers of the RMS Titanic were among the estimated 2,223 people who sailed on the maiden voyage of the second of the White Star Line's Olympic class ocean liners, from Southampton to New York City. Halfway through the voyage, the ship struck an iceberg and sank in the early morning of 15 April 1912, resulting in the deaths of over 1,500 people, including approximately 703 of the passengers."


In [1]:
from IPython.display import YouTubeVideo; YouTubeVideo("9xoqXVjBEF8")


Out[1]:

The goal will be to analyze the passengers list and look for some patterns and, ultimately, find whether woman and children really went first.

Assignment

Your mission will be to complete the Python code in the cells below and execute it until the output looks similar or identical to the output shown. I would recommend to use a temporary notebook to work with the dataset, and when the code is ready and producing the expected output, copypaste it to this notebook. Once is done and validated, just copy the file elsewhere, as the notebook will be the only file to be sent for evaluation. Of course, everything starts by downloading this notebook.

*No worries, there is no test in this class, just... assignments!*

Deadline

February $24^{th}$.

Data

The Titanic's passengers were divided into three separate classes, determined not only by the price of their ticket but by wealth and social class: those travelling in first class, the wealthiest passengers on board, were prominent members of the upper class and included businessmen, politicians, high-ranking military personnel, industrialists, bankers and professional athletes. Second class passengers were middle class travellers and included professors, authors, clergymen and tourists. Third class or steerage passengers were primarily immigrants moving to the United States and Canada.

In the file titanic.xls you will find part of the original list of passengers. The variables or columns are described below:

  • survival: Survival (0 = No; 1 = Yes)
  • pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • name: Name
  • sex: Sex
  • age: Age
  • sibsp: Number of Siblings/Spouses Aboard
  • parch: Number of Parents/Children Aboard
  • ticket: Ticket Number
  • fare: Passenger Fare in pound sterling (£)
  • cabin: Cabin
  • embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • boat: Boat number used for survival
  • home.dest: Home / Final destination

Consider that pclass is a proxy for socio-economic status:

  • 1st ~ Upper
  • 2nd ~ Middle
  • 3rd ~ Lower

And that age is given in years, with a couple of exceptions

  • If age less than 1, is given as a fraction
  • If the age is an estimation, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

  • Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
  • Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
  • Parent: Mother or Father of Passenger Aboard Titanic
  • Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

When loaded into a DataFrame, the dataset looks like this:


In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set some Pandas options
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)

In [3]:
titanic = pd.read_excel("data/titanic.xls")
titanic.head()


Out[3]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON

Preparation

Before we can actually start playing with the data, we need to add some stuff. First, create a new column, cabin_type, with the letter of the cabin if known. For example, if the cabin is 'C22 C26', cabin_type would be C; for something like 'A90 B11' (which never happens), only the first code is used, being A the cabin_type.


In [4]:
titanic["cabin_type"] = titanic["cabin"].
titanic.head()


Out[4]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat home.dest cabin_type
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 St Louis, MO B
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 Montreal, PQ / Chesterville, ON C
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C

We also need a column, name_title, with the title included in the name but in lower case. For example, if the name is Allison, Mrs. Hudson J C (Bessie Waldo Daniels), name_title would be mrs. Note the dot is excluded. If the title is none of Master, Miss, Ms or Mrs, it will be classified as other.


In [5]:
def assign_name(name):
    pass

titanic["name_title"] = titanic["name"].apply(assign_name)
titanic.head()


Out[5]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat home.dest cabin_type name_title
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 St Louis, MO B miss
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 Montreal, PQ / Chesterville, ON C master
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C miss
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C mr
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C mrs

Finally, a third new columnd age_cat that maps the age to one of the next values: children (under 14 years), adolescents (14-20), adult (21-64), and senior (65+).


In [6]:
def assign_age(age):
    pass

titanic["age_cat"] = titanic["age"].apply(assign_age)
titanic.head()


Out[6]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat home.dest cabin_type name_title age_cat
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 St Louis, MO B miss adult
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 Montreal, PQ / Chesterville, ON C master children
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C miss children
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C mr adult
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN Montreal, PQ / Chesterville, ON C mrs adult

The last cosmetic change will be to replace the letters in the variable embarked for the actual names of the cities:

  • C: Cherbourg
  • Q: Queenstown
  • S: Southampton

Note that in this case we won't create a new column but replace the original one.


In [7]:
titanic["embarked"] = titanic["embarked"].
titanic.head()


Out[7]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat home.dest cabin_type name_title age_cat
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 Southampton 2 St Louis, MO B miss adult
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 Southampton 11 Montreal, PQ / Chesterville, ON C master children
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 Southampton NaN Montreal, PQ / Chesterville, ON C miss children
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 Southampton NaN Montreal, PQ / Chesterville, ON C mr adult
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 Southampton NaN Montreal, PQ / Chesterville, ON C mrs adult

Analysis

*Rose: "I'l never let go Jack..."*

Where did the richer people embark from? And what about the destination? Where did the 10 poorest people want to go/come from? To solve this, take a look on the fare they paid, and return the average per city.


In [8]:
titanic.groupby()[["fare"]].aggregate().sort("fare")


Out[8]:
fare
embarked
Queenstown 12.409012
Southampton 27.418824
Cherbourg 62.336267

In [9]:
titanic.groupby()[[]].aggregate().sort()[:10]


Out[9]:
fare
home.dest
Liverpool, England / Belfast 0.000000
Belfast, NI 0.000000
Belfast 0.000000
Rotterdam, Netherlands 0.000000
Syria 6.155567
Liverpool 6.500000
Co Cork, Ireland Charlestown, MA 6.750000
Effington Rut, SD 6.975000
Portugal 7.050000
Argentina 7.050000

Let's consider now proportions (ratio) of survival. Calculate the proportion of passengers that survived by sex. And the same proportion by age category.


In [10]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio


Out[10]:
survived
sex
female 72.746781
male 19.098458

In [11]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio


Out[11]:
survived
age_cat
adolescents 38.255034
adult 39.617834
children 35.911602
senior 15.384615

Calculate the same proportion, but by sex and class.


In [12]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio


Out[12]:
survived
sex pclass
female 1 96.527778
2 88.679245
3 49.074074
male 1 34.078212
2 14.619883
3 15.212982

Calculate survival proportions by age category, class and sex.


In [13]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio.unstack()


Out[13]:
survived
sex female male
age_cat pclass
adolescents 1 100.000000 20.000000
2 92.307692 11.764706
3 54.285714 12.500000
adult 1 96.551724 34.328358
2 86.842105 7.812500
3 44.186047 15.918367
children 1 91.666667 39.393939
2 94.117647 54.166667
3 51.578947 15.469613
senior 1 100.000000 14.285714
2 NaN 0.000000
3 NaN 0.000000

Calculate survival proportions by age category and sex.


In [14]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio.unstack()


Out[14]:
survived
sex female male
age_cat
adolescents 73.015873 12.790698
adult 77.697842 18.737673
children 61.290323 22.689076
senior 100.000000 8.333333

So, women and children first?

*[Mrs. Charlotte Collyer and her daughter Marjorie](http://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic)*

Let's now see price. In order to see distributions of price, the first thing we need to do is to calculate the average fare per class, as well as the average age, and the number of people in each class.


In [15]:
pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={"fare": , "age": , "name": },
               margins=True)


Out[15]:
age fare name
pclass
1 39.159918 87.508992 323
2 29.506705 21.179196 277
3 24.816367 13.302889 709
All 29.881135 33.295479 1309

And even split the latter per cabin type and adding an aggregate counting the number of survivors.


In [16]:
pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={"fare": , "age": , "name": , "survived": },
               margins=)


Out[16]:
age fare name survived
pclass cabin_type
1 A 44.157895 41.244314 22 11
B 36.476190 122.383078 65 47
C 38.382752 107.926598 94 57
D 41.040541 58.919065 40 28
E 39.593750 63.464706 34 24
T 45.000000 35.500000 1 0
2 D 29.800000 13.595833 6 4
E 38.833333 11.587500 4 3
F 19.076923 23.423077 13 10
3 E 21.666667 11.000000 3 3
F 27.200000 9.395838 8 3
G 12.000000 14.205000 5 3
All 29.881135 33.295479 1309 500

Well, it starts to look like the fare maybe, and only maybe, had something to do with the probability of survival, isn't?

Visualizations

Before getting deeper into that relationship, let's just explore some basic plots. For example, let's plot the proportion of survived by name title.


In [17]:
ax = titanic.boxplot(column=, by=, grid=False)
ax.set_title("")
ax.set_xlabel("")
ax.set_ylabel("")
ax.get_figure().suptitle("")  # Do not change this statement


Out[17]:
<matplotlib.text.Text at 0x7fb758706cf8>

And also, if we define a new column alone that is True if the passenger was travelling withouh family, and False if with family, let's plot that into a box plot by age.


In [18]:
titanic["alone"] = 
ax = titanic.boxplot(column="", by="", grid=False)
ax.set_title("")
ax.set_xlabel("")
ax.set_ylabel("")
ax.get_figure().suptitle("")  # Do not change this statement


Out[18]:
<matplotlib.text.Text at 0x7fb7587d34e0>

On the other hand, to better see the relationship between proportion of survived and fare, we are going to create a table that has, in rows, two indices, sex and age category, and as the values, the average age, the average fare, and the proportion of passengers that survived. Define and use the function ratio as an aggreate to calculate the proportion of people that survived.


In [19]:
def ratio(values):
    pass

In [20]:
pt_titanic = pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={
                    "fare": ,
                    "age": ,
                    "survived": 
                },
               margins=)
pt_titanic


Out[20]:
age fare survived
sex age_cat
female adolescents 17.428571 34.826592 73.015873
adult 34.953237 57.661422 77.697842
children 5.208335 26.012198 61.290323
senior 76.000000 78.850000 100.000000
male adolescents 17.970930 21.331057 12.790698
adult 34.433925 28.636848 18.737673
children 5.416666 21.671076 22.689076
senior 69.541667 44.978483 8.333333
All 29.881135 33.295479 38.197097

Now let's create two plots, one for the female and other for male, comparing the series of average fare and survival proportion, and sorted by fare.


In [21]:
female = pt_titanic.query("sex == 'female'").sort()[[]]
male = pt_titanic.query().sort()[[]]

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot
ax2 = fig.add_subplot

female.plot(ax=)
ax1.set_ylabel()

male.plot(ax=)
ax2.set_xlabel()
ax2.set_ylabel()

ax2.set_xticklabels(["children", "", "adolescent", "", "adult", "", "senior"])


Out[21]:
[<matplotlib.text.Text at 0x7fb75852add8>,
 <matplotlib.text.Text at 0x7fb75852e3c8>,
 <matplotlib.text.Text at 0x7fb7584fcdd8>,
 <matplotlib.text.Text at 0x7fb75850ba58>,
 <matplotlib.text.Text at 0x7fb75850e4e0>,
 <matplotlib.text.Text at 0x7fb75850ef28>,
 <matplotlib.text.Text at 0x7fb7585119b0>]

At least for woman, it looks like the richer the woman, the greater her chances to survive. Men, on the other hand, does not look like that; if any, it would be the opposite.

But let's take a closer look to the distribution of ages (ignoring the people with unknown age), by creating a histogram of the age distribution.


In [22]:
ax = titanic[].dropna()...(bins=30, alpha=0.8, grid=False, figsize=(12, 4))
ax.set_title()


Out[22]:
<matplotlib.text.Text at 0x7fb7584adb70>

To further support this idea, let's create another pivot table, but instead of using the age categories, let's just calculate the corresponding values for fare and propotion of survival for each individual age.


In [23]:
pt_titanic2 = pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={
               },
               margins=True)
pt_titanic2[[]]


Out[23]:
fare survived
sex age
female 0.1667 20.575000 100.000000
0.75 19.258300 100.000000
0.9167 27.750000 100.000000
1.0 19.467500 80.000000
2.0 39.955357 28.571429
3.0 25.476400 33.333333
4.0 22.828340 100.000000
5.0 22.717700 100.000000
6.0 32.137500 50.000000
7.0 26.250000 100.000000
8.0 24.441667 66.666667
9.0 24.808320 20.000000
... ... ... ...
male 62.0 18.321875 25.000000
63.0 26.000000 0.000000
64.0 121.416667 0.000000
65.0 32.093067 0.000000
66.0 10.500000 0.000000
67.0 221.779200 0.000000
70.0 40.750000 0.000000
70.5 7.750000 0.000000
71.0 42.079200 0.000000
74.0 7.775000 0.000000
80.0 30.000000 100.000000
All 33.295479 38.197097

167 rows × 2 columns

And now plot that information into a scatter plot, being the X-axis the average fare in pound sterling, and the Y-axis the average ratio of survival.


In [24]:
female2 = pt_titanic2.query
male2 = pt_titanic2.query
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(, , color='m')
ax.scatter(, , color='b')
ax.set_title()
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.legend(["female", "male"])


Out[24]:
<matplotlib.legend.Legend at 0x7fb758420ba8>

The pattern, if any, is hard to see using this visualization. However, we can use panda.cut() to create bigger age groups and try again.


In [25]:
# This cell is complete
labels = ["{}-{}".format(i, i+9) for i in range(0, 71, 10)]
titanic["age_group"] = pd.cut(titanic["age"].dropna(), range(0, 81, 10), right=False, labels=labels)
titanic.head()


Out[25]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat home.dest cabin_type name_title age_cat alone age_group
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 Southampton 2 St Louis, MO B miss adult True 20-29
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 Southampton 11 Montreal, PQ / Chesterville, ON C master children False 0-9
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 Southampton NaN Montreal, PQ / Chesterville, ON C miss children False 0-9
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 Southampton NaN Montreal, PQ / Chesterville, ON C mr adult False 30-39
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 Southampton NaN Montreal, PQ / Chesterville, ON C mrs adult False 20-29

In [26]:
pt_titanic3 = pd.pivot_table(titanic,
               index=[],
               values=[],
               aggfunc={
                },
               margins=True)
female3 = pt_titanic3.query
male3 = pt_titanic3.query
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot
ax.scatter(, , color='m')
ax.scatter(, , color='b')
ax.set_title("")
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.legend(["female", "male"])


Out[26]:
<matplotlib.legend.Legend at 0x7fb75839eda0>

Effectively, once grouped in groups of 10 years, the more expensive the fare, the more likely you were to survive. But only if you were a woman, and actually the trend gets stronger for older woman rather than for children. And for the males, it doesn't matter; if any, the trend seems to be the poorer the better your chances for survival.

And now for a video.


In [27]:
YouTubeVideo("FHG2oizTlpY")


Out[27]:

Statistcs

By applying the simple ordinary least squares we can see if these trends are actually significant.


In [28]:
import statsmodels.api as sm

First, we calculate the ordinary least squares for both, male and female, using the fare as our independent variable (X).


In [29]:
female_model3 = sm.OLS(, )
female_results3 = female_model3.fit()
print(female_results3.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.609
Model:                            OLS   Adj. R-squared:                  0.544
Method:                 Least Squares   F-statistic:                     9.335
Date:                Tue, 10 Feb 2015   Prob (F-statistic):             0.0224
Time:                        16:45:22   Log-Likelihood:                -26.637
No. Observations:                   8   AIC:                             57.27
Df Residuals:                       6   BIC:                             57.43
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         57.7050      7.754      7.442      0.000        38.731    76.679
fare           0.3644      0.119      3.055      0.022         0.073     0.656
==============================================================================
Omnibus:                        1.994   Durbin-Watson:                   2.682
Prob(Omnibus):                  0.369   Jarque-Bera (JB):                0.165
Skew:                           0.325   Prob(JB):                        0.921
Kurtosis:                       3.269   Cond. No.                         183.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/home/versae/.venvs/dh2304/lib/python3.4/site-packages/scipy/stats/stats.py:1205: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
  int(n))

In [30]:
male_model3 = sm.OLS(, )
male_results3 = male_model3.fit()
print(male_results3.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.034
Model:                            OLS   Adj. R-squared:                 -0.126
Method:                 Least Squares   F-statistic:                    0.2142
Date:                Tue, 10 Feb 2015   Prob (F-statistic):              0.660
Time:                        16:45:23   Log-Likelihood:                -33.415
No. Observations:                   8   AIC:                             70.83
Df Residuals:                       6   BIC:                             70.99
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         27.6788     19.551      1.416      0.207       -20.160    75.517
fare          -0.2435      0.526     -0.463      0.660        -1.531     1.044
==============================================================================
Omnibus:                       11.693   Durbin-Watson:                   1.310
Prob(Omnibus):                  0.003   Jarque-Bera (JB):                3.872
Skew:                           1.484   Prob(JB):                        0.144
Kurtosis:                       4.676   Cond. No.                         113.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now, we plot everything together.


In [31]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(, , 'mo', label="Female")
ax.plot(female3["fare"], female_results3.fittedvalues, 'r--.', label='$R^2$: {:.2}'.format(female_results3.rsquared))
ax.plot(, , 'bo', label="Male")
ax.plot(male3["fare"], male_results3.fittedvalues, 'g--.', label='$R^2$: {:.2}'.format(male_results3.rsquared))
ax.legend(loc='best')
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.set_title("")


Out[31]:
<matplotlib.text.Text at 0x7fb7501ae198>

And repeat the whole process for age group and proportion of survived.


In [32]:
pt_titanic4 = pd.pivot_table(titanic,
               index=,
               values=,
               aggfunc=,
               margins=True)
female4 = pt_titanic4.query
male4 = pt_titanic4.query

female_model4 = sm.OLS(, )
female_results4 = female_model4.
print(female_results4.)

male_model4 = sm.OLS(, )
male_results4 = male_model4.
print(male_results4.)

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(, , 'mo', label="Female")
ax.plot(, , 'r--.', label='$R^2$: {:.2}'.format(female_results4.))
ax.plot(, , 'bo', label="Male")
ax.plot(, , 'g--.', label='$R^2$: {:.2}'.format(male_results4.))
ax.legend(loc='best')
ax.set_xlabel()
ax.set_ylabel()
ax.set_title()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.801
Model:                            OLS   Adj. R-squared:                  0.767
Method:                 Least Squares   F-statistic:                     24.09
Date:                Tue, 10 Feb 2015   Prob (F-statistic):            0.00269
Time:                        16:45:23   Log-Likelihood:                -23.941
No. Observations:                   8   AIC:                             51.88
Df Residuals:                       6   BIC:                             52.04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         63.0402      3.950     15.959      0.000        53.375    72.706
age            0.4267      0.087      4.908      0.003         0.214     0.639
==============================================================================
Omnibus:                        0.953   Durbin-Watson:                   3.483
Prob(Omnibus):                  0.621   Jarque-Bera (JB):                0.488
Skew:                          -0.542   Prob(JB):                        0.783
Kurtosis:                       2.461   Cond. No.                         91.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.569
Model:                            OLS   Adj. R-squared:                  0.498
Method:                 Least Squares   F-statistic:                     7.937
Date:                Tue, 10 Feb 2015   Prob (F-statistic):             0.0305
Time:                        16:45:23   Log-Likelihood:                -30.185
No. Observations:                   8   AIC:                             64.37
Df Residuals:                       6   BIC:                             64.53
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         40.7309      8.789      4.635      0.004        19.226    62.236
age           -0.5572      0.198     -2.817      0.030        -1.041    -0.073
==============================================================================
Omnibus:                        1.502   Durbin-Watson:                   2.045
Prob(Omnibus):                  0.472   Jarque-Bera (JB):                0.019
Skew:                          -0.073   Prob(JB):                        0.991
Kurtosis:                       3.186   Cond. No.                         90.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Out[32]:
<matplotlib.text.Text at 0x7fb7501515f8>

Conclusions

Who survived at the end? What about the correlation between fare and ratio of survival for women And men? Is OLS the best approach for men?

What about class? And cabins?

Had any age category better chances at sruvival? Which one(s)?

Why? Did the children survive more than the older people?

*My Heart Will Go On*