KNN

Describe the data.

  • How big?

In [1]:
import pandas
import numpy

MY_TITANIC_TRAIN = 'train_titanic.csv'
MY_TITANIC_TEST = 'test_titanic.csv'
titanic_dataframe = pandas.read_csv(MY_TITANIC_TRAIN, header=0)
print('length: {0}'.format(len(titanic_dataframe)))
titanic_dataframe.head(5)


length: 891
Out[1]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
  • What are the columns and what do they mean?

In [2]:
titanic_dataframe.columns


Out[2]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

VARIABLE DESCRIPTIONS: survival Survival (0 = No; 1 = Yes) pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) Parent: Mother or Father of Passenger Aboard Titanic Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

What's the average age of..

  • any Titanic passenger

In [3]:
titanic_dataframe.Age.mean()


Out[3]:
29.69911764705882
  • a survivor

In [4]:
survivors = titanic_dataframe[(titanic_dataframe.Survived==1)]
survivors.Age.mean()


Out[4]:
28.343689655172415
  • a non-surviving first-class passenger

In [5]:
dead_rich = titanic_dataframe[(titanic_dataframe.Survived==0)&(titanic_dataframe.Pclass==1)]
dead_rich.Age.mean()


Out[5]:
43.6953125
  • Male survivors older than 30 from anywhere but Queenstown

In [6]:
is_survivor = titanic_dataframe['Survived'] == 1
is_male = titanic_dataframe['Sex'] == 'male'
not_Queenstown = titanic_dataframe['Embarked'] != 'Q'
over_30 = titanic_dataframe['Age'] > 30
moniq = titanic_dataframe[is_survivor & is_male & not_Queenstown & over_30]
moniq.Age.mean()


Out[6]:
41.48780487804878

For the groups you chose, how far (in years) are the average ages from the median ages?

  • any Titanic Passenger

In [7]:
titanic_dataframe.Age.mean() - titanic_dataframe.Age.median()


Out[7]:
1.69911764705882
  • a survivor

In [8]:
survivors.Age.mean() - survivors.Age.median()


Out[8]:
0.34368965517241534
  • a non-surviving first-class passenger

In [9]:
dead_rich.Age.mean() - dead_rich.Age.median()


Out[9]:
-1.5546875
  • Male survivors older than 30 from anywhere but Queenstown

In [10]:
moniq.Age.mean() - moniq.Age.median()


Out[10]:
3.4878048780487774

What's the most common...

  • passenger class

In [11]:
titanic_dataframe['Pclass'].mode().item()


Out[11]:
3
  • port of Embarkation

In [12]:
titanic_dataframe['Embarked'].mode().item()


Out[12]:
'S'
  • number of siblings or spouses aboard for survivors

In [13]:
survivors['SibSp'].mode().item()


Out[13]:
0

Within what range of standard deviations from the mean (0-1, 1-2, 2-3) is the median ticket price? Is it above or below the mean?


In [14]:
ticket_sd = titanic_dataframe['Fare'].std()
median_ticket_price = titanic_dataframe['Fare'].median()
mean_ticket_price = titanic_dataframe['Fare'].mean()
abs((median_ticket_price - mean_ticket_price) / ticket_sd)


Out[14]:
0.3571902456652297

How much more expensive was the 90th percentile ticket than the 5th percentile ticket?


In [15]:
import numpy as np
percentile_expense = np.percentile(titanic_dataframe['Fare'], [90, 5])
percentile_expense[0] - percentile_expense[1]


Out[15]:
70.7333

Are they the same class?


In [16]:
ticket_price_90 = titanic_dataframe[(titanic_dataframe['Fare'] >= percentile_expense[0]) & (titanic_dataframe['Pclass'] != 1)]

In [17]:
print(len(ticket_price_90))


0

In [18]:
ticket_price_5 = titanic_dataframe[(titanic_dataframe['Fare'] <= percentile_expense[1]) & (titanic_dataframe['Pclass'] < 3)]

In [19]:
print(len(ticket_price_5))


12

The highest average ticket price was paid by passengers from which port? Null ports don't count


In [20]:
ports = [port for port in titanic_dataframe.Embarked.unique() if port == port]
by_port = [(p, titanic_dataframe[titanic_dataframe.Embarked==p]) for p in ports]
for p, people in by_port:
    print(p, people.Fare.mean())


S 27.07981180124218
C 59.95414404761905
Q 13.276029870129872

Which port has passengers from the most similar class?


In [21]:
for p, people in by_port:
    print(p, people.Pclass.std())


S 0.7894024365513868
C 0.944099832575668
Q 0.36927447293799803

What fraction of surviving 1st-class males paid lower than the overall median ticket price?


In [22]:
lower_than_median = titanic_dataframe['Fare'] < median_ticket_price
is_first_class = titanic_dataframe['Pclass'] == 1
cheapo_dude_survivor_fraction = titanic_dataframe[is_male & lower_than_median & is_first_class & is_survivor]
print(cheapo_dude_survivor_fraction.count())


PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

0/0

How much older/younger was the average surviving passenger with family members than the average non-surviving passenger without them?


In [23]:
fam = titanic_dataframe[((titanic_dataframe.SibSp > 0) | (titanic_dataframe.Parch > 0)) & (titanic_dataframe.Survived == 1)]
no_fam = titanic_dataframe[((titanic_dataframe.SibSp == 0) & (titanic_dataframe.Parch == 0)) & (titanic_dataframe.Survived == 0)]
fam.Age.mean() - no_fam.Age.mean()


Out[23]:
-6.888171076642337

Display the relationship (i.e. make a plot) between survival rate and the quantile of the ticket price for 20 integer quantiles. Make sure you label your axes.


In [24]:
import matplotlib.pyplot as pl
% matplotlib inline

fare_quantiles = np.percentile(titanic_dataframe.Fare, np.arange(5, 105, 5.0))
survival_quantiles = []
previous_quantile = 0
for f_q in fare_quantiles:
    people = titanic_dataframe[(previous_quantile <= titanic_dataframe.Fare) & (titanic_dataframe.Fare < f_q)]
    survival_quantiles.append(sum(people.Survived == 1) / float(len(people)))
pl.plot(np.arange(5, 105, 5.0), survival_quantiles)
pl.xlabel('Fare percentile')
pl.ylabel('Survival rate')


Out[24]:
<matplotlib.text.Text at 0x2351bc46ef0>