A crash course in

Surviving Titanic

(with numpy and matplotlib)

This notebook is going to teach you to use the basic data science stack for Python: Jupyter, Numpy, matplotlib, and sklearn.

Part I: Jupyter notebooks in a nutshell

  • You are reading this line in a jupyter notebook.
  • A notebook consists of cells. A cell can contain either code or hypertext.
    • This cell contains hypertext. The next cell contains code.
  • You can run a cell with code by selecting it (click) and pressing Ctrl + Enter to execute the code and display output(if any).
  • If you're running this on a device with no keyboard, you are doing it wrong use the top bar (esp. play/stop/restart buttons) to run code.
  • Behind the curtains, there's a Python interpreter that runs that code and remembers anything you defined.

Run these cells to get started


In [ ]:
a = 5

In [ ]:
print(a * 2)


10
  • Ctrl + S to save changes (or use the button that looks like a floppy disk)
  • Top menu → Kernel → Interrupt (or Stop button) if you want it to stop running cell midway.
  • Top menu → Kernel → Restart (or cyclic arrow button) if interrupt doesn't fix the problem (you will lose all variables).
  • For shortcut junkies like us: Top menu → Help → Keyboard Shortcuts

Now the most important feature of jupyter notebooks for this course:

  • if you're typing something, press Tab to see automatic suggestions, use arrow keys + enter to pick one.
  • if you move your cursor inside some function and press Shift + Tab, you'll get a help window. Shift + (Tab , Tab) (press Tab twice) will expand it.

In [ ]:
# run this first
import math

In [ ]:
# place your cursor at the end of the unfinished line below to find a function
# that computes arctangent from two parameters (should have 2 in it's name)
# once you chose it, press shift + tab + tab(again) to see the docs

math.a  # <---

Part II: Loading data with Pandas

Pandas is a library that helps you load the data, prepare it and perform some lightweight analysis. The god object here is the pandas.DataFrame - a 2D table with batteries included.

In the cells below we use it to read the data on the infamous titanic shipwreck.

please keep running all the code cells as you read


In [ ]:
# If you are running in Google Colab, this cell will download the dataset from our repository.
# Otherwise, this cell will do nothing.

import sys
if 'google.colab' in sys.modules:
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/week01_intro/primer_python_for_ml/train.csv

In [ ]:
import pandas as pd
# this yields a pandas.DataFrame
data = pd.read_csv("train.csv", index_col='PassengerId')

In [ ]:
# Selecting rows
head = data[:10]

head  # if you leave an expression at the end of a cell, jupyter will "display" it automatically


Out[ ]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

About the data

Here's some of the columns

  • Name - a string with person's full name
  • Survived - 1 if a person survived the shipwreck, 0 otherwise.
  • Pclass - passenger class. Pclass == 3 is cheap'n'cheerful, Pclass == 1 is for moneybags.
  • Sex - a person's gender (in those good ol' times when there were just 2 of them)
  • Age - age in years, if available
  • Sibsp - number of siblings on a ship
  • Parch - number of parents on a ship
  • Fare - ticket cost
  • Embarked - port where the passenger embarked
    • C = Cherbourg; Q = Queenstown; S = Southampton

In [ ]:
# table dimensions
print("len(data) =", len(data))
print("data.shape =", data.shape)


len(data) = 891
data.shape = (891, 11)

In [ ]:
# select a single row by PassengerId (using .loc)
print(data.loc[4])


Survived                                               1
Pclass                                                 1
Name        Futrelle, Mrs. Jacques Heath (Lily May Peel)
Sex                                               female
Age                                                   35
SibSp                                                  1
Parch                                                  0
Ticket                                            113803
Fare                                                53.1
Cabin                                               C123
Embarked                                               S
Name: 4, dtype: object

In [ ]:
# select a single row by index (using .iloc)
print(data.iloc[3])


Survived                                               1
Pclass                                                 1
Name        Futrelle, Mrs. Jacques Heath (Lily May Peel)
Sex                                               female
Age                                                   35
SibSp                                                  1
Parch                                                  0
Ticket                                            113803
Fare                                                53.1
Cabin                                               C123
Embarked                                               S
Name: 4, dtype: object

In [ ]:
# select a single column.
ages = data["Age"]
print(ages[:10])  # alternatively: data.Age


PassengerId
1     22.0
2     38.0
3     26.0
4     35.0
5     35.0
6      NaN
7     54.0
8      2.0
9     27.0
10    14.0
Name: Age, dtype: float64

In [ ]:
# select several columns and rows at once
# alternatively: data[["Fare","Pclass"]].loc[5:10]
data.loc[5:10, ("Fare", "Pclass")]


Out[ ]:
Fare Pclass
PassengerId
5 8.0500 3
6 8.4583 3
7 51.8625 1
8 21.0750 3
9 11.1333 3
10 30.0708 2

Your turn:


In [ ]:
# Select passengers number 13 and 666 (with these PassengerId values). Did they survive?

<YOUR CODE>

In [ ]:
# Compute the overall survival rate: what fraction of passengers survived the shipwreck?

<YOUR CODE>

Pandas also has some basic data analysis tools. For one, you can quickly display statistical aggregates for each column using .describe()


In [ ]:
data.describe()

Some columns contain NaN values - this means that there is no data there. For example, passenger #6 has unknown age. To simplify the future data analysis, we'll replace NaN values by using pandas fillna function.

Note: we do this so easily because it's a tutorial. In general, you think twice before you modify data like this.


In [ ]:
data.loc[6]

In [ ]:
data['Age'] = data['Age'].fillna(value=data['Age'].mean())
data['Fare'] = data['Fare'].fillna(value=data['Fare'].mean())

In [ ]:
data.loc[6]

More pandas:

Part III: Numpy and vectorized computing

Almost any machine learning model requires some computational heavy lifting usually involving linear algebra problems. Unfortunately, raw Python is terrible at this because each operation is interpreted at runtime.

So instead, we'll use numpy - a library that lets you run blazing fast computation with vectors, matrices and other tensors. Again, the god object here is numpy.ndarray:


In [ ]:
import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

print("a =", a)
print("b =", b)

# math and boolean operations can applied to each element of an array
print("a + 1 =", a + 1)
print("a * 2 =", a * 2)
print("a == 2", a == 2)
# ... or corresponding elements of two (or more) arrays
print("a + b =", a + b)
print("a * b =", a * b)


a =  [1 2 3 4 5]
b =  [5 4 3 2 1]
a + 1 = [2 3 4 5 6]
a * 2 = [ 2  4  6  8 10]
a == 2 [False  True False False False]
a + b = [6 6 6 6 6]
a * b = [5 8 9 8 5]

In [ ]:
# Your turn: compute half-products of a and b elements (i.e. ½ of the products of corresponding elements)
<YOUR CODE>

In [ ]:
# compute elementwise quotient between squared a and (b plus 1)
<YOUR CODE>

How fast is it, Harry?

Let's compare computation time for Python and Numpy

  • Two arrays of $10^6$ elements

    • first one: from 0 to 1 000 000
    • second one: from 99 to 1 000 099
  • Computing:

    • elementwise sum
    • elementwise product
    • square root of first array
    • sum of all elements in the first array

In [ ]:
%%time
# ^-- this "magic" measures and prints cell computation time

# Option I: pure Python
arr_1 = range(1000000)
arr_2 = range(99, 1000099)


a_sum = []
a_prod = []
sqrt_a1 = []
for i in range(len(arr_1)):
    a_sum.append(arr_1[i]+arr_2[i])
    a_prod.append(arr_1[i]*arr_2[i])
    a_sum.append(arr_1[i]**0.5)

arr_1_sum = sum(arr_1)

In [ ]:
%%time

# Option II: start from Python, convert to numpy
arr_1 = range(1000000)
arr_2 = range(99, 1000099)

arr_1, arr_2 = np.array(arr_1), np.array(arr_2)


a_sum = arr_1 + arr_2
a_prod = arr_1 * arr_2
sqrt_a1 = arr_1 ** .5
arr_1_sum = arr_1.sum()

In [ ]:
%%time

# Option III: pure numpy
arr_1 = np.arange(1000000)
arr_2 = np.arange(99, 1000099)

a_sum = arr_1 + arr_2
a_prod = arr_1 * arr_2
sqrt_a1 = arr_1 ** .5
arr_1_sum = arr_1.sum()

If you want more serious benchmarks, take a look at this.


There's also a bunch of pre-implemented operations including logarithms, trigonometry, vector/matrix products and aggregations.


In [ ]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("numpy.sum(a) =", np.sum(a))
print("numpy.mean(a) =", np.mean(a))
print("numpy.min(a) =",  np.min(a))
print("numpy.argmin(b) =", np.argmin(b))  # index of minimal element
# dot product. Also used for matrix/tensor multiplication
print("numpy.dot(a,b) =", np.dot(a, b))
print(
    "numpy.unique(['male','male','female','female','male']) =",
    np.unique(['male', 'male', 'female', 'female', 'male']))


numpy.sum(a) = 15
numpy.mean(a) = 3.0
numpy.min(a) = 1
numpy.argmin(b) = 4
numpy.dot(a,b) = 35
numpy.unique(['male','male','female','female','male']) = ['female' 'male']

There is a lot more stuff. Check out a Numpy cheat sheet here.

The important part: all this functionality works with dataframes:


In [ ]:
print("Max ticket price: ", np.max(data["Fare"]))
print("\nThe guy who paid the most:\n", data.iloc[np.argmax(data["Fare"])])


Max ticket price:  512.3292

The guy who paid the most:
 Survived                   1
Pclass                     1
Name        Ward, Miss. Anna
Sex                   female
Age                       35
SibSp                      0
Parch                      0
Ticket              PC 17755
Fare                 512.329
Cabin                    NaN
Embarked                   C
Name: 259, dtype: object

In [ ]:
# your code: compute mean passenger age and the oldest guy on the ship
<YOUR CODE>

In [ ]:
print("Boolean operations")

print('a =', a)
print('b =', b)
print("a > 2", a > 2)
print("numpy.logical_not(a>2) =", np.logical_not(a > 2))
print("numpy.logical_and(a>2,b>2) =", np.logical_and(a > 2, b > 2))
print("numpy.logical_or(a>4,b<3) =", np.logical_or(a > 2, b < 3))

print()

print("shortcuts")
print("~(a > 2) =", ~(a > 2))  # logical_not(a > 2)
print("(a > 2) & (b > 2) =", (a > 2) & (b > 2))  # logical_and
print("(a > 2) | (b < 3) =", (a > 2) | (b < 3))  # logical_or


Boolean operations
a =  [1 2 3 4 5]
b =  [5 4 3 2 1]
a > 2 [False False  True  True  True]
numpy.logical_not(a>2) =  [ True  True False False False]
numpy.logical_and(a>2,b>2) =  [False False  True False False]
numpy.logical_or(a>4,b<3) =  [False False  True  True  True]

 shortcuts
~(a > 2) =  [ True  True False False False]
(a > 2) & (b > 2) =  [False False  True False False]
(a > 2) | (b < 3) =  [False False  True  True  True]

The final Numpy feature we'll need is indexing: selecting elements from an array. Aside from Python indexes and slices (e.g. a[1:4]), Numpy also allows you to select several elements at once.


In [ ]:
a = np.array([0, 1, 4, 9, 16, 25])
ix = np.array([1, 2, 5])
print("a =", a)
print("Select by element index")
print("a[[1,2,5]] =", a[ix])

print("\nSelect by boolean mask")
# select all elementts in a that are greater than 5
print("a[a > 5] =", a[a > 5])
print("(a % 2 == 0) =", a % 2 == 0)  # True for even, False for odd
print("a[a % 2 == 0] =", a[a % 2 == 0])  # select all elements in a that are even


# select male children
print("data[(data['Age'] < 18) & (data['Sex'] == 'male')] = (below)")
data[(data['Age'] < 18) & (data['Sex'] == 'male')]


a =  [ 0  1  4  9 16 25]
Select by element index
a[[1,2,5]] =  [ 1  4 25]

Select by boolean mask
a[a > 5] =  [ 9 16 25]
(a % 2 == 0) = [ True False  True False  True False]
a[a % 2 == 0] = [ 0  4 16]
data[(data['Age'] < 18) & (data['Sex'] == 'male')] = (below)
Out[ ]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
8 0 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.0750 NaN S
17 0 3 Rice, Master. Eugene male 2.00 4 1 382652 29.1250 NaN Q
51 0 3 Panula, Master. Juha Niilo male 7.00 4 1 3101295 39.6875 NaN S
60 0 3 Goodwin, Master. William Frederick male 11.00 5 2 CA 2144 46.9000 NaN S
64 0 3 Skoog, Master. Harald male 4.00 3 2 347088 27.9000 NaN S
79 1 2 Caldwell, Master. Alden Gates male 0.83 0 2 248738 29.0000 NaN S
87 0 3 Ford, Mr. William Neal male 16.00 1 3 W./C. 6608 34.3750 NaN S
126 1 3 Nicola-Yarred, Master. Elias male 12.00 1 0 2651 11.2417 NaN C
139 0 3 Osen, Mr. Olaf Elon male 16.00 0 0 7534 9.2167 NaN S
164 0 3 Calic, Mr. Jovo male 17.00 0 0 315093 8.6625 NaN S
165 0 3 Panula, Master. Eino Viljami male 1.00 4 1 3101295 39.6875 NaN S
166 1 3 Goldsmith, Master. Frank John William "Frankie" male 9.00 0 2 363291 20.5250 NaN S
172 0 3 Rice, Master. Arthur male 4.00 4 1 382652 29.1250 NaN Q
183 0 3 Asplund, Master. Clarence Gustaf Hugo male 9.00 4 2 347077 31.3875 NaN S
184 1 2 Becker, Master. Richard F male 1.00 2 1 230136 39.0000 F4 S
194 1 2 Navratil, Master. Michel M male 3.00 1 1 230080 26.0000 F2 S
221 1 3 Sunderland, Mr. Victor Francis male 16.00 0 0 SOTON/OQ 392089 8.0500 NaN S
262 1 3 Asplund, Master. Edvin Rojj Felix male 3.00 4 2 347077 31.3875 NaN S
267 0 3 Panula, Mr. Ernesti Arvid male 16.00 4 1 3101295 39.6875 NaN S
279 0 3 Rice, Master. Eric male 7.00 4 1 382652 29.1250 NaN Q
283 0 3 de Pelsmaeker, Mr. Alfons male 16.00 0 0 345778 9.5000 NaN S
306 1 1 Allison, Master. Hudson Trevor male 0.92 1 2 113781 151.5500 C22 C26 S
334 0 3 Vander Planke, Mr. Leo Edmondus male 16.00 2 0 345764 18.0000 NaN S
341 1 2 Navratil, Master. Edmond Roger male 2.00 1 1 230080 26.0000 F2 S
349 1 3 Coutts, Master. William Loch "William" male 3.00 1 1 C.A. 37671 15.9000 NaN S
353 0 3 Elias, Mr. Tannous male 15.00 1 1 2695 7.2292 NaN C
387 0 3 Goodwin, Master. Sidney Leonard male 1.00 5 2 CA 2144 46.9000 NaN S
408 1 2 Richards, Master. William Rowe male 3.00 1 1 29106 18.7500 NaN S
434 0 3 Kallio, Mr. Nikolai Erland male 17.00 0 0 STON/O 2. 3101274 7.1250 NaN S
446 1 1 Dodge, Master. Washington male 4.00 0 2 33638 81.8583 A34 S
481 0 3 Goodwin, Master. Harold Victor male 9.00 5 2 CA 2144 46.9000 NaN S
490 1 3 Coutts, Master. Eden Leslie "Neville" male 9.00 1 1 C.A. 37671 15.9000 NaN S
501 0 3 Calic, Mr. Petar male 17.00 0 0 315086 8.6625 NaN S
533 0 3 Elias, Mr. Joseph Jr male 17.00 1 1 2690 7.2292 NaN C
550 1 2 Davies, Master. John Morgan Jr male 8.00 1 1 C.A. 33112 36.7500 NaN S
551 1 1 Thayer, Mr. John Borland Jr male 17.00 0 2 17421 110.8833 C70 C
575 0 3 Rush, Mr. Alfred George John male 16.00 0 0 A/4. 20589 8.0500 NaN S
684 0 3 Goodwin, Mr. Charles Edward male 14.00 5 2 CA 2144 46.9000 NaN S
687 0 3 Panula, Mr. Jaako Arnold male 14.00 4 1 3101295 39.6875 NaN S
722 0 3 Jensen, Mr. Svend Lauritz male 17.00 1 0 350048 7.0542 NaN S
732 0 3 Hassan, Mr. Houssein G N male 11.00 0 0 2699 18.7875 NaN C
747 0 3 Abbott, Mr. Rossmore Edward male 16.00 1 1 C.A. 2673 20.2500 NaN S
752 1 3 Moor, Master. Meier male 6.00 0 1 392096 12.4750 E121 S
756 1 2 Hamalainen, Master. Viljo male 0.67 1 1 250649 14.5000 NaN S
765 0 3 Eklund, Mr. Hans Linus male 16.00 0 0 347074 7.7750 NaN S
788 0 3 Rice, Master. George Hugh male 8.00 4 1 382652 29.1250 NaN Q
789 1 3 Dean, Master. Bertram Vere male 1.00 1 2 C.A. 2315 20.5750 NaN S
792 0 2 Gaskell, Mr. Alfred male 16.00 0 0 239865 26.0000 NaN S
803 1 1 Carter, Master. William Thornton II male 11.00 1 2 113760 120.0000 B96 B98 S
804 1 3 Thomas, Master. Assad Alexander male 0.42 0 1 2625 8.5167 NaN C
820 0 3 Skoog, Master. Karl Thorsten male 10.00 3 2 347088 27.9000 NaN S
825 0 3 Panula, Master. Urho Abraham male 2.00 4 1 3101295 39.6875 NaN S
828 1 2 Mallet, Master. Andre male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C
832 1 2 Richards, Master. George Sibley male 0.83 1 1 29106 18.7500 NaN S
842 0 2 Mudd, Mr. Thomas Charles male 16.00 0 0 S.O./P.P. 3 10.5000 NaN S
845 0 3 Culumovic, Mr. Jeso male 17.00 0 0 315090 8.6625 NaN S
851 0 3 Andersson, Master. Sigvard Harald Elias male 4.00 4 2 347082 31.2750 NaN S
870 1 3 Johnson, Master. Harold Theodor male 4.00 1 1 347742 11.1333 NaN S

Your turn

Use numpy and pandas to answer a few questions about data


In [ ]:
# who on average paid more for their ticket, men or women?

mean_fare_men = <YOUR CODE>
mean_fare_women = <YOUR CODE>

print(mean_fare_men, mean_fare_women)

In [ ]:
# who is more likely to survive: a child (<18 yo) or an adult?

child_survival_rate = <YOUR CODE>
adult_survival_rate = <YOUR CODE>

print(child_survival_rate, adult_survival_rate)

Part IV: plots and matplotlib

Using Python to visualize the data is covered by yet another library: matplotlib.

Just like Python itself, matplotlib has an awesome tendency of keeping simple things simple while still allowing you to write complicated stuff with convenience (e.g. super-detailed plots or custom animations).


In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
# ^-- this "magic" tells all future matplotlib plots to be drawn inside notebook and not in a separate window.

# line plot
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])


Out[ ]:
[<matplotlib.lines.Line2D at 0x7f2fec9370f0>]

In [ ]:
# scatter-plot
plt.scatter([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])

plt.show()  # show the first plot and begin drawing next one



In [ ]:
# draw a scatter plot with custom markers and colors
plt.scatter([1, 1, 2, 3, 4, 4.5], [3, 2, 2, 5, 15, 24],
            c=["red", "blue", "orange", "green", "cyan", "gray"], marker="x")

# without .show(), several plots will be drawn on top of one another
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25], c="black")

# adding more sugar
plt.title("Conspiracy theory proven!!!")
plt.xlabel("Per capita alcohol consumption")
plt.ylabel("# Layers in state of the art image classifier")

# fun with correlations: http://bit.ly/1FcNnWF


Out[ ]:
<matplotlib.text.Text at 0x7f2fe8fcaf28>

In [ ]:
# histogram - showing data density
plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 8, 9, 10])
plt.show()

plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4,
          4, 5, 5, 5, 6, 7, 7, 8, 9, 10], bins=5)


Out[ ]:
(array([ 4.,  7.,  5.,  3.,  3.]),
 array([  0.,   2.,   4.,   6.,   8.,  10.]),
 <a list of 5 Patch objects>)

In [ ]:
# plot a histogram of age and a histogram of ticket fares on separate plots

<YOUR CODE>

# bonus: use tab shift-tab to see if there is a way to draw a 2D histogram of age vs fare.

In [ ]:
# make a scatter plot of passenger age vs ticket fare

<YOUR CODE>

# kudos if you add separate colors for men and women

Part V (final): machine learning with scikit-learn

Scikit-learn is the tool for simple machine learning pipelines.

It's a single library that unites a whole bunch of models under the common interface:

  • Create: model = sklearn.whatever.ModelNameHere(parameters_if_any)
  • Train: model.fit(X, y)
  • Predict: model.predict(X_test)

It also contains utilities for feature extraction, quality estimation or cross-validation.


In [ ]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

features = data[["Fare", "SibSp"]].copy()
answers = data["Survived"]

model = RandomForestClassifier(n_estimators=100)
model.fit(features[:-100], answers[:-100])

test_predictions = model.predict(features[-100:])
print("Test accuracy:", accuracy_score(answers[-100:], test_predictions))


Test accuracy: 0.66

Final quest: add more features to achieve accuracy of at least 0.80

Hint: for string features like "Sex" or "Embarked" you will have to compute some kind of numeric representation. For example, 1 if male and 0 if female or vice versa

Hint II: you can use model.feature_importances_ to get a hint on how much did it rely each of your features.

Here are more resources for sklearn:


Okay, here's what we've learned: to survive a shipwreck you need to become an underaged girl with parents on the ship. Be sure to use this helpful advice next time you find yourself in a shipwreck.