A crash course in

Surviving Titanic

(with numpy and matplotlib)

This notebook is going to teach you to use the basic data science stack for Python: Jupyter, Numpy, matplotlib, and sklearn.

Part I: Jupyter notebooks in a nutshell

You are reading this line in a jupyter notebook.
A notebook consists of cells. A cell can contain either code or hypertext.
- This cell contains hypertext. The next cell contains code.
You can run a cell with code by selecting it (click) and pressing Ctrl + Enter to execute the code and display output(if any).
If you're running this on a device with no keyboard, ~~you are doing it wrong~~ use the top bar (esp. play/stop/restart buttons) to run code.
Behind the curtains, there's a Python interpreter that runs that code and remembers anything you defined.

Run these cells to get started



In [ ]:

    
a = 5



In [ ]:

    
print(a * 2)

Ctrl + S to save changes (or use the button that looks like a floppy disk)
Top menu → Kernel → Interrupt (or Stop button) if you want it to stop running cell midway.
Top menu → Kernel → Restart (or cyclic arrow button) if interrupt doesn't fix the problem (you will lose all variables).
For shortcut junkies like us: Top menu → Help → Keyboard Shortcuts

More: Hacker's guide, Beginner's guide, Datacamp tutorial

Now the most important feature of jupyter notebooks for this course:

if you're typing something, press Tab to see automatic suggestions, use arrow keys + enter to pick one.
if you move your cursor inside some function and press Shift + Tab, you'll get a help window. Shift + (Tab , Tab) (press Tab twice) will expand it.



In [ ]:

    
# run this first
import math



In [ ]:

    
# place your cursor at the end of the unfinished line below to find a function
# that computes arctangent from two parameters (should have 2 in it's name)
# once you chose it, press shift + tab + tab(again) to see the docs

math.a  # <---

Part II: Loading data with Pandas

Pandas is a library that helps you load the data, prepare it and perform some lightweight analysis. The god object here is the pandas.DataFrame - a 2D table with batteries included.

In the cells below we use it to read the data on the infamous titanic shipwreck.

please keep running all the code cells as you read



In [ ]:

    
# If you are running in Google Colab, this cell will download the dataset from our repository.
# Otherwise, this cell will do nothing.

import sys
if 'google.colab' in sys.modules:
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/week01_intro/primer_python_for_ml/train.csv



In [ ]:

    
import pandas as pd
# this yields a pandas.DataFrame
data = pd.read_csv("train.csv", index_col='PassengerId')



In [ ]:

    
# Selecting rows
head = data[:10]

head  # if you leave an expression at the end of a cell, jupyter will "display" it automatically









    Out[ ]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
    
    
      6
      0
      3
      Moran, Mr. James
      male
      NaN
      0
      0
      330877
      8.4583
      NaN
      Q
    
    
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54.0
      0
      0
      17463
      51.8625
      E46
      S
    
    
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2.0
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27.0
      0
      2
      347742
      11.1333
      NaN
      S
    
    
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14.0
      1
      0
      237736
      30.0708
      NaN
      C

About the data

Here's some of the columns

Name - a string with person's full name
Survived - 1 if a person survived the shipwreck, 0 otherwise.
Pclass - passenger class. Pclass == 3 is cheap'n'cheerful, Pclass == 1 is for moneybags.
Sex - a person's gender (in those good ol' times when there were just 2 of them)
Age - age in years, if available
Sibsp - number of siblings on a ship
Parch - number of parents on a ship
Fare - ticket cost
Embarked - port where the passenger embarked
- C = Cherbourg; Q = Queenstown; S = Southampton



In [ ]:

    
# table dimensions
print("len(data) =", len(data))
print("data.shape =", data.shape)









    



len(data) = 891
data.shape = (891, 11)



In [ ]:

    
# select a single row by PassengerId (using .loc)
print(data.loc[4])









    



Survived                                               1
Pclass                                                 1
Name        Futrelle, Mrs. Jacques Heath (Lily May Peel)
Sex                                               female
Age                                                   35
SibSp                                                  1
Parch                                                  0
Ticket                                            113803
Fare                                                53.1
Cabin                                               C123
Embarked                                               S
Name: 4, dtype: object



In [ ]:

    
# select a single row by index (using .iloc)
print(data.iloc[3])









    



Survived                                               1
Pclass                                                 1
Name        Futrelle, Mrs. Jacques Heath (Lily May Peel)
Sex                                               female
Age                                                   35
SibSp                                                  1
Parch                                                  0
Ticket                                            113803
Fare                                                53.1
Cabin                                               C123
Embarked                                               S
Name: 4, dtype: object



In [ ]:

    
# select a single column.
ages = data["Age"]
print(ages[:10])  # alternatively: data.Age









    



PassengerId
1     22.0
2     38.0
3     26.0
4     35.0
5     35.0
6      NaN
7     54.0
8      2.0
9     27.0
10    14.0
Name: Age, dtype: float64



In [ ]:

    
# select several columns and rows at once
# alternatively: data[["Fare","Pclass"]].loc[5:10]
data.loc[5:10, ("Fare", "Pclass")]









    Out[ ]:







  
    
      
      Fare
      Pclass
    
    
      PassengerId
      
      
    
  
  
    
      5
      8.0500
      3
    
    
      6
      8.4583
      3
    
    
      7
      51.8625
      1
    
    
      8
      21.0750
      3
    
    
      9
      11.1333
      3
    
    
      10
      30.0708
      2

Your turn:



In [ ]:

    
# Select passengers number 13 and 666 (with these PassengerId values). Did they survive?

<YOUR CODE>



In [ ]:

    
# Compute the overall survival rate: what fraction of passengers survived the shipwreck?

<YOUR CODE>

Pandas also has some basic data analysis tools. For one, you can quickly display statistical aggregates for each column using .describe()



In [ ]:

    
data.describe()

Some columns contain NaN values - this means that there is no data there. For example, passenger #6 has unknown age. To simplify the future data analysis, we'll replace NaN values by using pandas fillna function.

Note: we do this so easily because it's a tutorial. In general, you think twice before you modify data like this.



In [ ]:

    
data.loc[6]



In [ ]:

    
data['Age'] = data['Age'].fillna(value=data['Age'].mean())
data['Fare'] = data['Fare'].fillna(value=data['Fare'].mean())



In [ ]:

    
data.loc[6]

More pandas:

A neat tutorial from pydata
Official tutorials, including this 10 minutes to pandas
Bunch of cheat sheets awaits just one google query away from you (e.g. basics, combining datasets and so on).

Part III: Numpy and vectorized computing

Almost any machine learning model requires some computational heavy lifting usually involving linear algebra problems. Unfortunately, raw Python is terrible at this because each operation is interpreted at runtime.

So instead, we'll use numpy - a library that lets you run blazing fast computation with vectors, matrices and other tensors. Again, the god object here is numpy.ndarray:



In [ ]:

    
import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

print("a =", a)
print("b =", b)

# math and boolean operations can applied to each element of an array
print("a + 1 =", a + 1)
print("a * 2 =", a * 2)
print("a == 2", a == 2)
# ... or corresponding elements of two (or more) arrays
print("a + b =", a + b)
print("a * b =", a * b)









    



a =  [1 2 3 4 5]
b =  [5 4 3 2 1]
a + 1 = [2 3 4 5 6]
a * 2 = [ 2  4  6  8 10]
a == 2 [False  True False False False]
a + b = [6 6 6 6 6]
a * b = [5 8 9 8 5]



In [ ]:

    
# Your turn: compute half-products of a and b elements (i.e. ½ of the products of corresponding elements)
<YOUR CODE>



In [ ]:

    
# compute elementwise quotient between squared a and (b plus 1)
<YOUR CODE>

How fast is it, Harry?

Let's compare computation time for Python and Numpy

Two arrays of $10^6$ elements
- first one: from 0 to 1 000 000
- second one: from 99 to 1 000 099
Computing:
- elementwise sum
- elementwise product
- square root of first array
- sum of all elements in the first array



In [ ]:

    
%%time
# ^-- this "magic" measures and prints cell computation time

# Option I: pure Python
arr_1 = range(1000000)
arr_2 = range(99, 1000099)


a_sum = []
a_prod = []
sqrt_a1 = []
for i in range(len(arr_1)):
    a_sum.append(arr_1[i]+arr_2[i])
    a_prod.append(arr_1[i]*arr_2[i])
    a_sum.append(arr_1[i]**0.5)

arr_1_sum = sum(arr_1)



In [ ]:

    
%%time

# Option II: start from Python, convert to numpy
arr_1 = range(1000000)
arr_2 = range(99, 1000099)

arr_1, arr_2 = np.array(arr_1), np.array(arr_2)


a_sum = arr_1 + arr_2
a_prod = arr_1 * arr_2
sqrt_a1 = arr_1 ** .5
arr_1_sum = arr_1.sum()



In [ ]:

    
%%time

# Option III: pure numpy
arr_1 = np.arange(1000000)
arr_2 = np.arange(99, 1000099)

a_sum = arr_1 + arr_2
a_prod = arr_1 * arr_2
sqrt_a1 = arr_1 ** .5
arr_1_sum = arr_1.sum()

If you want more serious benchmarks, take a look at this.

There's also a bunch of pre-implemented operations including logarithms, trigonometry, vector/matrix products and aggregations.



In [ ]:

    
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("numpy.sum(a) =", np.sum(a))
print("numpy.mean(a) =", np.mean(a))
print("numpy.min(a) =",  np.min(a))
print("numpy.argmin(b) =", np.argmin(b))  # index of minimal element
# dot product. Also used for matrix/tensor multiplication
print("numpy.dot(a,b) =", np.dot(a, b))
print(
    "numpy.unique(['male','male','female','female','male']) =",
    np.unique(['male', 'male', 'female', 'female', 'male']))









    



numpy.sum(a) = 15
numpy.mean(a) = 3.0
numpy.min(a) = 1
numpy.argmin(b) = 4
numpy.dot(a,b) = 35
numpy.unique(['male','male','female','female','male']) = ['female' 'male']

There is a lot more stuff. Check out a Numpy cheat sheet here.

The important part: all this functionality works with dataframes:



In [ ]:

    
print("Max ticket price: ", np.max(data["Fare"]))
print("\nThe guy who paid the most:\n", data.iloc[np.argmax(data["Fare"])])









    



Max ticket price:  512.3292

The guy who paid the most:
 Survived                   1
Pclass                     1
Name        Ward, Miss. Anna
Sex                   female
Age                       35
SibSp                      0
Parch                      0
Ticket              PC 17755
Fare                 512.329
Cabin                    NaN
Embarked                   C
Name: 259, dtype: object



In [ ]:

    
# your code: compute mean passenger age and the oldest guy on the ship
<YOUR CODE>



In [ ]:

    
print("Boolean operations")

print('a =', a)
print('b =', b)
print("a > 2", a > 2)
print("numpy.logical_not(a>2) =", np.logical_not(a > 2))
print("numpy.logical_and(a>2,b>2) =", np.logical_and(a > 2, b > 2))
print("numpy.logical_or(a>4,b<3) =", np.logical_or(a > 2, b < 3))

print()

print("shortcuts")
print("~(a > 2) =", ~(a > 2))  # logical_not(a > 2)
print("(a > 2) & (b > 2) =", (a > 2) & (b > 2))  # logical_and
print("(a > 2) | (b < 3) =", (a > 2) | (b < 3))  # logical_or









    



Boolean operations
a =  [1 2 3 4 5]
b =  [5 4 3 2 1]
a > 2 [False False  True  True  True]
numpy.logical_not(a>2) =  [ True  True False False False]
numpy.logical_and(a>2,b>2) =  [False False  True False False]
numpy.logical_or(a>4,b<3) =  [False False  True  True  True]

 shortcuts
~(a > 2) =  [ True  True False False False]
(a > 2) & (b > 2) =  [False False  True False False]
(a > 2) | (b < 3) =  [False False  True  True  True]

The final Numpy feature we'll need is indexing: selecting elements from an array. Aside from Python indexes and slices (e.g. a[1:4]), Numpy also allows you to select several elements at once.



In [ ]:

    
a = np.array([0, 1, 4, 9, 16, 25])
ix = np.array([1, 2, 5])
print("a =", a)
print("Select by element index")
print("a[[1,2,5]] =", a[ix])

print("\nSelect by boolean mask")
# select all elementts in a that are greater than 5
print("a[a > 5] =", a[a > 5])
print("(a % 2 == 0) =", a % 2 == 0)  # True for even, False for odd
print("a[a % 2 == 0] =", a[a % 2 == 0])  # select all elements in a that are even


# select male children
print("data[(data['Age'] < 18) & (data['Sex'] == 'male')] = (below)")
data[(data['Age'] < 18) & (data['Sex'] == 'male')]









    



a =  [ 0  1  4  9 16 25]
Select by element index
a[[1,2,5]] =  [ 1  4 25]

Select by boolean mask
a[a > 5] =  [ 9 16 25]
(a % 2 == 0) = [ True False  True False  True False]
a[a % 2 == 0] = [ 0  4 16]
data[(data['Age'] < 18) & (data['Sex'] == 'male')] = (below)






    Out[ ]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2.00
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      17
      0
      3
      Rice, Master. Eugene
      male
      2.00
      4
      1
      382652
      29.1250
      NaN
      Q
    
    
      51
      0
      3
      Panula, Master. Juha Niilo
      male
      7.00
      4
      1
      3101295
      39.6875
      NaN
      S
    
    
      60
      0
      3
      Goodwin, Master. William Frederick
      male
      11.00
      5
      2
      CA 2144
      46.9000
      NaN
      S
    
    
      64
      0
      3
      Skoog, Master. Harald
      male
      4.00
      3
      2
      347088
      27.9000
      NaN
      S
    
    
      79
      1
      2
      Caldwell, Master. Alden Gates
      male
      0.83
      0
      2
      248738
      29.0000
      NaN
      S
    
    
      87
      0
      3
      Ford, Mr. William Neal
      male
      16.00
      1
      3
      W./C. 6608
      34.3750
      NaN
      S
    
    
      126
      1
      3
      Nicola-Yarred, Master. Elias
      male
      12.00
      1
      0
      2651
      11.2417
      NaN
      C
    
    
      139
      0
      3
      Osen, Mr. Olaf Elon
      male
      16.00
      0
      0
      7534
      9.2167
      NaN
      S
    
    
      164
      0
      3
      Calic, Mr. Jovo
      male
      17.00
      0
      0
      315093
      8.6625
      NaN
      S
    
    
      165
      0
      3
      Panula, Master. Eino Viljami
      male
      1.00
      4
      1
      3101295
      39.6875
      NaN
      S
    
    
      166
      1
      3
      Goldsmith, Master. Frank John William "Frankie"
      male
      9.00
      0
      2
      363291
      20.5250
      NaN
      S
    
    
      172
      0
      3
      Rice, Master. Arthur
      male
      4.00
      4
      1
      382652
      29.1250
      NaN
      Q
    
    
      183
      0
      3
      Asplund, Master. Clarence Gustaf Hugo
      male
      9.00
      4
      2
      347077
      31.3875
      NaN
      S
    
    
      184
      1
      2
      Becker, Master. Richard F
      male
      1.00
      2
      1
      230136
      39.0000
      F4
      S
    
    
      194
      1
      2
      Navratil, Master. Michel M
      male
      3.00
      1
      1
      230080
      26.0000
      F2
      S
    
    
      221
      1
      3
      Sunderland, Mr. Victor Francis
      male
      16.00
      0
      0
      SOTON/OQ 392089
      8.0500
      NaN
      S
    
    
      262
      1
      3
      Asplund, Master. Edvin Rojj Felix
      male
      3.00
      4
      2
      347077
      31.3875
      NaN
      S
    
    
      267
      0
      3
      Panula, Mr. Ernesti Arvid
      male
      16.00
      4
      1
      3101295
      39.6875
      NaN
      S
    
    
      279
      0
      3
      Rice, Master. Eric
      male
      7.00
      4
      1
      382652
      29.1250
      NaN
      Q
    
    
      283
      0
      3
      de Pelsmaeker, Mr. Alfons
      male
      16.00
      0
      0
      345778
      9.5000
      NaN
      S
    
    
      306
      1
      1
      Allison, Master. Hudson Trevor
      male
      0.92
      1
      2
      113781
      151.5500
      C22 C26
      S
    
    
      334
      0
      3
      Vander Planke, Mr. Leo Edmondus
      male
      16.00
      2
      0
      345764
      18.0000
      NaN
      S
    
    
      341
      1
      2
      Navratil, Master. Edmond Roger
      male
      2.00
      1
      1
      230080
      26.0000
      F2
      S
    
    
      349
      1
      3
      Coutts, Master. William Loch "William"
      male
      3.00
      1
      1
      C.A. 37671
      15.9000
      NaN
      S
    
    
      353
      0
      3
      Elias, Mr. Tannous
      male
      15.00
      1
      1
      2695
      7.2292
      NaN
      C
    
    
      387
      0
      3
      Goodwin, Master. Sidney Leonard
      male
      1.00
      5
      2
      CA 2144
      46.9000
      NaN
      S
    
    
      408
      1
      2
      Richards, Master. William Rowe
      male
      3.00
      1
      1
      29106
      18.7500
      NaN
      S
    
    
      434
      0
      3
      Kallio, Mr. Nikolai Erland
      male
      17.00
      0
      0
      STON/O 2. 3101274
      7.1250
      NaN
      S
    
    
      446
      1
      1
      Dodge, Master. Washington
      male
      4.00
      0
      2
      33638
      81.8583
      A34
      S
    
    
      481
      0
      3
      Goodwin, Master. Harold Victor
      male
      9.00
      5
      2
      CA 2144
      46.9000
      NaN
      S
    
    
      490
      1
      3
      Coutts, Master. Eden Leslie "Neville"
      male
      9.00
      1
      1
      C.A. 37671
      15.9000
      NaN
      S
    
    
      501
      0
      3
      Calic, Mr. Petar
      male
      17.00
      0
      0
      315086
      8.6625
      NaN
      S
    
    
      533
      0
      3
      Elias, Mr. Joseph Jr
      male
      17.00
      1
      1
      2690
      7.2292
      NaN
      C
    
    
      550
      1
      2
      Davies, Master. John Morgan Jr
      male
      8.00
      1
      1
      C.A. 33112
      36.7500
      NaN
      S
    
    
      551
      1
      1
      Thayer, Mr. John Borland Jr
      male
      17.00
      0
      2
      17421
      110.8833
      C70
      C
    
    
      575
      0
      3
      Rush, Mr. Alfred George John
      male
      16.00
      0
      0
      A/4. 20589
      8.0500
      NaN
      S
    
    
      684
      0
      3
      Goodwin, Mr. Charles Edward
      male
      14.00
      5
      2
      CA 2144
      46.9000
      NaN
      S
    
    
      687
      0
      3
      Panula, Mr. Jaako Arnold
      male
      14.00
      4
      1
      3101295
      39.6875
      NaN
      S
    
    
      722
      0
      3
      Jensen, Mr. Svend Lauritz
      male
      17.00
      1
      0
      350048
      7.0542
      NaN
      S
    
    
      732
      0
      3
      Hassan, Mr. Houssein G N
      male
      11.00
      0
      0
      2699
      18.7875
      NaN
      C
    
    
      747
      0
      3
      Abbott, Mr. Rossmore Edward
      male
      16.00
      1
      1
      C.A. 2673
      20.2500
      NaN
      S
    
    
      752
      1
      3
      Moor, Master. Meier
      male
      6.00
      0
      1
      392096
      12.4750
      E121
      S
    
    
      756
      1
      2
      Hamalainen, Master. Viljo
      male
      0.67
      1
      1
      250649
      14.5000
      NaN
      S
    
    
      765
      0
      3
      Eklund, Mr. Hans Linus
      male
      16.00
      0
      0
      347074
      7.7750
      NaN
      S
    
    
      788
      0
      3
      Rice, Master. George Hugh
      male
      8.00
      4
      1
      382652
      29.1250
      NaN
      Q
    
    
      789
      1
      3
      Dean, Master. Bertram Vere
      male
      1.00
      1
      2
      C.A. 2315
      20.5750
      NaN
      S
    
    
      792
      0
      2
      Gaskell, Mr. Alfred
      male
      16.00
      0
      0
      239865
      26.0000
      NaN
      S
    
    
      803
      1
      1
      Carter, Master. William Thornton II
      male
      11.00
      1
      2
      113760
      120.0000
      B96 B98
      S
    
    
      804
      1
      3
      Thomas, Master. Assad Alexander
      male
      0.42
      0
      1
      2625
      8.5167
      NaN
      C
    
    
      820
      0
      3
      Skoog, Master. Karl Thorsten
      male
      10.00
      3
      2
      347088
      27.9000
      NaN
      S
    
    
      825
      0
      3
      Panula, Master. Urho Abraham
      male
      2.00
      4
      1
      3101295
      39.6875
      NaN
      S
    
    
      828
      1
      2
      Mallet, Master. Andre
      male
      1.00
      0
      2
      S.C./PARIS 2079
      37.0042
      NaN
      C
    
    
      832
      1
      2
      Richards, Master. George Sibley
      male
      0.83
      1
      1
      29106
      18.7500
      NaN
      S
    
    
      842
      0
      2
      Mudd, Mr. Thomas Charles
      male
      16.00
      0
      0
      S.O./P.P. 3
      10.5000
      NaN
      S
    
    
      845
      0
      3
      Culumovic, Mr. Jeso
      male
      17.00
      0
      0
      315090
      8.6625
      NaN
      S
    
    
      851
      0
      3
      Andersson, Master. Sigvard Harald Elias
      male
      4.00
      4
      2
      347082
      31.2750
      NaN
      S
    
    
      870
      1
      3
      Johnson, Master. Harold Theodor
      male
      4.00
      1
      1
      347742
      11.1333
      NaN
      S

Your turn

Use numpy and pandas to answer a few questions about data



In [ ]:

    
# who on average paid more for their ticket, men or women?

mean_fare_men = <YOUR CODE>
mean_fare_women = <YOUR CODE>

print(mean_fare_men, mean_fare_women)



In [ ]:

    
# who is more likely to survive: a child (<18 yo) or an adult?

child_survival_rate = <YOUR CODE>
adult_survival_rate = <YOUR CODE>

print(child_survival_rate, adult_survival_rate)

Part IV: plots and matplotlib

Using Python to visualize the data is covered by yet another library: matplotlib.

Just like Python itself, matplotlib has an awesome tendency of keeping simple things simple while still allowing you to write complicated stuff with convenience (e.g. super-detailed plots or custom animations).



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline
# ^-- this "magic" tells all future matplotlib plots to be drawn inside notebook and not in a separate window.

# line plot
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])









    Out[ ]:





[<matplotlib.lines.Line2D at 0x7f2fec9370f0>]



In [ ]:

    
# scatter-plot
plt.scatter([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])

plt.show()  # show the first plot and begin drawing next one



In [ ]:

    
# draw a scatter plot with custom markers and colors
plt.scatter([1, 1, 2, 3, 4, 4.5], [3, 2, 2, 5, 15, 24],
            c=["red", "blue", "orange", "green", "cyan", "gray"], marker="x")

# without .show(), several plots will be drawn on top of one another
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25], c="black")

# adding more sugar
plt.title("Conspiracy theory proven!!!")
plt.xlabel("Per capita alcohol consumption")
plt.ylabel("# Layers in state of the art image classifier")

# fun with correlations: http://bit.ly/1FcNnWF









    Out[ ]:





<matplotlib.text.Text at 0x7f2fe8fcaf28>



In [ ]:

    
# histogram - showing data density
plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 8, 9, 10])
plt.show()

plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4,
          4, 5, 5, 5, 6, 7, 7, 8, 9, 10], bins=5)









    












    Out[ ]:





(array([ 4.,  7.,  5.,  3.,  3.]),
 array([  0.,   2.,   4.,   6.,   8.,  10.]),
 <a list of 5 Patch objects>)



In [ ]:

    
# plot a histogram of age and a histogram of ticket fares on separate plots

<YOUR CODE>

# bonus: use tab shift-tab to see if there is a way to draw a 2D histogram of age vs fare.



In [ ]:

    
# make a scatter plot of passenger age vs ticket fare

<YOUR CODE>

# kudos if you add separate colors for men and women

Extended tutorial
A cheat sheet
Other libraries for more sophisticated stuff: Plotly and Bokeh

Part V (final): machine learning with scikit-learn

Scikit-learn is the tool for simple machine learning pipelines.

It's a single library that unites a whole bunch of models under the common interface:

Create: model = sklearn.whatever.ModelNameHere(parameters_if_any)
Train: model.fit(X, y)
Predict: model.predict(X_test)

It also contains utilities for feature extraction, quality estimation or cross-validation.



In [ ]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

features = data[["Fare", "SibSp"]].copy()
answers = data["Survived"]

model = RandomForestClassifier(n_estimators=100)
model.fit(features[:-100], answers[:-100])

test_predictions = model.predict(features[-100:])
print("Test accuracy:", accuracy_score(answers[-100:], test_predictions))









    



Test accuracy: 0.66

Final quest: add more features to achieve accuracy of at least 0.80

Hint: for string features like "Sex" or "Embarked" you will have to compute some kind of numeric representation. For example, 1 if male and 0 if female or vice versa

Hint II: you can use model.feature_importances_ to get a hint on how much did it rely each of your features.

Here are more resources for sklearn:

Okay, here's what we've learned: to survive a shipwreck you need to become an underaged girl with parents on the ship. Be sure to use this helpful advice next time you find yourself in a shipwreck.

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C