Welcome to the Jupyter Notebook

I might slip and call it the "IPython Notebook" sometimes, because it was originally just for interactive Python sessions. But it does much more now (including R and Julia).

What is it?

This "notebook" is my go-to version of the Lab Book you might have seen in bench science. It captures a fluid mix of text (in "markdown" format, with decent support for $\LaTeX$) and computation, in this case in Python.


In [1]:
# this is a python comment
# this cell contains python code

# executing the cell yields the results of the python command
2+2


Out[1]:
4

Why does it work so well for me?

The Donald Knuth had a dream of "literate programming" which captivated me years ago, but I could never really do it until I had the notebook technology.

The notebook works really well for including plots and other graphics as well.


In [2]:
# live code some graphics here

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot([3,1,4,1,5])


Out[2]:
[<matplotlib.lines.Line2D at 0x7f6bb124ea50>]

In [3]:
plt.style.use("fivethirtyeight")

In [4]:
plt.plot([3,1,4,1,5])


Out[4]:
[<matplotlib.lines.Line2D at 0x7f6baed87d50>]

In [5]:
# your turn: plot some additional digits of pi

import sympy

In [6]:
# to digits and then plot
pi_str = str(sympy.N(sympy.pi, n=100))
pi_digits = [int(x) for x in pi_str if x != '.']

In [7]:
plt.plot(pi_digits)


Out[7]:
[<matplotlib.lines.Line2D at 0x7f6badb90810>]

We will use two "packages" for the hands-on portion of this tutorial

Pandas

This is a panel data package with a cute name.


In [8]:
# live code an example of loading the va data csv with pandas here

import pandas as pd

In [9]:
df = pd.read_csv('../3-data/IHME_PHMRC_VA_DATA_ADULT_Y2013M09D11_0.csv', low_memory=False)

In [10]:
# DataFrame.iloc method selects row and columns by "integer location"

df.iloc[5:10, 5:10]


Out[10]:
gs_code46 gs_text46 va46 gs_code55 gs_text55
5 X09 Fires 19 X09 Fires
6 N17 Renal Failure 40 N17 Renal Failure
7 C30 AIDS with TB 2 C30 AIDS with TB
8 C34 Lung Cancer 27 C34 Lung Cancer
9 S85 Sepsis 42 S85 Sepsis

In [11]:
# If you are new to this sort of thing, what do you think this does?

df.iloc[5:10, :10]


Out[11]:
site module gs_code34 gs_text34 va34 gs_code46 gs_text46 va46 gs_code55 gs_text55
5 UP Adult X09 Fires 15 X09 Fires 19 X09 Fires
6 Dar Adult N17 Renal Failure 29 N17 Renal Failure 40 N17 Renal Failure
7 Dar Adult B20 AIDS 1 C30 AIDS with TB 2 C30 AIDS with TB
8 Bohol Adult C34 Lung Cancer 19 C34 Lung Cancer 27 C34 Lung Cancer
9 UP Adult O67 Maternal 21 S85 Sepsis 42 S85 Sepsis

In [12]:
# I don't have time to show you the details now, but I find that
# pandas DataFrames have really done things well.  For example:

df.gs_text34


Out[12]:
0                             Cirrhosis
1                              Epilepsy
2                             Pneumonia
3                                  COPD
4           Acute Myocardial Infarction
5                                 Fires
6                         Renal Failure
7                                  AIDS
8                           Lung Cancer
9                              Maternal
10                             Maternal
11                             Drowning
12                        Renal Failure
13        Other Cardiovascular Diseases
14                        Renal Failure
15                                 AIDS
16      Other Non-communicable Diseases
17                                Falls
18                                Fires
19                            Cirrhosis
20                                 AIDS
21                               Stroke
22                         Road Traffic
23                            Cirrhosis
24              Bite of Venomous Animal
25                             Diabetes
26                                 COPD
27        Other Cardiovascular Diseases
28        Other Cardiovascular Diseases
29            Other Infectious Diseases
                     ...               
7811                     Stomach Cancer
7812                          Cirrhosis
7813                      Renal Failure
7814          Other Infectious Diseases
7815                             Stroke
7816                               AIDS
7817                             Stroke
7818      Other Cardiovascular Diseases
7819                           Drowning
7820                    Prostate Cancer
7821                           Diabetes
7822                          Cirrhosis
7823                        Lung Cancer
7824                       Road Traffic
7825                             Stroke
7826                 Diarrhea/Dysentery
7827                          Cirrhosis
7828                          Cirrhosis
7829                           Diabetes
7830                           Maternal
7831                           Maternal
7832                      Breast Cancer
7833                               AIDS
7834                          Cirrhosis
7835                           Homicide
7836                    Cervical Cancer
7837      Other Cardiovascular Diseases
7838                         Poisonings
7839                              Fires
7840                  Esophageal Cancer
Name: gs_text34, dtype: object

In [13]:
df.gs_text34.value_counts()


Out[13]:
Stroke                             630
Other Non-communicable Diseases    599
Pneumonia                          540
AIDS                               502
Maternal                           468
Renal Failure                      416
Other Cardiovascular Diseases      416
Diabetes                           414
Acute Myocardial Infarction        400
Cirrhosis                          313
TB                                 276
Other Infectious Diseases          263
Diarrhea/Dysentery                 228
Road Traffic                       202
Breast Cancer                      195
Falls                              173
COPD                               171
Homicide                           167
Leukemia/Lymphomas                 156
Cervical Cancer                    155
Suicide                            124
Fires                              122
Drowning                           106
Lung Cancer                        106
Other Injuries                     103
Malaria                            100
Colorectal Cancer                   99
Poisonings                          86
Bite of Venomous Animal             66
Stomach Cancer                      62
Epilepsy                            48
Prostate Cancer                     48
Asthma                              47
Esophageal Cancer                   40
Name: gs_text34, dtype: int64

Scikit-Learn

This is a python-based machine learning library that has a lot of great methods in a common framework.


In [14]:
# you can guess what the next line does, 
# even if you have never used python before:

import sklearn.neighbors

In [15]:
# here is how sklearn creates a "classifier":

clf = sklearn.neighbors.KNeighborsClassifier()

In [16]:
# I didn't mention `numpy` before, but this is "the fundamental
# package for scientific computing with Python"

import numpy as np

In [17]:
# sklearn gets mixed up with Pandas DataFrames and Series,
# so you need to turn things into np.arrays:

X = np.array(df.loc[:, ['va46']])
y = np.array(df.gs_text34)

In [18]:
# one nice thing about sklearn is that it has all different
# fancy machine learning methods, but they all follow a
# common pattern:

clf.fit(X, y)


Out[18]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [19]:
clf.predict([[19]])


Out[19]:
array(['Fires'], dtype=object)

We will see plenty more of sklearn so I'll leave things here for now.