Welcome to the Jupyter Notebook

I might slip and call it the "IPython Notebook" sometimes, because it was originally just for interactive Python sessions. But it does much more now (including R and Julia).

What is it?

This "notebook" is my go-to version of the Lab Book you might have seen in bench science. It captures a fluid mix of text (in "markdown" format, with decent support for $\LaTeX$) and computation, in this case in Python.



In [1]:

    
# this is a python comment
# this cell contains python code

# executing the cell yields the results of the python command
2+2









    Out[1]:





4

Why does it work so well for me?

The Donald Knuth had a dream of "literate programming" which captivated me years ago, but I could never really do it until I had the notebook technology.

The notebook works really well for including plots and other graphics as well.



In [2]:

    
# live code some graphics here

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot([3,1,4,1,5])









    Out[2]:





[<matplotlib.lines.Line2D at 0x7f6bb124ea50>]



In [3]:

    
plt.style.use("fivethirtyeight")



In [4]:

    
plt.plot([3,1,4,1,5])









    Out[4]:





[<matplotlib.lines.Line2D at 0x7f6baed87d50>]



In [5]:

    
# your turn: plot some additional digits of pi

import sympy



In [6]:

    
# to digits and then plot
pi_str = str(sympy.N(sympy.pi, n=100))
pi_digits = [int(x) for x in pi_str if x != '.']



In [7]:

    
plt.plot(pi_digits)









    Out[7]:





[<matplotlib.lines.Line2D at 0x7f6badb90810>]

We will use two "packages" for the hands-on portion of this tutorial

Pandas

This is a panel data package with a cute name.



In [8]:

    
# live code an example of loading the va data csv with pandas here

import pandas as pd



In [9]:

    
df = pd.read_csv('../3-data/IHME_PHMRC_VA_DATA_ADULT_Y2013M09D11_0.csv', low_memory=False)



In [10]:

    
# DataFrame.iloc method selects row and columns by "integer location"

df.iloc[5:10, 5:10]









    Out[10]:






  
    
      
      gs_code46
      gs_text46
      va46
      gs_code55
      gs_text55
    
  
  
    
      5
      X09
      Fires
      19
      X09
      Fires
    
    
      6
      N17
      Renal Failure
      40
      N17
      Renal Failure
    
    
      7
      C30
      AIDS with TB
      2
      C30
      AIDS with TB
    
    
      8
      C34
      Lung Cancer
      27
      C34
      Lung Cancer
    
    
      9
      S85
      Sepsis
      42
      S85
      Sepsis



In [11]:

    
# If you are new to this sort of thing, what do you think this does?

df.iloc[5:10, :10]









    Out[11]:






  
    
      
      site
      module
      gs_code34
      gs_text34
      va34
      gs_code46
      gs_text46
      va46
      gs_code55
      gs_text55
    
  
  
    
      5
      UP
      Adult
      X09
      Fires
      15
      X09
      Fires
      19
      X09
      Fires
    
    
      6
      Dar
      Adult
      N17
      Renal Failure
      29
      N17
      Renal Failure
      40
      N17
      Renal Failure
    
    
      7
      Dar
      Adult
      B20
      AIDS
      1
      C30
      AIDS with TB
      2
      C30
      AIDS with TB
    
    
      8
      Bohol
      Adult
      C34
      Lung Cancer
      19
      C34
      Lung Cancer
      27
      C34
      Lung Cancer
    
    
      9
      UP
      Adult
      O67
      Maternal
      21
      S85
      Sepsis
      42
      S85
      Sepsis



In [12]:

    
# I don't have time to show you the details now, but I find that
# pandas DataFrames have really done things well.  For example:

df.gs_text34









    Out[12]:





0                             Cirrhosis
1                              Epilepsy
2                             Pneumonia
3                                  COPD
4           Acute Myocardial Infarction
5                                 Fires
6                         Renal Failure
7                                  AIDS
8                           Lung Cancer
9                              Maternal
10                             Maternal
11                             Drowning
12                        Renal Failure
13        Other Cardiovascular Diseases
14                        Renal Failure
15                                 AIDS
16      Other Non-communicable Diseases
17                                Falls
18                                Fires
19                            Cirrhosis
20                                 AIDS
21                               Stroke
22                         Road Traffic
23                            Cirrhosis
24              Bite of Venomous Animal
25                             Diabetes
26                                 COPD
27        Other Cardiovascular Diseases
28        Other Cardiovascular Diseases
29            Other Infectious Diseases
                     ...               
7811                     Stomach Cancer
7812                          Cirrhosis
7813                      Renal Failure
7814          Other Infectious Diseases
7815                             Stroke
7816                               AIDS
7817                             Stroke
7818      Other Cardiovascular Diseases
7819                           Drowning
7820                    Prostate Cancer
7821                           Diabetes
7822                          Cirrhosis
7823                        Lung Cancer
7824                       Road Traffic
7825                             Stroke
7826                 Diarrhea/Dysentery
7827                          Cirrhosis
7828                          Cirrhosis
7829                           Diabetes
7830                           Maternal
7831                           Maternal
7832                      Breast Cancer
7833                               AIDS
7834                          Cirrhosis
7835                           Homicide
7836                    Cervical Cancer
7837      Other Cardiovascular Diseases
7838                         Poisonings
7839                              Fires
7840                  Esophageal Cancer
Name: gs_text34, dtype: object



In [13]:

    
df.gs_text34.value_counts()









    Out[13]:





Stroke                             630
Other Non-communicable Diseases    599
Pneumonia                          540
AIDS                               502
Maternal                           468
Renal Failure                      416
Other Cardiovascular Diseases      416
Diabetes                           414
Acute Myocardial Infarction        400
Cirrhosis                          313
TB                                 276
Other Infectious Diseases          263
Diarrhea/Dysentery                 228
Road Traffic                       202
Breast Cancer                      195
Falls                              173
COPD                               171
Homicide                           167
Leukemia/Lymphomas                 156
Cervical Cancer                    155
Suicide                            124
Fires                              122
Drowning                           106
Lung Cancer                        106
Other Injuries                     103
Malaria                            100
Colorectal Cancer                   99
Poisonings                          86
Bite of Venomous Animal             66
Stomach Cancer                      62
Epilepsy                            48
Prostate Cancer                     48
Asthma                              47
Esophageal Cancer                   40
Name: gs_text34, dtype: int64

Scikit-Learn

This is a python-based machine learning library that has a lot of great methods in a common framework.



In [14]:

    
# you can guess what the next line does, 
# even if you have never used python before:

import sklearn.neighbors



In [15]:

    
# here is how sklearn creates a "classifier":

clf = sklearn.neighbors.KNeighborsClassifier()



In [16]:

    
# I didn't mention `numpy` before, but this is "the fundamental
# package for scientific computing with Python"

import numpy as np



In [17]:

    
# sklearn gets mixed up with Pandas DataFrames and Series,
# so you need to turn things into np.arrays:

X = np.array(df.loc[:, ['va46']])
y = np.array(df.gs_text34)



In [18]:

    
# one nice thing about sklearn is that it has all different
# fancy machine learning methods, but they all follow a
# common pattern:

clf.fit(X, y)









    Out[18]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')



In [19]:

    
clf.predict([[19]])









    Out[19]:





array(['Fires'], dtype=object)

We will see plenty more of sklearn so I'll leave things here for now.

	gs_code46	gs_text46	va46	gs_code55	gs_text55
5	X09	Fires	19	X09	Fires
6	N17	Renal Failure	40	N17	Renal Failure
7	C30	AIDS with TB	2	C30	AIDS with TB
8	C34	Lung Cancer	27	C34	Lung Cancer
9	S85	Sepsis	42	S85	Sepsis

	site	module	gs_code34	gs_text34	va34	gs_code46	gs_text46	va46	gs_code55	gs_text55
5	UP	Adult	X09	Fires	15	X09	Fires	19	X09	Fires
6	Dar	Adult	N17	Renal Failure	29	N17	Renal Failure	40	N17	Renal Failure
7	Dar	Adult	B20	AIDS	1	C30	AIDS with TB	2	C30	AIDS with TB
8	Bohol	Adult	C34	Lung Cancer	19	C34	Lung Cancer	27	C34	Lung Cancer
9	UP	Adult	O67	Maternal	21	S85	Sepsis	42	S85	Sepsis