Analyzing patient data

Words are useful, but what’s more useful are the sentences and stories we build with them.
A lot of powerful tools are built into languages like Python, even more live in the libraries they are used to build
We need to import a library called NumPy
Use this library to do fancy things with numbers (e.g. if you have matrices or arrays).



In [2]:

    
import numpy

Importing a library akin to getting lab equipment out of a locker and setting up on bench
Libraries provide additional functionality
With NumPy loaded we can read the CSV into python.



In [3]:

    
#assuming the data file is in the data/ folder
numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')









    Out[3]:





array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ..., 
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])

numpy.loadtex() is a function call, runs loadtxt in numpy
uses dot notation to access thing.component
two parameters: filename and delimiter - both character strings (")
we didn't save in memory using a variable
variables in python must start with letter & are case sensitive
assignment operator is =
let's look at assigning this inflammation data to a variable



In [4]:

    
data = numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')



In [5]:

    
print(data)









    



[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]



In [7]:

    
weight_kg = 55 #assigns value 55 to weight_kg



In [8]:

    
print(weight_kg) #we can print to the screen



In [9]:

    
print("weight in kg", weight_kg)









    



weight in kg 55



In [15]:

    
weight_kg = 70



In [11]:

    
print("weight in kg", weight_kg)









    



weight in kg 70

print above shows several things at once by separating with commas
variable as putting sticky note on value
means assigning a value to one variable does not chage the value of other variables.



In [12]:

    
weight_kg * 2









    Out[12]:





140



In [16]:

    
weight_lb = weight_kg * 2.2



In [17]:

    
print('weigh in lb:', weight_lb)









    



weigh in lb: 154.0



In [18]:

    
print("weight in lb:", weight_kg*2.2)









    



weight in lb: 154.0



In [19]:

    
print(data)









    



[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]

whos #ipython command to see what variables & mods you have



In [20]:

    
whos









    



Variable    Type       Data/Info
--------------------------------
data        ndarray    60x40: 2400 elems, type `float64`, 19200 bytes
numpy       module     <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
weight_kg   int        70
weight_lb   float      154.0

What does the following program print out?

first, second = 'Grace', 'Hopper'
third, fourth = second, first
print(third, fourth)



In [21]:

    
print(data)









    



[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]



In [22]:

    
print(type(data)) #we can get type of object









    



<class 'numpy.ndarray'>

data refers to N-dimensional array
data corres. to patients' inflammation
let's look at the shape of the data



In [23]:

    
print(data.shape)

data has 60 rows and 40 columns
when we created data with numpy it also creates members or attributes
extra info describes data like adjective does a noun
dot notation to access members



In [24]:

    
print('first value in data', data[0,0]) #use index in square brackets









    



first value in data 0.0



In [26]:

    
print('4th value in data', data[0,3]) #use index in square brackets









    



4th value in data 3.0



In [32]:

    
print('first value in 3rd row data', data[3,0]) #use index in square brackets









    



first value in 3rd row data 0.0



In [31]:

    
!head -3 data/inflammation-01.csv









    



0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1



In [33]:

    
print('middle value in data', data[30,20]) # get the middle value - notice here i didn't use print









    



middle value in data 13.0

programming languages like MATLAB and R start counting at 1
languages in C family (C++, Java, Perl & python)
we have MxN array in python, indices go from 0 to M-1 on the first axis and 0 to N-1 on second
indices are (row, column)



In [16]:

    
data[0:4, 0:10] #select whole sections of matrix, 1st 10 days & 4 patients









    Out[16]:





array([[ 0.,  0.,  1.,  3.,  1.,  2.,  4.,  7.,  8.,  3.],
       [ 0.,  1.,  2.,  1.,  2.,  1.,  3.,  2.,  2.,  6.],
       [ 0.,  1.,  1.,  3.,  3.,  2.,  6.,  2.,  5.,  9.],
       [ 0.,  0.,  2.,  0.,  4.,  2.,  2.,  1.,  6.,  7.]])

slice 0:4 means start at 0 and go up to but not include 4
up-to-but-not-including takes a bit of getting used to



In [17]:

    
data[5:10,0:10]









    Out[17]:





array([[ 0.,  0.,  1.,  2.,  2.,  4.,  2.,  1.,  6.,  4.],
       [ 0.,  0.,  2.,  2.,  4.,  2.,  2.,  5.,  5.,  8.],
       [ 0.,  0.,  1.,  2.,  3.,  1.,  2.,  3.,  5.,  3.],
       [ 0.,  0.,  0.,  3.,  1.,  5.,  6.,  5.,  5.,  8.],
       [ 0.,  1.,  1.,  2.,  1.,  3.,  5.,  3.,  5.,  8.]])

dont' have to include uper and lower bound
python uses 0 by default if we don't include lower
no upper slice runs to the axis
: will include everything



In [18]:

    
data[:3, 36:]









    Out[18]:





array([[ 2.,  3.,  0.,  0.],
       [ 1.,  1.,  0.,  1.],
       [ 2.,  2.,  1.,  1.]])

A section of an array is called a slice. We can take slices of character strings as well:

element = 'oxygen'
print('first three characters:', element[0:3])
print('last three characters:', element[3:6])
first three characters: oxy
last three characters: gen

What is the value of element[:4]? What about element[4:]? Or element[:]?

What is element[-1]? What is element[-2]? Given those answers, explain what element[1:-1] does.



In [1]:

    
element = 'oxygen'
print('first three characters:', element[0:3])
print('last three characters:', element[3:6])









    



first three characters: oxy
last three characters: gen



In [6]:

    
print(element[:4])
print(element[4:])









    



oxyg
en



In [7]:

    
print(:)









    



  File "<ipython-input-7-83302cbc0a1f>", line 1
    print(:)
          ^
SyntaxError: invalid syntax



In [14]:

    
#oxygen
print(element[-1])
print(element[-2])
print(element[2:-1])









    



n
e
yge



In [13]:

    
doubledata = data * 2.0 #we can perform math on array









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-76134479298d> in <module>()
----> 1 doubledata = data * 2.0 #we can perform math on array

NameError: name 'data' is not defined

operation on arrays is done on each individual element of the array



In [20]:

    
doubledata









    Out[20]:





array([[ 0.,  0.,  2., ...,  6.,  0.,  0.],
       [ 0.,  2.,  4., ...,  2.,  0.,  2.],
       [ 0.,  2.,  2., ...,  4.,  2.,  2.],
       ..., 
       [ 0.,  2.,  2., ...,  2.,  2.,  2.],
       [ 0.,  0.,  0., ...,  0.,  4.,  0.],
       [ 0.,  0.,  2., ...,  2.,  2.,  0.]])



In [21]:

    
data[:3, 36:]









    Out[21]:





array([[ 2.,  3.,  0.,  0.],
       [ 1.,  1.,  0.,  1.],
       [ 2.,  2.,  1.,  1.]])



In [22]:

    
doubledata[:3, 36:]









    Out[22]:





array([[ 4.,  6.,  0.,  0.],
       [ 2.,  2.,  0.,  2.],
       [ 4.,  4.,  2.,  2.]])

we can also do arithmetic operation with another array of same shape (same dims)



In [23]:

    
tripledata = doubledata + data



In [24]:

    
print('tripledata:')
print(tripledata[:3, 36:])









    



tripledata:
[[ 6.  9.  0.  0.]
 [ 3.  3.  0.  3.]
 [ 6.  6.  3.  3.]]

we can do more than simple arithmetic
let's take average inflammation for patients



In [25]:

    
print(data.mean())

mean is a method of the array (function)
variables are nouns, methods are verbs - they are what the thing knows how to do
for mean we need empty () parense even if we aren't passing in parameters to tell python to go do something
data.shape doesn't need () because it's just a description
NumPy arrays have lots of useful methods:



In [26]:

    
print('maximum inflammation: ', data.max())
print('minimum inflammation: ', data.min())
print('standard deviation:', data.std())









    



maximum inflammation:  20.0
minimum inflammation:  0.0
standard deviation: 4.61383319712

however, we are usually more interested in partial stats, e.g. max value per patient or the avg value per day
we can create a new subset array of the data we want



In [34]:

    
%matplotlib inline



In [36]:

    
import matplotlib.pyplot as plt



In [37]:

    
data









    Out[37]:





array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ..., 
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])

let's visualize this data with matplotlib library
first we import the plyplot module from matplotlib



In [38]:

    
plt.imshow(data)









    Out[38]:





<matplotlib.image.AxesImage at 0x10c26a2e8>



In [39]:

    
image = plt.imshow(data)
plt.savefig('timsheatmap.png')

nice, but ipython/jupyter proved us with 'magic' functions and one lets us display our plot inline
% indicates an ipython magic function

what if we need max inflammation for all patients, or the average for each day?
most array methods let us specify the axis we want to work on



In [44]:

    
avg_inflam = data.mean(axis=0) #asix zero is by each day



In [45]:

    
print(data.mean(axis=0))









    



[  0.           0.45         1.11666667   1.75         2.43333333   3.15
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]



In [46]:

    
print(data.mean(axis=0).shape) #Nx1 vector of averages









    



(40,)



In [47]:

    
print(data.mean(axis=1)) #avg inflam per patient across all days









    



[ 5.45   5.425  6.1    5.9    5.55   6.225  5.975  6.65   6.625  6.525
  6.775  5.8    6.225  5.75   5.225  6.3    6.55   5.7    5.85   6.55
  5.775  5.825  6.175  6.1    5.8    6.425  6.05   6.025  6.175  6.55
  6.175  6.35   6.725  6.125  7.075  5.725  5.925  6.15   6.075  5.75
  5.975  5.725  6.3    5.9    6.75   5.925  7.225  6.15   5.95   6.275  5.7
  6.1    6.825  5.975  6.725  5.7    6.25   6.4    7.05   5.9  ]



In [48]:

    
print(data.mean(axis=1).shape)









    



(60,)

now let's look at avg inflammation over days (columns)



In [49]:

    
print(avg_inflam)









    



[  0.           0.45         1.11666667   1.75         2.43333333   3.15
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]



In [51]:

    
day_avg_plot = plt.plot(avg_inflam)

avg per day across all patients in the var day_avg_plot
matplotlib create and display a line graph of those values



In [52]:

    
data.mean(axis=0).shape









    Out[52]:





(40,)



In [53]:

    
data.shape









    Out[53]:





(60, 40)



In [54]:

    
data.mean(axis=1).shape









    Out[54]:





(60,)



In [55]:

    
max_plot = plt.plot(data.max(axis=0))

Create a plot showing the standard deviation (numpy.std) of the inflammation data for each day across all patients.



In [ ]: