Analyzing patient data

  • Words are useful, but what’s more useful are the sentences and stories we build with them.
  • A lot of powerful tools are built into languages like Python, even more live in the libraries they are used to build
  • We need to import a library called NumPy
  • Use this library to do fancy things with numbers (e.g. if you have matrices or arrays).

In [2]:
import numpy
  • Importing a library akin to getting lab equipment out of a locker and setting up on bench
  • Libraries provide additional functionality
  • With NumPy loaded we can read the CSV into python.

In [3]:
#assuming the data file is in the data/ folder
numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')


Out[3]:
array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ..., 
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])
  • numpy.loadtex() is a function call, runs loadtxt in numpy
  • uses dot notation to access thing.component
  • two parameters: filename and delimiter - both character strings (")
  • we didn't save in memory using a variable
  • variables in python must start with letter & are case sensitive
  • assignment operator is =
  • let's look at assigning this inflammation data to a variable

In [4]:
data = numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')

In [5]:
print(data)


[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]

In [7]:
weight_kg = 55 #assigns value 55 to weight_kg

In [8]:
print(weight_kg) #we can print to the screen


55

In [9]:
print("weight in kg", weight_kg)


weight in kg 55

In [15]:
weight_kg = 70

In [11]:
print("weight in kg", weight_kg)


weight in kg 70
  • print above shows several things at once by separating with commas
  • variable as putting sticky note on value
  • means assigning a value to one variable does not chage the value of other variables.

In [12]:
weight_kg * 2


Out[12]:
140

In [16]:
weight_lb = weight_kg * 2.2

In [17]:
print('weigh in lb:', weight_lb)


weigh in lb: 154.0

In [18]:
print("weight in lb:", weight_kg*2.2)


weight in lb: 154.0

In [19]:
print(data)


[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]

whos #ipython command to see what variables & mods you have


In [20]:
whos


Variable    Type       Data/Info
--------------------------------
data        ndarray    60x40: 2400 elems, type `float64`, 19200 bytes
numpy       module     <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
weight_kg   int        70
weight_lb   float      154.0

What does the following program print out?

first, second = 'Grace', 'Hopper'
third, fourth = second, first
print(third, fourth)

In [21]:
print(data)


[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]

In [22]:
print(type(data)) #we can get type of object


<class 'numpy.ndarray'>
  • data refers to N-dimensional array
  • data corres. to patients' inflammation
  • let's look at the shape of the data

In [23]:
print(data.shape)


(60, 40)
  • data has 60 rows and 40 columns
  • when we created data with numpy it also creates members or attributes
  • extra info describes data like adjective does a noun
  • dot notation to access members

In [24]:
print('first value in data', data[0,0]) #use index in square brackets


first value in data 0.0

In [26]:
print('4th value in data', data[0,3]) #use index in square brackets


4th value in data 3.0

In [32]:
print('first value in 3rd row data', data[3,0]) #use index in square brackets


first value in 3rd row data 0.0

In [31]:
!head -3 data/inflammation-01.csv


0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1

In [33]:
print('middle value in data', data[30,20]) # get the middle value - notice here i didn't use print


middle value in data 13.0
  • programming languages like MATLAB and R start counting at 1
  • languages in C family (C++, Java, Perl & python)
  • we have MxN array in python, indices go from 0 to M-1 on the first axis and 0 to N-1 on second
  • indices are (row, column)

In [16]:
data[0:4, 0:10] #select whole sections of matrix, 1st 10 days & 4 patients


Out[16]:
array([[ 0.,  0.,  1.,  3.,  1.,  2.,  4.,  7.,  8.,  3.],
       [ 0.,  1.,  2.,  1.,  2.,  1.,  3.,  2.,  2.,  6.],
       [ 0.,  1.,  1.,  3.,  3.,  2.,  6.,  2.,  5.,  9.],
       [ 0.,  0.,  2.,  0.,  4.,  2.,  2.,  1.,  6.,  7.]])
  • slice 0:4 means start at 0 and go up to but not include 4
  • up-to-but-not-including takes a bit of getting used to

In [17]:
data[5:10,0:10]


Out[17]:
array([[ 0.,  0.,  1.,  2.,  2.,  4.,  2.,  1.,  6.,  4.],
       [ 0.,  0.,  2.,  2.,  4.,  2.,  2.,  5.,  5.,  8.],
       [ 0.,  0.,  1.,  2.,  3.,  1.,  2.,  3.,  5.,  3.],
       [ 0.,  0.,  0.,  3.,  1.,  5.,  6.,  5.,  5.,  8.],
       [ 0.,  1.,  1.,  2.,  1.,  3.,  5.,  3.,  5.,  8.]])
  • dont' have to include uper and lower bound
  • python uses 0 by default if we don't include lower
  • no upper slice runs to the axis
  • : will include everything

In [18]:
data[:3, 36:]


Out[18]:
array([[ 2.,  3.,  0.,  0.],
       [ 1.,  1.,  0.,  1.],
       [ 2.,  2.,  1.,  1.]])

A section of an array is called a slice. We can take slices of character strings as well:

element = 'oxygen'
print('first three characters:', element[0:3])
print('last three characters:', element[3:6])
first three characters: oxy
last three characters: gen

What is the value of element[:4]? What about element[4:]? Or element[:]?

What is element[-1]? What is element[-2]? Given those answers, explain what element[1:-1] does.


In [1]:
element = 'oxygen'
print('first three characters:', element[0:3])
print('last three characters:', element[3:6])


first three characters: oxy
last three characters: gen

In [6]:
print(element[:4])
print(element[4:])


oxyg
en

In [7]:
print(:)


  File "<ipython-input-7-83302cbc0a1f>", line 1
    print(:)
          ^
SyntaxError: invalid syntax

In [14]:
#oxygen
print(element[-1])
print(element[-2])
print(element[2:-1])


n
e
yge

In [13]:
doubledata = data * 2.0 #we can perform math on array


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-76134479298d> in <module>()
----> 1 doubledata = data * 2.0 #we can perform math on array

NameError: name 'data' is not defined
  • operation on arrays is done on each individual element of the array

In [20]:
doubledata


Out[20]:
array([[ 0.,  0.,  2., ...,  6.,  0.,  0.],
       [ 0.,  2.,  4., ...,  2.,  0.,  2.],
       [ 0.,  2.,  2., ...,  4.,  2.,  2.],
       ..., 
       [ 0.,  2.,  2., ...,  2.,  2.,  2.],
       [ 0.,  0.,  0., ...,  0.,  4.,  0.],
       [ 0.,  0.,  2., ...,  2.,  2.,  0.]])

In [21]:
data[:3, 36:]


Out[21]:
array([[ 2.,  3.,  0.,  0.],
       [ 1.,  1.,  0.,  1.],
       [ 2.,  2.,  1.,  1.]])

In [22]:
doubledata[:3, 36:]


Out[22]:
array([[ 4.,  6.,  0.,  0.],
       [ 2.,  2.,  0.,  2.],
       [ 4.,  4.,  2.,  2.]])
  • we can also do arithmetic operation with another array of same shape (same dims)

In [23]:
tripledata = doubledata + data

In [24]:
print('tripledata:')
print(tripledata[:3, 36:])


tripledata:
[[ 6.  9.  0.  0.]
 [ 3.  3.  0.  3.]
 [ 6.  6.  3.  3.]]
  • we can do more than simple arithmetic
  • let's take average inflammation for patients

In [25]:
print(data.mean())


6.14875
  • mean is a method of the array (function)
  • variables are nouns, methods are verbs - they are what the thing knows how to do
  • for mean we need empty () parense even if we aren't passing in parameters to tell python to go do something
  • data.shape doesn't need () because it's just a description
  • NumPy arrays have lots of useful methods:

In [26]:
print('maximum inflammation: ', data.max())
print('minimum inflammation: ', data.min())
print('standard deviation:', data.std())


maximum inflammation:  20.0
minimum inflammation:  0.0
standard deviation: 4.61383319712
  • however, we are usually more interested in partial stats, e.g. max value per patient or the avg value per day
  • we can create a new subset array of the data we want

In [34]:
%matplotlib inline

In [36]:
import matplotlib.pyplot as plt

In [37]:
data


Out[37]:
array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ..., 
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])
  • let's visualize this data with matplotlib library
  • first we import the plyplot module from matplotlib

In [38]:
plt.imshow(data)


Out[38]:
<matplotlib.image.AxesImage at 0x10c26a2e8>

In [39]:
image = plt.imshow(data)
plt.savefig('timsheatmap.png')


  • nice, but ipython/jupyter proved us with 'magic' functions and one lets us display our plot inline
  • % indicates an ipython magic function
  • what if we need max inflammation for all patients, or the average for each day?
  • most array methods let us specify the axis we want to work on

In [44]:
avg_inflam = data.mean(axis=0) #asix zero is by each day

In [45]:
print(data.mean(axis=0))


[  0.           0.45         1.11666667   1.75         2.43333333   3.15
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]

In [46]:
print(data.mean(axis=0).shape) #Nx1 vector of averages


(40,)

In [47]:
print(data.mean(axis=1)) #avg inflam per patient across all days


[ 5.45   5.425  6.1    5.9    5.55   6.225  5.975  6.65   6.625  6.525
  6.775  5.8    6.225  5.75   5.225  6.3    6.55   5.7    5.85   6.55
  5.775  5.825  6.175  6.1    5.8    6.425  6.05   6.025  6.175  6.55
  6.175  6.35   6.725  6.125  7.075  5.725  5.925  6.15   6.075  5.75
  5.975  5.725  6.3    5.9    6.75   5.925  7.225  6.15   5.95   6.275  5.7
  6.1    6.825  5.975  6.725  5.7    6.25   6.4    7.05   5.9  ]

In [48]:
print(data.mean(axis=1).shape)


(60,)

now let's look at avg inflammation over days (columns)


In [49]:
print(avg_inflam)


[  0.           0.45         1.11666667   1.75         2.43333333   3.15
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]

In [51]:
day_avg_plot = plt.plot(avg_inflam)


  • avg per day across all patients in the var day_avg_plot
  • matplotlib create and display a line graph of those values

In [52]:
data.mean(axis=0).shape


Out[52]:
(40,)

In [53]:
data.shape


Out[53]:
(60, 40)

In [54]:
data.mean(axis=1).shape


Out[54]:
(60,)

In [55]:
max_plot = plt.plot(data.max(axis=0))


Create a plot showing the standard deviation (numpy.std) of the inflammation data for each day across all patients.


In [ ]: