Learning Programming Through Examples
This is the IPython notebook. You access it through the browser window however it is actually a program running on the super-computer on Will's desk. I have done some work to make sure that nothing you do here will screw up the computer. So feel free to experiment! This is a 'real' python interpreter and can do anything that python can do. Evaluate the next set of code-blocks (use 'shift-enter') to run the code in a cell.
Subraction 200
Multiplication 150000
Raise to the Power 931322574615478515625000000000000000000000000000000000000000000000000000000000000
While having a $10,000 calculator is pretty cool, real programming involves using variables. Think of variables as just NAMES. Names can be anything WITHOUT SPACES, dashes, &, or other weird characters.
Then you can do anything with a variable that you can with a 'literal' (something you typed out).
Subtraction -495
Multiplication 2500
You can also do interesting things. Like multiply a string (a variable with text) with a number. Can you guess what will happen?
John DoeJohn DoeJohn DoeJohn DoeJohn Doe
Oooooo ... concatenation! Think of it this way: 'What would you expect to happen if you had "5 times John Doe"'? What do you think would happen if you added a number and a string?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-f6f20bfb61c6> in <module>()
----> 1 print ThisIsAName+this_is_a_number
TypeError: cannot concatenate 'str' and 'int' objects
Whoops,it doesn't understand. Python thinks we are trying to concatenate a string and a number. But numbers and strings aren't the same "Type" (duh). So it raises a TypeError and gives you a little description of what went wrong. This is called a Traceback ... when you're learning programming you'll see a lot of these. We'll go over them in more detail later ... but the important thing to know now is "Read them from the bottom to the top".
Single varaibles are for chumps! List, Tuples (pronouced 'two-pulls'), sets, and dictionaries are ways to store multiple items in one name. Each has advantages and disadvantages.
Items can be retrieved from a List or Tuple by thier index (position in the list). Just remember, computers start counting at ZERO!!!!
Patient 4's age: 17
Patient 4's weight: 78 in kilos
In pounds: 171.6
There is one big difference between Lists and Tuples. Items in a list can be changed while items within a Tuple cannot be. This is called 'mutability'
Original: [66, 76, 54, 78, 61, 60]
After changing: [66, 76, 54, 78, 5000, 60]
After changing back [66, 76, 54, 78, 61, 60]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-af836c102b16> in <module>()
1 #This does not
----> 2 patient_ages[3] = 12
3 print patient_ages
TypeError: 'tuple' object does not support item assignment
Dictionaries (pronounced and typed as 'dict') are UNORDERED mappings between keys and values. Just like a normal dictionary a key (kinda like a word) references one (and only one) value (kinda like a definition). Here are some examples.
Jack is patient # 2
Jeb is patient # 4
{'Jamie': 1, 'Jeb': 4, 'Jess': 4, 'Jill': 3, 'Jack': 2, 'John': 0}
['Jack', 'Jamie', 'Jeb', 'Jess', 'Jill', 'John']
So this lets us do some cool things:
Jack's Age: 89
John's Weight in Kilos: 66
John's Weight in pounds: 145.2
Lists, dicts and tuples allow us to do cool things called 'iteration'. This is one of the 'fundamental concepts' in programming, the idea of defining a set of rules for handeling an 'item' and then iterate through multiple 'items' doing the same thing. So as an example lets print the weight in kilos of each patient.
Jamie weighs 167.2 in pounds
Jeb weighs 134.2 in pounds
Jess weighs 134.2 in pounds
Jill weighs 171.6 in pounds
Jack weighs 118.8 in pounds
John weighs 145.2 in pounds
However keeping all of these things (Age, weight, name, etc) in seperate variables gets annoying. Especially as we start adding the hundreds of other things we could measure. So we're going to put them in a 'DataFrame'. Think of a DataFrame like an excel worksheet ... it has rows and columns and each position has a value (or is empty, something we'll cover later). However, 'python' doesn't have a native DataFrame (or anything like it); but thankfully people much smarter than anyone here have developed THOUSANDS of tools that can be accessed through the 'import' command. We'll cover them as we go; but in general there is a python module that can do almost anything you can think of!
Age Weight
0 23 66
1 45 76
2 89 54
3 17 78
4 56 61
5 94 60
Average age 54.0
Smallest weight 54
Oldest Age 94
Out[23]:
|
0 |
1 |
2 |
3 |
4 |
5 |
Age |
23 |
45 |
89 |
17 |
56 |
94 |
Weight |
66 |
76 |
54 |
78 |
61 |
60 |
Once you 'grok' the concept of a DataFrame and all of the cool things you can do then you'll have almost limitless ability in dealing with 'tabular' data.
For example, what if we wanted to find the number of patients with an Age > 50.
We can also use that 'mask' to index back into the DataFrame to answer a question like "Of all the patients over 50 what is thier average weight?"
Hmm, it seems like young people weigh (on average) more then old people. Wouldn't it be cool if we could do a quick t-test to see if the difference is significant? As I said before, python has a module to do everything! Statistical functions like that are in the module "scipy.stats". You can see a full list of stat tests here: http://docs.scipy.org/doc/scipy/reference/stats.html (scroll about half-way down to find ttest_ind)
There is a LOT of information in there. It takes practice to learn what the important things are but I'll walk you through this one.
"Definition: ttest_ind(a, b, axis=0, equal_var=True)"
This tells us how to 'call' the function. It takes 2 'positional arguements' "a" and "b" and some optional arguements. We'll ignore them for now.
"Docstring:
Calculates the T-test for the means of TWO INDEPENDENT samples of scores.
This is a two-sided test for the null hypothesis that 2 independent samples
have identical average (expected) values. This test assumes that the
populations have identical variances."
This tells us what the function does. All good libraries have some documentation telling what each function does.
"
Parameters
----------
"
This tells what each input 'means'. So you can see the first two are 'array-like' which is just math/programming speak for "list of numbers". It tells us what the 'axis' arguement does, but it's just Greek now, so ignore it. It also tells us that we can tell the function whether we assume the populations have equal variance. For no we'll assume 'yes', which is the default.
"
Returns
-------
t : float or array
The calculated t-statistic.
prob : float or array
The two-tailed p-value.
"
This tells us that it returns two values, the t-statistic and the two-tailed p-value.
So we use it like this:
Wow, pretty cool! We can say that (in this made-up data) old people have a statistically significantly (p<0.05) lower body weight then younger patients. But this is made up fake data. Would this be true in real data? Like the stuff we have from real patients? Well, lets see!
I have done a lot of work to get data from redcap into python in a DataFrame so we can analyze questions like this. So you can load the redcap data into this python workspace and ask questions like this. How I did this is beyond the scope of this tutorial and there will be a little bit of 'magic' required to get it in. I'll explain as I go.
From this you can see that there are 425 columns in the table, 115 of them are numbers (float), 171 are True/False (boolean) and 139 are something else (object) [usually these are strings]. But that still doesn't give us much information about WHAT is in there. But this will give us an idea:
From that giant block of text we can see all the columns that we have. To answer out question we only need a small subset of them. So we'll pull out the Age and weight columns.
Now you can start to see some of the hidden messyness of dealing with REAL data. You can see that not all of the data is there. We have 1419 entries only 1397 of them have an Age and only 544 of them have a Weight. But we'll work with what we have.
Looks promissing. I wonder what the p-value is on that.
Oooo, here we get into more of the problems with missing data. The t-test function doesn't know what to do with missing data (NaN) and so it choked and returned a NaN. So we need to think a second about what to do with missing data. It really depends on your situation, but in this one; the "right" answer is to just drop the missing data. And in this case we need to drop the patients who are missing an Age or Weight or BOTH. DataFrames have easy ways to deal with this: called "dropna". Lets look at the docstring.
Again, lots of stuff but only a little is important.
"Definition: real_pat_data.dropna(self, axis=0, how='any', thresh=None, subset=None)"
How to call the function. The 'self' arguement is automatically added, due to "magic" you don't need to add it, as you get more programming experience it will make more sense as to why, but for right now, just know "skip anything that says 'self'"
"
Docstring:
Return object with labels on given axis omitted where alternately any
or all of the data are missing"
This tells us that it will drop items from the dataset if there are missing values.
In the paramenters we need to pay attention:
axis : {0, 1}, or tuple/list thereof
Pass tuple or list to drop on multiple axes
This lets us change which 'axis' it will work on. Think of axis Zero as the 'rows' and axis One as the 'columns'. In our case we want to drop ROWS which have missing values not COLUMNS. So we'll use "axis = 0".
how : {'any', 'all'}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
This lets us control how stringent the criteria is. Will it drop a row if ANY are missing or if ALL are missing.
thresh : int, default None
int value : require that many non-NA values
If ANY or ALL don't work we could use an integer number of missing values. For example drop if more than 5 columns are empty in a 50 column DataFrame.
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
This is for when we only want it to consider a SUBSET of the columns for checking missing values.
So in our case we want to drop the rows of the dataset inwhich any of the columns are empty.
So, we have 543 rows in which we have both an Age and a Weight, now we can do something.
Wow, it worked! Old people do weigh less the young people (p<0.05). We also hiddenly did something else REALLY cool, that you didn't even see, did you?
We used the 'mask' (real_old_mask) calculated on the original dataset (real_pat_data) on the filtered dataset (full_data). If you look at the "Index" in the print-out from each dataset you can get a hint as to how this happened.
Pandas was smart enough to go through and match the rows which have the same row labels. This can be good and bad. It makes life much easier but can sometimes lead to errors if you're not careful!
But there's an error in our statistics logic ... the code is right, but our analysis is wrong. You know that we have multiple patients dataset who have multiple visits (that whole longitudinal thingy). So some patients are counted more then once and thier ages/weights are likely to be similar. If our cohort has a bias (intentional or accidental) in the age or weight selection for returning patients then it will bias our test. So we should probably consider only one Age and one Weight for each patient. There are COUNTLESS ways to do this, I'm going to show a few and use them as ways to show you other cool Python-y things.
First lets look at each patient and find thier average age and average weight. To do that we're going to use the 'groupby' function to group rows by Patient ID and then calculate the average age and weight.
You can see that it also changed the 'index' of the mean_data DataFrame. It is now indexed by Patient ID "Index: 507 entries, A0001 to A0514" and we can also see that many of the patients do not have Ages and Weights for ANY of thier visits. So we'll work with the data we got and see what happens.
Whoops, an error! Hmmm, what's happening? It says: "ValueError: cannot index with vector containing NA / NaN values". That doesn't make much sense ... we dropped all the NaN values from "full_mean_data" and we've used this "real_old_mask" variable before and it worked just fine. What could it mean? Let's deconstruct the code above and see where the error is.
Still have an error (duh, I wrote the same code, just more 'verbose'). But if we read it bottom-to-top then you can see something. The error seems to be when we try to pull out the weights. Hmmmm. Let's look at the data and see if there's something fishy.
Eureaka! Do you see it? The first column in both tables is the Index. The Data is indexed by "Patient ID" and the Mask is indexed by numbers. And those numbers refer to the rows in the 'real_pat_data' matrix ... so they can't be mapped to Patient ID's. So we just need to make a new mask! Remember how I said above that Pandas is smart ... well sometimes its just smrt.
There we go! Now we know which patients have an average age > 50! Now we can go back to our original question.
Interesting ... when we correct for the fact that we have patients with multiple visits (and therefore the age and weights from each row are not TRULY independent) we lose significance. Maybe the 'average' weight and 'average' age is not the correct way to think about it. It would make more sense to look at a single visit for each patient and take the Age and Weight from that visit. But which Visit? To be 'fair' we need to do it one the same visit for each patient. One could make a case for any particular visit, so I'm going to say thier 'initial visit' (ROO) since everyone has an ROO.
A little note about the database. Its niave to believe that its 'sorted' in any particular order. So make sure any of your tests don't rely on that, always check first.
Notice something? Not every visit has a Visit Number.
There are many ways to do this, but I'll show the simplest.
Looks like it worked. Let's see what the result is.
Wow! Not even CLOSE! That seems weird doesn't it? When I look at the average weight across all visits and compare young/old I'm almost significant but when I look only at the intake visit its way off. I wonder what's happening here. Let's look at the data. We can use some plotting to see what's happening. Why don't we look at the weight at each visit. We'll use a boxplot to do this.
It seems like the R09 and R10 visits have different variances and much lower medians. Maybe if we exclude those we'll get answers that match better.
Hmm, That didn't seem to help. It still looks like the average weight of patients less than 50 yrs-old is different from the average weight of patients older than 50. But why is is it SO different from a result where we calculate based on the intake visit? Lets look again.
Hmmmm, I don't see anything else useful here. The variance of the age/weight decreases as we look at higher visits but that's to be expected (less patients make it to the later visits). Maybe its something about the 'averaging'. Maybe young people and old people have different variances in thier weights. We could look at that by calculating the Standard Deviation for the weight of each patient.
Hmm, so some people have a HUGE variation in weight (67 lbs std is a lot). Lets see if that is correlated with Age. So lets put it in with the Age at thier first visit.
Hmmm, it won't let us put them together saying "Length of values does not match length of index" ... Implying that the two items don't match properly. Since this is a common issue, Pandas has a way to deal with this. Called, "merge".
Hmmm, what's going on with that Weight_x, Weight_y? It comes when doing a 'merge' and the two DataFrames have columns with the same name.
So we can rename the column in one and then it will work fine.
Now we can go back to our earlier question? Does the weight vary more or less with age and does that account for our earlier observations.
Hmm, there's no scatter plot in DataFrame. I wonder if there's another package that does it? Why don't you google 'scatter plot in python'. What is the package in the first few results?
Hmmm, I don't see much of a trend. However, that one patient is WAY out there in terms of weight-control and they're right on the cusp of 50. What happens if we drop them from the analysis. First we have to find them.
If we wanted to drop a patient from a dataset what do you think the method will be?
Hmm, that seemed to help fix the p-value, but not much. I wonder if there's something else we can find.