In these exercises you'll use a real life medical dataset to learn how to obtain basic statistics from the data. This dataset comes from Gluegrant, an American project that aims to find a which genes are more important for the recovery of severely injured patients!
The dataset is a medical dataset with 184 patients, distributed into 2 test groups where each group divided in 2, patients and control. The dataset is composed of clinical values:
In [1]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
CSS = """
.output {
flex-direction: row;
}
"""
complete_data = pd.read_csv("../data/Exercises_Summary_Statistics_Data.csv")
complete_data = complete_data.set_index('Patient_id')
The dimensions of the dataset are
In [2]:
complete_data.shape
Out[2]:
Let's take a look:
In [3]:
complete_data.iloc[:, 0:15].head()
Out[3]:
You can consider that this dataset comes from a online shopping service like Amazon. Imagine that they were conducting an A/B test, where a small part of their website was changed, like the related items suggestions. You have 2 groups, the "control group" that is the group that is experiencing the original website (without modifications) and Group 1 that is using the website with the new suggestions.
Consider also that the genes are products or product categories where the customers buy a certain ammount of products. Your objective now is to find if there is a significant difference between the control group and Group 1.
In [5]:
male_patients = complete_data[complete_data.Sex == "male"]
female_patients = complete_data[complete_data.Sex == "female"]
# Mean
male_mean_age = male_patients.Age.mean()
female_mean_age = female_patients.Age.mean()
# Median
male_median_age = male_patients.Age.median()
female_median_age = female_patients.Age.median()
# Std
male_std_age = male_patients.Age.std()
female_std_age = female_patients.Age.std()
print("The male mean age is:", male_mean_age, "The median age is:", male_median_age, \
"and the standard dev is:", male_std_age)
print("The female mean age is:", female_mean_age, "The median age is:", female_median_age, \
"and the standard dev is:", female_std_age)
In [6]:
display(male_patients.Age.quantile(q=[0,1/4,1/2,3/4,1]))
display(female_patients.Age.quantile(q=[0,1/4,1/2,3/4,1]))
There is almost no difference from the sexes! Really strange to see such close numbers...
We have a column named Results that has the information of what happened to the patient. It has both happy and tragic information. Let's first check out how many results there are.
In [7]:
#Lets first remove the control patients. Those patients don't hava a result since they weren't injured.
patient_data = complete_data[~complete_data.Group.isin(["Control"])]
patient_data.Result.unique()
Out[7]:
Ok, we have 8 types of outcomes for the patients. One of them is control, ignore that, it's a problem with the dataset.
Let's check the numbers for each of these outcomes.
In [8]:
patient_data.Result.value_counts()
Out[8]:
Ok, so, good news, most of our patients survived the injury! :)
Next step, is there any gene difference between the patients that survived the injury and those that didn't?
(Optional): This question is very difficult to answer and in biostatistics we use something called survival analysis to model the patient's outcome according to a set of variables. Here we wont do that, but we will attempt to get a nice result!
Let's check if there are any genes that have very different values in the patients that survived and the ones that didn't!
In [9]:
patients_death = patient_data[patient_data.Result == "09: Death"]
patients_alive = patient_data[patient_data.Result != "09: Death"]
gene_names = ["Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6"]
display(patients_death[gene_names].describe())
display(patients_alive[gene_names].describe())
HTML('<style>{}</style>'.format(CSS))
Out[9]:
Looking at the mean, Gene4 seems to be a good one to predict the death of the patient, since it is much higher on the dead patients that in the alive ones. But the median (50% in the tables) say otherwise, I smell something fishy, let's see the plot of the data!
In [10]:
display(Image('./profileGraph.png', width=2000, unconfined=True))
So, it seems like our result was caused by an outlier! Therefore, there is no clear difference between the expression of gene4 on the dead and alive patients. (From the dataset of 55k genes I didn't find any gene with a significant difference between the groups).