In these exercises you'll use a real life medical dataset to learn how to obtain basic statistics from the data. This dataset comes from Gluegrant, an American project that aims to find a which genes are more important for the recovery of severely injured patients! It was sightly edited to remove some complexities, but if you wish to check it out in it's full glory, it's available on the website and I can show it to you!
Have fun being a biostaticist for 1 hour! :)
In this exercise the objective if for you to learn how to use Pandas functions to obtain simple statistics of Datasets.
The dataset is a medical dataset with 184 patients, distributed into 2 test groups where each group divided in 2, patients and control. The dataset is composed of clinical values:
and of the gene expression (higher = more expressed):
Don't worry if you are not from a biological background, consider that these genes are simply numeric values related to the patient. We will not delve into the biological meaning of any of the genes, we'll only try to find if there are differences between the gene values for the different groups!
If you are still not confortable using this dataset, imagine this situation instead:
You can consider that this dataset comes from a online shopping service like Amazon. Imagine that they were conducting an A/B test, where a small part of their website was changed, like the related items suggestions. You have 2 groups, the "control group" that is the group that is experiencing the original website (without modifications) and Group 1 that is using the website with the new suggestions.
Consider also that the genes are products or product categories where the customers buy a certain ammount of products. Your objective now is to find if there is a significant difference between the control group and Group 1.
Ok, introductions aside, please have fun being a biostaticist for 45 minutes! :P Any doubt, please call me or any of the other professors!
In [4]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
CSS = """
.output {
flex-direction: row;
}
"""
patient_data = pd.read_csv("../data/Exercises_Summary_Statistics_Data.csv")
patient_data.head()
Out[4]:
Ok, first lets get a quick look at who is in each of the groups. In medical studies it's important that the control and patient groups aren't too different from each other, so that we can draw relevant results.
Since we are going to perform multiple statistics on the patient and control groups, we should create a variable for each one of the groups, so that we mantain our code readable!
Remember: If you want to subset the comand is: Name_of_dataframe[Name_of_dataframe.column == "Value"]
In [5]:
patients = # Subtet patient_data to include only patients
control = # Subset patient_data to include only control
In [6]:
patient_mean = # Calculate the mean of the patients
control_mean = # Calculate the mean of the controls
print("The patient mean age is:", patient_mean, "and the control mean age is:", control_mean, "\t")
In [7]:
patient_median = # Calculate the median of the patients
control_median = # Calculate the median of the controls
print("The patient median age is:", patient_median, "and the control median age is:", control_median, "\t")
(Optional): Is there a significant difference between the age of the Patients and Control? Consider that this dataset is composed mainly of people injured using powertools or other type of machinery, therefore, it's composed mainly of people in working age 20-ish to 60-ish.
In [8]:
patient_std = # Standard Deviation of the patients
control_std = # Standard Deviation of the controls
print("The patient std is:", patient_std, "and the control std is:", control_std, "\t")
In [9]:
patient_quantiles = # Patient quantiles
control_quantiles = # Control quantiles
print("Patients:\f")
display(pd.DataFrame(patient_quantiles))
print("Control:\f")
display(pd.DataFrame(control_quantiles))
HTML('<style>{}</style>'.format(CSS))
Out[9]:
(Options): Do the dispersion statistics show a significant difference in the dispersion of the data?
Next, let's try to find out the number or each of the sexes and the prercantage of males in each of the groups.
Remember: To get a frequency table, use Name_of_dataframe.column.value_counts(). To one way to get the number of a certain group do Name_of_dataframe.column.value_counts()["name_of_group"]
In [10]:
num_male_patients = # Get the number of male patients
num_female_patients = # Get the number of female patients
print("The number of male patients is:", num_male_patients, \
"\nThe number of female patients is:", num_female_patients, \
"\nAnd the percentage of males is:", num_male_patients / (num_male_patients + num_female_patients), "\t")
In [11]:
num_male_control = # Get the number of male control
num_female_control = # Get the number of female control
print("The number of male control patients is:", num_male_control, \
"\nThe number of female control patients is:", num_female_control, \
"\nAnd the percentage of males is:", num_male_control / (num_male_control + num_female_control), "\t")
In [12]:
gene1_patients = patients.Gene1
gene1_control = control.Gene1
# Mean
mean_gene1_patients = # Gene1 mean for patients
mean_gene1_control = # Gene1 mean for control
# Median
median_gene1_patients = # Gene1 median for patients
median_gene1_control = # Gene1 median for control
# Std
std_gene1_patients = # Gene1 std for patents
std_gene1_control = # Gene1 std for control
print("Patients: Mean =", mean_gene1_patients, "Median =", median_gene1_patients, "Std =", std_gene1_patients, "\t")
print("Control: Mean =", mean_gene1_control, "Median =", median_gene1_control, "Std =", std_gene1_control, "\t")
In [13]:
gene2_patients = patients.Gene2
gene2_control = control.Gene2
# Mean
mean_gene2_patients = # Gene2 mean for patients
mean_gene2_control = # Gene2 mean for control
# Median
median_gene2_patients = # Gene2 median for patients
median_gene2_control = # Gene2 median for control
# Std
std_gene2_patients = # Gene2 std for patents
std_gene2_control = # Gene2 std for control
print("Patients: Mean =", mean_gene2_patients, "Median =", median_gene2_patients, "Std =", std_gene2_patients, "\t")
print("Control: Mean =", mean_gene2_control, "Median =", median_gene2_control, "Std =", std_gene2_control, "\t")
In [14]:
gene6_patients = patients.Gene6
gene6_control = control.Gene6
# Mean
mean_gene6_patients = # Gene6 mean for patients
mean_gene6_control = # Gene6 mean for control
# Median
median_gene6_patients = # Gene6 median for patients
median_gene6_control = # Gene6 median for control
# Std
std_gene6_patients = # Gene6 std for patents
std_gene6_control = # Gene6 std for control
print("Patients: Mean =", mean_gene6_patients, "Median =", median_gene6_patients, "Std =", std_gene6_patients, "\t")
print("Control: Mean =", mean_gene6_control, "Median =", median_gene6_control, "Std =", std_gene6_control, "\t")
Of the 3 genes, which ones do you believe are involved in the process of recovery?
Help: Recall that we have 2 groups, a group of patients that is recovering from a severe accident and a control group that are fine. You should look at the statistics for the 3 genes (mean, median and standard deviation [this last one is skippable]) and try to find differences!
In [ ]:
In [15]:
gene_names = ["Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6"]
display(# Get the summary of the gene columns for PATIENTS)
display(# Get the summary of the gene columns for CONTROL)
What if we want the a measure of difference for each gene?
In [16]:
display(# Mean of the PATIENT genes / # Mean of the CONTROL genes)