Summary Statistics - Exercises

In these exercises you'll use a real life medical dataset to learn how to obtain basic statistics from the data. This dataset comes from Gluegrant, an American project that aims to find a which genes are more important for the recovery of severely injured patients! It was sightly edited to remove some complexities, but if you wish to check it out in it's full glory, it's available on the website and I can show it to you!

Have fun being a biostaticist for 1 hour! :)

Objectives

In this exercise the objective if for you to learn how to use Pandas functions to obtain simple statistics of Datasets.

Dataset information

The dataset is a medical dataset with 184 patients, distributed into 2 test groups where each group divided in 2, patients and control. The dataset is composed of clinical values:

  • Patient.id
  • Age
  • Sex
  • Group (to what group they belong)
  • Results (outcome of the patient)

and of the gene expression (higher = more expressed):

Remarks for people without bio background

Don't worry if you are not from a biological background, consider that these genes are simply numeric values related to the patient. We will not delve into the biological meaning of any of the genes, we'll only try to find if there are differences between the gene values for the different groups!

If you are still not confortable using this dataset, imagine this situation instead:

You can consider that this dataset comes from a online shopping service like Amazon. Imagine that they were conducting an A/B test, where a small part of their website was changed, like the related items suggestions. You have 2 groups, the "control group" that is the group that is experiencing the original website (without modifications) and Group 1 that is using the website with the new suggestions.

Consider also that the genes are products or product categories where the customers buy a certain ammount of products. Your objective now is to find if there is a significant difference between the control group and Group 1.

Start

Ok, introductions aside, please have fun being a biostaticist for 45 minutes! :P Any doubt, please call me or any of the other professors!

Import Data


In [4]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML

CSS = """
.output {
    flex-direction: row;
}
"""

patient_data = pd.read_csv("../data/Exercises_Summary_Statistics_Data.csv")

patient_data.head()


Out[4]:
Patient_id Age Sex Result Group Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
0 1 20 male control Control 950.444 5609.021 530.861 56.001 38.539 32.496
1 2 34 female control Control 728.066 3337.738 271.314 60.238 37.117 30.645
2 3 40 female control Control 1208.076 4430.424 520.859 67.374 41.698 29.476
3 4 31 male control Control 3426.842 6524.846 842.426 68.772 36.682 32.125
4 5 21 female control Control 3781.265 7916.231 574.768 70.522 34.877 27.416

Exercise 1 - Lets get a quick look at the groups

Ok, first lets get a quick look at who is in each of the groups. In medical studies it's important that the control and patient groups aren't too different from each other, so that we can draw relevant results.

Separate the patients and control into 2 dataframes

Since we are going to perform multiple statistics on the patient and control groups, we should create a variable for each one of the groups, so that we mantain our code readable!

Remember: If you want to subset the comand is: Name_of_dataframe[Name_of_dataframe.column == "Value"]


In [5]:
patients = # Subtet patient_data to include only patients
control = # Subset patient_data to include only control

Find out the Age means for each of the groups

Remember: To find the mean of a dataframe column, just use Name_of_dataframe.column.mean()


In [6]:
patient_mean = # Calculate the mean of the patients
control_mean = # Calculate the mean of the controls

print("The patient mean age is:", patient_mean, "and the control mean age is:", control_mean, "\t")


The patient mean age is: 33.64556962025316 and the control mean age is: 29.884615384615383 	

Find the Median of each group

As seen on the presentation, the mean can affected by outliers on the data, lets check that out with the median.

Remember: To find the median of a dataframecolumn, just use Name_of_dataframe.column.median()


In [7]:
patient_median = # Calculate the median of the patients
control_median = # Calculate the median of the controls

print("The patient median age is:", patient_median, "and the control median age is:", control_median, "\t")


The patient median age is: 33.0 and the control median age is: 28.0 	

Results - Mean / Median

Is there a significant difference of the mean and median?

(Optional): Is there a significant difference between the age of the Patients and Control? Consider that this dataset is composed mainly of people injured using powertools or other type of machinery, therefore, it's composed mainly of people in working age 20-ish to 60-ish.

Find the Standard deviation of each group

Let's see if there is a large deviation from the mean in each of the groups.

Remember: The standard deviation is taken as Name_of_dataframe.column.std()


In [8]:
patient_std = # Standard Deviation of the patients
control_std = # Standard Deviation of the controls

print("The patient std is:", patient_std, "and the control std is:", control_std, "\t")


The patient std is: 11.166987259352279 and the control std is: 10.195398660481787 	

Find the quantiles

Let's use the quantiles to obtain the dispersion of the groups. Get the 0, 0.25, 0.5, 0.75 and 1 quantiles.

Remember: The quantiles are obtained using the comand Name_of_dataframe.column.quantile(q=[percentages])


In [9]:
patient_quantiles = # Patient quantiles
control_quantiles = # Control quantiles
print("Patients:\f")
display(pd.DataFrame(patient_quantiles))
print("Control:\f")
display(pd.DataFrame(control_quantiles))
HTML('<style>{}</style>'.format(CSS))


Patients:
Age
0.00 16.0
0.25 24.0
0.50 33.0
0.75 43.0
1.00 55.0
Control:
Age
0.00 17.0
0.25 21.5
0.50 28.0
0.75 34.0
1.00 54.0
Out[9]:

Results - Interval Statistics

(Options): Do the dispersion statistics show a significant difference in the dispersion of the data?

Find out how many patients are male and how many are female

Next, let's try to find out the number or each of the sexes and the prercantage of males in each of the groups.

Remember: To get a frequency table, use Name_of_dataframe.column.value_counts(). To one way to get the number of a certain group do Name_of_dataframe.column.value_counts()["name_of_group"]

Number of male patients


In [10]:
num_male_patients = # Get the number of male patients
num_female_patients = # Get the number of female patients

print("The number of male patients is:", num_male_patients, \
      "\nThe number of female patients is:", num_female_patients, \
      "\nAnd the percentage of males is:", num_male_patients / (num_male_patients + num_female_patients), "\t")


The number of male patients is: 98 
The number of female patients is: 60 
And the percentage of males is: 0.620253164557 	

Number of male control patients


In [11]:
num_male_control = # Get the number of male control
num_female_control = # Get the number of female control

print("The number of male control patients is:", num_male_control, \
      "\nThe number of female control patients is:", num_female_control, \
      "\nAnd the percentage of males is:", num_male_control / (num_male_control + num_female_control), "\t")


The number of male control patients is: 17 
The number of female control patients is: 9 
And the percentage of males is: 0.653846153846 	

Results - Percentage of the sexes

(Optional): Is there a significant difference between the percentage of male patients and male control patients?

Exercise 2 - Let the Biostatistics begin

I have selected 6 genes from a total of ~55000. The objective here is for you to try to find genes that are different from the patient group and control group using the tools that you learned on exercise 1.

Gene 1


In [12]:
gene1_patients = patients.Gene1
gene1_control = control.Gene1

# Mean
mean_gene1_patients = # Gene1 mean for patients
mean_gene1_control = # Gene1 mean for control

# Median
median_gene1_patients = # Gene1 median for patients
median_gene1_control = # Gene1 median for control

# Std
std_gene1_patients = # Gene1 std for patents
std_gene1_control = # Gene1 std for control

print("Patients: Mean =", mean_gene1_patients, "Median =", median_gene1_patients, "Std =", std_gene1_patients, "\t")
print("Control:  Mean =", mean_gene1_control, "Median =", median_gene1_control, "Std =", std_gene1_control, "\t")


Patients: Mean = 13676.395265822784 Median = 13555.230500000001 Std = 3092.69814423062 	
Control:  Mean = 1156.0739230769232 Median = 947.1095 Std = 854.9755207222215 	

Gene 2


In [13]:
gene2_patients = patients.Gene2
gene2_control = control.Gene2

# Mean
mean_gene2_patients = # Gene2 mean for patients
mean_gene2_control = # Gene2 mean for control

# Median
median_gene2_patients = # Gene2 median for patients
median_gene2_control = # Gene2 median for control

# Std
std_gene2_patients = # Gene2 std for patents
std_gene2_control = # Gene2 std for control

print("Patients: Mean =", mean_gene2_patients, "Median =", median_gene2_patients, "Std =", std_gene2_patients, "\t")
print("Control:  Mean =", mean_gene2_control, "Median =", median_gene2_control, "Std =", std_gene2_control, "\t")


Patients: Mean = 16955.432499999995 Median = 17023.491 Std = 2743.730488293748 	
Control:  Mean = 3439.741 Median = 3067.3015 Std = 1549.3355492407961 	

I will just ask for one more gene, since the process is entirely the same!

Gene 6


In [14]:
gene6_patients = patients.Gene6
gene6_control = control.Gene6

# Mean
mean_gene6_patients = # Gene6 mean for patients
mean_gene6_control = # Gene6 mean for control

# Median
median_gene6_patients = # Gene6 median for patients
median_gene6_control = # Gene6 median for control

# Std
std_gene6_patients = # Gene6 std for patents
std_gene6_control = # Gene6 std for control

print("Patients: Mean =", mean_gene6_patients, "Median =", median_gene6_patients, "Std =", std_gene6_patients, "\t")
print("Control:  Mean =", mean_gene6_control, "Median =", median_gene6_control, "Std =", std_gene6_control, "\t")


Patients: Mean = 30.24018987341772 Median = 29.8615 Std = 4.903053008532824 	
Control:  Mean = 30.018538461538462 Median = 29.8885 Std = 3.6831340701176676 	

Results - Genes 1, 2 and 6

Of the 3 genes, which ones do you believe are involved in the process of recovery?

Help: Recall that we have 2 groups, a group of patients that is recovering from a severe accident and a control group that are fine. You should look at the statistics for the 3 genes (mean, median and standard deviation [this last one is skippable]) and try to find differences!


In [ ]:

Can we do this without so much code?

Can we obtain the previous statistics for the 6 genes without all the effort?

Remember: Have u checked out the .describe() method?


In [15]:
gene_names = ["Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6"]

display(# Get the summary of the gene columns for PATIENTS)
display(# Get the summary of the gene columns for CONTROL)


Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
count 158.000000 158.000000 158.000000 158.000000 158.000000 158.000000
mean 13676.395266 16955.432500 8545.115209 88.850810 40.589082 30.240190
std 3092.698144 2743.730488 2468.762672 93.839906 4.982023 4.903053
min 4216.792000 9000.672000 2076.031000 57.577000 29.299000 20.265000
50% 13555.230500 17023.491000 8901.650500 80.864500 40.286500 29.861500
max 21642.619000 23432.793000 13809.735000 1252.313000 54.990000 56.933000
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
count 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000
mean 1156.073923 3439.741000 469.361115 67.455538 40.434962 30.018538
std 854.975521 1549.335549 146.219868 6.643712 4.406569 3.683134
min 308.670000 1729.328000 253.831000 52.941000 34.290000 23.870000
50% 947.109500 3067.301500 428.851500 68.073000 39.957500 29.888500
max 3781.265000 7916.231000 842.426000 81.252000 50.098000 38.250000

What if we want the a measure of difference for each gene?


In [16]:
display(# Mean of the PATIENT genes / # Mean of the CONTROL genes)


Gene1    11.830035
Gene2     4.929276
Gene3    18.205844
Gene4     1.317176
Gene5     1.003812
Gene6     1.007384
dtype: float64