Summary Statistics - Exercises

In these exercises you'll use a real life medical dataset to learn how to obtain basic statistics from the data. This dataset comes from Gluegrant, an American project that aims to find a which genes are more important for the recovery of severely injured patients! It was sightly edited to remove some complexities, but if you wish to check it out in it's full glory, it's available on the website and I can show it to you!

Have fun being a biostaticist for 1 hour! :)

Objectives

In this exercise the objective if for you to learn how to use Pandas functions to obtain simple statistics of Datasets.

Dataset information

The dataset is a medical dataset with 184 patients, distributed into 2 test groups where each group divided in 2, patients and control. The dataset is composed of clinical values:

Patient.id
Age
Sex
Group (to what group they belong)
Results (outcome of the patient)

and of the gene expression (higher = more expressed):

Gene1: MMP9
Gene2: S100A12
Gene3: MCEMP1
Gene4: ACSL1
Gene5: SLC7A2
Gene6: CDC14B

Remarks for people without bio background

Don't worry if you are not from a biological background, consider that these genes are simply numeric values related to the patient. We will not delve into the biological meaning of any of the genes, we'll only try to find if there are differences between the gene values for the different groups!

If you are still not confortable using this dataset, imagine this situation instead:

You can consider that this dataset comes from a online shopping service like Amazon. Imagine that they were conducting an A/B test, where a small part of their website was changed, like the related items suggestions. You have 2 groups, the "control group" that is the group that is experiencing the original website (without modifications) and Group 1 that is using the website with the new suggestions.

Consider also that the genes are products or product categories where the customers buy a certain ammount of products. Your objective now is to find if there is a significant difference between the control group and Group 1.

Start

Ok, introductions aside, please have fun being a biostaticist for 45 minutes! :P Any doubt, please call me or any of the other professors!

Import Data



In [4]:

    
import pandas as pd
import numpy as np
from IPython.display import display, HTML

CSS = """
.output {
    flex-direction: row;
}
"""

patient_data = pd.read_csv("../data/Exercises_Summary_Statistics_Data.csv")

patient_data.head()

Exercise 1 - Lets get a quick look at the groups

Ok, first lets get a quick look at who is in each of the groups. In medical studies it's important that the control and patient groups aren't too different from each other, so that we can draw relevant results.

Separate the patients and control into 2 dataframes

Since we are going to perform multiple statistics on the patient and control groups, we should create a variable for each one of the groups, so that we mantain our code readable!

Remember: If you want to subset the comand is: Name_of_dataframe[Name_of_dataframe.column == "Value"]



In [5]:

    
patients = # Subtet patient_data to include only patients
control = # Subset patient_data to include only control

Find out the Age means for each of the groups

Remember: To find the mean of a dataframe column, just use Name_of_dataframe.column.mean()



In [6]:

    
patient_mean = # Calculate the mean of the patients
control_mean = # Calculate the mean of the controls

print("The patient mean age is:", patient_mean, "and the control mean age is:", control_mean, "\t")









    



The patient mean age is: 33.64556962025316 and the control mean age is: 29.884615384615383

Find the Median of each group

As seen on the presentation, the mean can affected by outliers on the data, lets check that out with the median.

Remember: To find the median of a dataframecolumn, just use Name_of_dataframe.column.median()



In [7]:

    
patient_median = # Calculate the median of the patients
control_median = # Calculate the median of the controls

print("The patient median age is:", patient_median, "and the control median age is:", control_median, "\t")









    



The patient median age is: 33.0 and the control median age is: 28.0

Results - Mean / Median

Is there a significant difference of the mean and median?

(Optional): Is there a significant difference between the age of the Patients and Control? Consider that this dataset is composed mainly of people injured using powertools or other type of machinery, therefore, it's composed mainly of people in working age 20-ish to 60-ish.

Find the Standard deviation of each group

Let's see if there is a large deviation from the mean in each of the groups.

Remember: The standard deviation is taken as Name_of_dataframe.column.std()



In [8]:

    
patient_std = # Standard Deviation of the patients
control_std = # Standard Deviation of the controls

print("The patient std is:", patient_std, "and the control std is:", control_std, "\t")









    



The patient std is: 11.166987259352279 and the control std is: 10.195398660481787

Find the quantiles

Let's use the quantiles to obtain the dispersion of the groups. Get the 0, 0.25, 0.5, 0.75 and 1 quantiles.

Remember: The quantiles are obtained using the comand Name_of_dataframe.column.quantile(q=[percentages])



In [9]:

    
patient_quantiles = # Patient quantiles
control_quantiles = # Control quantiles
print("Patients:\f")
display(pd.DataFrame(patient_quantiles))
print("Control:\f")
display(pd.DataFrame(control_quantiles))
HTML('<style>{}</style>'.format(CSS))

Results - Interval Statistics

(Options): Do the dispersion statistics show a significant difference in the dispersion of the data?

Find out how many patients are male and how many are female

Next, let's try to find out the number or each of the sexes and the prercantage of males in each of the groups.

Remember: To get a frequency table, use Name_of_dataframe.column.value_counts(). To one way to get the number of a certain group do Name_of_dataframe.column.value_counts()["name_of_group"]

Number of male patients



In [10]:

    
num_male_patients = # Get the number of male patients
num_female_patients = # Get the number of female patients

print("The number of male patients is:", num_male_patients, \
      "\nThe number of female patients is:", num_female_patients, \
      "\nAnd the percentage of males is:", num_male_patients / (num_male_patients + num_female_patients), "\t")









    



The number of male patients is: 98 
The number of female patients is: 60 
And the percentage of males is: 0.620253164557

Number of male control patients



In [11]:

    
num_male_control = # Get the number of male control
num_female_control = # Get the number of female control

print("The number of male control patients is:", num_male_control, \
      "\nThe number of female control patients is:", num_female_control, \
      "\nAnd the percentage of males is:", num_male_control / (num_male_control + num_female_control), "\t")









    



The number of male control patients is: 17 
The number of female control patients is: 9 
And the percentage of males is: 0.653846153846

Results - Percentage of the sexes

(Optional): Is there a significant difference between the percentage of male patients and male control patients?

Exercise 2 - Let the Biostatistics begin

I have selected 6 genes from a total of ~55000. The objective here is for you to try to find genes that are different from the patient group and control group using the tools that you learned on exercise 1.

Gene 1



In [12]:

    
gene1_patients = patients.Gene1
gene1_control = control.Gene1

# Mean
mean_gene1_patients = # Gene1 mean for patients
mean_gene1_control = # Gene1 mean for control

# Median
median_gene1_patients = # Gene1 median for patients
median_gene1_control = # Gene1 median for control

# Std
std_gene1_patients = # Gene1 std for patents
std_gene1_control = # Gene1 std for control

print("Patients: Mean =", mean_gene1_patients, "Median =", median_gene1_patients, "Std =", std_gene1_patients, "\t")
print("Control:  Mean =", mean_gene1_control, "Median =", median_gene1_control, "Std =", std_gene1_control, "\t")









    



Patients: Mean = 13676.395265822784 Median = 13555.230500000001 Std = 3092.69814423062 	
Control:  Mean = 1156.0739230769232 Median = 947.1095 Std = 854.9755207222215

Gene 2



In [13]:

    
gene2_patients = patients.Gene2
gene2_control = control.Gene2

# Mean
mean_gene2_patients = # Gene2 mean for patients
mean_gene2_control = # Gene2 mean for control

# Median
median_gene2_patients = # Gene2 median for patients
median_gene2_control = # Gene2 median for control

# Std
std_gene2_patients = # Gene2 std for patents
std_gene2_control = # Gene2 std for control

print("Patients: Mean =", mean_gene2_patients, "Median =", median_gene2_patients, "Std =", std_gene2_patients, "\t")
print("Control:  Mean =", mean_gene2_control, "Median =", median_gene2_control, "Std =", std_gene2_control, "\t")









    



Patients: Mean = 16955.432499999995 Median = 17023.491 Std = 2743.730488293748 	
Control:  Mean = 3439.741 Median = 3067.3015 Std = 1549.3355492407961

I will just ask for one more gene, since the process is entirely the same!

Gene 6



In [14]:

    
gene6_patients = patients.Gene6
gene6_control = control.Gene6

# Mean
mean_gene6_patients = # Gene6 mean for patients
mean_gene6_control = # Gene6 mean for control

# Median
median_gene6_patients = # Gene6 median for patients
median_gene6_control = # Gene6 median for control

# Std
std_gene6_patients = # Gene6 std for patents
std_gene6_control = # Gene6 std for control

print("Patients: Mean =", mean_gene6_patients, "Median =", median_gene6_patients, "Std =", std_gene6_patients, "\t")
print("Control:  Mean =", mean_gene6_control, "Median =", median_gene6_control, "Std =", std_gene6_control, "\t")









    



Patients: Mean = 30.24018987341772 Median = 29.8615 Std = 4.903053008532824 	
Control:  Mean = 30.018538461538462 Median = 29.8885 Std = 3.6831340701176676

Results - Genes 1, 2 and 6

Of the 3 genes, which ones do you believe are involved in the process of recovery?

Help: Recall that we have 2 groups, a group of patients that is recovering from a severe accident and a control group that are fine. You should look at the statistics for the 3 genes (mean, median and standard deviation [this last one is skippable]) and try to find differences!



In [ ]:

Can we do this without so much code?

Can we obtain the previous statistics for the 6 genes without all the effort?

Remember: Have u checked out the .describe() method?



In [15]:

    
gene_names = ["Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6"]

display(# Get the summary of the gene columns for PATIENTS)
display(# Get the summary of the gene columns for CONTROL)









    







  
    
      
      Gene1
      Gene2
      Gene3
      Gene4
      Gene5
      Gene6
    
  
  
    
      count
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
    
    
      mean
      13676.395266
      16955.432500
      8545.115209
      88.850810
      40.589082
      30.240190
    
    
      std
      3092.698144
      2743.730488
      2468.762672
      93.839906
      4.982023
      4.903053
    
    
      min
      4216.792000
      9000.672000
      2076.031000
      57.577000
      29.299000
      20.265000
    
    
      50%
      13555.230500
      17023.491000
      8901.650500
      80.864500
      40.286500
      29.861500
    
    
      max
      21642.619000
      23432.793000
      13809.735000
      1252.313000
      54.990000
      56.933000
    
  








    







  
    
      
      Gene1
      Gene2
      Gene3
      Gene4
      Gene5
      Gene6
    
  
  
    
      count
      26.000000
      26.000000
      26.000000
      26.000000
      26.000000
      26.000000
    
    
      mean
      1156.073923
      3439.741000
      469.361115
      67.455538
      40.434962
      30.018538
    
    
      std
      854.975521
      1549.335549
      146.219868
      6.643712
      4.406569
      3.683134
    
    
      min
      308.670000
      1729.328000
      253.831000
      52.941000
      34.290000
      23.870000
    
    
      50%
      947.109500
      3067.301500
      428.851500
      68.073000
      39.957500
      29.888500
    
    
      max
      3781.265000
      7916.231000
      842.426000
      81.252000
      50.098000
      38.250000

What if we want the a measure of difference for each gene?



In [16]:

    
display(# Mean of the PATIENT genes / # Mean of the CONTROL genes)









    





Gene1    11.830035
Gene2     4.929276
Gene3    18.205844
Gene4     1.317176
Gene5     1.003812
Gene6     1.007384
dtype: float64

	Age
0.00	16.0
0.25	24.0
0.50	33.0
0.75	43.0
1.00	55.0

	Age
0.00	17.0
0.25	21.5
0.50	28.0
0.75	34.0
1.00	54.0

	Patient_id	Age	Sex	Result	Group	Gene1	Gene2	Gene3	Gene4	Gene5	Gene6
0	1	20	male	control	Control	950.444	5609.021	530.861	56.001	38.539	32.496
1	2	34	female	control	Control	728.066	3337.738	271.314	60.238	37.117	30.645
2	3	40	female	control	Control	1208.076	4430.424	520.859	67.374	41.698	29.476
3	4	31	male	control	Control	3426.842	6524.846	842.426	68.772	36.682	32.125
4	5	21	female	control	Control	3781.265	7916.231	574.768	70.522	34.877	27.416

	Gene1	Gene2	Gene3	Gene4	Gene5	Gene6
count	158.000000	158.000000	158.000000	158.000000	158.000000	158.000000
mean	13676.395266	16955.432500	8545.115209	88.850810	40.589082	30.240190
std	3092.698144	2743.730488	2468.762672	93.839906	4.982023	4.903053
min	4216.792000	9000.672000	2076.031000	57.577000	29.299000	20.265000
50%	13555.230500	17023.491000	8901.650500	80.864500	40.286500	29.861500
max	21642.619000	23432.793000	13809.735000	1252.313000	54.990000	56.933000

	Gene1	Gene2	Gene3	Gene4	Gene5	Gene6
count	26.000000	26.000000	26.000000	26.000000	26.000000	26.000000
mean	1156.073923	3439.741000	469.361115	67.455538	40.434962	30.018538
std	854.975521	1549.335549	146.219868	6.643712	4.406569	3.683134
min	308.670000	1729.328000	253.831000	52.941000	34.290000	23.870000
50%	947.109500	3067.301500	428.851500	68.073000	39.957500	29.888500
max	3781.265000	7916.231000	842.426000	81.252000	50.098000	38.250000