For this module, we are going to go through the scenario where your PI gives you a dataset and tells that you have to perform the standard filtering steps, analyze it, and return the results (without actually telling you what the dataset represents). Luckily, we learned many of the relevant commands in the first genetic variation module, so you have all the tools you need to impress your demanding PI.
If you look in the '~/34_Data_Variation-II/' directory (which is not in this assigment folder but should be in your home directory), you will see three files with extensions that should look somewhat familiar to you from the last class:
You will also see a file called malmo_v4_mds.mds, which we will investigate in parts 2 and 3.
First, let's run the basic PLINK command to look at this data, and send the output to plink_output/validation
within this assignment folder. Note that to use the binary file, you have to use the '--bfile' flag instead of the '--file' flag.
Record the command you ran:
How many SNPs are in this dataset?
How many cases and controls are in this dataset?
Now let's perform similar filtering to what we did for the first module: add a MAF threshold of 0.01, a SNP genotyping rate of 0.05, and missingness per individual/sample of 0.1. In addition to these filters, we will also add a filter to exclude SNPs that significantly diverge from Hardy-Weinberg equilibrium, which is meant to identify SNPs that are under specific evolutionary forces and may therefore not be appropriate to include in a case/control association study. Look through the PLINK thresholding page to find the flag for filtering by HWE. What is that flag?
Now, let's filter out SNPs that significantly deviate from HWE with a p-value of 0.001 or less, in addition to the other filters described above. Record the code you ran here, and send the output to 'plink_output/filtering' in this directory:
How many individuals were removed for low genotyping?
How many SNPs were filtered out for diverging from HWE?
How many SNPs failed the missingness test?
How many SNPs failed the MAF test?
How many SNPs passed all the filtering?
How many cases and controls passed filtering?
A very common and important analysis in genetic studies is to consider the population stratification of the sample: because different ancestries/populations have different genetic architectures, comparing people without consideration of which population they come from can lead to false results, where a genetic signal may seem to be associated with a trait but really be due to differences between the case and control populations. The standard way to analyze this stratification is to calculate the genotype-based pairwise distances between each individual and each other individual, perform multi-dimensional scaling on these distances, which provides a way to make a 2D plot of the similarity of the individuals in the dataset, and cluster individuals based on this information to account for different populations.
You can read this PLINK page that describes approaches for understanding population stratification. We have generated a multidimensional scaling file for you called 'malmo_v4_mds.mds', also in the ~/34_Data_Variation-II/ directory. Read through the linked PLINK page to find the flags that you would use to generate an MDS file with 3 dimensions (but do not run this command, as it takes a long time!):
Next, download this MDS file, read it into R (noting that there is a header), and plot the first two dimensions against each other (the 3rd and 4th columns of the data file). Create a directory called "images" and save the plot into the images/ folder of this assignment.
Do you see anything suggesting that there is population stratification in this sample?
Given that we have seen population stratification, the question of how to account for this in our association test still remains. Instead of using the basic '--assoc' flag to do our association test, we now have to use a more sophisticated flag, which is the '--logistic' flag. This tells PLINK that we want to run a logistic regression, which is one way to predict a categorical value (i.e. the case or control status of each individual) while taking genotype weights as well as other covariates, in this case the population clusters, into account. In PLINK, we can specify the MDS file as a covariate for the logistic regression model to account for using the '--covar' flag with the malmo_v4_mds.mds
file as the argument.
We will also use two new flags that we haven't seen before: to adjust for multiple testing and sort the output by correct signifiance, we use the '--adjust' flag, and to generate 95% confidence intervals around the estimated parameters, we use the '--ci 0.95' flag.
Write the command with all the filters that we used before plus the --logistic flag, the --covar flag with the MDS file as the argument, and the two flags described above to run this logistic association test, and send the output to plink_output/filtering_and_logistic
:
This may take a few minutes, but now we have our logistic regression results in two files: plink_output/filtering_and_logistic.assoc.logistic
, which contains the logistic regression and the 95% confidence intervals for the odds ratios of each SNP, and plink_output/filtering_and_logistic.assoc.logistic.adjusted
, which contains the SNPs sorted by their adjusted p-values. Give the command to look at the first 30 lines of the adjusted file:
By looking at the relatively conservative Bonferroni corrected p-values (the BONF column), how many SNPs are significant at a p-value cutoff of 0.05?
While we're on the topic of p-values, how should you interpret the Bonferroni corrected p-value (hint: what's the null hypothesis, and what's the alternative hypothesis)?
In [0]:
Now look at the less conservative Benjamini-Hochberg computation of the false discovery rate, the 'FDR_BH' column in the data file. How many SNPs are significant at a cutoff of 0.05 false discovery?
The two output files contain different information, including the actual position of each SNP in filtering_and_logistic.assoc.logistic
. One important analysis that we should do is to figure out if we have many different and independent significant signals or if there are some significant SNPs that are close to each other, suggesting that they are in linkage disequilibrium and may be tagging the same causal mechanism. Since we don't have too many significant SNPs, we can figure this out by simply grepping for the top Bonferroni SNPs in filtering_and_logistic.assoc.logistic and looking at where they reside. Note that each SNP has 4 lines in this file, for each of four tests: additive, which accounts for all the covariates, and one test for each of the three MDS covariates.
Fill in the following cell with the grep code you ran for each SNP and the corresponding output (you only need to include the first line of the result). We have filled out the first one for you including the header line:
By 'clumping' nearby SNPs together based on their positions (the third column here), how many distinct association signals do you think there are?
What are the general odds ratios for each of the distinct signals you identified? Are they in the same direction, and can you generally interpret what these values represent? How do these correspond to the 'STAT' column in the logistic output file?
Finally, go on the genome browser and check these regions of associations, making sure to use the hg18 assembly; what genes do you see in these regions and can you give any general interpretation given that we haven't told you what our actual phenotype is?
Now you can give these results to your PI, who will surely be impressed with your knowledge of the somewhat esoteric art of running PLINK commands!