In this exercise, you will use PLINK to analyze some genetic variation data and use various databases (UCSC Genome Browser) to examine genetic variants. Open a terminal and switch to the "33_Data_Variation-I" directory. You will find a zip file called “data.zip”. Unzip this file for the following analysis. We have already installed the PLINK on the CoCalc for you. You can type "plink" in the terminal run it.
There is no separate homework section for this in-class, the sections below will be graded.
First run PLINK with the basic input (--file) and output (--out) commands on the data in "33_Data_Variation-I". Exmaine the output and answer the following questions.
Q1. Write down your code below.
In [0]:
Q2. How many SNPs are in the dataset?
Q3. How many cases and controls are in the dataset?
Next, let's remove the low quality SNPs and samples. Run PLINK by putting explicit thresholds on SNP minor allele frequency (--maf 0.01), SNP genotyping rate (--geno 0.1) and missingness per individual/sample (--mind 0.1)
Q4. Write down your code below.
In [0]:
Q5. How many SNPs failed the missingness test?
Q6. How many SNPs failed the frequency test?
Q7. How many SNPs passed the filtering?
Q8. How many samples passed the filtering?
With the same filtering criteria, let's now perform an association analysis to identify associations between SNP alleles and the disease phenotype. Write down your code for the association analysis.
In [0]:
Q9. Which 5 SNPs are most strongly associated with the phenotype based on p-value? Hint: use the sort command to sort the pvalues. Write down your code and the top 5 SNPs in the following two cells.
In [0]:
The p-values measure the statistical significance of the association of each SNP with the phenotype of interest. The p-values are calculated based on the chi-square test. This is all interesting, but in plain words, how should you interpret this p-value (hint: what's the null hypothesis, and what's the alternative hypothesis)?
Q10. For rs12302542, write down your answers to the following questions.
What is the genomic location of the SNP?
What is the minor allele and its frequency of this SNP based on the entire sample frequencies?
What is the pvalue for the association of this SNP with the phenotype of interest?
What is the effect size (odds ratio)?
Is this SNP significantly associated with the phenotype of interest?
Q11: Use the UCSC browser to find the gene closest to this SNP. Note, the coordinates in the files used by plink are for hg18, so we'd recommend that you use the rs number to find the nearest gene.
The gene MAPK3 is a member of the MAP kinase family, which is also known as extracellular signal-regulated kinases and acts in signaling cascade that regulates various cellular processes such as proliferation, differentiation, and cell cycle progression in response to a variety of extracellular signals. MAPK3 is activated by upstream kinases, resulting in its translocation to the nucleus where it phosphorylates nuclear targets. We are going to use the UCSC Genome Browser to find SNPs in this gene.
First, go to the UCSC Genome Browser and find the locus of the MAPK3 gene in human (make sure you select the hg19 assembly). You will discover that several genes are called MAPK3. Click on the first isoform of the gene.
Jump to the (chr16:30,128,215-30,128,325) 3rd exon of MAPK3. Scroll down to the Variation and Repeats section, and set the "All SNPs (150)", “ExAC” and “HapMap SNPs” options to "full.” Also, under Comparative Genomics, select “full’ for Conservation. You are welcome to turn off other tracks to have a clear and less cluttered view. Click refresh.
Q12: Focusing on those SNPs reported in dbSNP build 147, how many SNPs do you find?
Q13: Look carefully at the Conservation tracks corresponding to 100-way conservation as measured by PhyloP as well as "Multiz" alignment of 100 vertebrates
(note: you can display the corresponding alignments for different species by clicking on the "Conservation" Link in the "Comparative Genomics" section of track selection. Also Note: you can also change the scale of PhyloP scores as well by clicking on the "Basewise Conservation (PhyloP)" link, setting the verticale viewing range to a larger value).
Compare rs61736376 to rs372702309. Is one more conserved than the other? Why or why not?
Q14: For rs61736376, what biological/biochemical effect(s) might this have for individuals with the non-reference allele?
Q15: Look at the Exome Aggregation Consortium (ExAC) track and the HapMap track. How many variants present in each of them?
Q16: What is the key difference in the experimental design between the ExAC and Hapmap projects? Could this explain the pattern of variation you see?
You will see many SNPs in this region. By looking at “Mammal Cons” scores at these SNPs, you can tell which SNPs are conserved. We can grab all the SNPs in a region using the Tables function. From the "Tools" menu, select "Table Browser".
There, set the group option to "Variation" The track option should be set to "All SNPs (150)" and the region options should be set to "position" and chr16:30,128,215-30,128,325, which forms the boundaries of exon 3 of the MAPK3 gene. Click on the "get output" button at the bottom of the screen. You might want to paste the output into a text file so you can manipulate the columns in terminal.
Q17: How many insertions, deletions and in-dels do you see? Hint: look at the "class" field
Q18: How many variants are predicted to directly affect the translated MAPK3 protein sequence? Hint: look at the "func" column
Now, let's get the actual values for all of the PhyloP scores. Note you may find it helpful to keep the previous table open in a separate tab in your browser, and make a new tab to extract the table below, so that you can compare both tables. You might also find copying the table(s) into Excel helpful in comparing the outputs.
Set the group option to "Comparative Genomics" The track option should be set to "Conservation" and the region options should be set to "position" and chr16:30,128,215-30,128,325, which forms the boundaries of exon 3 of the MAPK3 gene. Under "table" select "100 Vert. Cons (phyloP100wayAll).
Click on the "get output" button at the bottom of the screen. Notice that insertions and deletions are also returned using this technique.
Q19. What is the PhyloP score for rs1143695? How many SNPs achieve the maximum PhyloP score in this region?