LEAP is a framework for genome-wide association studies (GWAS) of ascertained case-control studies. LEAP works by estimating latent liability values for every individual, and then treating these values as continuous phenotypes. Please see the LEAP Github page for installation instructions and for a full documentation of LEAP.
LEAP uses the packages pysnptools and FaSTLMM for data processing and GWAS analysis. Before reading this notebook, it is recommended to read the FaSTLMM Ipython notebook, which demonstrates how one can perform GWAS of continuous phenotypes. This notebook explains how to adapt the analyses described there for ascertained case-control studies.
The LEAP procedure is composed of five stages:
The following Python code demonstrates the LEAP procedure in detail on a small synthetic data set. The dataset was simulated with 50% heritability and 0.1% prevalence. It included 500 cases, 500 controls, 499 causal SNPs and 10000 SNPs differentiated with FST=0.01. Causal SNPs are called csnp[i]. The original liabilities for this file are available in the file dataset1.phe.liab (but this file is not used by LEAP). The code should be run from the "regression" directory in the LEAP package.
We note that it is also possible to run LEAP via executable Python scripts. Please see the LEAP Github page for detailed instructions, and the file "leap_pipeline.sh" in the "regression" directory of the LEAP package for a usage example.
We start by importing all the required packages, setting notebook parameters and pointing to the data.
In [1]:
#import packages
import pandas as pd
import numpy as np
import leap.leapUtils as leapUtils
import leap.leapMain as leapMain
from pysnptools.snpreader.bed import Bed
from fastlmm.util.stats import plotp
import fastlmm.association as association
import pylab
import logging
In [2]:
# set some ipython notebook properties
%matplotlib inline
# set degree of verbosity (adapt to INFO for more verbose output)
logging.basicConfig(level=logging.WARNING)
# set figure sizes
pylab.rcParams['figure.figsize'] = (10.0, 8.0)
# set display width for pandas data frames
pd.set_option('display.width', 1000)
In [3]:
#Define analysis data
bfile = 'dataset1/dataset1'
phenoFile = bfile+'.phe'
chromosomes = xrange(1,11)
prevalence = 0.001
bed = Bed(bfile).read().standardize()
causalSNPs = [s for s in bed.sid if 'csnp' in s] #These are the names of the causal SNPs
We first attempt to perform association testing via a simple linear regression, by treating the case-control phenotype as if it were continuous. We note that it is more appropriate to use a generalized linear model (e.g. logistic regression). However, in this demonstration we use a linear regression model, because it is much faster, and often yields results very similar to logistic regression in balanced case-control studies.
In [4]:
#Analyze the data with simple linear regression
linreg_df = leapUtils.linreg(bfile, phenoFile)
#print the top 10 ranking SNPs
print linreg_df.head(n=10)
In [5]:
# qq plot for non-causal SNPs only
nocsnps_df = linreg_df[~(linreg_df['SNP'].str.contains("csnp"))]
plotp.qqplot(nocsnps_df["PValue"].values, xlim=[0,5], ylim=[0,5], title="Simple Linear Regression, non-causal SNPs")
pylab.show()
The Q-Q plot demonstrates that the linear regression model suffers from severe inflation, and cannot control for type I error. We therefore turn to use a linear mixed model (LMM) instead.
We now perform an analysis using an LMM. LMMs are designed for continuous phenotype studies, but are often used to analyze confounded case-control studies as well, because they are highly robust to type I errors due to confounding. We perform a leave-one-chromosome-out (LOCO) analysis to prevent proximal contamination. Thus, in the analysis of each SNP, we control for confounding by conditioning on all SNPs, except for the ones on the tested chromosome.
In [6]:
#Naively analyze the case-control data via an LMM as if it were continuous
naiveLMM_df = association.single_snp_leave_out_one_chrom(bfile, phenoFile)
#print 'Top 10 most associated SNPs:'
print naiveLMM_df.head(n=10)
In [7]:
# qq plot for non-causal SNPs only
nocsnps_df = naiveLMM_df[~(naiveLMM_df['SNP'].str.contains("csnp"))]
plotp.qqplot(nocsnps_df["PValue"].values, xlim=[0,5], ylim=[0,5], title="Standard LMM, non-causal SNPs")
pylab.show()
In [8]:
#Power plot for LMM analysis
leapUtils.powerPlot(naiveLMM_df, causalSNPs, title='Standard LMM Power Curve')
The LMM properly controls for type I errors. However, we may lose power by naively treating the case-control status as if were continuous. We now try to use LEAP, which computes continuous liability values for all individuals in the study, and then tests for association with these values via an LMM.
We begin by performing a greedy algorithm to find and exclude related individuals (stage 1 of the 5 LEAP stages) from subsequent stages.
In [9]:
#1. Find individuals to exclude to eliminate relatedness (kinship coeff > 0.05)
indsToKeep = leapUtils.findRelated(bed, cutoff=0.05)
We now run a loop over each chromosome. For each chromosome, we compute an eigendecomposition of the kinship matrix consisting of all SNPs on the other chromosomes (stage 2 of the LEAP procedure), estimate heritability and liabilities using all other chromosomes (stages 3-4 of the LEAP procedure) and then test for association with the estimated liabilities (stage 5 of the procedure).
In [10]:
#Iterate over each chromosome
frame_list = []
for chrom in chromosomes:
print
print 'Analyzing chromosome', chrom, '...'
#Create a bed object excluding SNPs from the current chromosome
bedExclude = leapUtils.getExcludedChromosome(bfile, chrom)
#Create a bed object including only SNPs from the current chromosome
bedTest = leapUtils.getChromosome(bfile, chrom)
#2. Compute eigendecomposition for the data
eigenFile = 'temp_eigen.npz'
eigen = leapMain.eigenDecompose(bedExclude, outFile=eigenFile)
#3. compute heritability explained by this data
h2 = leapMain.calcH2(phenoFile, prevalence, eigen, keepArr=indsToKeep, h2coeff=1.0)
#4. Compute liabilities explained by this data
liabs = leapMain.probit(bedExclude, phenoFile, h2, prevalence, eigen, keepArr=indsToKeep)
#5. perform GWAS, using the liabilities as the observed phenotypes
results_df = leapMain.leapGwas(bedExclude, bedTest, liabs, h2)
frame_list.append(results_df)
We now join the results for all the chromosomes, and display the results.
In [11]:
#Join together the results of all chromosomes, and print the top ranking SNPs
leap_df = pd.concat(frame_list)
leap_df.sort("PValue", inplace=True)
leap_df.index = np.arange(len(leap_df))
print 'Top 10 most associated SNPs:'
print leap_df.head(n=10)
In [12]:
# qq plot for non-causal SNPs only
nocsnps_df = leap_df[~(leap_df['SNP'].str.contains("csnp"))]
plotp.qqplot(nocsnps_df["PValue"].values, xlim=[0,5], ylim=[0,5], title="LEAP, non-causal SNPs")
pylab.show()
In [13]:
#Power plot for LEAP and for a standard LMM
leapUtils.powerPlot(naiveLMM_df, causalSNPs, title='LMM')
leapUtils.powerPlot(leap_df, causalSNPs)
pylab.legend(["LMM", "LEAP"], loc='upper left')
pylab.show()
The results demonstrate that LEAP properly controls for type I error on the one hand, but is more powerful than a standard LMM on the other (for significance thresholds < 10^(-3.5)). The advantage here is relatively small due to the small data set size, but it substantially increases for larger data sets, as demonstrated in the paper.