LEAP Manual

December 26, 2014

LEAP is a framework for genome-wide association studies (GWAS) of ascertained case-control studies. LEAP works by estimating latent liability values for every individual, and then treating these values as continuous phenotypes. Please see the LEAP Github page for installation instructions and for a full documentation of LEAP.

LEAP uses the packages pysnptools and FaSTLMM for data processing and GWAS analysis. Before reading this notebook, it is recommended to read the FaSTLMM Ipython notebook, which demonstrates how one can perform GWAS of continuous phenotypes. This notebook explains how to adapt the analyses described there for ascertained case-control studies.

The LEAP procedure is composed of five stages:

  1. Finding a set of individuals who are related to other individuals in the study. LEAP employs a greedy algorithm to find a small subset of such individuals, such that after their exclusion, there are no related individuals in the study. These individuals are excluded from the analysis in stages 3 and 4 below, but after fitting a model in stage 4, their liabilities are estimated along with other indviduals. All individuals are considered in the GWAS stage (stage 5).
  2. Computing an eigendecomposition for every excluded chromosome, consisting of all SNPs on all chromosomes except for the excluded one.
  3. Estimating the heritability explained by each set of chromosomes (with one chromosome excluded), using the method of Golan et al..
  4. Estimating the latent liability value of every individual via Probit regression.
  5. Testing SNPs on the excluded chromosome for association with the estimated liabilities.

The following Python code demonstrates the LEAP procedure in detail on a small synthetic data set. The dataset was simulated with 50% heritability and 0.1% prevalence. It included 500 cases, 500 controls, 499 causal SNPs and 10000 SNPs differentiated with FST=0.01. Causal SNPs are called csnp[i]. The original liabilities for this file are available in the file dataset1.phe.liab (but this file is not used by LEAP). The code should be run from the "regression" directory in the LEAP package.

We note that it is also possible to run LEAP via executable Python scripts. Please see the LEAP Github page for detailed instructions, and the file "leap_pipeline.sh" in the "regression" directory of the LEAP package for a usage example.

Code and Data preparation

We start by importing all the required packages, setting notebook parameters and pointing to the data.


In [1]:
#import packages
import pandas as pd
import numpy as np
import leap.leapUtils as leapUtils
import leap.leapMain as leapMain
from pysnptools.snpreader.bed import Bed
from fastlmm.util.stats import plotp
import fastlmm.association as association
import pylab
import logging


D:\Anaconda\Lib\site-packages\matplotlib\__init__.py:1256: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)

In [2]:
# set some ipython notebook properties
%matplotlib inline

# set degree of verbosity (adapt to INFO for more verbose output)
logging.basicConfig(level=logging.WARNING)

# set figure sizes
pylab.rcParams['figure.figsize'] = (10.0, 8.0)

# set display width for pandas data frames
pd.set_option('display.width', 1000)

In [3]:
#Define analysis data
bfile = 'dataset1/dataset1'
phenoFile = bfile+'.phe'
chromosomes = xrange(1,11)
prevalence = 0.001
bed = Bed(bfile).read().standardize()
causalSNPs = [s for s in bed.sid if 'csnp' in s] #These are the names of the causal SNPs

Simple Linear Regression Analysis

We first attempt to perform association testing via a simple linear regression, by treating the case-control phenotype as if it were continuous. We note that it is more appropriate to use a generalized linear model (e.g. logistic regression). However, in this demonstration we use a linear regression model, because it is much faster, and often yields results very similar to logistic regression in balanced case-control studies.


In [4]:
#Analyze the data with simple linear regression
linreg_df = leapUtils.linreg(bfile, phenoFile)

#print the top 10 ranking SNPs
print linreg_df.head(n=10)


       SNP  Chr  GenDist  ChrPos        PValue
0   csnp48    3        0       0  4.894914e-10
1  csnp191    3      140     140  4.405043e-08
2  csnp253    4      330     330  1.435530e-06
3  csnp309    7      310     310  1.936628e-06
4  csnp366    6      300     300  2.103015e-06
5   csnp81    7       90      90  2.255680e-06
6  csnp345    8      340     340  2.482452e-06
7  csnp389    3      390     390  5.301356e-06
8  csnp244    7      230     230  6.157386e-06
9  csnp250    5      280     280  6.977496e-06

In [5]:
# qq plot for non-causal SNPs only
nocsnps_df = linreg_df[~(linreg_df['SNP'].str.contains("csnp"))]
plotp.qqplot(nocsnps_df["PValue"].values, xlim=[0,5], ylim=[0,5], title="Simple Linear Regression, non-causal SNPs")
pylab.show()


lambda=1.5553

The Q-Q plot demonstrates that the linear regression model suffers from severe inflation, and cannot control for type I error. We therefore turn to use a linear mixed model (LMM) instead.

Standard LMM Analysis

We now perform an analysis using an LMM. LMMs are designed for continuous phenotype studies, but are often used to analyze confounded case-control studies as well, because they are highly robust to type I errors due to confounding. We perform a leave-one-chromosome-out (LOCO) analysis to prevent proximal contamination. Thus, in the analysis of each SNP, we control for confounding by conditioning on all SNPs, except for the ones on the tested chromosome.


In [6]:
#Naively analyze the case-control data via an LMM as if it were continuous
naiveLMM_df = association.single_snp_leave_out_one_chrom(bfile, phenoFile) 

#print 'Top 10 most associated SNPs:'
print naiveLMM_df.head(n=10)


       SNP  Chr  GenDist  ChrPos        PValue  SnpWeight  SnpWeightSE  NullLogDelta
0   csnp48    3        0       0  1.779753e-08  -0.081166     0.014293            -5
1  csnp191    3      140     140  3.109763e-07  -0.073400     0.014248            -5
2  csnp132    5      150     150  1.152797e-05   0.063249     0.014347            -5
3  snp1419    8     1330    1330  1.446491e-05   0.061527     0.014117            -5
4  csnp253    4      330     330  1.574784e-05   0.061869     0.014258            -5
5  csnp449    8      450     450  2.150155e-05  -0.060615     0.014198            -5
6  snp3412    9     3210    3210  4.255551e-05   0.057628     0.014017            -5
7  csnp389    3      390     390  5.295321e-05   0.059238     0.014591            -5
8   csnp81    7       90      90  7.491180e-05  -0.057288     0.014406            -5
9  csnp250    5      280     280  8.015306e-05   0.057014     0.014396            -5

In [7]:
# qq plot for non-causal SNPs only
nocsnps_df = naiveLMM_df[~(naiveLMM_df['SNP'].str.contains("csnp"))]
plotp.qqplot(nocsnps_df["PValue"].values, xlim=[0,5], ylim=[0,5], title="Standard LMM, non-causal SNPs")
pylab.show()


lambda=1.0178

In [8]:
#Power plot for LMM analysis
leapUtils.powerPlot(naiveLMM_df, causalSNPs, title='Standard LMM Power Curve')


The LMM properly controls for type I errors. However, we may lose power by naively treating the case-control status as if were continuous. We now try to use LEAP, which computes continuous liability values for all individuals in the study, and then tests for association with these values via an LMM.

LEAP

We begin by performing a greedy algorithm to find and exclude related individuals (stage 1 of the 5 LEAP stages) from subsequent stages.


In [9]:
#1. Find individuals to exclude to eliminate relatedness (kinship coeff > 0.05)
indsToKeep = leapUtils.findRelated(bed, cutoff=0.05)


Computing kinship matrix...
Done in 1.34 seconds
Marking 54 individuals to be removed due to high relatedness

We now run a loop over each chromosome. For each chromosome, we compute an eigendecomposition of the kinship matrix consisting of all SNPs on the other chromosomes (stage 2 of the LEAP procedure), estimate heritability and liabilities using all other chromosomes (stages 3-4 of the LEAP procedure) and then test for association with the estimated liabilities (stage 5 of the procedure).


In [10]:
#Iterate over each chromosome
frame_list = []
for chrom in chromosomes:
    print
    print 'Analyzing chromosome', chrom, '...'

    #Create a bed object excluding SNPs from the current chromosome
    bedExclude = leapUtils.getExcludedChromosome(bfile, chrom)

    #Create a bed object including only SNPs from the current chromosome
    bedTest = leapUtils.getChromosome(bfile, chrom)	

    #2. Compute eigendecomposition for the data
    eigenFile = 'temp_eigen.npz'
    eigen = leapMain.eigenDecompose(bedExclude, outFile=eigenFile)	

    #3. compute heritability explained by this data
    h2 = leapMain.calcH2(phenoFile, prevalence, eigen, keepArr=indsToKeep, h2coeff=1.0)

    #4. Compute liabilities explained by this data
    liabs = leapMain.probit(bedExclude, phenoFile, h2, prevalence, eigen, keepArr=indsToKeep)

    #5. perform GWAS, using the liabilities as the observed phenotypes
    results_df = leapMain.leapGwas(bedExclude, bedTest, liabs, h2)
    frame_list.append(results_df)


Analyzing chromosome 1 ...
Computing kinship matrix...
Done in 1.25 seconds
Computing eigendecomposition...
Done in 0.86 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.457940
Beginning Probit regression...
Done in 0.11 seconds
Performing LEAP GWAS...

Analyzing chromosome 2 ...
Computing kinship matrix...
Done in 1.04 seconds
Computing eigendecomposition...
Done in 1.02 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.465129
Beginning Probit regression...
Done in 0.12 seconds
Performing LEAP GWAS...

Analyzing chromosome 3 ...
Computing kinship matrix...
Done in 1.05 seconds
Computing eigendecomposition...
Done in 0.84 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.460831
Beginning Probit regression...
Done in 0.12 seconds
Performing LEAP GWAS...

Analyzing chromosome 4 ...
Computing kinship matrix...
Done in 1.16 seconds
Computing eigendecomposition...
Done in 0.88 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.448058
Beginning Probit regression...
Done in 0.12 seconds
Performing LEAP GWAS...

Analyzing chromosome 5 ...
Computing kinship matrix...
Done in 1.39 seconds
Computing eigendecomposition...
Done in 0.97 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.433674
Beginning Probit regression...
Done in 0.11 seconds
Performing LEAP GWAS...

Analyzing chromosome 6 ...
Computing kinship matrix...
Done in 1.27 seconds
Computing eigendecomposition...
Done in 1.43 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.460130
Beginning Probit regression...
Done in 0.12 seconds
Performing LEAP GWAS...

Analyzing chromosome 7 ...
Computing kinship matrix...
Done in 1.23 seconds
Computing eigendecomposition...
Done in 1.03 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.460162
Beginning Probit regression...
Done in 0.13 seconds
Performing LEAP GWAS...

Analyzing chromosome 8 ...
Computing kinship matrix...
Done in 0.92 seconds
Computing eigendecomposition...
Done in 0.79 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.482879
Beginning Probit regression...
Done in 0.13 seconds
Performing LEAP GWAS...

Analyzing chromosome 9 ...
Computing kinship matrix...
Done in 1.15 seconds
Computing eigendecomposition...
Done in 0.90 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.470872
Beginning Probit regression...
Done in 0.15 seconds
Performing LEAP GWAS...

Analyzing chromosome 10 ...
Computing kinship matrix...
Done in 0.91 seconds
Computing eigendecomposition...
Done in 0.84 seconds
Removing the top 10 PCs from the kinship matrix
Computing h2 for a binary phenotype
h2: 0.492799
Beginning Probit regression...
Done in 0.13 seconds
Performing LEAP GWAS...

We now join the results for all the chromosomes, and display the results.


In [11]:
#Join together the results of all chromosomes, and print the top ranking SNPs
leap_df = pd.concat(frame_list)
leap_df.sort("PValue", inplace=True)
leap_df.index = np.arange(len(leap_df))
print 'Top 10 most associated SNPs:'
print leap_df.head(n=10)


Top 10 most associated SNPs:
       SNP  Chr  GenDist  ChrPos        PValue  SnpWeight  SnpWeightSE  NullLogDelta
0   csnp48    3        0       0  7.548321e-09  -0.362539     0.062202      0.156998
1  csnp191    3      140     140  8.086279e-09  -0.359522     0.061810      0.156998
2  csnp253    4      330     330  1.819815e-06   0.296565     0.061771      0.208521
3  csnp389    3      390     390  1.088394e-05   0.279377     0.063188      0.156998
4  snp1419    8     1330    1330  1.357757e-05   0.274829     0.062855      0.068512
5  csnp449    8      450     450  2.153346e-05  -0.269316     0.063090      0.068512
6  csnp380    2      340     340  2.908695e-05   0.263239     0.062677      0.139710
7  csnp250    5      280     280  3.058698e-05   0.258183     0.061643      0.266875
8  snp1326    8     1270    1270  4.457615e-05   0.257467     0.062789      0.068512
9  csnp366    6      300     300  4.887874e-05   0.256423     0.062867      0.159819

In [12]:
# qq plot for non-causal SNPs only
nocsnps_df = leap_df[~(leap_df['SNP'].str.contains("csnp"))]
plotp.qqplot(nocsnps_df["PValue"].values, xlim=[0,5], ylim=[0,5], title="LEAP, non-causal SNPs")
pylab.show()


lambda=1.0126

In [13]:
#Power plot for LEAP and for a standard LMM
leapUtils.powerPlot(naiveLMM_df, causalSNPs, title='LMM')
leapUtils.powerPlot(leap_df, causalSNPs)
pylab.legend(["LMM", "LEAP"], loc='upper left')
pylab.show()


The results demonstrate that LEAP properly controls for type I error on the one hand, but is more powerful than a standard LMM on the other (for significance thresholds < 10^(-3.5)). The advantage here is relatively small due to the small data set size, but it substantially increases for larger data sets, as demonstrated in the paper.