FaST-LMM, which stands for Factored Spectrally Transformed Linear Mixed Models, is a program for performing both single-SNP and SNP-set genome-wide association studies (GWAS) on extremely large data sets. This release contains the improvements described in Widmer et al., Scientific Reports 2014, and tests for epistasis. See the FaST-LMM annotated bibliography for additional papers that contributed to the FaST-LMM project and associated code.
For help with the software, please contact fastlmm@lists.research.microsoft.com.
This version of FaST-LMM is designed for use with randomly ascertained data with Gaussian residuals. If you have case-control data with substantial ascertainment bias, you should first transform your phenotype(s) using LEAP (Weissbrod et al., arXiv 2014), which is available at https://github.com/omerwe/LEAP. If you are analyzing continuous phenotypes with non-Gaussian residuals, you should first transform your phenotype(s) using Warped-LMM (Fusi et al., Nature Commun 2014), available at https://github.com/MicrosoftGenomics/WarpedLMM.
FaST-LMM uses four input files containing (1) the SNP data to be tested, (2) the SNP data used to determine the genetic similarity matrix (GSM) between individuals (which can be different from 1), (3) the phenotype data, and (4, optionally) a set of covariates.
SNP files can be in PLINK format (ped/map, tped/tfam, bed/bim/fam, or fam/dat/map). For the most speed, use the binary format in SNP major order. See the PLINK manual http://pngu.mgh.harvard.edu/~purcell/plink/ (Purcell et al., Am J Hum Genet 2007) for further details. FaST-LMM also supports Hdf5 file format http://www.hdfgroup.org/HDF5/whatishdf5.html. See https://github.com/MicrosoftGenomics/PySnpTools for more details. Note that each SNP is standardized to have mean zero and standard deviation one across all individuals before processing. Missing values are mean imputed.
The required file containing the phenotype uses the PLINK alternate phenotype format with no header. The covariate file also uses this format (with additional columns for multiple covariates).
To prepare this notebook to run analyses, please run the following script.
In [1]:
# set some ipython notebook properties
%matplotlib inline
# set degree of verbosity (adapt to INFO for more verbose output)
import logging
logging.basicConfig(level=logging.WARNING)
# set figure sizes
import pylab
pylab.rcParams['figure.figsize'] = (10.0, 8.0)
# set display width for pandas data frames
import pandas as pd
pd.set_option('display.width', 1000)
If you would like to run any of the code below from the command line, first copy it into a file (e.g., test.py
), and then run it by typing python text.py
at the command line.
If you would like to see all the options for a function just type ? <function_name>
to an ipython prompt.
First, let's run a standard LMM analysis in which the GSM uses (almost) all available SNPs. The model for this analysis is called LMM(all) in Widmer et al., Scientific Reports 2014. We'll apply this model to the synthetic data in tests\datasets\synth
. The data has 500 samples with 5000 SNPs, and was generated from a Balding-Nichols model with FST=0.05.
When using a linear mixed model for association analysis, it is important to avoid proximal contamination (Lippert et al., Nat Meth 2011). To understand proximal contamination, first note that a LMM with no fixed effects, using a realized relationship matrix for the GSM (as FaST-LMM does), is mathematically equivalent to linear regression of the SNPs on the phenotype, with weights integrated over independent Normal distributions having the same variance (e.g., Hayes et al., Genet Res 2009). That is, a LMM using a given set of SNPs for the GSM is equivalent to a form of linear regression using those SNPs as covariates to correct for confounding. This equivalence implies that, when testing a given SNP, that SNP (and SNPs physically close to it) should be excluded from the computation of the GSM. If not, when testing a particular SNP, we would also be using that same SNP as a covariate, making the log likelihood of the null model higher than it should be, thus leading to deflation of the test statistic and loss of power.
Excluding the SNP you are testing and those SNPs in close proximity to it from the GSM in a naïve way is extremely computationally expensive. A computationally efficient approach for performing the exclusion is to use a GSM computed from all but chromosome $i$ when testing SNPs on chromosome $i$ (Lippert et al., Nat Meth 2011). We call this approach leave out one chromosome (LOOC). The analysis here does this.
In [2]:
# import the algorithm
from fastlmm.association import single_snp
# set up data
##############################
bed_fn = "../../tests/datasets/synth/all"
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"
# run gwas
###################################################################
results_df = single_snp(bed_fn, pheno_fn, covar=cov_fn, count_A1=True)
# manhattan plot
import pylab
import fastlmm.util.util as flutil
flutil.manhattan_plot(results_df[["Chr", "ChrPos", "PValue"]].values,pvalue_line=1e-5,xaxis_unit_bp=False)
pylab.show()
# qq plot
from fastlmm.util.stats import plotp
plotp.qqplot(results_df["PValue"].values, xlim=[0,5], ylim=[0,5])
# print head of results data frame
import pandas as pd
pd.set_option('display.width', 1000)
results_df.head(n=10)
Out[2]:
In Widmer et al., Scientific Reports 2014, we have shown that power can be increased while still maintaining control of type I error by using two GSMs: one based on all SNPs (G0
) and one based on selected SNPs that are highly correlated with the phenotype (G1
). The model is called LMM(select + all). This approach has greater computational demands (we recommend using a cluster computer when analyzing large data sets). Here is an example of how to apply this model to the synthetic data.
In [3]:
# example for two kernel feature selection
# this takes a couple of minutes to run on a 20-proc machine.
from pysnptools.snpreader import Bed
from fastlmm.association import single_snp_all_plus_select
from fastlmm.util.runner import LocalMultiProc
runner = LocalMultiProc(20,mkl_num_threads=5)
# define file names
snp_reader = Bed("../../tests/datasets/synth/all", count_A1=True)
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"
# find the chr5 SNPs
test_snps = snp_reader[:,snp_reader.pos[:,0] == 5]
#select the 2nd kernel and run GWAS
results_df = single_snp_all_plus_select(test_snps=test_snps,G=snp_reader,pheno=pheno_fn,GB_goal=2,do_plot=True,runner=runner)
import fastlmm.util.util as flutil
flutil.manhattan_plot(results_df[["Chr", "ChrPos", "PValue"]].values,pvalue_line=1e-5,xaxis_unit_bp=False)
pylab.show()
results_df.head()
Out[3]:
Aside: In some applications, you may want to use two kernels constructed from two pre-specified sets of SNPs (i.e., with no feature selection). Here we show how to do that and how to simultaneously find h2 and the mixing weight between the kernels.
In [4]:
# example script for two kernel without feature selection
import numpy as np
import pysnptools.util
from pysnptools.snpreader import Bed
from fastlmm.association import single_snp
# define file names
bed_fn = "../../tests/datasets/synth/all"
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"
# select data
###################################################################
snp_reader = Bed(bed_fn,count_A1=True)
# partition snps on chr6 vs rest
G1_chr = 6
G0 = snp_reader[:,snp_reader.pos[:,0] != G1_chr]
G1 = snp_reader[:,snp_reader.pos[:,0] == G1_chr]
# test on chr5
test_snps = snp_reader[:,snp_reader.pos[:,0] == 5]
results_df = single_snp(test_snps, pheno_fn, K0=G0, K1=G1, covar=cov_fn, GB_goal=2, count_A1=True)
import fastlmm.util.util as flutil
flutil.manhattan_plot(results_df[["Chr", "ChrPos", "PValue"]].values,pvalue_line=1e-5,xaxis_unit_bp=False)
pylab.show()
results_df.head()
Out[4]:
In the same publication, we have shown that a simpler, more computationally efficient model can be used when the data is confounded only by population structure and not by family structure or cryptic relatedness. Under these circumstances, we have found that a model with a single GSM based on selected SNPs in combination with principle components as fixed-effect covariates yields good control of type I error and power.
This model, called LMM(select)+PCs, should be used with caution. Even if you explicitly remove closely related individuals from your data, cryptic relatedness may remain.
To use this model, first identify the principle components to be used with the PCgeno algorithm as described in Widmer et al., Scientific Reports 2014.
Then you can call single_snp_select with these PCs as covariates.
In [5]:
from pysnptools.snpreader import Bed
from fastlmm.util import compute_auto_pcs
from fastlmm.association import single_snp_select
# define file names
bed_fn = "../../tests/datasets/synth/all"
snp_reader = Bed(bed_fn,count_A1=True)
phen_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
# find number of PCs
pcs = compute_auto_pcs(bed_fn,count_A1=True)
print "selected number of PCs:", pcs["vals"].shape[1]
# test on chr5
test_snps = snp_reader[:,snp_reader.pos[:,0] == 5]
results_df = single_snp_select(test_snps=test_snps, G=snp_reader, pheno=phen_fn, covar=pcs, GB_goal=2)
import fastlmm.util.util as flutil
flutil.manhattan_plot(results_df[["Chr", "ChrPos", "PValue"]].values,pvalue_line=1e-5,xaxis_unit_bp=False)
pylab.show()
results_df.head()
Out[5]:
You can test for epistatic interactions between pairs of SNPs as well. Here is an example analysis applied to the same synthetic data. Note that this version of the code uses a likelihood ratio test based on maximum-likelihood estimates. A REML-based version is in the works.
In [6]:
# import the algorithm and reader
from fastlmm.association import epistasis
from pysnptools.snpreader import Bed
# define file names
bed_reader = Bed("../../tests/datasets/synth/all", count_A1=True)
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"
# partition data into the first 50 SNPs on chr1 and all but chr1
G0 = bed_reader[:,bed_reader.pos[:,0] != 1]
test_snps = bed_reader[:,bed_reader.pos[:,0] == 1][:,0:50]
# run epistasis analysis
results_df = epistasis(test_snps, pheno_fn, G0=G0, covar=cov_fn)
# qq plot
from fastlmm.util.stats import plotp
plotp.qqplot(results_df["PValue"].values, xlim=[0,5], ylim=[0,5])
# print head of results data frame
pd.set_option('display.width', 1000)
results_df.head(n=10)
Out[6]:
SNP-set association testing is performed similarly to single-SNP testing, except we test sets of SNPs by putting them together into one GSM, separate from any (optional) background GSM that corrects for confounding. Both LRT ("lrt") and score ("sc_davies") tests are supported. The LRT is computed by default, but you can switch to the score test by setting test=“sc_davies”. The score test can be more conservative in some settings (and hence can have less power), but it is closed form and often can be computed much faster. The algorithms in their current form are described in Lippert et al., Bioinformatics 2014.
Here is an example that uses the LRT and no background GSM--note the inflation in the results.
In [7]:
# this will take a few minutes to run
# import the algorithm and reader
from fastlmm.association import snp_set
from pysnptools.snpreader import Bed
import datetime
# define file names
test_snps_fn = "../../tests/datasets/synth/chr1"
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
cov_fn = "../../tests/datasets/synth/cov.txt"
set_list_fn = "../../tests/datasets/synth/chr1.sets.txt"
G0_fn = None
# run SNP-set analysis
results_df = snp_set(test_snps=test_snps_fn, G0=G0_fn, set_list=set_list_fn, pheno=pheno_fn, covar=cov_fn, test="lrt")
# qq plot
from fastlmm.util.stats import plotp
plotp.qqplot(results_df["P-value"].values, xlim=[0,5], ylim=[0,5], addlambda=False, legend="QQ")
# print head of results data frame
import pandas as pd
pd.set_option('display.width', 1000)
results_df.head(n=10)
Out[7]:
As just mentioned, we see inflation of the test statistic because no background GSM was used. When you include a background GSM (using G0
) type I error is better controlled. As in the single SNP case, proximal contamination is avoided using LOOC.
In [8]:
#Here we give a G0, the GSM, created via Bed file with all the chroms except chrom 1 (the test snps)
#WARNING: Even on a fast machine, this takes 24 minutes (but, running with test="sc_davies" takes seconds)
bed_reader = Bed("../../tests/datasets/synth/all.bed",count_A1=True)
chrNot1 = bed_reader[:,bed_reader.pos[:,0] != 1]
G0_fn = "../../tests/datasets/synth/chrNot1.bed"
Bed.write(G0_fn,chrNot1.read(),count_A1=True)
# run SNP-set analysis
results_df = snp_set(test_snps=test_snps_fn, G0=G0_fn, set_list=set_list_fn, pheno=pheno_fn, covar=cov_fn, test="lrt")
# qq plot
from fastlmm.util.stats import plotp
plotp.qqplot(results_df["P-value"].values, xlim=[0,5], ylim=[0,5], addlambda=False, legend="QQ")
# print head of results data frame
import pandas as pd
pd.set_option('display.width', 1000)
results_df.head(n=10)
Out[8]:
The G1
option available for single-marker testing is not yet supported for set testing. With the same caveats as previously described, however, you can use PC covariates and a G0
based on selected SNPs when your data is confounded, for example, by population structure.
In [9]:
from pysnptools.snpreader import Pheno, Bed
from fastlmm.inference import FastLMM
import numpy as np
# define file names
snp_reader = Bed("../../tests/datasets/synth/all",count_A1=True)
cov_fn = "../../tests/datasets/synth/cov.txt"
pheno_fn = "../../tests/datasets/synth/pheno_10_causals.txt"
# Divide the data into train (all but the last 10 individuals) and test (the last 10 individuals)
# (the cov and pheno will automatially be divided to match)
train = snp_reader[:-10,:]
test = snp_reader[-10:,:]
# In the style of scikit-learn, create a predictor and then train it.
fastlmm = FastLMM(GB_goal=2)
fastlmm.fit(K0_train=train,X=cov_fn,y=pheno_fn)
# Now predict with it
mean, covariance = fastlmm.predict(K0_whole_test=test,X=cov_fn)
print "Predicted means and stdevs"
print mean.val[:,0]
print np.sqrt(np.diag(covariance.val))
#Plot actual phenotype and predicted phenotype
whole_pheno = Pheno(pheno_fn)
actual_pheno = whole_pheno[whole_pheno.iid_to_index(mean.iid),:].read()
pylab.plot(actual_pheno.val,"r.")
pylab.plot(mean.val,"b.")
pylab.errorbar(np.arange(mean.iid_count),mean.val,yerr=np.sqrt(np.diag(covariance.val)),fmt='.')
pylab.xlabel('testing examples')
pylab.ylabel('phenotype, actual (red) and predicted (blue with stdev)')
pylab.show()
Columns in the output are as follows:
SNP
or Set
The SNP or set identifier tested.
PValue
The P value computed for the SNP tested.
NullLogLike
The log likelihood of the null model.
AltLogLike
The log likelihood of the alternative model.
Global statistics, such as sample size and the number of SNPs tested are printed to stdout at the end of the run.
Chr
The chromosome identifier for the SNP tested or 0 if unplaced. Taken from the PLINK file.
GenDist
The genetic distance of the SNP on the chromosome. Taken from the PLINK file. Any units are allowed, but typically centimorgans or morgans are used.
ChrPos
The base-pair position of the SNP on the chromosome (bp units). Taken from the PLINK file.
SnpWeight
The fixed-effect weight of the SNP.
SnpWeightSE
The standard error of the SnpWeight.
Nullh2
The narrow sense heritability given by h2 $=σ_g^2/(σ_g^2+σ_e^2)$ on the null model.
#SNPs_in_Set
The number of markers in the set.
Alt_h2
The value found in the alt. model, given by $\sigma^2(h2(a2*K_S+(1-a2) * K_C)+(1-h2)*I)$, where $K_C$ is the NxN matrix to correct for confounding, and $K_S$ is the matrix containing SNPs to be tested.
Alt_a2
See Alt_h2.
Null_h2
The value found in the null model, given by $\sigma^2(h2*K_C+1-h2)*I)$, where $K_C$ is the $N \times N$ matrix to correct for confounding.