eQTL Processing

This notebook contains code to read the results from "Run eQTL Analysis" and parse them into useful formats. There is a cell or cells in this notebook for each eQTL Analysis often followed by something like 2 +. This prevents the notebook from being executed all at once. Basically, after running the first few cells to set things up, you probably just want to execute the other cells individually as you need to.

Some cells can be executed while the eQTL is running. This allows you to see how many significant hits you are getting etc. as things are running. This code (specifically process_eqtl_results is designed to run efficiently by skipping genes that were already parsed and using the already calculated results. However, if some genes' results have changed, you should delete the processing directory (os.path.join(outdir, 'eqtls01') for instance) and let process_eqtl_results re-process all of the data. This is also sometimes necessary of process_eqtl_results hits an error. Often deleting the processing output directory and re-running process_eqtl_results will fix the problem.

TODO: Add some descriptions below of each eQTL analysis and when the cells should be run (during or after eQTL analysis).


In [1]:
import glob
import os
import random
random.seed(20151226)
import subprocess

import cdpybio as cpb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import pybedtools as pbt
import seaborn as sns
import statsmodels.stats.multitest as smm
import vcf as pyvcf

import cardipspy as cpy
import ciepy

%matplotlib inline
%load_ext rpy2.ipython

dy_name = 'eqtl_processing'

import socket
if socket.gethostname() == 'fl-hn1' or socket.gethostname() == 'fl-hn2':
    dy = os.path.join(ciepy.root, 'sandbox', 'tmp', dy_name)
    cpy.makedir(dy)
    pbt.set_tempdir(dy)
    
outdir = os.path.join(ciepy.root, 'output', dy_name)
cpy.makedir(outdir)

private_outdir = os.path.join(ciepy.root, 'private_output', dy_name)
cpy.makedir(private_outdir)

In [2]:
transcript_to_gene = pd.read_table(cpy.gencode_transcript_gene, header=None, 
                                   squeeze=True, index_col=0)
gene_info = pd.read_table(cpy.gencode_gene_info, index_col=0)

exp = pd.read_table(os.path.join(
        ciepy.root, 'output', 'eqtl_input', 
        'tpm_log_filtered_phe_std_norm_peer_resid.tsv'), 
                    index_col=0)

snpsnap = None

Methods

Given a set of eQTL results, I want to create several files:

  • lead_variants.tsv - This file will include the lead variants for each gene including variants that equally significant.
  • lead_variants_single.tsv - This file will include a single lead variant for each gene. I'll choose the most significant randomly if there are ties.
  • lead_variants_single_snv.tsv - This file will include the most significant SNV per gene. If there are ties I'll choose randomly.
  • pvalues.tsv - This file has the permutation p-value for each gene.
  • qvalues.tsv - This file has the permutation p-value, the q-value, and whether the gene is significant.
  • sig_snv_independent.tsv - This file will be created by LD pruning the variants in most_sig_single_snv.tsv.
  • all_snv_results_sorted.tsv.gz - This file has the results for all SNVs in all genes. It will be sorted by position.

In [3]:
def calculate_permutation_pvalues(eqtl_dy, out_dy):
    """
    Calculate empirical p-values based on the p-values from permutations. This function
    is pretty slow but it will read the existing output file (if it exists) and not recalculate
    p-values for genes that are already done so if you run this repeatedly as the eQTL analysis
    is running, it won't take too long at the end.
    
    Parameters:
    -----------
    eqtl_dy : str
        Path to directory with eQTL results. This directory should contain one 
        subdirectory for each gene tested. Each subdirectory should be named with
        the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
        [gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
        contains the minimum EMMAX p-value for each permutation.
        
    out_dy : str
        Path to directory to write output file pvalues.tsv. This file has 
        gene IDs in the first column and permutation p-values in the second column.
        
    Returns:
    --------
    pvals : pandas.Series
        Series with gene IDs as index and permutation p-values as values.
        
    new_lead_vars : pandas.DataFrame
        Dataframe with lead variants for each gene that we've just calculated
        permutation p-values for. Note that this is not all genes, just those 
    
    """
    cpy.makedir(out_dy)
    min_fns = glob.glob(os.path.join(eqtl_dy, '*', 'minimum_pvalues.tsv'))
    min_fns = pd.Series(min_fns, index=[fn.split(os.path.sep)[-2] for fn in min_fns])
    fn = os.path.join(out_dy, 'pvalues.tsv')
    if os.path.exists(fn):
        pvals = pd.read_table(fn, index_col=0,
                              header=None, squeeze=True)
    else:
        pvals = pd.Series()
    new_pvals = []
    new_genes = []
    new_lead_vars = []

    min_fns = min_fns[set(min_fns.index) - set(pvals.index)]
    if min_fns.shape[0] > 0:
        for fn in min_fns.values:
            gene_id = fn.split(os.path.sep)[-2]
            new_genes.append(gene_id)
            res_fn = os.path.join(os.path.split(fn)[0], '{}.tsv'.format(gene_id))
            with open(res_fn) as f:
                if len(f.readlines()) == 0:
                    print(res_fn)
            res = ciepy.read_emmax_output(res_fn)
            min_pvals = pd.read_table(fn, header=None, squeeze=True)
            p = (1 + sum(min_pvals <= res.PVALUE.min())) / float(min_pvals.shape[0] + 1)
            new_pvals.append(p)
            t = res[res.PVALUE == res.PVALUE.min()]
            t['gene_id'] = gene_id
            new_lead_vars.append(t)
        new_pvals = pd.Series(new_pvals, index=new_genes)
        pvals = pd.concat([pvals, new_pvals])
        new_lead_vars = pd.concat(new_lead_vars)
        new_lead_vars = parse_emmax_results(new_lead_vars)
        return pvals, new_lead_vars
    else:
        return pvals, None
    
def parse_emmax_results(df, add_af=True):
    """
    Take a dataframe of results from EMMAX and parse into a more useful format.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Pandas dataframe of EMMAX results read with ciepy.read_emmax_output(). Extra
        columns can be present but they may be overwritten. The dataframe must also
        include a "gene_id" column with the gene ID.
        
    Returns:
    --------
    out : pandas.DataFrame
        Parsed version of input df including (1) more reasonable column names, (2) chr
        chromosomes, (3) variant type column, 
    
    """
    # Lowercase column names.
    df.columns = [x.lower() for x in df.columns]
    # chr chromosome names.
    df['chrom'] = 'chr' + df.chrom.astype(int).astype(str)
    # Make cooordinates zero-based, half-open (like a bed file).
    df['beg'] -= 1
    df.columns = ['chrom', 'start'] + list(df.columns[2:])
    df.start = df.start.astype(int)
    # Add ref/alt alleles. CNVs will just have N for both ref and alt.
    df['ref'] = df.marker_id.apply(lambda x: x.split('_')[1].split('/')[0])
    df['alt'] = df.marker_id.apply(lambda x: x.split('_')[1].split('/')[1])
    # Variant type
    df['variant_type'] = 'snv'
    df.ix[df.ref.apply(lambda x: len(x)) > df.alt.apply(lambda x: len(x)), 'variant_type'] = 'del'
    df.ix[df.ref.apply(lambda x: len(x)) < df.alt.apply(lambda x: len(x)), 'variant_type'] = 'ins'
    df.ix[df.marker_id.apply(lambda x: 'CNV' in x), 'variant_type'] = 'cnv'
    df.ix[df.marker_id.apply(lambda x: 'DUP' in x), 'variant_type'] = 'cnv'
    df.ix[df.marker_id.apply(lambda x: 'DEL' in x), 'variant_type'] = 'cnv'
    # Annotate with variant caller
    df['variant_caller'] = 'gatk'
    df.ix[df.variant_type == 'cnv', 'variant_caller'] = 'genomestrip'
    df.ix[df.marker_id.apply(lambda x: 'DUP' in x), 'variant_caller'] = 'lumpy'
    df.ix[df.marker_id.apply(lambda x: 'DEL' in x), 'variant_caller'] = 'lumpy'
    # CNVs don't have the correct end because I didn't encode that info in the VCF so I'll
    # fix that here.
    cnv_ends = df.ix[df.variant_type == 'cnv', 'marker_id'].apply(lambda x: int(x.split('_')[-1]))
    df.ix[df.variant_type == 'cnv', 'end'] = cnv_ends.values
    df.end = df.end.astype(int)
    # Add location column.
    df['location'] = (df.chrom + ':' + df.start.astype(str) + '-' + df.end.astype(str))
    # Create unique index.
    df.index = df.location + ':' + df.gene_id
    # Add RSIDs. Not all variants have RSIDs.
    t = df.marker_id.apply(lambda x: x.split('_')[-1][0:2])
    t = t[t == 'rs']
    rsids = df.ix[t.index, 'marker_id'].apply(lambda x: x.split('_')[-1])
    df.ix[t.index, 'rsid'] = rsids
    # Add lengths of variants. I'll calculate lengths separately for different types of variants.
    # I won't give SNVs a length.
    df['length'] = np.nan
    t = df[df.variant_type == 'ins']
    df.ix[t.index, 'length'] = t.alt.apply(lambda x: len(x)) - t.ref.apply(lambda x: len(x))
    t = df[df.variant_type == 'del']
    df.ix[t.index, 'length'] = t.ref.apply(lambda x: len(x)) - t.alt.apply(lambda x: len(x))
    t = df[df.variant_type == 'cnv']
    df.ix[t.index, 'length'] = t.end - t.start
    # Distance to TSS.
    df = tss_to_eqtl_gene(df)
    # Add some info about the gene.
    df = df.merge(gene_info[['gene_name', 'gene_type']], left_on='gene_id', right_index=True)
    if add_af:
        # Add 1KGP allele frequencies for SNVs.
        afs = ['AF', 'EUR_AF', 'SAS_AF', 'AFR_AF', 'AMR_AF', 'EAS_AF']
        for af in afs:
            df[af] = np.nan
        fn = ('/publicdata/1KGP_20151103/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.'
              '20130502.sites.vcf.gz')
        kgp_vcf_reader = pyvcf.Reader(open(fn))
        for i in df[df.variant_type == 'snv'].index:
            kgp = None
            res = kgp_vcf_reader.fetch(df.ix[i, 'chrom'][3:], 
                                       df.ix[i, 'start'] + 1, 
                                       df.ix[i, 'start'] + 1)
            while True:
                try:
                    kgp = res.next()
                    if kgp.INFO['VT'] == ['SNP']:
                        break
                except StopIteration:
                    break
            if kgp:
                for af in afs:
                    df.ix[i, af] = kgp.INFO[af][0]
            else:
                for af in afs:
                    df.ix[i, af] = 0
    return df
    
def tss_to_eqtl_gene(df):
    """
    Take a dataframe of lead variants and add the distance from the variants
    to the nearest TSS for the eQTL gene.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Dataframe with lead variants.
        
    Returns:
    --------
    out : pandas.DataFrame
        Dataframe with lead variants with distance to TSS added.
    
    """
    tss = pbt.BedTool(cpy.gencode_tss_bed)
    tss_df = tss.to_dataframe()
    tss_df['gene'] = tss_df.name.apply(lambda x: transcript_to_gene[x.split('_')[0]])
    s = '\n'.join(tss_df.gene + '\t' + tss_df.start.astype(str) + '\t' +
                  tss_df.end.astype(str) + '\t' + tss_df.name + '\t' + 
                  tss_df.score + '\t' + tss_df.strand) + '\n'
    fake_tss = pbt.BedTool(s, from_string=True)
    fake_tss = fake_tss.sort()

    s = '\n'.join(df.gene_id + '\t' + df.start.astype(str) + '\t' + 
                  df.end.astype(str) + '\t.\t' + df.chrom) + '\n'
    fake_df_bt = pbt.BedTool(s, from_string=True)
    fake_df_bt = fake_df_bt.sort()

    res = fake_df_bt.closest(fake_tss, D='b', sorted=True)
    res_df = res.to_dataframe()
    res_df.index = (res_df.score + ':' + res_df.start.astype(str) + '-' + 
                    res_df.end.astype(str) + ':' + res_df.chrom)

    res_df['variant_gene'] = res_df.index
    res_df = res_df.drop_duplicates(subset=['variant_gene'])

    df['tss_dist'] = res_df.ix[df.index, 'blockStarts']
    df['tss_dist_abs'] = df.tss_dist.abs()
    return df

def qvalue(pvals, summary=True, plot=False):
    """Use the R qvalue package to adjust pvalues. pvals should be a pandas
    Series with gene names as the index and pvalues as the values."""
    import rpy2.robjects as ro
    ro.r('suppressMessages(library(qvalue))')
    ro.globalenv['pvals'] = pvals
    ro.r('qobj = qvalue(p=pvals, fdr.level=0.05)')
    ro.r('qvalues <- qobj$qvalues')
    ro.r('pi0 <- qobj$pi0')
    ro.r('lfdr <- qobj$lfdr')
    ro.r('sig <- qobj$significant')
    qvalues = ro.globalenv['qvalues']
    pi0 = ro.globalenv['pi0']
    lfdr = ro.globalenv['lfdr']
    sig = ro.globalenv['sig']
    qvalue_res = pd.DataFrame([list(pvals), list(qvalues), list(sig)], 
                              index=['perm_pvalue', 'perm_qvalue', 'perm_sig'],
                              columns=pvals.index).T
    qvalue_res['perm_sig'] = qvalue_res.perm_sig.astype(bool)
    qvalue_res = qvalue_res.sort_values(['perm_qvalue'])
    qvalues = pd.Series(list(qvalues), index=pvals.index)
    qvalue_res.index.name = None
    if summary:
        ro.r('summary(qobj)')
    if plot:
        ro.r('plot(qobj)')
    return qvalue_res

def make_single_lead_variant(df):
    """
    Take a dataframe of variants and randomly choose one single lead variant
    per gene if there are ties.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Dataframe with results from EMMAX.
        
    Returns:
    --------
    out : pandas.DataFrame
        Dataframe with a single lead variant per gene.
    
    """
    # To choose randomly, I'll assign a random number to each row, then sort
    # by gene and row.
    random.seed(204050)
    out = df.copy(deep=True)
    out['random'] = [random.random() for x in out.index]
    # Because I sort by pvalue in the next line, the input data frame
    # can contain non-lead variants. The row at the top of each gene's
    # results will have the minimum p-value.
    out = out.sort_values(by=['gene_id', 'pvalue', 'random'])
    out = out.drop_duplicates(subset=['gene_id'])
    out = out.drop(['random'], axis=1)
    return out

def process_eqtl_results(eqtl_dy, out_dy):
    """
    This method parses and annotates results from EMMAX, calculates the permutation p-values,
    and performs multiple-testing correction. Some of the steps are slow, but
    the method will read output files already available in out_dy and not re-analyze genes
    that are already present in the output file, so that means you can run this method repeatedly
    as the eQTL analysis proceeds. If you want all results reanalyzed, delete the output files 
    in out_dy before running.
    
    Parameters:
    -----------
    eqtl_dy : str
        Path to directory with eQTL results. This directory should contain one 
        subdirectory for each gene tested. Each subdirectory should be named with
        the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
        [gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
        contains the minimum EMMAX p-value for each permutation.
        
    out_dy : str
        Path to directory to write output files. The output files are (1) pvalues.tsv:
        gene IDs in the first column and permutation p-values in the second column, (2)
        qvalues.tsv: results from multiple hypothesis testing, (3) lead_variants.tsv:
        the most significant variants per gene, (4) lead_variants_single.tsv: one most
        significant variant per gene (ties broken randomly).
    
    """
    # Calculate permutation p-values and find lead variants for new genes.
    pvals, new_lead_vars = calculate_permutation_pvalues(eqtl_dy, out_dy)
    pvals.to_csv(os.path.join(out_dy, 'pvalues.tsv'), sep='\t')
    # Merge new lead variants with lead variants we already calculated if
    # they exist.
    fn = os.path.join(out_dy, 'lead_variants.tsv')
    if os.path.exists(fn):
        lead_vars = pd.read_table(fn, index_col=0)
    else:
        lead_vars = None
    if new_lead_vars is not None and lead_vars is not None:
        lead_vars = pd.concat([lead_vars, new_lead_vars])
    elif new_lead_vars is not None:
        lead_vars = new_lead_vars
    # Correct for multiple testing.
    qvalue_res = qvalue(pvals)
    qvalue_res.to_csv(os.path.join(out_dy, 'qvalues.tsv'), sep='\t')
    lead_vars = lead_vars.merge(qvalue_res, left_on='gene_id', right_index=True)
    lead_vars.sort_values(by=['gene_id', 'pvalue'], inplace=True)
    lead_vars.to_csv(os.path.join(out_dy, 'lead_variants.tsv'), sep='\t')
    # Make dataframe with single lead variant per gene.
    lead_vars_single = make_single_lead_variant(lead_vars)
    lead_vars_single.to_csv(os.path.join(out_dy, 'lead_variants_single.tsv'), sep='\t')

def make_gene_variant_pairs(qvalue_res, eqtl_dy):
    """
    Make a dataframe with all significant variants per gene. 
    
    Parameters:
    -----------
    qvalue_res : pandas.DataFrame
        Dataframe with columns perm_pvalue, perm_qvalue, and perm_sig. This is the 
        output from the qvalue() method.
        
    eqtl_dy : str
        Path to directory with eQTL results. This directory should contain one 
        subdirectory for each gene tested. Each subdirectory should be named with
        the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
        [gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
        contains the minimum EMMAX p-value for each permutation.
        
    Returns:
    --------
    gene_variant_pairs : pandas.DataFrame
        Dataframe with all significant gene-variant pairs.
    
    """
    # Get permutation p-value cutoff that maps to q = 0.05. I'll do this by finding
    # the largest permutation p-value that is still significant.
    # It's necessary to round this p-value because the floating point numbers get 
    # a little messed up. I found that some significant p-values weren't included
    # because the max permutation p-value was slightly smaller than it should be
    # due to floating point errors.
    qvalue_res = qvalue_res[qvalue_res.perm_sig]
    gene_variant_pairs = []
    for gene_id in qvalue_res.index:
        gene_variant_pairs.append(get_all_significant_variants(gene_id, eqtl_dy))
    gene_variant_pairs = pd.concat(gene_variant_pairs)
    gene_variant_pairs = parse_emmax_results(gene_variant_pairs, add_af=False)
    gene_variant_pairs.sort_values(by=['gene_id', 'pvalue'], inplace=True)
    return gene_variant_pairs

def get_all_significant_variants(gene_id, eqtl_dy):
    """
    Calculate permutation p-values for all variants tested for a given gene gene_id
    and return a dataframe with variants whose permutation p-value is less than
    perm_pval.
    
    Parameters:
    -----------
    gene_id : str
        Gene ID to calculate permutation p-values for.
        
    eqtl_dy : str
        Path to directory with eQTL results. This directory should contain one 
        subdirectory for each gene tested. Each subdirectory should be named with
        the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
        [gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
        contains the minimum EMMAX p-value for each permutation.
        
    Returns:
    --------
    res : pandas.DataFrame
        Dataframe with all significant variants for this gene.
    
    """
    fn = os.path.join(eqtl_dy, gene_id, 'minimum_pvalues.tsv')
    min_pvals = pd.read_table(fn, header=None, names=['pvalue'])
    res_fn = os.path.join(os.path.split(fn)[0], '{}.tsv'.format(gene_id))
    res = ciepy.read_emmax_output(res_fn).dropna(subset=['PVALUE'])
    min_pvals['null'] = True
    df = pd.DataFrame({'pvalue':res.PVALUE, 'null':False}).drop_duplicates().dropna()
    t = pd.concat([df, min_pvals])
    t.sort_values(by=['pvalue'], inplace=True)
    t['num_null'] = t.null.cumsum()
    t['perm_pval'] = (t.num_null + 1) / t.null.sum()
    t = t[t.null == False]
    t = t.drop(['null', 'num_null'], axis=1)
    res = res.merge(t, left_on='PVALUE', right_on='pvalue').drop('pvalue', axis=1)
    res = res[res.perm_pval == res.perm_pval.min()]
    res['gene_id'] = gene_id
    return res

def post_process_eqtl_results(eqtl_dy, out_dy):
    """
    This method creates several more output files. Unlike process_eqtl_results, it's better
    to wait and run this method after the eQTLs are all done.
    
    Parameters:
    -----------
    eqtl_dy : str
        Path to directory with eQTL results. This directory should contain one 
        subdirectory for each gene tested. Each subdirectory should be named with
        the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
        [gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
        contains the minimum EMMAX p-value for each permutation.
        
    out_dy : str
        Path to directory to write output files. The output files are TODO
    
    """
    qvalue_res = pd.read_table(os.path.join(out_dy, 'qvalues.tsv'), index_col=0)
    # Make file with all significant gene-variant pairs.
    gvp = make_gene_variant_pairs(qvalue_res, eqtl_dy)
    gvp.to_csv(os.path.join(out_dy, 'gene_variant_pairs.tsv'), sep='\t')
    # Make file with lead SNV for each significant gene (if there is a significant
    # SNV for the gene). Note this only has genes that were significant.
    sig_lead_snvs = make_single_lead_variant(gvp[gvp.variant_type == 'snv'])
    sig_lead_snvs.to_csv(os.path.join(out_dy, 'sig_lead_snvs_single.tsv'), sep='\t')
    # LD prune lead SNVs.
    indep = get_independent_snvs(sig_lead_snvs)
    indep.to_csv(os.path.join(out_dy, 'independent_lead_snvs.tsv'), sep='\t')
    s = '\n'.join(indep[['chrom', 'start', 'end']].apply(
            lambda x: '\t'.join([str(y) for y in x.values]), axis=1)) + '\n'
    bt = pbt.BedTool(s, from_string=True)
    bt = bt.sort()
    bt.saveas(os.path.join(out_dy, 'independent_lead_snvs.bed'))
    # Make pseudoheritability file.
    fns = glob.glob(os.path.join(eqtl_dy, '*', 'ENS*.reml'))
    h2 = []
    for fn in fns:
        with open(fn) as f:
            h2.append(f.readlines()[-1].split()[1])
    h2 = pd.Series(h2, index=[x.split('/')[-2] for x in fns])
    h2 = h2.astype(float)
    h2.to_csv(os.path.join(out_dy, 'h2.tsv'), sep='\t')

def get_snpsnap():
    snpsnap_fns = glob.glob('/publicdata/SNPsnap_20151104/EUR_parse/*.tab')
    dfs = []
    for tab in snpsnap_fns:
        df = pd.read_table(tab, index_col=0, low_memory=False)
        tdf = df[['snp_maf', 'dist_nearest_gene_snpsnap_protein_coding',
                  'friends_ld08']]
        tdf.index = 'chr' + tdf.index
        dfs.append(tdf)
    snps = pd.concat(dfs)
    snps['maf_bin'] = pd.cut(snps.snp_maf, np.arange(0, 0.55, 0.05))
    snps['ld_bin'] = pd.cut(np.log10(snps.friends_ld08.replace(np.nan, 0) + 1), 10)
    snps['dist_bin'] = pd.cut(np.log10(snps.dist_nearest_gene_snpsnap_protein_coding
                                       + 1), 10)
    snps = snps[['maf_bin', 'ld_bin', 'dist_bin']]
    return snps

def get_independent_snvs(df):
    ld_beds = glob.glob('/publicdata/1KGP_20151103/LD_20151110/tabix/*EUR*.bed.gz')
    ld_beds = dict(zip([os.path.split(x)[1].split('_')[0] for x in ld_beds], ld_beds))
    df = df.drop_duplicates(subset=['location'])
    tdf = df[['chrom', 'start', 'end', 'pvalue']]
    tdf.index = tdf.chrom + ':' + tdf.end.astype(str)
    indep = cpb.analysis.ld_prune(tdf, ld_beds, snvs=list(snpsnap.index)).drop('pvalue', axis=1)
    return indep

def pseudo(eqtl_dy, out_dy):
    cpy.makedir(out_dy)
    fns = glob.glob(os.path.join(eqtl_dy, '*', 'ENS*.reml'))
    h2 = []
    for fn in fns:
        with open(fn) as f:
            h2.append(f.readlines()[-1].split()[1])
    h2 = pd.Series(h2, index=[x.split('/')[-2] for x in fns])
    h2 = h2.astype(float)
    h2.to_csv(os.path.join(out_dy, 'h2.tsv'), sep='\t')

In [4]:
2 +


  File "<ipython-input-4-9371e255fe97>", line 1
    2 +
       ^
SyntaxError: invalid syntax

Processing during eQTL run

First eQTLs

Note: when running process_eqtl_results, sometimes I get an error about "plan shapes not aligned." In this case, just delete the output directory (out_dy = os.path.join(outdir, 'eqtls01')) and then run the command.

It seems like there is a bug right now for adding new genes to genes that have already finished. When a given analysis is finished, you should just delete the out_dy below and run process_eqtl_results for all the results at once.


In [119]:
eqtl_dy = '/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/run_eqtl_analysis/eqtls01/gene_results'
out_dy = os.path.join(outdir, 'eqtls01')
process_eqtl_results(eqtl_dy, out_dy)


Call:
qvalue(p = pvals, fdr.level = 0.05)

pi0:	0.3879203	

Cumulative number of significant calls:

          <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1    <1
p-value     2732   3491  4501   5226  5986 7157 17796
q-value        0   3162  4201   4911  5746 7266 17805
local FDR      0   2732  3359   3646  3906 4458 17805

Second eQTLs


In [ ]:
eqtl_dy = '/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/run_eqtl_analysis/eqtls02/gene_results'
out_dy = os.path.join(outdir, 'eqtls02')
process_eqtl_results(eqtl_dy, out_dy)


Call:
qvalue(p = pvals, fdr.level = 0.05)

pi0:	0.46176	

Cumulative number of significant calls:

          <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1   <1
p-value      243    383   655    890  1184 1638 5743
q-value        0      0   417    535   709 1097 5746
local FDR      0      0   301    369   431  556 5716

Third eQTLs

a = pd.read_table(os.path.join(outdir, 'eqtls02', 'qvalues.tsv'), index_col=0) b = pd.read_table(os.path.join(outdir, 'eqtls03', 'qvalues.tsv'), index_col=0) set(a[a.perm_sig].index) - set(b.index)

In [146]:
eqtl_dy = '/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/run_eqtl_analysis/eqtls03/gene_results'
out_dy = os.path.join(outdir, 'eqtls03')
process_eqtl_results(eqtl_dy, out_dy)


Call:
qvalue(p = pvals, fdr.level = 0.05)

pi0:	0.3201686	

Cumulative number of significant calls:

          <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1  <1
p-value       43     67   116    152   189  261 709
q-value        0     47    89    121   175  270 709
local FDR      0      0    63     76   104  142 709

Unrelateds


In [144]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output'
           '/run_eqtl_analysis/unrelated_eqtls01/gene_results')
out_dy = os.path.join(outdir, 'unrelated_eqtls01')
process_eqtl_results(eqtl_dy, out_dy)


Call:
qvalue(p = pvals, fdr.level = 0.05)

pi0:	0.5791135	

Cumulative number of significant calls:

          <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1    <1
p-value     1720   2258  3092   3780  4484 5574 17794
q-value        0   1720  2525   2928  3434 4266 17805
local FDR      0      0  2032   2195  2394 2674 17805

Post-processing


In [ ]:
if snpsnap is None:
    snpsnap = get_snpsnap()

In [125]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/eqtls01/gene_results/')
out_dy = os.path.join(outdir, 'eqtls01')
gvp = post_process_eqtl_results(eqtl_dy, out_dy)

In [ ]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/eqtls02/gene_results/')
out_dy = os.path.join(outdir, 'eqtls02')
gvp = post_process_eqtl_results(eqtl_dy, out_dy)

In [147]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/eqtls03/gene_results')
out_dy = os.path.join(outdir, 'eqtls03')
post_process_eqtl_results(eqtl_dy, out_dy)

In [145]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output'
           '/run_eqtl_analysis/unrelated_eqtls01/gene_results')
out_dy = os.path.join(outdir, 'unrelated_eqtls01')
post_process_eqtl_results(eqtl_dy, out_dy)

In [ ]:
2  +

No PEER factors


In [21]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/no_peer01/gene_results')
out_dy = os.path.join(outdir, 'no_peer01')
pseudo(eqtl_dy, out_dy)

In [25]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/no_peer_no_std_norm01/gene_results')
out_dy = os.path.join(outdir, 'no_peer_no_std_norm01')
pseudo(eqtl_dy, out_dy)

Unrelateds


In [4]:
def get_min_pvals(eqtl_dy, out_dy):
    """
    Find the minimum p-value for each gene.
    
    Parameters:
    -----------
    eqtl_dy : str
        Path to directory with eQTL results. This directory should contain one 
        subdirectory for each gene tested. Each subdirectory should be named with
        the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
        [gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
        contains the minimum EMMAX p-value for each permutation.
        
    out_dy : str
        Path to directory to write output file pvalues.tsv. This file has 
        gene IDs in the first column and permutation p-values in the second column.
        
    Returns:
    --------
    min_pvals : pandas.Series
        Series with gene IDs as index and permutation p-values as values.
        
    """
    cpy.makedir(out_dy)
    fn = os.path.join(out_dy, 'min_pvalues.tsv')
    genes = []
    min_pvals = []

    fns = glob.glob(os.path.join(eqtl_dy, '*', 'ENSG*.tsv'))
    for fn in fns:
        gene_id = fn.split(os.path.sep)[-2]
        genes.append(gene_id)
        res = ciepy.read_emmax_output(fn)
        min_pvals.append(res.PVALUE.min())
    min_pvals = pd.Series(min_pvals, index=genes)
    min_pvals.to_csv(os.path.join(out_dy, 'min_pvalues.tsv'), sep='\t')

In [ ]:
i = 40
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))

fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
    get_min_pvals(eqtl_dy, out_dy)
    pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'), 
                          index_col=0, header=None, names=['min_pval'])
    corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
    pvals['sig'] = corrected[0]
    pvals['min_pval_bf'] = corrected[1]
    pvals.to_csv(fn, sep='\t')

In [95]:
i = 50
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))

fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
    get_min_pvals(eqtl_dy, out_dy)
    pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'), 
                          index_col=0, header=None, names=['min_pval'])
    corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
    pvals['sig'] = corrected[0]
    pvals['min_pval_bf'] = corrected[1]
    pvals.to_csv(fn, sep='\t')

In [101]:
i = 60
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))

fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
    get_min_pvals(eqtl_dy, out_dy)
    pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'), 
                          index_col=0, header=None, names=['min_pval'])
    corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
    pvals['sig'] = corrected[0]
    pvals['min_pval_bf'] = corrected[1]
    pvals.to_csv(fn, sep='\t')

In [102]:
i = 70
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))

fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
    get_min_pvals(eqtl_dy, out_dy)
    pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'), 
                          index_col=0, header=None, names=['min_pval'])
    corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
    pvals['sig'] = corrected[0]
    pvals['min_pval_bf'] = corrected[1]
    pvals.to_csv(fn, sep='\t')

In [103]:
i = 80
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))

fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
    get_min_pvals(eqtl_dy, out_dy)
    pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'), 
                          index_col=0, header=None, names=['min_pval'])
    corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
    pvals['sig'] = corrected[0]
    pvals['min_pval_bf'] = corrected[1]
    pvals.to_csv(fn, sep='\t')

In [104]:
i = 90
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))

fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
    get_min_pvals(eqtl_dy, out_dy)
    pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'), 
                          index_col=0, header=None, names=['min_pval'])
    corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
    pvals['sig'] = corrected[0]
    pvals['min_pval_bf'] = corrected[1]
    pvals.to_csv(fn, sep='\t')

In [ ]:
i = 100
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
           'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))

fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
    get_min_pvals(eqtl_dy, out_dy)
    pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'), 
                          index_col=0, header=None, names=['min_pval'])
    corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
    pvals['sig'] = corrected[0]
    pvals['min_pval_bf'] = corrected[1]
    pvals.to_csv(fn, sep='\t')

In [ ]:
2 +

Combined Results

I want to make a file with all EMMAX results combined. I am going to sort this file by position and $p$-value. This will allow me to collect some stats like the smallest $p$-value observed for each SNV, how many times a SNV was tested, etc.

I'll use the IPython cluster to sort the individual output files then merge them using sort -m. This is much faster than concatenating and sorting though the merging still takes a few hours.


In [22]:
def make_combined_results(eqtl_dy, out_dy):
    out = os.path.join(out_dy, 'all_results_sorted.tsv.gz')
    dy = os.path.join(outdir, 'tmp')
    cpy.makedir(dy)
    fns = glob.glob(os.path.join(eqtl_dy, 'ENS*', 'ENS*.tsv'))
    commands = ['sort -k1,1 -k2,2n -k3,3n -k11,11n {} | grep -v ^# | '
                'awk \'{{print $0"\t{}"}}\' > {}'.format(x, x.split('/')[-2], os.path.join(dy, os.path.split(x)[1]))
                for x in fns]
    from ipyparallel import Client
    parallel_client = Client(profile='parallel')
    dview = parallel_client[:]
    print('Cluster has {} engines.'.format(len(parallel_client.ids)))
    with dview.sync_imports():
        import subprocess
    dview.scatter('commands', commands)
    %px y = [subprocess.check_call(i, shell=True) for i in commands]
    c = 'sort -m -k1,1 -k2,2n -k3,3n -k11,11n {} | bgzip -c > {}'.format(os.path.join(dy, '*'), out)
    subprocess.check_call(c, shell=True)
    c = 'tabix -p bed {}'.format(out)
    subprocess.check_call(c, shell=True)
    c = 'rm -r {}'.format(dy)
    subprocess.check_call(c, shell=True)
    
    out2 = os.path.join(out_dy, 'top_results_sorted.tsv.gz')
    c = ("zcat {} | awk 'a!~$1 || b!~$2 || c!~$3 ; {{a=$1}} {{b=$2}} {{c=$3}}' | "
         "bgzip -c > {}".format(out, out2))
    subprocess.check_call(c, shell=True)
    c = 'tabix -p bed {}'.format(out2)
    subprocess.check_call(c, shell=True)

In [ ]:
eqtl_dy = '/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/run_eqtl_analysis/eqtls01/gene_results'
out_dy = os.path.join(outdir, 'eqtls01')
make_combined_results(eqtl_dy, out_dy)

In [ ]:
for i in [40, 50, 60, 70, 80, 90, 100]:
    eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
               'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
    out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
    make_combined_results(eqtl_dy, out_dy)