This notebook contains code to read the results from "Run eQTL Analysis"
and parse them into useful formats. There is a cell or cells in this notebook
for each eQTL Analysis often followed by something like 2 +. This prevents
the notebook from being executed all at once. Basically, after running the first
few cells to set things up, you probably just want to execute the other cells
individually as you need to.
Some cells can be executed while the eQTL is running.
This allows you to see how many significant hits you are getting etc. as things
are running. This code (specifically process_eqtl_results
is designed to run efficiently by skipping genes that were already
parsed and using the already calculated results. However, if some genes' results have
changed, you should delete the processing directory (os.path.join(outdir, 'eqtls01')
for instance) and let process_eqtl_results re-process all of the data. This is also
sometimes necessary of process_eqtl_results hits an error. Often deleting the processing
output directory and re-running process_eqtl_results will fix the problem.
TODO: Add some descriptions below of each eQTL analysis and when the cells should be run (during or after eQTL analysis).
In [1]:
import glob
import os
import random
random.seed(20151226)
import subprocess
import cdpybio as cpb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
import pybedtools as pbt
import seaborn as sns
import statsmodels.stats.multitest as smm
import vcf as pyvcf
import cardipspy as cpy
import ciepy
%matplotlib inline
%load_ext rpy2.ipython
dy_name = 'eqtl_processing'
import socket
if socket.gethostname() == 'fl-hn1' or socket.gethostname() == 'fl-hn2':
dy = os.path.join(ciepy.root, 'sandbox', 'tmp', dy_name)
cpy.makedir(dy)
pbt.set_tempdir(dy)
outdir = os.path.join(ciepy.root, 'output', dy_name)
cpy.makedir(outdir)
private_outdir = os.path.join(ciepy.root, 'private_output', dy_name)
cpy.makedir(private_outdir)
In [2]:
transcript_to_gene = pd.read_table(cpy.gencode_transcript_gene, header=None,
squeeze=True, index_col=0)
gene_info = pd.read_table(cpy.gencode_gene_info, index_col=0)
exp = pd.read_table(os.path.join(
ciepy.root, 'output', 'eqtl_input',
'tpm_log_filtered_phe_std_norm_peer_resid.tsv'),
index_col=0)
snpsnap = None
Given a set of eQTL results, I want to create several files:
lead_variants.tsv - This file will include the lead variants for each gene including
variants that equally significant.lead_variants_single.tsv - This file will include a single lead variant for each gene. I'll
choose the most significant randomly if there are ties.lead_variants_single_snv.tsv - This file will include the most significant SNV per gene. If there
are ties I'll choose randomly.pvalues.tsv - This file has the permutation p-value for each gene.qvalues.tsv - This file has the permutation p-value, the q-value, and whether the gene is significant.sig_snv_independent.tsv - This file will be created by LD pruning the variants in most_sig_single_snv.tsv.all_snv_results_sorted.tsv.gz - This file has the results for all SNVs in all genes.
It will be sorted by position.
In [3]:
def calculate_permutation_pvalues(eqtl_dy, out_dy):
"""
Calculate empirical p-values based on the p-values from permutations. This function
is pretty slow but it will read the existing output file (if it exists) and not recalculate
p-values for genes that are already done so if you run this repeatedly as the eQTL analysis
is running, it won't take too long at the end.
Parameters:
-----------
eqtl_dy : str
Path to directory with eQTL results. This directory should contain one
subdirectory for each gene tested. Each subdirectory should be named with
the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
[gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
contains the minimum EMMAX p-value for each permutation.
out_dy : str
Path to directory to write output file pvalues.tsv. This file has
gene IDs in the first column and permutation p-values in the second column.
Returns:
--------
pvals : pandas.Series
Series with gene IDs as index and permutation p-values as values.
new_lead_vars : pandas.DataFrame
Dataframe with lead variants for each gene that we've just calculated
permutation p-values for. Note that this is not all genes, just those
"""
cpy.makedir(out_dy)
min_fns = glob.glob(os.path.join(eqtl_dy, '*', 'minimum_pvalues.tsv'))
min_fns = pd.Series(min_fns, index=[fn.split(os.path.sep)[-2] for fn in min_fns])
fn = os.path.join(out_dy, 'pvalues.tsv')
if os.path.exists(fn):
pvals = pd.read_table(fn, index_col=0,
header=None, squeeze=True)
else:
pvals = pd.Series()
new_pvals = []
new_genes = []
new_lead_vars = []
min_fns = min_fns[set(min_fns.index) - set(pvals.index)]
if min_fns.shape[0] > 0:
for fn in min_fns.values:
gene_id = fn.split(os.path.sep)[-2]
new_genes.append(gene_id)
res_fn = os.path.join(os.path.split(fn)[0], '{}.tsv'.format(gene_id))
with open(res_fn) as f:
if len(f.readlines()) == 0:
print(res_fn)
res = ciepy.read_emmax_output(res_fn)
min_pvals = pd.read_table(fn, header=None, squeeze=True)
p = (1 + sum(min_pvals <= res.PVALUE.min())) / float(min_pvals.shape[0] + 1)
new_pvals.append(p)
t = res[res.PVALUE == res.PVALUE.min()]
t['gene_id'] = gene_id
new_lead_vars.append(t)
new_pvals = pd.Series(new_pvals, index=new_genes)
pvals = pd.concat([pvals, new_pvals])
new_lead_vars = pd.concat(new_lead_vars)
new_lead_vars = parse_emmax_results(new_lead_vars)
return pvals, new_lead_vars
else:
return pvals, None
def parse_emmax_results(df, add_af=True):
"""
Take a dataframe of results from EMMAX and parse into a more useful format.
Parameters:
-----------
df : pandas.DataFrame
Pandas dataframe of EMMAX results read with ciepy.read_emmax_output(). Extra
columns can be present but they may be overwritten. The dataframe must also
include a "gene_id" column with the gene ID.
Returns:
--------
out : pandas.DataFrame
Parsed version of input df including (1) more reasonable column names, (2) chr
chromosomes, (3) variant type column,
"""
# Lowercase column names.
df.columns = [x.lower() for x in df.columns]
# chr chromosome names.
df['chrom'] = 'chr' + df.chrom.astype(int).astype(str)
# Make cooordinates zero-based, half-open (like a bed file).
df['beg'] -= 1
df.columns = ['chrom', 'start'] + list(df.columns[2:])
df.start = df.start.astype(int)
# Add ref/alt alleles. CNVs will just have N for both ref and alt.
df['ref'] = df.marker_id.apply(lambda x: x.split('_')[1].split('/')[0])
df['alt'] = df.marker_id.apply(lambda x: x.split('_')[1].split('/')[1])
# Variant type
df['variant_type'] = 'snv'
df.ix[df.ref.apply(lambda x: len(x)) > df.alt.apply(lambda x: len(x)), 'variant_type'] = 'del'
df.ix[df.ref.apply(lambda x: len(x)) < df.alt.apply(lambda x: len(x)), 'variant_type'] = 'ins'
df.ix[df.marker_id.apply(lambda x: 'CNV' in x), 'variant_type'] = 'cnv'
df.ix[df.marker_id.apply(lambda x: 'DUP' in x), 'variant_type'] = 'cnv'
df.ix[df.marker_id.apply(lambda x: 'DEL' in x), 'variant_type'] = 'cnv'
# Annotate with variant caller
df['variant_caller'] = 'gatk'
df.ix[df.variant_type == 'cnv', 'variant_caller'] = 'genomestrip'
df.ix[df.marker_id.apply(lambda x: 'DUP' in x), 'variant_caller'] = 'lumpy'
df.ix[df.marker_id.apply(lambda x: 'DEL' in x), 'variant_caller'] = 'lumpy'
# CNVs don't have the correct end because I didn't encode that info in the VCF so I'll
# fix that here.
cnv_ends = df.ix[df.variant_type == 'cnv', 'marker_id'].apply(lambda x: int(x.split('_')[-1]))
df.ix[df.variant_type == 'cnv', 'end'] = cnv_ends.values
df.end = df.end.astype(int)
# Add location column.
df['location'] = (df.chrom + ':' + df.start.astype(str) + '-' + df.end.astype(str))
# Create unique index.
df.index = df.location + ':' + df.gene_id
# Add RSIDs. Not all variants have RSIDs.
t = df.marker_id.apply(lambda x: x.split('_')[-1][0:2])
t = t[t == 'rs']
rsids = df.ix[t.index, 'marker_id'].apply(lambda x: x.split('_')[-1])
df.ix[t.index, 'rsid'] = rsids
# Add lengths of variants. I'll calculate lengths separately for different types of variants.
# I won't give SNVs a length.
df['length'] = np.nan
t = df[df.variant_type == 'ins']
df.ix[t.index, 'length'] = t.alt.apply(lambda x: len(x)) - t.ref.apply(lambda x: len(x))
t = df[df.variant_type == 'del']
df.ix[t.index, 'length'] = t.ref.apply(lambda x: len(x)) - t.alt.apply(lambda x: len(x))
t = df[df.variant_type == 'cnv']
df.ix[t.index, 'length'] = t.end - t.start
# Distance to TSS.
df = tss_to_eqtl_gene(df)
# Add some info about the gene.
df = df.merge(gene_info[['gene_name', 'gene_type']], left_on='gene_id', right_index=True)
if add_af:
# Add 1KGP allele frequencies for SNVs.
afs = ['AF', 'EUR_AF', 'SAS_AF', 'AFR_AF', 'AMR_AF', 'EAS_AF']
for af in afs:
df[af] = np.nan
fn = ('/publicdata/1KGP_20151103/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.'
'20130502.sites.vcf.gz')
kgp_vcf_reader = pyvcf.Reader(open(fn))
for i in df[df.variant_type == 'snv'].index:
kgp = None
res = kgp_vcf_reader.fetch(df.ix[i, 'chrom'][3:],
df.ix[i, 'start'] + 1,
df.ix[i, 'start'] + 1)
while True:
try:
kgp = res.next()
if kgp.INFO['VT'] == ['SNP']:
break
except StopIteration:
break
if kgp:
for af in afs:
df.ix[i, af] = kgp.INFO[af][0]
else:
for af in afs:
df.ix[i, af] = 0
return df
def tss_to_eqtl_gene(df):
"""
Take a dataframe of lead variants and add the distance from the variants
to the nearest TSS for the eQTL gene.
Parameters:
-----------
df : pandas.DataFrame
Dataframe with lead variants.
Returns:
--------
out : pandas.DataFrame
Dataframe with lead variants with distance to TSS added.
"""
tss = pbt.BedTool(cpy.gencode_tss_bed)
tss_df = tss.to_dataframe()
tss_df['gene'] = tss_df.name.apply(lambda x: transcript_to_gene[x.split('_')[0]])
s = '\n'.join(tss_df.gene + '\t' + tss_df.start.astype(str) + '\t' +
tss_df.end.astype(str) + '\t' + tss_df.name + '\t' +
tss_df.score + '\t' + tss_df.strand) + '\n'
fake_tss = pbt.BedTool(s, from_string=True)
fake_tss = fake_tss.sort()
s = '\n'.join(df.gene_id + '\t' + df.start.astype(str) + '\t' +
df.end.astype(str) + '\t.\t' + df.chrom) + '\n'
fake_df_bt = pbt.BedTool(s, from_string=True)
fake_df_bt = fake_df_bt.sort()
res = fake_df_bt.closest(fake_tss, D='b', sorted=True)
res_df = res.to_dataframe()
res_df.index = (res_df.score + ':' + res_df.start.astype(str) + '-' +
res_df.end.astype(str) + ':' + res_df.chrom)
res_df['variant_gene'] = res_df.index
res_df = res_df.drop_duplicates(subset=['variant_gene'])
df['tss_dist'] = res_df.ix[df.index, 'blockStarts']
df['tss_dist_abs'] = df.tss_dist.abs()
return df
def qvalue(pvals, summary=True, plot=False):
"""Use the R qvalue package to adjust pvalues. pvals should be a pandas
Series with gene names as the index and pvalues as the values."""
import rpy2.robjects as ro
ro.r('suppressMessages(library(qvalue))')
ro.globalenv['pvals'] = pvals
ro.r('qobj = qvalue(p=pvals, fdr.level=0.05)')
ro.r('qvalues <- qobj$qvalues')
ro.r('pi0 <- qobj$pi0')
ro.r('lfdr <- qobj$lfdr')
ro.r('sig <- qobj$significant')
qvalues = ro.globalenv['qvalues']
pi0 = ro.globalenv['pi0']
lfdr = ro.globalenv['lfdr']
sig = ro.globalenv['sig']
qvalue_res = pd.DataFrame([list(pvals), list(qvalues), list(sig)],
index=['perm_pvalue', 'perm_qvalue', 'perm_sig'],
columns=pvals.index).T
qvalue_res['perm_sig'] = qvalue_res.perm_sig.astype(bool)
qvalue_res = qvalue_res.sort_values(['perm_qvalue'])
qvalues = pd.Series(list(qvalues), index=pvals.index)
qvalue_res.index.name = None
if summary:
ro.r('summary(qobj)')
if plot:
ro.r('plot(qobj)')
return qvalue_res
def make_single_lead_variant(df):
"""
Take a dataframe of variants and randomly choose one single lead variant
per gene if there are ties.
Parameters:
-----------
df : pandas.DataFrame
Dataframe with results from EMMAX.
Returns:
--------
out : pandas.DataFrame
Dataframe with a single lead variant per gene.
"""
# To choose randomly, I'll assign a random number to each row, then sort
# by gene and row.
random.seed(204050)
out = df.copy(deep=True)
out['random'] = [random.random() for x in out.index]
# Because I sort by pvalue in the next line, the input data frame
# can contain non-lead variants. The row at the top of each gene's
# results will have the minimum p-value.
out = out.sort_values(by=['gene_id', 'pvalue', 'random'])
out = out.drop_duplicates(subset=['gene_id'])
out = out.drop(['random'], axis=1)
return out
def process_eqtl_results(eqtl_dy, out_dy):
"""
This method parses and annotates results from EMMAX, calculates the permutation p-values,
and performs multiple-testing correction. Some of the steps are slow, but
the method will read output files already available in out_dy and not re-analyze genes
that are already present in the output file, so that means you can run this method repeatedly
as the eQTL analysis proceeds. If you want all results reanalyzed, delete the output files
in out_dy before running.
Parameters:
-----------
eqtl_dy : str
Path to directory with eQTL results. This directory should contain one
subdirectory for each gene tested. Each subdirectory should be named with
the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
[gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
contains the minimum EMMAX p-value for each permutation.
out_dy : str
Path to directory to write output files. The output files are (1) pvalues.tsv:
gene IDs in the first column and permutation p-values in the second column, (2)
qvalues.tsv: results from multiple hypothesis testing, (3) lead_variants.tsv:
the most significant variants per gene, (4) lead_variants_single.tsv: one most
significant variant per gene (ties broken randomly).
"""
# Calculate permutation p-values and find lead variants for new genes.
pvals, new_lead_vars = calculate_permutation_pvalues(eqtl_dy, out_dy)
pvals.to_csv(os.path.join(out_dy, 'pvalues.tsv'), sep='\t')
# Merge new lead variants with lead variants we already calculated if
# they exist.
fn = os.path.join(out_dy, 'lead_variants.tsv')
if os.path.exists(fn):
lead_vars = pd.read_table(fn, index_col=0)
else:
lead_vars = None
if new_lead_vars is not None and lead_vars is not None:
lead_vars = pd.concat([lead_vars, new_lead_vars])
elif new_lead_vars is not None:
lead_vars = new_lead_vars
# Correct for multiple testing.
qvalue_res = qvalue(pvals)
qvalue_res.to_csv(os.path.join(out_dy, 'qvalues.tsv'), sep='\t')
lead_vars = lead_vars.merge(qvalue_res, left_on='gene_id', right_index=True)
lead_vars.sort_values(by=['gene_id', 'pvalue'], inplace=True)
lead_vars.to_csv(os.path.join(out_dy, 'lead_variants.tsv'), sep='\t')
# Make dataframe with single lead variant per gene.
lead_vars_single = make_single_lead_variant(lead_vars)
lead_vars_single.to_csv(os.path.join(out_dy, 'lead_variants_single.tsv'), sep='\t')
def make_gene_variant_pairs(qvalue_res, eqtl_dy):
"""
Make a dataframe with all significant variants per gene.
Parameters:
-----------
qvalue_res : pandas.DataFrame
Dataframe with columns perm_pvalue, perm_qvalue, and perm_sig. This is the
output from the qvalue() method.
eqtl_dy : str
Path to directory with eQTL results. This directory should contain one
subdirectory for each gene tested. Each subdirectory should be named with
the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
[gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
contains the minimum EMMAX p-value for each permutation.
Returns:
--------
gene_variant_pairs : pandas.DataFrame
Dataframe with all significant gene-variant pairs.
"""
# Get permutation p-value cutoff that maps to q = 0.05. I'll do this by finding
# the largest permutation p-value that is still significant.
# It's necessary to round this p-value because the floating point numbers get
# a little messed up. I found that some significant p-values weren't included
# because the max permutation p-value was slightly smaller than it should be
# due to floating point errors.
qvalue_res = qvalue_res[qvalue_res.perm_sig]
gene_variant_pairs = []
for gene_id in qvalue_res.index:
gene_variant_pairs.append(get_all_significant_variants(gene_id, eqtl_dy))
gene_variant_pairs = pd.concat(gene_variant_pairs)
gene_variant_pairs = parse_emmax_results(gene_variant_pairs, add_af=False)
gene_variant_pairs.sort_values(by=['gene_id', 'pvalue'], inplace=True)
return gene_variant_pairs
def get_all_significant_variants(gene_id, eqtl_dy):
"""
Calculate permutation p-values for all variants tested for a given gene gene_id
and return a dataframe with variants whose permutation p-value is less than
perm_pval.
Parameters:
-----------
gene_id : str
Gene ID to calculate permutation p-values for.
eqtl_dy : str
Path to directory with eQTL results. This directory should contain one
subdirectory for each gene tested. Each subdirectory should be named with
the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
[gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
contains the minimum EMMAX p-value for each permutation.
Returns:
--------
res : pandas.DataFrame
Dataframe with all significant variants for this gene.
"""
fn = os.path.join(eqtl_dy, gene_id, 'minimum_pvalues.tsv')
min_pvals = pd.read_table(fn, header=None, names=['pvalue'])
res_fn = os.path.join(os.path.split(fn)[0], '{}.tsv'.format(gene_id))
res = ciepy.read_emmax_output(res_fn).dropna(subset=['PVALUE'])
min_pvals['null'] = True
df = pd.DataFrame({'pvalue':res.PVALUE, 'null':False}).drop_duplicates().dropna()
t = pd.concat([df, min_pvals])
t.sort_values(by=['pvalue'], inplace=True)
t['num_null'] = t.null.cumsum()
t['perm_pval'] = (t.num_null + 1) / t.null.sum()
t = t[t.null == False]
t = t.drop(['null', 'num_null'], axis=1)
res = res.merge(t, left_on='PVALUE', right_on='pvalue').drop('pvalue', axis=1)
res = res[res.perm_pval == res.perm_pval.min()]
res['gene_id'] = gene_id
return res
def post_process_eqtl_results(eqtl_dy, out_dy):
"""
This method creates several more output files. Unlike process_eqtl_results, it's better
to wait and run this method after the eQTLs are all done.
Parameters:
-----------
eqtl_dy : str
Path to directory with eQTL results. This directory should contain one
subdirectory for each gene tested. Each subdirectory should be named with
the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
[gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
contains the minimum EMMAX p-value for each permutation.
out_dy : str
Path to directory to write output files. The output files are TODO
"""
qvalue_res = pd.read_table(os.path.join(out_dy, 'qvalues.tsv'), index_col=0)
# Make file with all significant gene-variant pairs.
gvp = make_gene_variant_pairs(qvalue_res, eqtl_dy)
gvp.to_csv(os.path.join(out_dy, 'gene_variant_pairs.tsv'), sep='\t')
# Make file with lead SNV for each significant gene (if there is a significant
# SNV for the gene). Note this only has genes that were significant.
sig_lead_snvs = make_single_lead_variant(gvp[gvp.variant_type == 'snv'])
sig_lead_snvs.to_csv(os.path.join(out_dy, 'sig_lead_snvs_single.tsv'), sep='\t')
# LD prune lead SNVs.
indep = get_independent_snvs(sig_lead_snvs)
indep.to_csv(os.path.join(out_dy, 'independent_lead_snvs.tsv'), sep='\t')
s = '\n'.join(indep[['chrom', 'start', 'end']].apply(
lambda x: '\t'.join([str(y) for y in x.values]), axis=1)) + '\n'
bt = pbt.BedTool(s, from_string=True)
bt = bt.sort()
bt.saveas(os.path.join(out_dy, 'independent_lead_snvs.bed'))
# Make pseudoheritability file.
fns = glob.glob(os.path.join(eqtl_dy, '*', 'ENS*.reml'))
h2 = []
for fn in fns:
with open(fn) as f:
h2.append(f.readlines()[-1].split()[1])
h2 = pd.Series(h2, index=[x.split('/')[-2] for x in fns])
h2 = h2.astype(float)
h2.to_csv(os.path.join(out_dy, 'h2.tsv'), sep='\t')
def get_snpsnap():
snpsnap_fns = glob.glob('/publicdata/SNPsnap_20151104/EUR_parse/*.tab')
dfs = []
for tab in snpsnap_fns:
df = pd.read_table(tab, index_col=0, low_memory=False)
tdf = df[['snp_maf', 'dist_nearest_gene_snpsnap_protein_coding',
'friends_ld08']]
tdf.index = 'chr' + tdf.index
dfs.append(tdf)
snps = pd.concat(dfs)
snps['maf_bin'] = pd.cut(snps.snp_maf, np.arange(0, 0.55, 0.05))
snps['ld_bin'] = pd.cut(np.log10(snps.friends_ld08.replace(np.nan, 0) + 1), 10)
snps['dist_bin'] = pd.cut(np.log10(snps.dist_nearest_gene_snpsnap_protein_coding
+ 1), 10)
snps = snps[['maf_bin', 'ld_bin', 'dist_bin']]
return snps
def get_independent_snvs(df):
ld_beds = glob.glob('/publicdata/1KGP_20151103/LD_20151110/tabix/*EUR*.bed.gz')
ld_beds = dict(zip([os.path.split(x)[1].split('_')[0] for x in ld_beds], ld_beds))
df = df.drop_duplicates(subset=['location'])
tdf = df[['chrom', 'start', 'end', 'pvalue']]
tdf.index = tdf.chrom + ':' + tdf.end.astype(str)
indep = cpb.analysis.ld_prune(tdf, ld_beds, snvs=list(snpsnap.index)).drop('pvalue', axis=1)
return indep
def pseudo(eqtl_dy, out_dy):
cpy.makedir(out_dy)
fns = glob.glob(os.path.join(eqtl_dy, '*', 'ENS*.reml'))
h2 = []
for fn in fns:
with open(fn) as f:
h2.append(f.readlines()[-1].split()[1])
h2 = pd.Series(h2, index=[x.split('/')[-2] for x in fns])
h2 = h2.astype(float)
h2.to_csv(os.path.join(out_dy, 'h2.tsv'), sep='\t')
In [4]:
2 +
Note: when running process_eqtl_results, sometimes I get an error about "plan
shapes not aligned." In this case, just delete the output directory
(out_dy = os.path.join(outdir, 'eqtls01'))
and then run the command.
It seems like there is a bug right now for adding new genes to genes that have
already finished. When a given analysis is finished, you should just delete the
out_dy below and run process_eqtl_results for all the results at once.
In [119]:
eqtl_dy = '/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/run_eqtl_analysis/eqtls01/gene_results'
out_dy = os.path.join(outdir, 'eqtls01')
process_eqtl_results(eqtl_dy, out_dy)
In [ ]:
eqtl_dy = '/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/run_eqtl_analysis/eqtls02/gene_results'
out_dy = os.path.join(outdir, 'eqtls02')
process_eqtl_results(eqtl_dy, out_dy)
In [146]:
eqtl_dy = '/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/run_eqtl_analysis/eqtls03/gene_results'
out_dy = os.path.join(outdir, 'eqtls03')
process_eqtl_results(eqtl_dy, out_dy)
In [144]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output'
'/run_eqtl_analysis/unrelated_eqtls01/gene_results')
out_dy = os.path.join(outdir, 'unrelated_eqtls01')
process_eqtl_results(eqtl_dy, out_dy)
In [ ]:
if snpsnap is None:
snpsnap = get_snpsnap()
In [125]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/eqtls01/gene_results/')
out_dy = os.path.join(outdir, 'eqtls01')
gvp = post_process_eqtl_results(eqtl_dy, out_dy)
In [ ]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/eqtls02/gene_results/')
out_dy = os.path.join(outdir, 'eqtls02')
gvp = post_process_eqtl_results(eqtl_dy, out_dy)
In [147]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/eqtls03/gene_results')
out_dy = os.path.join(outdir, 'eqtls03')
post_process_eqtl_results(eqtl_dy, out_dy)
In [145]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output'
'/run_eqtl_analysis/unrelated_eqtls01/gene_results')
out_dy = os.path.join(outdir, 'unrelated_eqtls01')
post_process_eqtl_results(eqtl_dy, out_dy)
In [ ]:
2 +
In [21]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/no_peer01/gene_results')
out_dy = os.path.join(outdir, 'no_peer01')
pseudo(eqtl_dy, out_dy)
In [25]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/no_peer_no_std_norm01/gene_results')
out_dy = os.path.join(outdir, 'no_peer_no_std_norm01')
pseudo(eqtl_dy, out_dy)
In [4]:
def get_min_pvals(eqtl_dy, out_dy):
"""
Find the minimum p-value for each gene.
Parameters:
-----------
eqtl_dy : str
Path to directory with eQTL results. This directory should contain one
subdirectory for each gene tested. Each subdirectory should be named with
the gene ID and contain the files [gene ID].tsv and minimum_pvalues.tsv.
[gene ID].tsv is the EMMAX output for the "real" data and minimum_pvalues.tsv
contains the minimum EMMAX p-value for each permutation.
out_dy : str
Path to directory to write output file pvalues.tsv. This file has
gene IDs in the first column and permutation p-values in the second column.
Returns:
--------
min_pvals : pandas.Series
Series with gene IDs as index and permutation p-values as values.
"""
cpy.makedir(out_dy)
fn = os.path.join(out_dy, 'min_pvalues.tsv')
genes = []
min_pvals = []
fns = glob.glob(os.path.join(eqtl_dy, '*', 'ENSG*.tsv'))
for fn in fns:
gene_id = fn.split(os.path.sep)[-2]
genes.append(gene_id)
res = ciepy.read_emmax_output(fn)
min_pvals.append(res.PVALUE.min())
min_pvals = pd.Series(min_pvals, index=genes)
min_pvals.to_csv(os.path.join(out_dy, 'min_pvalues.tsv'), sep='\t')
In [ ]:
i = 40
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
get_min_pvals(eqtl_dy, out_dy)
pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'),
index_col=0, header=None, names=['min_pval'])
corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
pvals['sig'] = corrected[0]
pvals['min_pval_bf'] = corrected[1]
pvals.to_csv(fn, sep='\t')
In [95]:
i = 50
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
get_min_pvals(eqtl_dy, out_dy)
pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'),
index_col=0, header=None, names=['min_pval'])
corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
pvals['sig'] = corrected[0]
pvals['min_pval_bf'] = corrected[1]
pvals.to_csv(fn, sep='\t')
In [101]:
i = 60
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
get_min_pvals(eqtl_dy, out_dy)
pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'),
index_col=0, header=None, names=['min_pval'])
corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
pvals['sig'] = corrected[0]
pvals['min_pval_bf'] = corrected[1]
pvals.to_csv(fn, sep='\t')
In [102]:
i = 70
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
get_min_pvals(eqtl_dy, out_dy)
pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'),
index_col=0, header=None, names=['min_pval'])
corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
pvals['sig'] = corrected[0]
pvals['min_pval_bf'] = corrected[1]
pvals.to_csv(fn, sep='\t')
In [103]:
i = 80
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
get_min_pvals(eqtl_dy, out_dy)
pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'),
index_col=0, header=None, names=['min_pval'])
corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
pvals['sig'] = corrected[0]
pvals['min_pval_bf'] = corrected[1]
pvals.to_csv(fn, sep='\t')
In [104]:
i = 90
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
get_min_pvals(eqtl_dy, out_dy)
pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'),
index_col=0, header=None, names=['min_pval'])
corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
pvals['sig'] = corrected[0]
pvals['min_pval_bf'] = corrected[1]
pvals.to_csv(fn, sep='\t')
In [ ]:
i = 100
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
fn = os.path.join(out_dy, 'min_pvalues_corrected.tsv')
if not os.path.exists(fn):
get_min_pvals(eqtl_dy, out_dy)
pvals = pd.read_table(os.path.join(out_dy, 'min_pvalues.tsv'),
index_col=0, header=None, names=['min_pval'])
corrected = smm.multipletests(pvals.min_pval, method='bonferroni')
pvals['sig'] = corrected[0]
pvals['min_pval_bf'] = corrected[1]
pvals.to_csv(fn, sep='\t')
In [ ]:
2 +
I want to make a file with all EMMAX results combined. I am going to sort this file by position and $p$-value. This will allow me to collect some stats like the smallest $p$-value observed for each SNV, how many times a SNV was tested, etc.
I'll use the IPython cluster to sort the individual output files then
merge them using sort -m. This is much faster than concatenating and
sorting though the merging still takes a few hours.
In [22]:
def make_combined_results(eqtl_dy, out_dy):
out = os.path.join(out_dy, 'all_results_sorted.tsv.gz')
dy = os.path.join(outdir, 'tmp')
cpy.makedir(dy)
fns = glob.glob(os.path.join(eqtl_dy, 'ENS*', 'ENS*.tsv'))
commands = ['sort -k1,1 -k2,2n -k3,3n -k11,11n {} | grep -v ^# | '
'awk \'{{print $0"\t{}"}}\' > {}'.format(x, x.split('/')[-2], os.path.join(dy, os.path.split(x)[1]))
for x in fns]
from ipyparallel import Client
parallel_client = Client(profile='parallel')
dview = parallel_client[:]
print('Cluster has {} engines.'.format(len(parallel_client.ids)))
with dview.sync_imports():
import subprocess
dview.scatter('commands', commands)
%px y = [subprocess.check_call(i, shell=True) for i in commands]
c = 'sort -m -k1,1 -k2,2n -k3,3n -k11,11n {} | bgzip -c > {}'.format(os.path.join(dy, '*'), out)
subprocess.check_call(c, shell=True)
c = 'tabix -p bed {}'.format(out)
subprocess.check_call(c, shell=True)
c = 'rm -r {}'.format(dy)
subprocess.check_call(c, shell=True)
out2 = os.path.join(out_dy, 'top_results_sorted.tsv.gz')
c = ("zcat {} | awk 'a!~$1 || b!~$2 || c!~$3 ; {{a=$1}} {{b=$2}} {{c=$3}}' | "
"bgzip -c > {}".format(out, out2))
subprocess.check_call(c, shell=True)
c = 'tabix -p bed {}'.format(out2)
subprocess.check_call(c, shell=True)
In [ ]:
eqtl_dy = '/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/run_eqtl_analysis/eqtls01/gene_results'
out_dy = os.path.join(outdir, 'eqtls01')
make_combined_results(eqtl_dy, out_dy)
In [ ]:
for i in [40, 50, 60, 70, 80, 90, 100]:
eqtl_dy = ('/projects/CARDIPS/analysis/cardips-ipsc-eqtl/private_output/'
'run_eqtl_analysis/unrelated_eqtls_{}/gene_results'.format(i))
out_dy = os.path.join(outdir, 'unrelated_eqtls_{}'.format(i))
make_combined_results(eqtl_dy, out_dy)