Sergey and Lauren developed a set of equtions found here:
Nuzhdin, S. V, Friesen, M. L., & McIntyre, L. M. (2012). Genotype-phenotype mapping in a post-GWAS world. Trends in Genetics : TIG, 28(9), 421–6. doi:10.1016/j.tig.2012.06.003
which potentially allow for the identificqtion of cis- and trans-effects. Here I try using these qeustions and test if they give reasonable results.
Basics: For a given gene the expression level of $E_{ii}$ of allele i in F1 genotype i.
$E_{ii} = \mu + C_i + (T_i + T_t)/2$
$E_{ti} = \mu + C_t + (T_i + T_t)/2$
For each allele the cis- and trans-effects are deviations from the population means, we expect that they will sum to zero:
$\sum^n_{i=1}C_i = 0$
$\sum^n_{i=1}T_i = 0$
Then the expected difference in expression between the Line and Tester allele over the entire population is:
$\sum^n_{i=1} \frac{E_{ti} - E_{ii}}{n}$
Which can be re-written as
$\sum^n_{i=1} \frac{C_{t} - C_{i}}{n} = C_t$
The cis-effect of allele i can be estimated by:
$\hat C_i = \hat E_{ii} - \hat E_{ti} + \hat C_t$
In [93]:
# Set-up default environment
%run '../ipython_startup.py'
# Import additional libraries
import sas7bdat as sas
import cPickle as pickle
from ase_cisEq import marenEq
from ase_cisEq import marenPrintTable
from ase_normalization import meanCenter
from ase_normalization import q3Norm
from ase_normalization import meanStd
In [94]:
def denPlt(df, values='cis_line', columns='line', index='fusion_id', layout=(7, 7)):
axes = df.pivot_table(values=values, columns=columns, index=index).plot(kind='kde', subplots=True, layout=layout, figsize=(10,10), rot=90, xlim=(-800, 800))
for ax in axes.ravel():
ax.get_yaxis().set_visible(False)
h, lbl = ax.get_legend_handles_labels()
ax.set_title(lbl[0])
ax.get_legend().remove()
plt.tight_layout()
return plt.gcf()
In [95]:
def scatPlt(df, line='sum_line', tester='sum_tester', title=None):
# Plot the cis-line effects x proportion by fusion
df['prop'] = df[tester] / (df[line] + df[tester])
# Create 5x5 panel plot
fig, axes = plt.subplots(5, 5, figsize=(20, 20))
fig.suptitle(title)
axes = axes.ravel()
# Group by fusion_id
grp = df.groupby('fusion_id')
for i, (n, gdf) in enumerate(grp):
gdf.plot(kind='scatter', x='cis_line', y='prop', ax=axes[i], title=n)
if i == 24:
break
return fig
This data set was created by: ase_summarize_ase_filters.sas
The data has had the following droped:
In [96]:
# Import clean dataset
with sas.SAS7BDAT(os.path.join(PROJ, 'sas_data/clean_ase_stack.sas7bdat')) as FH:
df = FH.to_data_frame()
dfClean = df[['line', 'mating_status', 'fusion_id', 'flag_AI_combined', 'q5_mean_theta', 'sum_both', 'sum_line', 'sum_tester', 'sum_total', 'mean_apn']]
For the maren equations, I am also going to drop exonic regions with less than 10 genotypes. The maren equations make some assumptions about the population level sums. Obvisouly the more genotypes that are present for each fusions the better, but I am comfortable with as few as 10 genotypes.
In [97]:
# Drop groups with less than 10 lines per fusion
grp = dfClean.groupby(['mating_status', 'fusion_id'])
dfGt10 = grp.filter(lambda x: x['line'].count() >= 10).copy()
print 'Rows ' + str(dfGt10.shape[0])
print 'Columns ' + str(dfGt10.shape[1])
Raw counts seem to have some issues. The magnitude of line cis and tester trans effects are very different, while tester cis and trans are the same number just different signs. Also of concern is that the estimated cis and trans effects for the line are not centered at 0, which is a major assumption of the equations.
In [98]:
# Calculate Maren TIG equations by mating status and exonic region
marenRawCounts = marenEq(dfGt10, Eii='sum_line', Eti='sum_tester', group=['mating_status', 'fusion_id'])
marenRawCounts['mag_cis'] = abs(marenRawCounts['cis_line'])
marenPrintTable(marenRawCounts)
In [99]:
# Plot the cis-line effects x proportion by fusion
fig = scatPlt(marenRawCounts, line='sum_line', tester='sum_tester', title='Raw Counts')
In [100]:
# Plot the distribution of cis-line effects by genotype.
fig = denPlt(marenRawCounts)
In [101]:
# Plot the distribution of cis-line effects by fusion
# Group by fusion_id
fusions = marenRawCounts.groupby('fusion_id').groups.keys()
# Get a randome set of fusions
sample = np.random.choice(fusions, size=25, replace=False)
# Plot density of 25 random fusions
fig = denPlt(marenRawCounts[marenRawCounts['fusion_id'].isin(sample)], columns='fusion_id', index='line', layout=(5, 5))
#fig.savefig(os.path.join(PROJ, 'pipeline_output/cis_effects/exonic_region_distributions_of_cis_line.png'))
In [102]:
# Center allele specific counts by taking the proportion and multiplying by 1000
dfGt10['tot_allele_specific'] = dfGt10[['sum_line', 'sum_tester']].sum(axis=1)
dfGt10['line_adj'] = dfGt10['sum_line'] / dfGt10['tot_allele_specific'] * 1000
dfGt10['tester_adj'] = dfGt10['sum_tester'] / dfGt10['tot_allele_specific'] * 1000
In [103]:
# Calculate Maren TIG equations by mating status and exonic region
marenRawCountsAdj = marenEq(dfGt10, Eii='line_adj', Eti='tester_adj', group=['mating_status', 'fusion_id'])
marenRawCountsAdj['mag_cis'] = abs(marenRawCountsAdj['cis_line'])
marenPrintTable(marenRawCountsAdj, line='line_adj', tester='tester_adj')
In [104]:
# Plot the cis-line effects x proportion by fusion
fig = scatPlt(marenRawCountsAdj, line='line_adj', tester='tester_adj', title='Coverage Centered')
In [105]:
# Plot the distribution of cis-line effects by genotype
fig = denPlt(marenRawCountsAdj)
fig.savefig(os.path.join(PROJ, 'pipeline_output/cis_effects/genotype_distributions_of_cis_line.png'))
In [106]:
# Plot the distribution of cis-tester effects by genotype
fig = denPlt(marenRawCountsAdj, values='cis_tester')
fig.savefig(os.path.join(PROJ, 'pipeline_output/cis_effects/genotype_distributions_of_cis_tester.png'))
In [107]:
# Plot the distribution of cis-line effects by fusion
fusions = marenRawCountsAdj.groupby('fusion_id').groups.keys()
sample = np.random.choice(fusions, size=25, replace=False)
fig = denPlt(marenRawCountsAdj[marenRawCountsAdj['fusion_id'].isin(sample)], columns='fusion_id', index='line', layout=(5, 5))
fig.savefig(os.path.join(PROJ, 'pipeline_output/cis_effects/exonic_region_distributions_of_cis_line.png'))
I am concerned about the calculation of $\mu$. Mean centering the raw counts will allow $\mu = 0$ and I can effectively ignore it. For each fusion_id I take the mean of all the raw counts (line and tester), then subtract this mean value from each count.
This had no affect on the results.
In [108]:
# Mean center
dfMeanCenter = meanCenter(dfGt10, columns=['sum_line', 'sum_tester'], group=['mating_status', 'fusion_id'])
In [109]:
# Calculate Maren TIG equations by mating status and exonic region for mean centered data.
marenMeanCenter = marenEq(dfMeanCenter, Eii='mean_center_sum_line', Eti='mean_center_sum_tester', group=['mating_status', 'fusion_id'])
marenMeanCenter['mag_cis'] = abs(marenMeanCenter['cis_line'])
marenPrintTable(marenMeanCenter, line='mean_center_sum_line', tester='mean_center_sum_tester')
In [110]:
# Plot the cis-line effects x proportion by fusion
fig = scatPlt(marenMeanCenter, line='mean_center_sum_line', tester='mean_center_sum_tester', title='Mean Centered')
In [111]:
fig = denPlt(marenMeanCenter)
In [112]:
# Mean standardization
dfMeanStd = dfGt10.copy()
meanStd(dfMeanStd, column='sum_line', group=['mating_status', 'fusion_id'])
meanStd(dfMeanStd, column='sum_tester', group=['mating_status', 'fusion_id'])
In [113]:
# Calculate Maren TIG equations by mating status and exonic region
marenMeanStd = marenEq(dfMeanStd, Eii='mean_std_sum_line', Eti='mean_std_sum_tester', group=['mating_status', 'fusion_id'])
marenMeanStd['mag_cis'] = abs(marenMeanStd['cis_line'])
marenPrintTable(marenMeanStd, line='mean_std_sum_line', tester='mean_std_sum_tester')
In [114]:
# Plot the cis-line effects x proportion by fusion
fig = scatPlt(marenMeanStd, line='mean_std_sum_line', tester='mean_std_sum_tester', title='Mean Standardized')
In [115]:
fig = denPlt(marenMeanStd)
For most of the CEGS projects, we have used a q3 normalization. Here I am taking the count value / the upper quartile for the line * median of the overall upper quartile.
$\frac{\text{sum_line}_{mgf}}{q3_mg} * \widetilde {q3_m}$
Where m is the mating status ('M' or 'V'), g is genotypes 1...G and f is exonic region 1...F.
In [116]:
dfQ3 = dfGt10.set_index(['mating_status', 'line', 'fusion_id'])
dfQ3['q3_norm_sum_line'] = q3Norm(dfQ3['sum_line'])
dfQ3['q3_norm_sum_tester'] = q3Norm(dfQ3['sum_tester'])
dfQ3.reset_index(inplace=True)
In [117]:
# Calculate Maren TIG equations by mating status and exonic region
marenQ3Norm = marenEq(dfQ3, Eii='q3_norm_sum_line', Eti='q3_norm_sum_tester', group=['mating_status', 'fusion_id'])
marenQ3Norm['mag_cis'] = abs(marenQ3Norm['cis_line'])
marenPrintTable(marenQ3Norm, line='q3_norm_sum_line', tester='q3_norm_sum_tester')
In [118]:
# Plot the cis-line effects x proportion by fusion
fig = scatPlt(marenQ3Norm, line='q3_norm_sum_line', tester='q3_norm_sum_tester', title='Q3 Normalized')
In [ ]:
fig = denPlt(marenQ3Norm)
In [208]:
fus = 'F10005_SI'
shapes = {'M': '+', 'V': '^'}
cnt = 0
colors = {}
for x in df['line']:
colors[x]= cnt
cnt+=1
In [216]:
df = marenRawCounts[marenRawCounts['fusion_id'] == fus]
x = df['cis_line'].values
xlab = 'cis_line'
y = (1 - df['prop']).values
ylab = 'prop'
s = df['mating_status'].map(shapes)
c = df['line'].map(colors)
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.scatter(x=x, y=y, s=40, c=c, cmap='Set1')
ax.plot([0, 1], [0, 1], transform=ax.transAxes)
ax.set_title(fus)
ax.set_xlabel(xlab)
ax.set_ylabel(ylab)
Out[216]:
In [206]:
df['trans_tester1'] = df['sum_line'].sum() / df.shape[0]
df['trans_tester2'] = df['sum_tester'].sum() / df.shape[0] - df['cis_tester']
df
Out[206]:
In [207]:
df['trans_line1'] = df['sum_line'] - df['cis_line'] - df['trans_tester1']
df['trans_line2'] = df['sum_tester'] - df['cis_tester'] - df['trans_tester1']
df['trans_line3'] = 2 * (df['sum_tester'] - df['cis_tester'] - df['trans_tester1'] / 2)
df
Out[207]:
In [ ]: