Sergey and Lauren developed a set of equtions found here:
Nuzhdin, S. V, Friesen, M. L., & McIntyre, L. M. (2012). Genotype-phenotype mapping in a post-GWAS world. Trends in Genetics : TIG, 28(9), 421–6. doi:10.1016/j.tig.2012.06.003
which potentially allow for the identificqtion of cis- and trans-effects. Here I try using these qeustions and test if they give reasonable results.
Basics: For a given gene the expression level of $E_{ii}$ of allele i in F1 genotype i.
$E_{ii} = \mu + C_i + (T_i + T_t)/2$
$E_{ti} = \mu + C_t + (T_i + T_t)/2$
For each allele the cis- and trans-effects are deviations from the population means, we expect that they will sum to zero:
$\sum^n_{i=1}C_i = 0$
$\sum^n_{i=1}T_i = 0$
Then the expected difference in expression between the Line and Tester allele over the entire population is:
$\sum^n_{i=1} \frac{E_{ti} - E_{ii}}{n}$
Which can be re-written as
$\sum^n_{i=1} \frac{C_{t} - C_{i}}{n} = C_t$
And
$T_t = 2(\frac{\sum^n_{i=1}E_{ti}}{n} - \mu - C_t)$
The cis-effect of allele i can be estimated by:
$\hat C_i = \hat E_{ii} - \hat E_{ti} + \hat C_t$
and trans-effects of allele i can be estimated by:
$\hat T_i = 2(\hat E_{ti} - \hat \mu - \hat C_t - \frac{\hat T_{ti}}{2})$
In [1]:
# Set-up default environment
%run '../ipython_startup.py'
# Import additional libraries
import sas7bdat as sas
import cPickle as pickle
from ase_cisEq import marenEq
from ase_cisEq import marenPrintTable
from ase_normalization import meanStd
from ase_plotting import dfPanelScatter
This data set was created by: ase_summarize_ase_filters.sas
The data has had the following droped:
In [2]:
# Import clean dataset
with sas.SAS7BDAT(os.path.join(PROJ, 'sas_data/clean_ase_stack.sas7bdat')) as FH:
df = FH.to_data_frame()
dfClean = df[['line', 'mating_status', 'fusion_id', 'flag_AI_combined', 'q5_mean_theta', 'sum_both', 'sum_line', 'sum_tester', 'sum_total', 'mean_apn']]
For the maren equations, I am also going to drop exonic regions with less than 10 genotypes. The maren equations make some assumptions about the population level sums. Obvisouly the more genotypes that are present for each fusions the better, but I am comfortable with as few as 10 genotypes.
In [3]:
# Drop groups with less than 10 lines per fusion
grp = dfClean.groupby(['mating_status', 'fusion_id'])
dfGt10 = grp.filter(lambda x: x['line'].count() >= 10).copy()
print 'Rows ' + str(dfGt10.shape[0])
print 'Columns ' + str(dfGt10.shape[1])
In [4]:
# Function to make a panel plot of kde for the maren equation output
def marenKDE(df, value1='cis_line', value2='trans_line', label1='cis_line', label2='trans_line'):
# pivot for easy plotting
line = pd.pivot_table(df, columns='fusion_id', index=('line'), values=value1)
tester = pd.pivot_table(df, columns='fusion_id', index=('line'), values=value2)
# Plot only the first 30 fusions
axes = line.iloc[:, :30].plot(kind='kde', subplots=True, layout=(6, 5), figsize=(20, 15), sharex=False, rot=90, color='b')
tester.iloc[:, :30].plot(kind='kde', subplots=True, ax=axes, sharex=False, color='g', rot=90)
# Add a vline to the plots and remove legend
for ax in axes.ravel():
ax.axvline(0, color='r', lw=2)
handles, labels = ax.get_legend_handles_labels()
ax.set_title(labels[0])
ax.legend().remove()
ax.get_yaxis().set_visible(False)
fig = plt.gcf()
plt.legend(handles, [label1, label2], bbox_transform=fig.transFigure, bbox_to_anchor=(0.65, 1.03), ncol=2, fontsize=18)
plt.tight_layout()
return fig
Raw counts seem to have some issues. The magnitude of line cis and tester trans effects are very different, while tester cis and trans are the same number just different signs. Also of concern is that the estimated cis and trans effects for the line are not centered at 0, which is a major assumption of the equations.
In [5]:
# Calculate Maren TIG equations by mating status and exonic region
marenRawCounts = marenEq(dfGt10, Eii='sum_line', Eti='sum_tester', group=['mating_status', 'fusion_id'])
marenRawCounts['mag_cis'] = abs(marenRawCounts['cis_line'])
marenPrintTable(marenRawCounts)
In [6]:
fig = marenKDE(marenRawCounts)
I am concerned about the calculation of $\mu$. Mean centering the raw counts will allow $\mu = 0$ and I can effectively ignore it. For each fusion_id I take the mean of all the raw counts (line and tester), then subtract this mean value from each count.
This had no affect on the results.
In [7]:
# Mean Center raw counts
def meanCenter(x):
"""Mean centering.
Mean center allele specific counts by using both line and tester counts.
"""
# Get the mean for combined counts
cntMean = np.mean(x[['sum_line', 'sum_tester']].values)
# Center values using mean
x['sum_line_center'] = x['sum_line'] - cntMean
x['sum_tester_center'] = x['sum_tester'] - cntMean
return x
# Group by mating status and fusions id
grp = dfGt10.groupby(['mating_status', 'fusion_id'])
# For each ms*fusion_id do the mean centering
meanCentered = grp.apply(meanCenter)
meanCentered.reset_index(inplace=True)
In [8]:
# Calculate Maren TIG equations by mating status and exonic region for mean centered data.
marenMeanCenter = marenEq(meanCentered, Eii='sum_line_center', Eti='sum_tester_center', group=['mating_status', 'fusion_id'])
marenMeanCenter['mag_cis'] = abs(marenMeanCenter['cis_line'])
marenPrintTable(marenMeanCenter, line='sum_line_center', tester='sum_tester_center')
In [9]:
fig = marenKDE(marenMeanCenter)
For most of the CEGS projects, we have used a q3 normalization. Here I am taking the count value / the upper quartile for the line * median of the overall upper quartile.
$\frac{\text{sum_line}_{gf}}{q3_g} * \widetilde {q3}$
Where g is genotypes 1...G and f is exonic region 1...F.
In [ ]:
# Functiont to calc the upper quartile normalization
def calcQ3(df, column):
# Calculate upper quantile for each mating_status*fusion
q3 = df.groupby(['mating_status', 'fusion_id'])[column].quantile(q=0.75)
q3.name = 'q3'
q3 = q3.reset_index()
# Calculate median q3 by mating_status
medQ3 = q3.groupby('mating_status').median()
medQ3.columns = ['medQ3']
# Merge q3 and medQ3 together
dfQ3 = q3.merge(medQ3, left_on='mating_status', right_index=True)
# Combine with original data
merged = df.merge(dfQ3, on=['mating_status', 'fusion_id'])
# Calcuate q3 norm and add to original dataset
# value / q3 * med(q3)
df['q3_norm_' + column] = merged[column] / merged['q3'] * merged['medQ3']
In [ ]:
# run q3 norm
dfQ3 = dfGt10.copy()
calcQ3(dfQ3, 'sum_line')
calcQ3(dfQ3, 'sum_tester')
In [ ]:
# Calculate Maren TIG equations by mating status and exonic region
marenQ3Norm = marenEq(dfQ3, Eii='q3_norm_sum_line', Eti='q3_norm_sum_tester', group=['mating_status', 'fusion_id'])
marenQ3Norm['mag_cis'] = abs(marenQ3Norm['cis_line'])
marenPrintTable(marenQ3Norm, line='q3_norm_sum_line', tester='q3_norm_sum_tester')
In [ ]:
fig = marenKDE(marenQ3Norm)
In [ ]:
# Mean standardization
meanStd(df=q3Line, colName='q3_norm_sum_line', group='fusion_id')
meanStd(df=q3Tester, colName='q3_norm_sum_tester', group='fusion_id')
# Merge everything together
q3Merge = q3Line.merge(q3Tester, how='inner', on=['line', 'mating_status', 'fusion_id'])
dfQ3Std = dfClean.merge(q3Merge, how='left', on=['line', 'mating_status', 'fusion_id'])
In [ ]:
# Calculate Maren TIG equations by mating status and exonic region
marenQ3NormStd = marenEq(dfQ3Std, Eii='mean_std_q3_norm_sum_line', Eti='mean_std_q3_norm_sum_tester', group=['mating_status', 'fusion_id'])
marenQ3NormStd['mag_cis'] = abs(marenQ3NormStd['cis_line'])
marenQ3NormStd.head()
In [ ]:
grp = marenQ3NormStd.groupby(['mating_status', 'fusion_id'])
tp = grp.get_group(('M', 'F10005_SI'))
In [ ]:
tp[['cis_line', 'trans_line']].plot(kind='kde')
In [ ]:
# Figure out the 25 highest expressed fusion
## group fusion by id and env
fusGrp = dfGt10.groupby(['mating_status', 'fusion_id'])
## Calculate the mean apn for each fusion across genotypes
mApn = fusGrp['mean_apn'].mean()
mApnI = mApn.reset_index()
mApnI.set_index('fusion_id', inplace=True)
# Get the 25 highest expressed fusions for mated and virgin
fusGrp2 = mApnI.groupby('mating_status')
m = fusGrp2.get_group('M').rank(ascending=False)
mFusHi = m[m['mean_apn'] <= 25].index
v = fusGrp2.get_group('V').rank(ascending=False)
vFusHi = v[v['mean_apn'] <= 25].index
# Get the 25 lowest expressed fusions for mated and virgin
m = fusGrp2.get_group('M').rank(ascending=True)
mFusLow = m[m['mean_apn'] <= 25].index
v = fusGrp2.get_group('V').rank(ascending=True)
vFusLow = v[v['mean_apn'] <= 25].index
In [ ]:
rawMHi = marenRawCounts[marenRawCounts['fusion_id'].isin(mFusHi) & (marenRawCounts['mating_status'] == 'M')]
rawVHi = marenRawCounts[marenRawCounts['fusion_id'].isin(vFusHi) & (marenRawCounts['mating_status'] == 'V')]
dfPanelScatter(df=rawMHi, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Mated\nHigh Expression Exonic regions\nRaw Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=rawVHi, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Virgin\nHigh Expression Exonic regions\nRaw Counts', colorCol='flag_AI_combined')
In [ ]:
dfPanelScatter(df=rawMHi, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Mated\nHigh Expression Exonic regions\nRaw Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=rawVHi, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Virgin\nHigh Expression Exonic regions\nRaw Counts', colorCol='flag_AI_combined')
In [ ]:
rawMLow = marenRawCounts[marenRawCounts['fusion_id'].isin(mFusLow) & (marenRawCounts['mating_status'] == 'M')]
rawVLow = marenRawCounts[marenRawCounts['fusion_id'].isin(vFusLow) & (marenRawCounts['mating_status'] == 'V')]
dfPanelScatter(df=rawMLow, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Mated\nLow Expression Exonic regions\nRaw Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=rawVLow, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Virgin\nLow Expression Exonic regions\nRaw Counts', colorCol='flag_AI_combined')
In [ ]:
dfPanelScatter(df=rawMLow, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Mated\nLow Expression Exonic regions\nRaw Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=rawVLow, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Virgin\nLow Expression Exonic regions\nRaw Counts', colorCol='flag_AI_combined')
In [ ]:
q3MHi = marenQ3Norm[marenQ3Norm['fusion_id'].isin(mFusHi) & (marenQ3Norm['mating_status'] == 'M')]
q3VHi = marenQ3Norm[marenQ3Norm['fusion_id'].isin(vFusHi) & (marenQ3Norm['mating_status'] == 'V')]
dfPanelScatter(df=q3MHi, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Mated\nHigh Expression Exonic regions\nQ3 Normalized Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=q3VHi, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Virgin\nHigh Expression Exonic regions\nQ3 Normalized Counts', colorCol='flag_AI_combined')
In [ ]:
dfPanelScatter(df=q3MHi, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Mated\nHigh Expression Exonic regions\nQ3 Normalized Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=q3VHi, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Virgin\nHigh Expression Exonic regions\nQ3 Normalized Counts', colorCol='flag_AI_combined')
In [ ]:
q3MLow = marenQ3Norm[marenQ3Norm['fusion_id'].isin(mFusLow) & (marenQ3Norm['mating_status'] == 'M')]
q3VLow = marenQ3Norm[marenQ3Norm['fusion_id'].isin(vFusLow) & (marenQ3Norm['mating_status'] == 'V')]
dfPanelScatter(df=q3MLow, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Mated\nLow Expression Exonic regions\nQ3 Normalized Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=q3VLow, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Virgin\nLow Expression Exonic regions\nQ3 Normalized Counts', colorCol='flag_AI_combined')
In [ ]:
dfPanelScatter(df=q3MLow, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Mated\nLow Expression Exonic regions\nQ3 Normalized Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=q3VLow, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Virgin\nLow Expression Exonic regions\nQ3 Normalized Counts', colorCol='flag_AI_combined')
In [ ]:
q3StdMHi = marenQ3NormStd[marenQ3NormStd['fusion_id'].isin(mFusHi) & (marenQ3NormStd['mating_status'] == 'M')]
q3StdVHi = marenQ3NormStd[marenQ3NormStd['fusion_id'].isin(vFusHi) & (marenQ3NormStd['mating_status'] == 'V')]
dfPanelScatter(df=q3StdMHi, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Mated\nHigh Expression Exonic regions\nQ3 Normalized and Mean Standardized Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=q3StdVHi, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Virgin\nHigh Expression Exonic regions\nQ3 Normalized and Mean Standardized Counts', colorCol='flag_AI_combined')
In [ ]:
dfPanelScatter(df=q3StdMHi, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Mated\nHigh Expression Exonic regions\nQ3 Normalized and Mean Standardized Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=q3StdVHi, x='mean_apn', y='mag_cis', group='fusion_id', plot_title='Virgin\nHigh Expression Exonic regions\nQ3 Normalized and Mean Standardized Counts', colorCol='flag_AI_combined')
In [ ]:
q3StdMLow = marenQ3NormStd[marenQ3NormStd['fusion_id'].isin(mFusLow) & (marenQ3NormStd['mating_status'] == 'M')]
q3StdVLow = marenQ3NormStd[marenQ3NormStd['fusion_id'].isin(vFusLow) & (marenQ3NormStd['mating_status'] == 'V')]
dfPanelScatter(df=q3StdMLow, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Mated\nLow Expression Exonic regions\nQ3 Normalized and Mean Standardized Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=q3StdVLow, x='q5_mean_theta', y='cis_line', group='fusion_id', vline=0.5, plot_title='Virgin\nLow Expression Exonic regions\nQ3 Normalized and Mean Standardized Counts', colorCol='flag_AI_combined')
In [ ]:
dfPanelScatter(df=q3StdMLow, x='mean_apn', y='cis_line', group='fusion_id', plot_title='Mated\nLow Expression Exonic regions\nQ3 Normalized and Mean Standardized Counts', colorCol='flag_AI_combined')
dfPanelScatter(df=q3StdVLow, x='mean_apn', y='cis_line', group='fusion_id', plot_title='Virgin\nLow Expression Exonic regions\nQ3 Normalized and Mean Standardized Counts', colorCol='flag_AI_combined')
In [ ]: