This is a (slightly modified) reproduction of Simon's quantification comparison analysis, but in an iPython Notebook



In [1]:

    
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Some constants and useful variables



In [2]:

    
MEAN_FL = 250
small = 1e-2 # for log things (use same value as Harold)

Some basic functions



In [3]:

    
# Truncate values in colname < val to 0
def filterExpression(df, colname="TPM", val=0.01):
    df[colname][df[colname] < val] = 0



In [4]:

    
# Join a list of DataFrames together, merging on the index and renaming conflicting columns
def joinFrames(frames):
    current = frames[0]
    for i, frame in enumerate(frames[1:], 2):
        current = current.join(frame, rsuffix='{}'.format(i))
    return current

Load all of the simulation files for Salmon & Kallisto



In [5]:

    
salmon_files = ["sim_test_salmon_{}/quant.sf".format(i) for i in xrange(1,11)]
salmonDataFrames = []
for fn in salmon_files:
    # Rename #Name => Name
    df = pd.read_csv(fn, sep='\t', skiprows=11)
    df = df.rename(columns={'# Name': 'Name'})
    df = df.set_index('Name')
    salmonDataFrames.append(df)



In [6]:

    
kallisto_files = ["sim_test_kallisto_bs/bs_abundance_{}.txt".format(i) for i in xrange(10)]
kallistoDataFrames = []
for fn in kallisto_files:
    # Rename #Name => Name
    df = pd.read_csv(fn, sep='\t')
    df = df.rename(columns={'target_id' : 'Name', 'length' : 'Length', 'est_counts' : 'NumReads', 'tpm' : 'TPM'})
    df = df.set_index('Name')
    kallistoDataFrames.append(df)

For all the simulations for each method, merge the results into a single data frame for easier analysis



In [7]:

    
mergedSalmon = joinFrames(salmonDataFrames)
mergedSalmon["NumReadsAvg"] = mergedSalmon[["NumReads"] + ["NumReads{}".format(i) for i in xrange(2,11)]].mean(axis=1)
mergedSalmon["TPMAvg"] = mergedSalmon[["TPM"] + ["TPM{}".format(i) for i in xrange(2,11)]].mean(axis=1)
mergedSalmon["NumReadsStdDev"] = mergedSalmon[["NumReads"] + ["NumReads{}".format(i) for i in xrange(2,11)]].std(axis=1)



In [8]:

    
mergedKallisto = joinFrames(kallistoDataFrames)
mergedKallisto["NumReadsAvg"] = mergedKallisto[["NumReads"] + ["NumReads{}".format(i) for i in xrange(2,11)]].mean(axis=1)
mergedKallisto["TPMAvg"] = mergedKallisto[["TPM"] + ["TPM{}".format(i) for i in xrange(2,11)]].mean(axis=1)
mergedKallisto["NumReadsStdDev"] = mergedKallisto[["NumReads"] + ["NumReads{}".format(i) for i in xrange(2,11)]].std(axis=1)

Load the ground truth data, the results of simulation 1 for Salmon, and the results on the full data for Kallisto



In [9]:

    
salmon = pd.read_csv('sim_test_salmon_1/quant.sf', sep='\t', skiprows=11)
salmon = salmon.rename(columns={'# Name': 'Name'})
salmon = salmon.set_index('Name')



In [10]:

    
kallisto = pd.read_csv('sim_test_kallisto/abundance.txt', sep='\t')
kallisto = kallisto.rename(columns={'target_id' : 'Name', 'length' : 'Length', 'est_counts' : 'NumReads', 'tpm' : 'TPM'})
kallisto = kallisto.set_index('Name')

Apparently there are some transcripts that are in the simulation, but they are not in the ground truth file ...

This code drops expressions for ground truth transcripts that were not included in the simulation, and sets "ground truth" values of 0 for those transcripts (mentioned above) that are in the simulation but not the expression ground truth file.

The original TPMs were on a larger dataset, so re-normalize them (but do it according to rounded counts --- thanks for pointing this out Harold!)



In [11]:

    
# Load the ground truth files
truth = pd.read_csv('data/quant_bias_corrected.sf', sep='\t', skiprows=12)
# Rename #Name => Name
truth = truth.rename(columns={'# Name':'Name'})
truth = truth.set_index('Name')

# There were transcripts that *were* selected for the simulation, but they *were not* in the
truth = truth.loc[kallisto.index]
truth.fillna(0, inplace=True)
# Since we filled all NaN's with 0, replace the length here
truth["Length"] = kallisto["Length"]
# Round the counts
roundedCounts = truth["NumReads"].round()

# The transcript fraction 
truth["EffectiveLen"] = truth["Length"] - MEAN_FL
# Don't allow a negative effective length
truth["EffectiveLen"][truth["EffectiveLen"] < 0] = 0

# Any reads from transcripts with no effective length
strangeReads = roundedCounts[truth["EffectiveLen"] <= 0]
print("Number of 'Strange' reads = {}".format(len(strangeReads[strangeReads > 0])))

# From the true transcript fractions, compute the true TPM with effective lengths
transcriptFracs = (roundedCounts / truth["EffectiveLen"])
transcriptFracs[transcriptFracs == np.inf] = 0
trueTPMEffectiveLen = 1000000.0 * (transcriptFracs / sum(transcriptFracs.values))
truth["TPM_EL_truth"] = trueTPMEffectiveLen

# From the true transcript fractions, compute the true TPM with actual lengths
transcriptFracs = (roundedCounts / truth["Length"])
transcriptFracs[transcriptFracs == np.inf] = 0
trueTPMRealLen = 1000000.0 * (transcriptFracs / sum(transcriptFracs.values))
truth["TPM_RL_truth"] = trueTPMRealLen

# Now, let the true counts be our rounded counts
truth["NumReads"] = roundedCounts









    



Number of 'Strange' reads = 0

Create some data frames for Salmon and Kallisto that contain the ground truth values as columns



In [12]:

    
truthSalmon = truth.join(mergedSalmon, lsuffix='_truth', rsuffix='_salmon')



In [13]:

    
truthKallisto = truth.join(kallisto, lsuffix='_truth', rsuffix='_kallisto')

Some simple correlations to start off



In [14]:

    
# correlation with the average salmon TPM 
print("TPM correlation Salmon (Avg) vs. Truth (RealLen) = {}".format(truthSalmon.corr(method='spearman')["TPM_RL_truth"]["TPMAvg"]))
print("TPM correlation Salmon (Avg) vs. Truth (EffLen) = {}".format(truthSalmon.corr(method='spearman')["TPM_EL_truth"]["TPMAvg"]))









    



TPM correlation Salmon (Avg) vs. Truth (RealLen) = 0.83495978804
TPM correlation Salmon (Avg) vs. Truth (EffLen) = 0.834077786529



In [15]:

    
# and with the first replicate
print("TPM correlation Salmon (Sim1) vs. Truth (RealLen) = {}".format(truthSalmon.corr(method='spearman')["TPM_RL_truth"]["TPM_salmon"]))
print("TPM correlation Salmon (Sim1) vs. Truth (EffLen) = {}".format(truthSalmon.corr(method='spearman')["TPM_EL_truth"]["TPM_salmon"]))









    



TPM correlation Salmon (Sim1) vs. Truth (RealLen) = 0.835845618235
TPM correlation Salmon (Sim1) vs. Truth (EffLen) = 0.834965112553



In [16]:

    
# correlation with Kallisto
print("TPM correlation Kallisto vs. Truth (RealLen) = {}".format(truthKallisto.corr(method='spearman')["TPM_RL_truth"]["TPM_kallisto"]))
print("TPM correlation Kallisto vs. Truth (EffLen) = {}".format(truthKallisto.corr(method='spearman')["TPM_EL_truth"]["TPM_kallisto"]))









    



TPM correlation Kallisto vs. Truth (RealLen) = 0.83485831708
TPM correlation Kallisto vs. Truth (EffLen) = 0.835781563454

Coefficient of Variation Plots



In [17]:

    
p = sns.regplot(mergedSalmon["NumReadsAvg"], mergedSalmon["NumReadsStdDev"] / mergedSalmon["NumReadsAvg"], fit_reg=False)
p.set_ylabel("CV")
p.set_title("Variability Salmon")









    Out[17]:





<matplotlib.text.Text at 0x10febe1d0>



In [18]:

    
p = sns.regplot(mergedKallisto["NumReadsAvg"], mergedKallisto["NumReadsStdDev"] / mergedKallisto["NumReadsAvg"], fit_reg=False)
p.set_ylabel("CV")
p.set_title("Variability Kallisto")









    Out[18]:





<matplotlib.text.Text at 0x10fbc6210>

TPM correlation plots



In [20]:

    
p = sns.regplot(np.log(truthSalmon["TPM_RL_truth"].values+1), np.log(truthSalmon["TPM_salmon"]+1), truthSalmon)
p.set_xlabel("true TPM (RealLen)")
p.set_ylabel("Salmon (Sim1) TPM")
p.set_title("Salmon TPM (Sim1) correlation")
sr = truthSalmon.corr(method='spearman')["TPM_RL_truth"]["TPM_salmon"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.835845618235



In [21]:

    
p = sns.regplot(np.log(truthSalmon["TPM_EL_truth"].values+1), np.log(truthSalmon["TPM_salmon"]+1), truthSalmon)
p.set_xlabel("true TPM (EffLen)")
p.set_ylabel("Salmon (Sim1) TPM")
p.set_title("Salmon TPM (Sim1) correlation")
sr = truthSalmon.corr(method='spearman')["TPM_EL_truth"]["TPM_salmon"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.834965112553



In [22]:

    
p = sns.regplot(np.log(truthKallisto["TPM_RL_truth"].values+1), np.log(truthKallisto["TPM_kallisto"].values+1), truthKallisto)
p.set_xlabel("true TPM (RealLen)")
p.set_ylabel("Kallisto (FullData) TPM")
p.set_title("Kallisto TPM correlation")
sr = truthKallisto.corr(method='spearman')["TPM_RL_truth"]["TPM_kallisto"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.83485831708



In [23]:

    
p = sns.regplot(np.log(truthKallisto["TPM_EL_truth"].values+1), np.log(truthKallisto["TPM_kallisto"].values+1), truthKallisto)
p.set_xlabel("true TPM (EffLen)")
p.set_ylabel("Kallisto (FullData) TPM")
p.set_title("Kallisto TPM correlation")
sr = truthKallisto.corr(method='spearman')["TPM_EL_truth"]["TPM_kallisto"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.835781563454

Correlations between numbers of reads

This is useful because it "sidesteps" the effective length issue



In [24]:

    
p = sns.regplot(np.log(truthSalmon["NumReads_truth"].values+1), np.log(truthSalmon["NumReads_salmon"]+1), truthSalmon)
p.set_xlabel("true NumReads")
p.set_ylabel("Salmon (Sim1) NumReads")
p.set_title("Salmon NumReads correlation")
sr = truthSalmon.corr(method='spearman')["NumReads_truth"]["NumReads_salmon"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.834552169711



In [25]:

    
p = sns.regplot(np.log(truthSalmon["NumReads_truth"].values+1), np.log(truthKallisto["NumReads_kallisto"]+1), truthKallisto)
p.set_xlabel("true NumReads")
p.set_ylabel("Kallisto (Sim1) NumReads")
p.set_title("Kallisto NumReads correlation")
sr = truthKallisto.corr(method="spearman")["NumReads_truth"]["NumReads_kallisto"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.836134057877

No big changes, but let's see how those correlations looks if we apply some "reasonable" cutoffs



In [26]:

    
# No big changes, but let's see how everything looks if we apply some "reasonable" cutoffs
filterExpression(truthSalmon, colname="NumReads_salmon", val=1)
filterExpression(truthSalmon, colname="TPM_salmon", val=0.01)
filterExpression(truthSalmon, colname="TPMAvg", val=0.01)
filterExpression(truthSalmon, colname="NumReads_truth", val=1)
filterExpression(truthSalmon, colname="TPM_RL_truth", val=0.01)
filterExpression(truthSalmon, colname="TPM_EL_truth", val=0.01)

filterExpression(truthKallisto, colname="NumReads_kallisto", val=1)
filterExpression(truthKallisto, colname="TPM_kallisto", val=0.01)
filterExpression(truthKallisto, colname="NumReads_truth", val=1)
filterExpression(truthKallisto, colname="TPM_RL_truth", val=0.01)
filterExpression(truthKallisto, colname="TPM_EL_truth", val=0.01)

(Filtered) TPM correlation plots



In [28]:

    
p = sns.regplot(np.log(truthSalmon["TPM_RL_truth"].values+1), np.log(truthSalmon["TPM_salmon"]+1), truthSalmon)
p.set_xlabel("true TPM (RealLen)")
p.set_ylabel("Salmon (Sim1) TPM")
p.set_title("Salmon TPM (Sim1) correlation")
sr = truthSalmon.corr(method='spearman')["TPM_RL_truth"]["TPM_salmon"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.84651452142



In [29]:

    
p = sns.regplot(np.log(truthSalmon["TPM_EL_truth"].values+1), np.log(truthSalmon["TPM_salmon"]+1), truthSalmon)
p.set_xlabel("true TPM (EffLen)")
p.set_ylabel("Salmon (Sim1) TPM")
p.set_title("Salmon TPM (Sim1) correlation")
sr = truthSalmon.corr(method='spearman')["TPM_EL_truth"]["TPM_salmon"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.845740329822



In [30]:

    
p = sns.regplot(np.log(truthKallisto["TPM_RL_truth"].values+1), np.log(truthKallisto["TPM_kallisto"].values+1), truthKallisto)
p.set_xlabel("true TPM (RealLen)")
p.set_ylabel("Kallisto (FullData) TPM")
p.set_title("Kallisto TPM correlation")
sr = truthKallisto.corr(method='spearman')["TPM_RL_truth"]["TPM_kallisto"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.833023060229



In [31]:

    
p = sns.regplot(np.log(truthKallisto["TPM_EL_truth"].values+1), np.log(truthKallisto["TPM_kallisto"].values+1), truthKallisto)
p.set_xlabel("true TPM (EffLen)")
p.set_ylabel("Kallisto (FullData) TPM")
p.set_title("Kallisto TPM correlation")
sr = truthKallisto.corr(method='spearman')["TPM_EL_truth"]["TPM_kallisto"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.834136306562

(Filtered) Read Count correlation plots



In [32]:

    
p = sns.regplot(np.log(truthSalmon["NumReads_truth"].values+1), np.log(truthSalmon["NumReads_salmon"]+1), truthSalmon)
p.set_xlabel("true NumReads")
p.set_ylabel("Salmon (Sim1) NumReads")
p.set_title("Salmon NumReads correlation")
sr = truthSalmon.corr(method='spearman')["NumReads_truth"]["NumReads_salmon"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.839419059657



In [33]:

    
p = sns.regplot(np.log(truthSalmon["NumReads_truth"].values+1), np.log(truthKallisto["NumReads_kallisto"]+1), truthKallisto)
p.set_xlabel("true NumReads")
p.set_ylabel("Kallisto (Sim1) NumReads")
p.set_title("Kallisto NumReads correlation")
sr = truthKallisto.corr(method="spearman")["NumReads_truth"]["NumReads_kallisto"]
print("Spearman r = {}".format(sr))









    



Spearman r = 0.837290633981