This is a (slightly modified) reproduction of Simon's quantification comparison analysis, but in an iPython Notebook


In [1]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Some constants and useful variables


In [2]:
MEAN_FL = 250
small = 1e-2 # for log things (use same value as Harold)

Some basic functions


In [3]:
# Truncate values in colname < val to 0
def filterExpression(df, colname="TPM", val=0.01):
    df[colname][df[colname] < val] = 0

In [4]:
# Join a list of DataFrames together, merging on the index and renaming conflicting columns
def joinFrames(frames):
    current = frames[0]
    for i, frame in enumerate(frames[1:], 2):
        current = current.join(frame, rsuffix='{}'.format(i))
    return current

Load all of the simulation files for Salmon & Kallisto


In [5]:
salmon_files = ["sim_test_salmon_{}/quant.sf".format(i) for i in xrange(1,11)]
salmonDataFrames = []
for fn in salmon_files:
    # Rename #Name => Name
    df = pd.read_csv(fn, sep='\t', skiprows=11)
    df = df.rename(columns={'# Name': 'Name'})
    df = df.set_index('Name')
    salmonDataFrames.append(df)

In [6]:
kallisto_files = ["sim_test_kallisto_bs/bs_abundance_{}.txt".format(i) for i in xrange(10)]
kallistoDataFrames = []
for fn in kallisto_files:
    # Rename #Name => Name
    df = pd.read_csv(fn, sep='\t')
    df = df.rename(columns={'target_id' : 'Name', 'length' : 'Length', 'est_counts' : 'NumReads', 'tpm' : 'TPM'})
    df = df.set_index('Name')
    kallistoDataFrames.append(df)

For all the simulations for each method, merge the results into a single data frame for easier analysis


In [7]:
mergedSalmon = joinFrames(salmonDataFrames)
mergedSalmon["NumReadsAvg"] = mergedSalmon[["NumReads"] + ["NumReads{}".format(i) for i in xrange(2,11)]].mean(axis=1)
mergedSalmon["TPMAvg"] = mergedSalmon[["TPM"] + ["TPM{}".format(i) for i in xrange(2,11)]].mean(axis=1)
mergedSalmon["NumReadsStdDev"] = mergedSalmon[["NumReads"] + ["NumReads{}".format(i) for i in xrange(2,11)]].std(axis=1)

In [8]:
mergedKallisto = joinFrames(kallistoDataFrames)
mergedKallisto["NumReadsAvg"] = mergedKallisto[["NumReads"] + ["NumReads{}".format(i) for i in xrange(2,11)]].mean(axis=1)
mergedKallisto["TPMAvg"] = mergedKallisto[["TPM"] + ["TPM{}".format(i) for i in xrange(2,11)]].mean(axis=1)
mergedKallisto["NumReadsStdDev"] = mergedKallisto[["NumReads"] + ["NumReads{}".format(i) for i in xrange(2,11)]].std(axis=1)

Load the ground truth data, the results of simulation 1 for Salmon, and the results on the full data for Kallisto


In [9]:
salmon = pd.read_csv('sim_test_salmon_1/quant.sf', sep='\t', skiprows=11)
salmon = salmon.rename(columns={'# Name': 'Name'})
salmon = salmon.set_index('Name')

In [10]:
kallisto = pd.read_csv('sim_test_kallisto/abundance.txt', sep='\t')
kallisto = kallisto.rename(columns={'target_id' : 'Name', 'length' : 'Length', 'est_counts' : 'NumReads', 'tpm' : 'TPM'})
kallisto = kallisto.set_index('Name')

Apparently there are some transcripts that are in the simulation, but they are not in the ground truth file ...

This code drops expressions for ground truth transcripts that were not included in the simulation, and sets "ground truth" values of 0 for those transcripts (mentioned above) that are in the simulation but not the expression ground truth file.

The original TPMs were on a larger dataset, so re-normalize them (but do it according to rounded counts --- thanks for pointing this out Harold!)


In [11]:
# Load the ground truth files
truth = pd.read_csv('data/quant_bias_corrected.sf', sep='\t', skiprows=12)
# Rename #Name => Name
truth = truth.rename(columns={'# Name':'Name'})
truth = truth.set_index('Name')

# There were transcripts that *were* selected for the simulation, but they *were not* in the
truth = truth.loc[kallisto.index]
truth.fillna(0, inplace=True)
# Since we filled all NaN's with 0, replace the length here
truth["Length"] = kallisto["Length"]
# Round the counts
roundedCounts = truth["NumReads"].round()

# The transcript fraction 
truth["EffectiveLen"] = truth["Length"] - MEAN_FL
# Don't allow a negative effective length
truth["EffectiveLen"][truth["EffectiveLen"] < 0] = 0

# Any reads from transcripts with no effective length
strangeReads = roundedCounts[truth["EffectiveLen"] <= 0]
print("Number of 'Strange' reads = {}".format(len(strangeReads[strangeReads > 0])))

# From the true transcript fractions, compute the true TPM with effective lengths
transcriptFracs = (roundedCounts / truth["EffectiveLen"])
transcriptFracs[transcriptFracs == np.inf] = 0
trueTPMEffectiveLen = 1000000.0 * (transcriptFracs / sum(transcriptFracs.values))
truth["TPM_EL_truth"] = trueTPMEffectiveLen

# From the true transcript fractions, compute the true TPM with actual lengths
transcriptFracs = (roundedCounts / truth["Length"])
transcriptFracs[transcriptFracs == np.inf] = 0
trueTPMRealLen = 1000000.0 * (transcriptFracs / sum(transcriptFracs.values))
truth["TPM_RL_truth"] = trueTPMRealLen

# Now, let the true counts be our rounded counts
truth["NumReads"] = roundedCounts


Number of 'Strange' reads = 0

Create some data frames for Salmon and Kallisto that contain the ground truth values as columns


In [12]:
truthSalmon = truth.join(mergedSalmon, lsuffix='_truth', rsuffix='_salmon')

In [13]:
truthKallisto = truth.join(kallisto, lsuffix='_truth', rsuffix='_kallisto')

Some simple correlations to start off


In [14]:
# correlation with the average salmon TPM 
print("TPM correlation Salmon (Avg) vs. Truth (RealLen) = {}".format(truthSalmon.corr(method='spearman')["TPM_RL_truth"]["TPMAvg"]))
print("TPM correlation Salmon (Avg) vs. Truth (EffLen) = {}".format(truthSalmon.corr(method='spearman')["TPM_EL_truth"]["TPMAvg"]))


TPM correlation Salmon (Avg) vs. Truth (RealLen) = 0.83495978804
TPM correlation Salmon (Avg) vs. Truth (EffLen) = 0.834077786529

In [15]:
# and with the first replicate
print("TPM correlation Salmon (Sim1) vs. Truth (RealLen) = {}".format(truthSalmon.corr(method='spearman')["TPM_RL_truth"]["TPM_salmon"]))
print("TPM correlation Salmon (Sim1) vs. Truth (EffLen) = {}".format(truthSalmon.corr(method='spearman')["TPM_EL_truth"]["TPM_salmon"]))


TPM correlation Salmon (Sim1) vs. Truth (RealLen) = 0.835845618235
TPM correlation Salmon (Sim1) vs. Truth (EffLen) = 0.834965112553

In [16]:
# correlation with Kallisto
print("TPM correlation Kallisto vs. Truth (RealLen) = {}".format(truthKallisto.corr(method='spearman')["TPM_RL_truth"]["TPM_kallisto"]))
print("TPM correlation Kallisto vs. Truth (EffLen) = {}".format(truthKallisto.corr(method='spearman')["TPM_EL_truth"]["TPM_kallisto"]))


TPM correlation Kallisto vs. Truth (RealLen) = 0.83485831708
TPM correlation Kallisto vs. Truth (EffLen) = 0.835781563454

Coefficient of Variation Plots


In [17]:
p = sns.regplot(mergedSalmon["NumReadsAvg"], mergedSalmon["NumReadsStdDev"] / mergedSalmon["NumReadsAvg"], fit_reg=False)
p.set_ylabel("CV")
p.set_title("Variability Salmon")


Out[17]:
<matplotlib.text.Text at 0x10febe1d0>

In [18]:
p = sns.regplot(mergedKallisto["NumReadsAvg"], mergedKallisto["NumReadsStdDev"] / mergedKallisto["NumReadsAvg"], fit_reg=False)
p.set_ylabel("CV")
p.set_title("Variability Kallisto")


Out[18]:
<matplotlib.text.Text at 0x10fbc6210>

TPM correlation plots


In [20]:
p = sns.regplot(np.log(truthSalmon["TPM_RL_truth"].values+1), np.log(truthSalmon["TPM_salmon"]+1), truthSalmon)
p.set_xlabel("true TPM (RealLen)")
p.set_ylabel("Salmon (Sim1) TPM")
p.set_title("Salmon TPM (Sim1) correlation")
sr = truthSalmon.corr(method='spearman')["TPM_RL_truth"]["TPM_salmon"]
print("Spearman r = {}".format(sr))


Spearman r = 0.835845618235

In [21]:
p = sns.regplot(np.log(truthSalmon["TPM_EL_truth"].values+1), np.log(truthSalmon["TPM_salmon"]+1), truthSalmon)
p.set_xlabel("true TPM (EffLen)")
p.set_ylabel("Salmon (Sim1) TPM")
p.set_title("Salmon TPM (Sim1) correlation")
sr = truthSalmon.corr(method='spearman')["TPM_EL_truth"]["TPM_salmon"]
print("Spearman r = {}".format(sr))


Spearman r = 0.834965112553

In [22]:
p = sns.regplot(np.log(truthKallisto["TPM_RL_truth"].values+1), np.log(truthKallisto["TPM_kallisto"].values+1), truthKallisto)
p.set_xlabel("true TPM (RealLen)")
p.set_ylabel("Kallisto (FullData) TPM")
p.set_title("Kallisto TPM correlation")
sr = truthKallisto.corr(method='spearman')["TPM_RL_truth"]["TPM_kallisto"]
print("Spearman r = {}".format(sr))


Spearman r = 0.83485831708

In [23]:
p = sns.regplot(np.log(truthKallisto["TPM_EL_truth"].values+1), np.log(truthKallisto["TPM_kallisto"].values+1), truthKallisto)
p.set_xlabel("true TPM (EffLen)")
p.set_ylabel("Kallisto (FullData) TPM")
p.set_title("Kallisto TPM correlation")
sr = truthKallisto.corr(method='spearman')["TPM_EL_truth"]["TPM_kallisto"]
print("Spearman r = {}".format(sr))


Spearman r = 0.835781563454

Correlations between numbers of reads

This is useful because it "sidesteps" the effective length issue


In [24]:
p = sns.regplot(np.log(truthSalmon["NumReads_truth"].values+1), np.log(truthSalmon["NumReads_salmon"]+1), truthSalmon)
p.set_xlabel("true NumReads")
p.set_ylabel("Salmon (Sim1) NumReads")
p.set_title("Salmon NumReads correlation")
sr = truthSalmon.corr(method='spearman')["NumReads_truth"]["NumReads_salmon"]
print("Spearman r = {}".format(sr))


Spearman r = 0.834552169711

In [25]:
p = sns.regplot(np.log(truthSalmon["NumReads_truth"].values+1), np.log(truthKallisto["NumReads_kallisto"]+1), truthKallisto)
p.set_xlabel("true NumReads")
p.set_ylabel("Kallisto (Sim1) NumReads")
p.set_title("Kallisto NumReads correlation")
sr = truthKallisto.corr(method="spearman")["NumReads_truth"]["NumReads_kallisto"]
print("Spearman r = {}".format(sr))


Spearman r = 0.836134057877

No big changes, but let's see how those correlations looks if we apply some "reasonable" cutoffs


In [26]:
# No big changes, but let's see how everything looks if we apply some "reasonable" cutoffs
filterExpression(truthSalmon, colname="NumReads_salmon", val=1)
filterExpression(truthSalmon, colname="TPM_salmon", val=0.01)
filterExpression(truthSalmon, colname="TPMAvg", val=0.01)
filterExpression(truthSalmon, colname="NumReads_truth", val=1)
filterExpression(truthSalmon, colname="TPM_RL_truth", val=0.01)
filterExpression(truthSalmon, colname="TPM_EL_truth", val=0.01)

filterExpression(truthKallisto, colname="NumReads_kallisto", val=1)
filterExpression(truthKallisto, colname="TPM_kallisto", val=0.01)
filterExpression(truthKallisto, colname="NumReads_truth", val=1)
filterExpression(truthKallisto, colname="TPM_RL_truth", val=0.01)
filterExpression(truthKallisto, colname="TPM_EL_truth", val=0.01)

(Filtered) TPM correlation plots


In [28]:
p = sns.regplot(np.log(truthSalmon["TPM_RL_truth"].values+1), np.log(truthSalmon["TPM_salmon"]+1), truthSalmon)
p.set_xlabel("true TPM (RealLen)")
p.set_ylabel("Salmon (Sim1) TPM")
p.set_title("Salmon TPM (Sim1) correlation")
sr = truthSalmon.corr(method='spearman')["TPM_RL_truth"]["TPM_salmon"]
print("Spearman r = {}".format(sr))


Spearman r = 0.84651452142

In [29]:
p = sns.regplot(np.log(truthSalmon["TPM_EL_truth"].values+1), np.log(truthSalmon["TPM_salmon"]+1), truthSalmon)
p.set_xlabel("true TPM (EffLen)")
p.set_ylabel("Salmon (Sim1) TPM")
p.set_title("Salmon TPM (Sim1) correlation")
sr = truthSalmon.corr(method='spearman')["TPM_EL_truth"]["TPM_salmon"]
print("Spearman r = {}".format(sr))


Spearman r = 0.845740329822

In [30]:
p = sns.regplot(np.log(truthKallisto["TPM_RL_truth"].values+1), np.log(truthKallisto["TPM_kallisto"].values+1), truthKallisto)
p.set_xlabel("true TPM (RealLen)")
p.set_ylabel("Kallisto (FullData) TPM")
p.set_title("Kallisto TPM correlation")
sr = truthKallisto.corr(method='spearman')["TPM_RL_truth"]["TPM_kallisto"]
print("Spearman r = {}".format(sr))


Spearman r = 0.833023060229

In [31]:
p = sns.regplot(np.log(truthKallisto["TPM_EL_truth"].values+1), np.log(truthKallisto["TPM_kallisto"].values+1), truthKallisto)
p.set_xlabel("true TPM (EffLen)")
p.set_ylabel("Kallisto (FullData) TPM")
p.set_title("Kallisto TPM correlation")
sr = truthKallisto.corr(method='spearman')["TPM_EL_truth"]["TPM_kallisto"]
print("Spearman r = {}".format(sr))


Spearman r = 0.834136306562

(Filtered) Read Count correlation plots


In [32]:
p = sns.regplot(np.log(truthSalmon["NumReads_truth"].values+1), np.log(truthSalmon["NumReads_salmon"]+1), truthSalmon)
p.set_xlabel("true NumReads")
p.set_ylabel("Salmon (Sim1) NumReads")
p.set_title("Salmon NumReads correlation")
sr = truthSalmon.corr(method='spearman')["NumReads_truth"]["NumReads_salmon"]
print("Spearman r = {}".format(sr))


Spearman r = 0.839419059657

In [33]:
p = sns.regplot(np.log(truthSalmon["NumReads_truth"].values+1), np.log(truthKallisto["NumReads_kallisto"]+1), truthKallisto)
p.set_xlabel("true NumReads")
p.set_ylabel("Kallisto (Sim1) NumReads")
p.set_title("Kallisto NumReads correlation")
sr = truthKallisto.corr(method="spearman")["NumReads_truth"]["NumReads_kallisto"]
print("Spearman r = {}".format(sr))


Spearman r = 0.837290633981