Analysis of the Coalescent Simulation



In [ ]:

    
library(ggplot2)
library(plyr)
library(reshape2)

Dataset

This is the summary of Spearman's $\rho$ over 10 replicates of the "coalescent" experiment



In [ ]:

    
stats = read.csv("overall.csv")
stats$rep = as.factor(sort(rep(1:10, times=44)))



In [ ]:

    
summary(stats)
table(stats$scale)
table(stats$coverage)

We compare average genome coverage and the scale of varaition againsnt accuracy (i.e. Spearman's $\rho$) (over the 10 reps).

We compare the effects of coverage and scale independently.

Coverage vs $\rho$ at 1/100 variation

A series of average coverages was run at the scale of 0.01 (i.e. an average of 1 variant in 100 bases across all pairwise comparisions of samples)



In [ ]:

    
coverage = stats[stats$scale==0.01, ]
coverage$scale = NULL
summary(coverage)



In [ ]:

    
ggplot(coverage, aes(x=coverage, y=spearman, linetype=measure, color=rep)) +
    geom_line(aes(linetype=measure, color=rep),size=1.5) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()

Here we summarise the replicates to averages $\pm$ SD. Note that we exclude replicate 8 as it is an outlier for both IP and WIP metrics (see above).



In [ ]:

    
coverage = coverage[coverage$rep != 8,]
csumm = ddply(coverage, .(coverage, measure), summarise,
                spearman_m=mean(spearman),
                spearman_sd=sd(spearman))
summary(csumm)

You can see below that WIP marginally outperforms IP, at low coverage. Above about 20x, I would say that WIP and IP have equivalent performance.

The ribbon is 2 SD, so there is certainly no signficant difference.



In [ ]:

    
ggplot(csumm, aes(x=coverage, y=spearman_m, ymin=spearman_m-2*spearman_sd, ymax=spearman_m+2*spearman_sd, group=measure)) +
    geom_line(aes(linetype=measure)) +
    geom_ribbon(aes(fill=measure), alpha=0.2) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()

The differnce between WIP and IP is calculated here



In [ ]:

    
cdiff = dcast(coverage, coverage * rep~ measure, value.var="spearman")
#cdiff = ddply(cdiff, .(coverage, rep), summarise, spearman_d=wip - ip)
cdiff = ddply(cdiff, .(coverage), summarise, diff_m=mean(wip - ip), diff_sd=sd(wip - ip))

summary(cdiff)



In [ ]:

    
ggplot(cdiff, aes(x=coverage, y=diff_m, ymin=diff_m-diff_sd, ymax=diff_m+diff_sd)) +
    geom_line() +
    geom_ribbon(alpha=0.4) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    scale_x_log10()+
    theme_bw()

There is no signficant difference between the two methods (0 is never outside the confidence interval).

Coverage vs $\rho$ at the lower 1/1000 varants

A series of average coverages was run at the scale of 0.001 (i.e. an average of 1 variant in 1000 bases across all pairwise comparisions of samples)



In [ ]:

    
coverage = stats[stats$scale==0.001, ]
coverage$scale = NULL
summary(coverage)



In [ ]:

    
ggplot(coverage, aes(x=coverage, y=spearman, linetype=measure, color=rep)) +
    geom_line(aes(linetype=measure, color=rep),size=1.5) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()

Here we summarise the replicates to averages $\pm$ SD.



In [ ]:

    
csumm = ddply(coverage, .(coverage, measure), summarise,
                spearman_m=mean(spearman),
                spearman_sd=sd(spearman))
summary(csumm)

You can see below that WIP marginally outperforms IP, at low coverage. Above about 20x, I would say that WIP and IP have equivalent performance.

The ribbon is 2 SD, so this time there could be signficant differences.



In [ ]:

    
p = ggplot(csumm, aes(x=coverage, y=spearman_m, ymin=spearman_m-spearman_sd, ymax=spearman_m+spearman_sd, group=measure)) +
    geom_line(aes(linetype=measure)) +
    geom_ribbon(aes(fill=measure), alpha=0.2) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()
pdf("cov_vs_rho_1000.pdf")
print(p)
dev.off()
p

The differnce between WIP and IP is calculated here



In [ ]:

    
cdiff = dcast(coverage, coverage * rep~ measure, value.var="spearman")
#cdiff = ddply(cdiff, .(coverage, rep), summarise, spearman_d=wip - ip)
cdiff = ddply(cdiff, .(coverage), summarise, diff_m=mean(wip - ip), diff_sd=sd(wip - ip))

summary(cdiff)



In [ ]:

    
ggplot(cdiff, aes(x=coverage, y=diff_m, ymin=diff_m-2*diff_sd, ymax=diff_m+2*diff_sd)) +
    geom_line() +
    geom_ribbon(alpha=0.4) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    scale_x_log10()+
    theme_bw()

Coverage vs $\rho$ at 1/1000

A series of average coverages was run at the scale of 0.001 (i.e. an average of 1 variant in 1000 bases across all pairwise comparisions of samples)



In [ ]:

    
coverage = stats[stats$scale==0.001, ]
coverage$scale = NULL
summary(coverage)



In [ ]:

    
ggplot(coverage, aes(x=coverage, y=spearman, linetype=measure, color=rep)) +
    geom_line(aes(linetype=measure, color=rep),size=1.5) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()

Here we summarise the replicates to averages $\pm$ SD. Note that we exclude replicate 8 as it is an outlier for both IP and WIP metrics (see above).



In [ ]:

    
coverage = coverage[coverage$rep != 8,]
csumm = ddply(coverage, .(coverage, measure), summarise,
                spearman_m=mean(spearman),
                spearman_sd=sd(spearman))
summary(csumm)

You can see below that WIP marginally outperforms IP, at low coverage. Above about 20x, I would say that WIP and IP have equivalent performance.

The ribbon is 1 SD, so there is certainly no signficant difference.



In [ ]:

    
p <- ggplot(csumm, aes(x=coverage, y=spearman_m, ymin=spearman_m-spearman_sd, ymax=spearman_m+spearman_sd, group=measure)) +
    geom_line(aes(linetype=measure)) +
    geom_ribbon(aes(fill=measure), alpha=0.2) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()
p



In [ ]:

    
pdf('coverage-accuracy.pdf')
print(p)
dev.off()

The differnce between WIP and IP is calculated here



In [ ]:

    
cdiff = dcast(coverage, coverage * rep~ measure, value.var="spearman")
#cdiff = ddply(cdiff, .(coverage, rep), summarise, spearman_d=wip - ip)
cdiff = ddply(cdiff, .(coverage), summarise, diff_m=mean(wip - ip), diff_sd=sd(wip - ip))

summary(cdiff)



In [ ]:

    
p = ggplot(cdiff, aes(x=coverage, y=diff_m, ymin=diff_m-2*diff_sd, ymax=diff_m+2*diff_sd)) +
    geom_line() +
    geom_ribbon(alpha=0.2) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Pair-wise difference in Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()

p



In [ ]:

    
pdf('diff-coverage.pdf')
print(p)
dev.off()

So in terms of pairwise differences in performance between WIP and IP, there is a signfincant improvement above about 2x (below which they are both pretty crap)

Conclusions

As expected

I think there might be an issue with the way I normalise trees. I think that we are probably at a higher level of divergence than I expect if we use the mean. I will do a run with a couple of reps using the maximum distance set to 1.0, i.e. that the entire tree scale is 0.5 (from root to tip, and then back again =1.0).



In [ ]: