Analysis of the Coalescent Simulation



In [ ]:

    
library(ggplot2)
library(plyr)
library(reshape2)

Dataset

This is the summary of Spearman's $\rho$ over 10 replicates of the "coalescent" experiment



In [ ]:

    
stats = read.csv("overall.csv")
stats$rep = as.factor(sort(rep(1:10, times=28)))



In [ ]:

    
summary(stats)

We compare average genome coverage and the scale of varaition againsnt accuracy (i.e. Spearman's $\rho$) (over the 10 reps).

We compare the effects of coverage and scale independently.

Coverage vs $\rho$

A series of average coverages was run at the scale of 0.01 (i.e. an average of 1 variant in 100 bases across all pairwise comparisions of samples)



In [ ]:

    
coverage = stats[stats$scale==0.01, ]
coverage$scale = NULL
summary(coverage)



In [ ]:

    
ggplot(coverage, aes(x=coverage, y=spearman, linetype=measure, color=rep)) +
    geom_line(aes(linetype=measure, color=rep),size=1.5) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()

Here we summarise the replicates to averages $\pm$ SD. Note that we exclude replicate 8 as it is an outlier for both IP and WIP metrics (see above).



In [ ]:

    
coverage = coverage[coverage$rep != 8,]
csumm = ddply(coverage, .(coverage, measure), summarise,
                spearman_m=mean(spearman),
                spearman_sd=sd(spearman))
summary(csumm)

You can see below that WIP marginally outperforms IP, at low coverage. Above about 20x, I would say that WIP and IP have equivalent performance.

The ribbon is 1 SD, so there is certainly no signficant difference.



In [ ]:

    
ggplot(csumm, aes(x=coverage, y=spearman_m, ymin=spearman_m-spearman_sd, ymax=spearman_m+spearman_sd, group=measure)) +
    geom_line(aes(linetype=measure)) +
    geom_ribbon(aes(fill=measure), alpha=0.2) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()

The differnce between WIP and IP is calculated here



In [ ]:

    
cdiff = dcast(coverage, coverage * rep~ measure, value.var="spearman")
#cdiff = ddply(cdiff, .(coverage, rep), summarise, spearman_d=wip - ip)
cdiff = ddply(cdiff, .(coverage), summarise, diff_m=mean(wip - ip), diff_sd=sd(wip - ip))

summary(cdiff)



In [ ]:

    
ggplot(cdiff, aes(x=coverage, y=diff_m, ymin=diff_m-diff_sd, ymax=diff_m+diff_sd)) +
    geom_line() +
    geom_ribbon(alpha=0.4) +
    xlab('Genome Coverage') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10()+
    theme_bw()

Scale vs $\rho$

Like coverage, we investigate the effect of variation at a constant coverage, in this case 30x. I also convert the scale into its inverse, as this is how some people prefer to think of it (i.e. one variant in X bases, as opposed to 0.0x variants per base on average. Each to their own...)



In [ ]:

    
scale = stats[stats$coverage==30 & stats$scale <= 0.01,]
scale$var = as.factor(as.character(1/scale$scale))
scale$var = reorder(scale$var, 1/scale$scale, min)
scale$scale = 1/scale$scale
str(scale)



In [ ]:

    
ssumm = ddply(scale, .(measure, scale), summarise,
              sp_avg=mean(spearman),
              sp_max=(mean(spearman) + sd(spearman)),
              sp_min=(mean(spearman) - sd(spearman)))



In [ ]:

    
p = ggplot(ssumm, aes(x=scale,  y=sp_avg, ymin=sp_min, ymax=sp_max, group=measure)) +
    geom_line(aes(linetype=measure)) +
    geom_ribbon(aes(fill=measure), alpha=0.2) +
    xlab('Divergence (1 variant per x bases)') +
    ylab(expression(paste("Spearman's ", rho))) +
    #scale_x_log10() +
    scale_x_reverse() +
    theme_bw()
pdf("variation_vs_accuracy.pdf")
print(p)
dev.off()
p

Conclusions

I think there might be an issue with the way I normalise trees. I think that we are probably at a higher level of divergence than I expect if we use the mean. I will do a run with a couple of reps using the maximum distance set to 1.0, i.e. that the entire tree scale is 0.5 (from root to tip, and then back again =1.0).
I'd like to re-do the coverage sweep at a scale of 0.005 or 0.002 or even 0.001. I think that this might be more inline with our rice experiment. My take home from this is that WIP is only important when your signal:noise ratio is low, like when you have a small amount of variation. Otherwise, they are equivalent (neither is signficantly worse on average). Norman, can you comment?



In [ ]: