In [ ]:
library(ggplot2)
library(plyr)
library(reshape2)
In [ ]:
stats = read.csv("overall.csv")
stats$rep = as.factor(sort(rep(1:10, times=28)))
In [ ]:
summary(stats)
We compare average genome coverage and the scale of varaition againsnt accuracy (i.e. Spearman's $\rho$) (over the 10 reps).
We compare the effects of coverage and scale independently.
In [ ]:
coverage = stats[stats$scale==0.01, ]
coverage$scale = NULL
summary(coverage)
In [ ]:
ggplot(coverage, aes(x=coverage, y=spearman, linetype=measure, color=rep)) +
geom_line(aes(linetype=measure, color=rep),size=1.5) +
xlab('Genome Coverage') +
ylab(expression(paste("Spearman's ", rho))) +
#scale_x_log10()+
theme_bw()
Here we summarise the replicates to averages $\pm$ SD. Note that we exclude replicate 8 as it is an outlier for both IP and WIP metrics (see above).
In [ ]:
coverage = coverage[coverage$rep != 8,]
csumm = ddply(coverage, .(coverage, measure), summarise,
spearman_m=mean(spearman),
spearman_sd=sd(spearman))
summary(csumm)
You can see below that WIP marginally outperforms IP, at low coverage. Above about 20x, I would say that WIP and IP have equivalent performance.
The ribbon is 1 SD, so there is certainly no signficant difference.
In [ ]:
ggplot(csumm, aes(x=coverage, y=spearman_m, ymin=spearman_m-spearman_sd, ymax=spearman_m+spearman_sd, group=measure)) +
geom_line(aes(linetype=measure)) +
geom_ribbon(aes(fill=measure), alpha=0.2) +
xlab('Genome Coverage') +
ylab(expression(paste("Spearman's ", rho))) +
#scale_x_log10()+
theme_bw()
The differnce between WIP and IP is calculated here
In [ ]:
cdiff = dcast(coverage, coverage * rep~ measure, value.var="spearman")
#cdiff = ddply(cdiff, .(coverage, rep), summarise, spearman_d=wip - ip)
cdiff = ddply(cdiff, .(coverage), summarise, diff_m=mean(wip - ip), diff_sd=sd(wip - ip))
summary(cdiff)
In [ ]:
ggplot(cdiff, aes(x=coverage, y=diff_m, ymin=diff_m-diff_sd, ymax=diff_m+diff_sd)) +
geom_line() +
geom_ribbon(alpha=0.4) +
xlab('Genome Coverage') +
ylab(expression(paste("Spearman's ", rho))) +
#scale_x_log10()+
theme_bw()
Like coverage, we investigate the effect of variation at a constant coverage, in this case 30x. I also convert the scale into its inverse, as this is how some people prefer to think of it (i.e. one variant in X bases, as opposed to 0.0x variants per base on average. Each to their own...)
In [ ]:
scale = stats[stats$coverage==30 & stats$scale <= 0.01,]
scale$var = as.factor(as.character(1/scale$scale))
scale$var = reorder(scale$var, 1/scale$scale, min)
scale$scale = 1/scale$scale
str(scale)
In [ ]:
ssumm = ddply(scale, .(measure, scale), summarise,
sp_avg=mean(spearman),
sp_max=(mean(spearman) + sd(spearman)),
sp_min=(mean(spearman) - sd(spearman)))
In [ ]:
p = ggplot(ssumm, aes(x=scale, y=sp_avg, ymin=sp_min, ymax=sp_max, group=measure)) +
geom_line(aes(linetype=measure)) +
geom_ribbon(aes(fill=measure), alpha=0.2) +
xlab('Divergence (1 variant per x bases)') +
ylab(expression(paste("Spearman's ", rho))) +
#scale_x_log10() +
scale_x_reverse() +
theme_bw()
pdf("variation_vs_accuracy.pdf")
print(p)
dev.off()
p
In [ ]: