Downstream analysis of HPLC-Orbitrap data

This notebook replicates the analysis workflow created by Marc Rurik. The workflow wrapped in KNIME nodes can be found here, and is illustrated in the picture below. The workflow is developed to analyze HPLC-Orbitrap measurements split into separate or alternating m/z scan ranges. The input expects to be OpenMS consensusXML files, generated by the TextExporter in OpenMS. The workflow further combines the scan ranges, filter them ultilizing blank samples and QC measurements, and then performs statistical analysis to retrieve reliable features.

The aim of this notebook is to give an user friendly and interactive environment where you can get on-the-fly information about the progress of your data. After each step of the pipeline, we provide you with live feedback, consisting of informative plots.

Illustration by: Marc Rurik

Read input files

Please run the following snippet and insert the paths to your input files containing the low/high mass range



In [ ]:

    
sessionInfo()



In [ ]:

    
low = read.csv("../results/alternate_pos_low_mr.csv", 
               header = FALSE, skip=7,
               stringsAsFactors = FALSE, 
               fill=TRUE, 
               col.names= paste0("V",seq_len(227)), 
               sep="")
high = read.csv("../results/alternate_pos_high_mr.csv", 
                header = FALSE, skip=7,
                stringsAsFactors = FALSE, 
                fill=TRUE, 
                col.names= paste0("V",seq_len(227)), 
                sep="")



In [ ]:

    
print("Head and tail of low mass range")
head(low)
tail(low)
print("Head and tail of high mass range")
head(high)
tail(high)

Source .rscripts



In [ ]:

    
source('Functions.R')

Parse the files



In [ ]:

    
low_parsed <- Parse(low)
high_parsed <- Parse(high)



In [ ]:

    
low_parsed[low_parsed=="NaN"] <- NA
high_parsed[high_parsed=="NaN"] <- NA
cat("Size of low mass range matrix:", dim(low_parsed))
head(low_parsed)
cat("Size of high mass range matrix: ", dim(high_parsed))
head(high_parsed)

Rename columns



In [ ]:

    
low_renamed <- renameColumns(low_parsed)
high_renamed <- renameColumns(high_parsed)

if (ncol(low_renamed)!=ncol(high_renamed)) {
    if (ncol(low_renamed>ncol(high_renamed))) {
        warning("Samples are missing in the high mass range")
    } else {
        warning("Samples are missing in the low mass range")
    }
}

print(names(low_renamed))

Concatenate the scan ranges



In [ ]:

    
data <- rbind(low_renamed, high_renamed[,names(low_renamed)])
names(data)<-gsub("_",".",names(data))
cat("Number of features: ", nrow(data))

Blank Filter



In [ ]:

    
blankFilterPassed = 20 #Percentage, if the median blank intesity is more than X% of the sample intensity --> contaminant 
data.bf <- blankFilter(data,blankFilterPassed)

ConsensusMap Normalization



In [ ]:

    
# set parameters
ignoreColsPattern = c("TCA", "Blank")
method = "mean"
outlier = c(0.68, 1/0.68)
verbose = TRUE

data.bf.norm <- consensusMapNormalization(data.bf, ignoreColsPattern, method, outlier, verbose)

Plot the log2 intensity and TIC distribution before and after normalization



In [ ]:

    
TIC<-colSums(apply(data.bf[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)], 2, as.numeric), na.rm=T)
TICnorm<-colSums(data.bf.norm$df[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)], na.rm=T)

# Boxplots before and after normalization
n <- ncol(data.matrix(data.bf[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)])) + 7

cc <- rep("black", length(names(data.bf[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)])))
cc[grep("Pool", names(data.bf[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)]))] <- topo.colors(5)[1]
cc[grep("C", names(data.bf[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)]))] <- topo.colors(5)[2]
cc[grep("L", names(data.bf[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)]))] <- topo.colors(5)[3]
cc[grep("H", names(data.bf[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)]))] <- topo.colors(5)[4]

layout(matrix(c(1, 1, 2, 3, 3, 4), nrow = 2, byrow = TRUE))
boxplot(log2(data.matrix(data.bf[,-which(names(data.bf) %in% data.bf.norm$ignoredCols)])), 
        main="Before normalization",
        col=cc,las=2, cex.axis=0.9, ylab="log2(intensity)")
boxplot(TIC,lwd = 2, frame=F, main="TIC distribution")
stripchart(TIC, vertical = TRUE, 
    method = "jitter", add = TRUE, pch = 20, col = 'blue')
boxplot(log2(data.matrix(data.bf.norm$df[,-which(names(data.bf.norm$df) %in% data.bf.norm$ignoredCols)])), 
        main="After normalization",
        col=cc, las=2, cex.axis=0.9, ylab="log2(intensity)")
boxplot(TICnorm,lwd = 2, frame=F, main="TIC distribution")
stripchart(TICnorm, vertical = TRUE, 
    method = "jitter", add = TRUE, pch = 20, col = 'blue')
par(mfrow=c(1,1))

Filtering on pools, RSD and biological replicates



In [ ]:

    
# set parameters
poolFilterCount = 5
biolReplFilterCount = 2
numReplicates = 3
maxRSD = 25

data.bf.norm.filt <- QAFilter(data.bf.norm$df, poolFilterCount, biolReplFilterCount, numReplicates, maxRSD)
cat(nrow(data.bf.norm.filt), "number of features left")

Handle missing values



In [ ]:

    
data.bf.norm.filt.imputed <- missingValues(data.bf.norm.filt)

Plot figure 3 in the paper



In [ ]:

    
low_mass <- sum(as.numeric(data.bf.norm.filt.imputed$mz)<200)
mid_mass <- sum(as.numeric(data.bf.norm.filt.imputed$mz)>200 & as.numeric(data.bf.norm.filt.imputed$mz)<400)
high_mass <- sum(as.numeric(data.bf.norm.filt.imputed$mz)>400 & as.numeric(data.bf.norm.filt.imputed$mz)<1000)

data <- data.frame(x=c("50-200", "200-400", "400-1000"), 
                   y=c(low_mass,mid_mass,high_mass))

ylimit <- c(0, 1.2*max(data$y))
par(oma=c(2,5,3,5))
xx <- barplot(data$y, 
              names.arg=data$x, 
              ylim=ylimit,
              ylab="No. of features", 
              xlab="m/z range",
              col=topo.colors(6))
text(x=xx,
     y=data$y,
     label=data$y,
     pos = 3)
title("Number of features detected in selected mass segment", outer = TRUE)

Statistical Analysis



In [ ]:

    
stat_features <- statisticalAnalysis(data.bf.norm.filt.imputed)



In [ ]:

    
source('multiplot.R')
FoldChange <- data.frame(stat_features[,grep("fc", names(stat_features))])
pvalues <- data.frame(stat_features[,grep("pval", names(stat_features))])
header <- gsub("_pval", "",names(pvalues))
par(mfrow=c(2,length(FoldChange)/2), oma= c(0,0,11,0), cex.main= 2)
for (i in 1:length(FoldChange)) {
    plot(log2(FoldChange[,i]),
         -log10(pvalues[,i]),
         pch=20, 
         #xlim=c(-max(log2(FoldChange[,i])),max(log2(FoldChange[,i]))), 
         xlim=c(-4,4),
         ylim=c(0,6),
         main = header[i],
         cex.main= 1.5,
         xlab = "log2(FoldChange)", 
         ylab = "-log10(pvalue)")
    
        fc <- which(abs(log2(FoldChange[,i]))>1)
        p <- which(-log10(pvalues[,i])>-log10(0.01))
        fc_p <- intersect(fc,p)
  
        abline(h=-log10(0.01), col="red", lty=2)
        abline(v=1, col="orange", lty=2)
        abline(v=-1, col="orange", lty=2)
  
        points(log2(FoldChange[fc,i]),    # orange = high fold change 
           -log10(pvalues[fc,i]), 
           col="orange", pch=20)
    
        points(log2(FoldChange[p,i]),     # red = low p-value
           -log10(pvalues[p,i]), 
           col="red", pch=20)
    
        points(log2(FoldChange[fc_p,i]),  # green = high fold change and low p_value
           -log10(pvalues[fc_p,i]), 
           col="darkgreen", pch=20)          
}
title("Volcano plots", outer=TRUE)

Export result as .csv



In [ ]:

    
write.csv(stat_features,"../results/stat_result.csv", row.names = F)