MLSeq walkthrough



In [2]:
source("https://bioconductor.org/biocLite.R")


Bioconductor version 3.1 (BiocInstaller 1.18.4), ?biocLite for help

In [3]:
biocLite("MLSeq")


BioC_mirror: http://bioconductor.org
Using Bioconductor version 3.1 (BiocInstaller 1.18.4), R version 3.2.2.
Installing package(s) ‘MLSeq’
also installing the dependencies ‘pbkrtest’, ‘quantreg’, ‘lme4’, ‘brglm’, ‘GenomeInfoDb’, ‘XVector’, ‘futile.logger’, ‘AnnotationDbi’, ‘annotate’, ‘gridExtra’, ‘ggplot2’, ‘car’, ‘reshape2’, ‘foreach’, ‘BradleyTerry2’, ‘S4Vectors’, ‘IRanges’, ‘GenomicRanges’, ‘BiocParallel’, ‘genefilter’, ‘geneplotter’, ‘Hmisc’, ‘caret’, ‘DESeq2’, ‘Biobase’, ‘edgeR’

The downloaded source packages are in
	‘/tmp/RtmpWM8WqY/downloaded_packages’
Old packages: 'curl', 'devtools', 'evaluate', 'git2r', 'jsonlite', 'mime',
  'R6', 'Rcpp', 'xml2', 'XML', 'spatial'

previous command took 5 minutes to complete


In [5]:
citation("MLSeq")


Out[5]:
To cite package ‘MLSeq’ in publications use:

  Gokmen Zararsiz, Dincer Goksuluk, Selcuk Korkmaz, Vahap Eldem, Izzet
  Parug Duru, Turgay Unver and Ahmet Ozturk (2015). MLSeq: Machine
  learning interface for RNA-Seq data. R package version 1.6.0.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {MLSeq: Machine learning interface for RNA-Seq data},
    author = {Gokmen Zararsiz and Dincer Goksuluk and Selcuk Korkmaz and Vahap Eldem and Izzet Parug Duru and Turgay Unver and Ahmet Ozturk},
    year = {2015},
    note = {R package version 1.6.0},
  }

ATTENTION: This citation information has been auto-generated from the
package DESCRIPTION file and may need manual editing, see
‘help("citation")’.

In [4]:
library(MLSeq)


Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
Loading required package: DESeq2
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
    do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
    is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
    pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
    rownames, sapply, setdiff, sort, table, tapply, union, unique,
    unlist, unsplit

Creating a generic function for ‘nchar’ from package ‘base’ in package ‘S4Vectors’
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: Rcpp
Loading required package: RcppArmadillo
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: limma

Attaching package: ‘limma’

The following object is masked from ‘package:DESeq2’:

    plotMA

The following object is masked from ‘package:BiocGenerics’:

    plotMA

Loading required package: randomForest
randomForest 4.6-10
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:Biobase’:

    combine

The following object is masked from ‘package:BiocGenerics’:

    combine

Loading required package: edgeR

In [6]:
filepath = system.file("extdata/cervical.txt", package = "MLSeq")
filepath


Out[6]:
'/home/daniel/R/x86_64-pc-linux-gnu-library/3.2/MLSeq/extdata/cervical.txt'

In [7]:
cervical = read.table(filepath, header = TRUE)

In [8]:
head(cervical[, 1:5])


Out[8]:
N1N2N3N4N5
let-7a865810550566921456
let-7a*31230736
let-7b97527904912242861759
let-7b*15182711911
let-7c828125129736413713
let-7c*00010

In [9]:
class(cervical)


Out[9]:
'data.frame'

In [10]:
dim(cervical)


Out[10]:
  1. 714
  2. 58

First 29 columns of the data contain the miRNA mapped counts of non-tumor samples, while the last 29 comlumns contain the count information of tumor samples. We need to create a class label in order to apply classification models.


In [11]:
class = data.frame(condition = factor(rep(c("N", "T"), c(29, 29))))
as.factor(class[, 1])


Out[11]:
  1. N
  2. N
  3. N
  4. N
  5. N
  6. N
  7. N
  8. N
  9. N
  10. N
  11. N
  12. N
  13. N
  14. N
  15. N
  16. N
  17. N
  18. N
  19. N
  20. N
  21. N
  22. N
  23. N
  24. N
  25. N
  26. N
  27. N
  28. N
  29. N
  30. T
  31. T
  32. T
  33. T
  34. T
  35. T
  36. T
  37. T
  38. T
  39. T
  40. T
  41. T
  42. T
  43. T
  44. T
  45. T
  46. T
  47. T
  48. T
  49. T
  50. T
  51. T
  52. T
  53. T
  54. T
  55. T
  56. T
  57. T
  58. T

For simplicity, we can work with a subset of cervical data using the first 150 features.


In [12]:
data = cervical[c(1:150), ]

Now we can split the data into training and test sets. The thraining set can be used to build classification models, and the test set can be used to assess the performance of each model. We use the set.seed function to specify initiala values of a random-number seed and use the sample function for selection.


In [13]:
nTest = ceiling(ncol(data) * 0.2)
set.seed(12345)
ind = sample(ncol(data), nTest, FALSE)

Now, training and test sets can be created based on this sampling process.


In [18]:
data.train = data[, -ind]
data.train = as.matrix(data.train + 1)
classtr = data.frame(condition = class[-ind, ])

data.test = data[, ind]
data.test = as.matrix(data.test + 1)
classts = data.frame(condition = class[ind, ])

Now we have 46 samples which will be used to train the classification models and have 12 remaining samples to test model performance.


In [19]:
dim(data.train)


Out[19]:
  1. 150
  2. 46

In [20]:
dim(data.test)


Out[20]:
  1. 150
  2. 12

In [21]:
data.trainS4 = DESeqDataSetFromMatrix(countData = data.train, colData = classtr, formula(~condition))
data.trainS4 = DESeq(data.trainS4, fitType = "local")
data.trainS4


converting counts to integer mode
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 15 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
Out[21]:
class: DESeqDataSet 
dim: 150 46 
exptData(0):
assays(5): counts mu cooks replaceCounts replaceCooks
rownames(150): let-7a let-7a* ... miR-18a* miR-18b
rowRanges metadata column names(28): baseMean baseVar ... maxCooks
  replace
colnames(46): 1 2 ... 45 46
colData names(3): condition sizeFactor replaceable

In [22]:
data.testS4 = DESeqDataSetFromMatrix(countData = data.test, colData = classts, formula(~condition))
data.testS4 = DESeq(data.testS4, fitType = "local")
data.testS4


converting counts to integer mode
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Out[22]:
class: DESeqDataSet 
dim: 150 12 
exptData(0):
assays(3): counts mu cooks
rownames(150): let-7a let-7a* ... miR-18a* miR-18b
rowRanges metadata column names(27): baseMean baseVar ... deviance
  maxCooks
colnames(12): 1 2 ... 11 12
colData names(2): condition sizeFactor

In [23]:
svm = classify(data = data.trainS4, method = "svm", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "T")

svm


found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
1 package is needed for this model and is not installed. (kernlab). Would you like to try to install it now?
Error in checkInstall(models$library): 
Error in eval(expr, envir, enclos): object 'svm' not found

In [25]:
install.packages("kernlab", repos='http://archive.linux.duke.edu/cran/')


Installing package into ‘/home/daniel/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
The downloaded source packages are in
	‘/tmp/RtmpWM8WqY/downloaded_packages’

In [27]:
install.packages("e1071", repos='http://archive.linux.duke.edu/cran/')


Installing package into ‘/home/daniel/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
The downloaded source packages are in
	‘/tmp/RtmpWM8WqY/downloaded_packages’

In [28]:
svm = classify(data = data.trainS4, method = "svm", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "T")

svm


found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
Out[28]:
  An object of class  MLSeq 

            Method  :  svm 

       Accuracy(%)  :  93.48 
    Sensitivity(%)  :  100 
    Specificity(%)  :  86.96 

  Reference Class   :  T 

After running the code given above, we obtain the results in MLSeq class. SVM successfully fits a model with 97.8% true classifcation accuracy by misclassifying only one non-tumor sample.

"svm" object also stores information about model training and the parameters used to build this model.


In [29]:
getSlots("MLSeq")


Out[29]:
method
'character'
deseqTransform
'character'
normalization
'character'
confusionMat
'confusionMatrix'
trained
'train'
ref
'character'

In [30]:
trained(svm)


Out[30]:
Support Vector Machines with Radial Basis Function Kernel 

 46 samples
150 predictors
  2 classes: 'T', 'N' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 37, 37, 37, 37, 36, 37, ... 
Resampling results across tuning parameters:

  C     Accuracy   Kappa      Accuracy SD  Kappa SD 
  0.25  0.8748148  0.7534359  0.11792344   0.2261951
  0.50  0.9044444  0.8114907  0.10164433   0.1945290
  1.00  0.8985185  0.8002087  0.09811869   0.1872383

Tuning parameter 'sigma' was held constant at a value of 0.003875898
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.003875898 and C = 0.5. 

Now let's build a Random Forest model


In [32]:
rf <- classify(data = data.trainS4, method = "randomforest", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt= 3, ref = "T")

rf


found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
Out[32]:
  An object of class  MLSeq 

            Method  :  randomforest 

       Accuracy(%)  :  100 
    Sensitivity(%)  :  100 
    Specificity(%)  :  100 

  Reference Class   :  T 

Prediction of new samples from the test set


In [33]:
pred.svm = predictClassify(svm, data.testS4)

pred.svm


found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
Out[33]:
  1. T
  2. T
  3. T
  4. T
  5. T
  6. T
  7. N
  8. N
  9. T
  10. T
  11. N
  12. N

In [34]:
pred.rf = predictClassify(rf, data.testS4)

pred.rf


found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
Out[34]:
  1. T
  2. T
  3. T
  4. T
  5. N
  6. T
  7. N
  8. N
  9. T
  10. T
  11. N
  12. N

In [35]:
table(pred.svm, relevel(data.testS4$condition, 2))

table(pred.rf, relevel(data.testS4$condition, 2))


Out[35]:
        
pred.svm T N
       T 6 2
       N 0 4
Out[35]:
       
pred.rf T N
      T 6 1
      N 0 5

In [36]:
sessionInfo()


Out[36]:
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] kernlab_0.9-22            MLSeq_1.6.0              
 [3] edgeR_3.10.2              randomForest_4.6-10      
 [5] limma_3.24.15             Biobase_2.28.0           
 [7] DESeq2_1.8.1              RcppArmadillo_0.5.500.2.0
 [9] Rcpp_0.12.0               GenomicRanges_1.20.6     
[11] GenomeInfoDb_1.4.2        IRanges_2.2.7            
[13] S4Vectors_0.6.5           BiocGenerics_0.14.0      
[15] caret_6.0-52              ggplot2_1.0.1            
[17] lattice_0.20-33           BiocInstaller_1.18.4     

loaded via a namespace (and not attached):
 [1] jsonlite_0.9.16      splines_3.2.2        foreach_1.4.2       
 [4] gtools_3.5.0         Formula_1.2-1        latticeExtra_0.6-26 
 [7] RSQLite_1.0.0        quantreg_5.19        uuid_0.1-2          
[10] digest_0.6.8         RColorBrewer_1.1-2   XVector_0.8.0       
[13] minqa_1.2.4          colorspace_1.2-6     rzmq_0.7.7          
[16] Matrix_1.2-2         plyr_1.8.3           XML_3.98-1.1        
[19] BradleyTerry2_1.0-6  SparseM_1.7          genefilter_1.50.0   
[22] xtable_1.7-4         scales_0.3.0         brglm_0.5-9         
[25] BiocParallel_1.2.21  lme4_1.1-9           MatrixModels_0.4-1  
[28] annotate_1.46.1      mgcv_1.8-7           car_2.1-0           
[31] repr_0.3             nnet_7.3-11          pbkrtest_0.4-2      
[34] proto_0.3-10         survival_2.38-3      magrittr_1.5        
[37] evaluate_0.7         nlme_3.1-122         MASS_7.3-44         
[40] class_7.3-14         foreign_0.8-66       tools_3.2.2         
[43] stringr_1.0.0        munsell_0.4.2        locfit_1.5-9.1      
[46] cluster_2.0.3        AnnotationDbi_1.30.1 lambda.r_1.1.7      
[49] compiler_3.2.2       e1071_1.6-7          futile.logger_1.4.1 
[52] grid_3.2.2           nloptr_1.0.4         iterators_1.0.7     
[55] IRkernel_0.4         base64enc_0.1-3      gtable_0.1.2        
[58] codetools_0.2-14     DBI_0.3.1            reshape2_1.4.1      
[61] gridExtra_2.0.0      Hmisc_3.16-0         futile.options_1.0.0
[64] stringi_0.5-5        IRdisplay_0.3        geneplotter_1.46.0  
[67] rpart_4.1-10         acepack_1.3-3.3     

In [37]:
confusionMatrix(table(pred.svm, relevel(data.testS4$condition, 2)))


Out[37]:
Confusion Matrix and Statistics

        
pred.svm T N
       T 6 2
       N 0 4
                                          
               Accuracy : 0.8333          
                 95% CI : (0.5159, 0.9791)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : 0.01929         
                                          
                  Kappa : 0.6667          
 Mcnemar's Test P-Value : 0.47950         
                                          
            Sensitivity : 1.0000          
            Specificity : 0.6667          
         Pos Pred Value : 0.7500          
         Neg Pred Value : 1.0000          
             Prevalence : 0.5000          
         Detection Rate : 0.5000          
   Detection Prevalence : 0.6667          
      Balanced Accuracy : 0.8333          
                                          
       'Positive' Class : T