In [ ]:
knitr::opts_chunk$set(cache=TRUE)
knitr::opts_chunk$set(warning = FALSE)

RML Notes

The MicrosoftRML (or RML for short) package is a state-of-the-art package of machine learning algorithms developed by Microsoft's Algorithms Development team and Microsoft Research. It provides a suite of transformers and learners that make it easy to analyze high-dimensional datasets, such as those arising from text datasets.

Installation Instructions

  • If you have corpnet access, review the installation instructions here.

Using RML

The MicrosoftRML package provides new, highly performant implementations of machine learning algorithms for classification, regression, and anamoly detection, that are especially well-equipped for handling large datasets. In addition to these fast learning algorithms (called learners), the RML package also provides transformers, for feature engineering. We outline the various learners and transformers in the following sections.

Transformers

The transformers in the RML package are labelled with the prefix mt. These can be used inside any of the mxTransforms calls of the learners we describe in the following section.

We outline most of the transformers in the table below;

transformer Use Additional Parameters
mtText bag of counts of n-grams ngramLength
mtCat create separate variables for each variable string maxNumTerms
mtCatHash same as mtCat but with hashing hashBits
mtWordBag bag of counts of n-grams ngramLength
mtWordHashBag same as mtWordBag but with hashing hashBits
mtConcat concatenation of multiple text columns into a single vector none

The hash equivalents of the text transforms use hashing to create dictionaries rather than counting. Hashing is typically more performant because it does not require an initial pass over the data to determine the dictionary, and therefore can be more performant than mtCat, which could run out of memory because of huge dictionary size. However, caution must be taken in specifying the number of hashBits: if too small, collisions may occur; if too large, you may end up with lots of redundant features.

Learners

In addition to the fast feature engineering functions listed in the table above, RML adds a number of new learning algorithms for regression, clasification and anamoly detection. The algorithms we'll take a look at today are listed in the table below, along with some of their important parameters:

learner Use Additional Parameters
mxFastForest fast random forest nTree
mxFastTree fast decision tree numBins
mxLogisticReg elastic-net logistic regression l1Weight, l2Weight
mxFastLinear SDCA linear binary classifer and regression l1Weight, l2Weight
mxNeuralNet classification and regression neural networks, with GPU acceleration acceleeration, numHiddenNodes, optimizer
mxOneClassSvm binary support vector machine kernel

Getting Started with RML


In [ ]:
packageVersion("RevoScaleR")
packageVersion("MicrosoftML")

If you are missing either of the above packages, please go back and refer to the installation instructions.

Natural Language Processing with RML

Let's take a look at using RML to estimate a model that would be very hard to do with RevoScaleR.

In particular, there are virtually no functionality in RevoScaleR for handling large text data. We will use RML to transform text data into useful features that we can use in a logistic regression learner. In order to deal with the high cardinality of text data, we will use the penalized regression models in RML.

IMDB Data

For this example, we will analyze IMDB movies reviews and the sentiment associated with the review. The data are available here.

I've also saved the data on a public facing Azure Blob Container here.

The data are saved as separate text files per review, and are separated into train and test sets, and further by positive and negative sentiments:

Data Hierarchy

  • train
    • pos
    • neg
  • test
    • pos
    • neg

Let's use the readLines function in R to convert these datasets into R data.frames.


In [ ]:
# load imdb data ---

cwd <- getwd()

options(stringsAsFactors = FALSE)

imdb_dir <- "/datadrive/aclImdb/"

read_reviews <- function(path, sentiment) {

  reviews <- lapply(path, readLines)
  reviews <- as.vector(unlist(reviews))

  reviews_df <- as.data.frame(matrix(reviews, ncol = 1))
  reviews_df$sentiment <- sentiment

  names(reviews_df)[1] <- 'review'

  return(reviews_df)

}

setwd(imdb_dir)

make_df <- function(path = "train") {

  pos_files <- list.files(paste(path, "pos", sep = "/"), full.names = TRUE)
  train_positive <- read_reviews(pos_files, 1)

  neg_files <- list.files(paste(path, "neg", sep = "/"), full.names = TRUE)
  train_negative <- read_reviews(neg_files, 0)

  train_df <- rbind(train_positive, train_negative)

}


# training sets -----------------------------------------------------------

train_df <- make_df("train")



# test sets ---------------------------------------------------------------

test_df <- make_df("test")

setwd(cwd)

Applying Transformers to Create Text Features

Our compiled data.frame of IMDB data reviews looks rather simple. It is is a data.frame of two columns, one containing the raw review, and the sescond containing the sentiment binary variable: positive or negative.

By itself, the raw text data source isn't a very helpful feature variable for predicting the sentiment value. However, we can create/engineer a large amount of feature variables using the text column.

As a first pass, we might even consider using the text data source as a collection of words, and try to use each word individually as it's own column. This will be the union of all the words that appear in any review, so will yield a very high cardinality/dimensionality feature matrix with large sparsity (i.e., any given review will only contain a small subset of all the words in the reviews "dictionary").

Next, we can use the mxLogisticReg function in RML. The mxLogisticReg function contains arguments for the hyperparameter weights for each of the penalty terms. Moreover, we will utilize a mxTransforms call to add a list of featurizers/transformers for engineering. While this feature engineering step might require multiple iterations and use cross-validation to pick the best choice, we will start with a text transformation and create ngrams of length 3. This will create a continguous collection of three words that can be then used as predictors. This is a simple method of thinking of possible interaction of words as possible predictors for our sentiment response.


In [ ]:
library(MicrosoftML)
library(dplyr)

train_sample <- train_df %>% sample_n(1000, replace = FALSE)

system.time(logit_model <- logisticRegression(sentiment ~ reviewTran,
                              data = train_sample,
                              l1Weight = 0.05,
                              l2Weight = 0.05,
                              mlTransforms = list(featurizeText(vars = c(reviewTran = "review"),
                                                         language = "English",
                                                         stopwordsRemover = stopwordsDefault(),
                                                         keepPunctuations = FALSE)))
)



system.time(fast_linear <- mxFastLinear(sentiment ~ reviewTran,
                              data = train_sample,
                              l1Weight = 0.05,
                              l2Weight = 0.05,
                              mxTransforms = list(mtText(vars = c(reviewTran = "review"),
                                                         language = "English",
                                                         stopwordsRemover = maPredefinedStopwords(),
                                                         keepPunctuations = FALSE,
                                                         ngramLength = 3)))
)

Using the Pipeline API


In [ ]:
library(magrittr)

review_logit <- train_df %>%
    featurize(mtText(vars = c(reviewTran = "review"),
                        stopwordsRemover = maPredefinedStopwords(),
                        keepPunctuations = FALSE,
                        ngramLength = 3)) %>%
    train(formula = sentiment ~ reviewTran,
          lr = LogisticReg(l2Weight = 0.05, l1Weight = 0.05)) %>% run

Testing the Logit Model

In order to predict our classifer on test data, we will use the mxPredict function from the RML package.


In [ ]:
predictions <- mxPredict(logit_model, data = test_df, extraVarsToWrite = "sentiment")
roc_results <- rxRoc(actualVarName = "sentiment", predVarNames = "Probability.1", data = predictions)
roc_results$predVarName <- factor(roc_results$predVarName)
plot(roc_results)
The Pipeline API

Why not use pipes! The pipeline API is still a work in progress, and the example below is just to show some of it's features. The API will support modifying pipelines, and additional featurization modules.


In [ ]:
options(stringsAsFactors = TRUE)

predictions_pipeline <- logit_model %>%
  mxPredict(data = test_df, extraVarsToWrite = "sentiment") %>%
  rxRocCurve(actualVarName = "sentiment", predVarNames = "Probability.1", data = .)

Testing the SDCA Model


In [ ]:
predictions <- mxPredict(logit_model, data = test_df, extraVarsToWrite = "sentiment")
roc_results <- rxRoc(actualVarName = "sentiment", predVarNames = "Probability.1", data = predictions)
roc_results$predVarName <- factor(roc_results$predVarName)
plot(roc_results)

Neural Networks

Let's try to estimate another binary classifier from this dataset, but with a Neural Network architecture rather than a logistic regression model.

In the following chunk, we call our neural network model, and set the optimizer to be a stochastic gradient descent optimizer with a learning rate of 0.2. Furthermore, we use the type argument to ensure we are learning a binary classifier. By default our network architecture will have 100 hidden nodes.


In [ ]:
nn_sentiment <- mxNeuralNet(sentiment ~ reviewTran,
                            data = train_df,
                            type = "binary",
                            mxTransforms = list(mtText(vars = c(reviewTran = "review"),
                                                       stopwordsRemover = maPredefinedStopwords(),
                                                       keepPunctuations = FALSE,
                                                       ngramLength = 3)),
                            optimizer = maOptimizerSgd(learningRate = 0.2))

Scoring the Neural Net

We can similary score our results from the neural network model


In [ ]:
predictions <- mxPredict(nn_sentiment, data = test_df, extraVarsToWrite = "sentiment")
roc_results <- rxRoc(actualVarName = "sentiment", predVarNames = "Probability.1", data = predictions)
roc_results$predVarName <- factor(roc_results$predVarName)
plot(roc_results)