In [ ]:
knitr::opts_chunk$set(cache=TRUE)
knitr::opts_chunk$set(warning = FALSE)
The MicrosoftRML (or RML for short) package is a state-of-the-art package of machine learning algorithms developed by Microsoft's Algorithms Development team and Microsoft Research. It provides a suite of transformers and learners that make it easy to analyze high-dimensional datasets, such as those arising from text datasets.
The MicrosoftRML package provides new, highly performant implementations of machine learning algorithms for classification, regression, and anamoly detection, that are especially well-equipped for handling large datasets. In addition to these fast learning algorithms (called learners), the RML package also provides transformers, for feature engineering. We outline the various learners and transformers in the following sections.
The transformers in the RML package are labelled with the prefix mt. These can be used inside any of the mxTransforms calls of the learners we describe in the following section.
We outline most of the transformers in the table below;
| transformer | Use | Additional Parameters | |
|---|---|---|---|
mtText |
bag of counts of n-grams | ngramLength |
|
mtCat |
create separate variables for each variable string | maxNumTerms |
|
mtCatHash |
same as mtCat but with hashing |
hashBits |
|
mtWordBag |
bag of counts of n-grams | ngramLength |
|
mtWordHashBag |
same as mtWordBag but with hashing |
hashBits |
|
mtConcat |
concatenation of multiple text columns into a single vector | none |
The hash equivalents of the text transforms use hashing to create dictionaries rather than counting. Hashing is typically more performant because it does not require an initial pass over the data to determine the dictionary, and therefore can be more performant than mtCat, which could run out of memory because of huge dictionary size. However, caution must be taken in specifying the number of hashBits: if too small, collisions may occur; if too large, you may end up with lots of redundant features.
In addition to the fast feature engineering functions listed in the table above, RML adds a number of new learning algorithms for regression, clasification and anamoly detection. The algorithms we'll take a look at today are listed in the table below, along with some of their important parameters:
| learner | Use | Additional Parameters | |
|---|---|---|---|
mxFastForest |
fast random forest | nTree |
|
mxFastTree |
fast decision tree | numBins |
|
mxLogisticReg |
elastic-net logistic regression | l1Weight, l2Weight |
|
mxFastLinear |
SDCA linear binary classifer and regression | l1Weight, l2Weight |
|
mxNeuralNet |
classification and regression neural networks, with GPU acceleration | acceleeration, numHiddenNodes, optimizer |
|
mxOneClassSvm |
binary support vector machine | kernel |
In [ ]:
packageVersion("RevoScaleR")
packageVersion("MicrosoftML")
If you are missing either of the above packages, please go back and refer to the installation instructions.
RMLLet's take a look at using RML to estimate a model that would be very hard to do with RevoScaleR.
In particular, there are virtually no functionality in RevoScaleR for handling large text data. We will use RML to transform text data into useful features that we can use in a logistic regression learner. In order to deal with the high cardinality of text data, we will use the penalized regression models in RML.
For this example, we will analyze IMDB movies reviews and the sentiment associated with the review. The data are available here.
I've also saved the data on a public facing Azure Blob Container here.
The data are saved as separate text files per review, and are separated into train and test sets, and further by positive and negative sentiments:
Data Hierarchy
Let's use the readLines function in R to convert these datasets into R data.frames.
In [ ]:
# load imdb data ---
cwd <- getwd()
options(stringsAsFactors = FALSE)
imdb_dir <- "/datadrive/aclImdb/"
read_reviews <- function(path, sentiment) {
reviews <- lapply(path, readLines)
reviews <- as.vector(unlist(reviews))
reviews_df <- as.data.frame(matrix(reviews, ncol = 1))
reviews_df$sentiment <- sentiment
names(reviews_df)[1] <- 'review'
return(reviews_df)
}
setwd(imdb_dir)
make_df <- function(path = "train") {
pos_files <- list.files(paste(path, "pos", sep = "/"), full.names = TRUE)
train_positive <- read_reviews(pos_files, 1)
neg_files <- list.files(paste(path, "neg", sep = "/"), full.names = TRUE)
train_negative <- read_reviews(neg_files, 0)
train_df <- rbind(train_positive, train_negative)
}
# training sets -----------------------------------------------------------
train_df <- make_df("train")
# test sets ---------------------------------------------------------------
test_df <- make_df("test")
setwd(cwd)
Our compiled data.frame of IMDB data reviews looks rather simple. It is is a data.frame of two columns, one containing the raw review, and the sescond containing the sentiment binary variable: positive or negative.
By itself, the raw text data source isn't a very helpful feature variable for predicting the sentiment value. However, we can create/engineer a large amount of feature variables using the text column.
As a first pass, we might even consider using the text data source as a collection of words, and try to use each word individually as it's own column. This will be the union of all the words that appear in any review, so will yield a very high cardinality/dimensionality feature matrix with large sparsity (i.e., any given review will only contain a small subset of all the words in the reviews "dictionary").
Next, we can use the mxLogisticReg function in RML. The mxLogisticReg function contains arguments for the hyperparameter weights for each of the penalty terms. Moreover, we will utilize a mxTransforms call to add a list of featurizers/transformers for engineering. While this feature engineering step might require multiple iterations and use cross-validation to pick the best choice, we will start with a text transformation and create ngrams of length 3. This will create a continguous collection of three words that can be then used as predictors. This is a simple method of thinking of possible interaction of words as possible predictors for our sentiment response.
In [ ]:
library(MicrosoftML)
library(dplyr)
train_sample <- train_df %>% sample_n(1000, replace = FALSE)
system.time(logit_model <- logisticRegression(sentiment ~ reviewTran,
data = train_sample,
l1Weight = 0.05,
l2Weight = 0.05,
mlTransforms = list(featurizeText(vars = c(reviewTran = "review"),
language = "English",
stopwordsRemover = stopwordsDefault(),
keepPunctuations = FALSE)))
)
system.time(fast_linear <- mxFastLinear(sentiment ~ reviewTran,
data = train_sample,
l1Weight = 0.05,
l2Weight = 0.05,
mxTransforms = list(mtText(vars = c(reviewTran = "review"),
language = "English",
stopwordsRemover = maPredefinedStopwords(),
keepPunctuations = FALSE,
ngramLength = 3)))
)
In [ ]:
library(magrittr)
review_logit <- train_df %>%
featurize(mtText(vars = c(reviewTran = "review"),
stopwordsRemover = maPredefinedStopwords(),
keepPunctuations = FALSE,
ngramLength = 3)) %>%
train(formula = sentiment ~ reviewTran,
lr = LogisticReg(l2Weight = 0.05, l1Weight = 0.05)) %>% run
In [ ]:
predictions <- mxPredict(logit_model, data = test_df, extraVarsToWrite = "sentiment")
roc_results <- rxRoc(actualVarName = "sentiment", predVarNames = "Probability.1", data = predictions)
roc_results$predVarName <- factor(roc_results$predVarName)
plot(roc_results)
In [ ]:
options(stringsAsFactors = TRUE)
predictions_pipeline <- logit_model %>%
mxPredict(data = test_df, extraVarsToWrite = "sentiment") %>%
rxRocCurve(actualVarName = "sentiment", predVarNames = "Probability.1", data = .)
In [ ]:
predictions <- mxPredict(logit_model, data = test_df, extraVarsToWrite = "sentiment")
roc_results <- rxRoc(actualVarName = "sentiment", predVarNames = "Probability.1", data = predictions)
roc_results$predVarName <- factor(roc_results$predVarName)
plot(roc_results)
Let's try to estimate another binary classifier from this dataset, but with a Neural Network architecture rather than a logistic regression model.
In the following chunk, we call our neural network model, and set the optimizer to be a stochastic gradient descent optimizer with a learning rate of 0.2. Furthermore, we use the type argument to ensure we are learning a binary classifier. By default our network architecture will have 100 hidden nodes.
In [ ]:
nn_sentiment <- mxNeuralNet(sentiment ~ reviewTran,
data = train_df,
type = "binary",
mxTransforms = list(mtText(vars = c(reviewTran = "review"),
stopwordsRemover = maPredefinedStopwords(),
keepPunctuations = FALSE,
ngramLength = 3)),
optimizer = maOptimizerSgd(learningRate = 0.2))
In [ ]:
predictions <- mxPredict(nn_sentiment, data = test_df, extraVarsToWrite = "sentiment")
roc_results <- rxRoc(actualVarName = "sentiment", predVarNames = "Probability.1", data = predictions)
roc_results$predVarName <- factor(roc_results$predVarName)
plot(roc_results)