Predicting breast cancer using data from AzureML

This notebook demonstrates a simple machine learning process to predict breast cancer incidence. The data resides in the Azure Machine Learning Studio, and this example downloads the data to the notebook and then fits a simple regression model.

The AzureML package allows you to import datasets from the AzureML-Studio to your local R session, or indeed a notebook.

About the data

The Breast cancer data is one of three cancer-related datasets provided by the Oncology Institute that appears frequently in machine learning literature. Combines diagnostic information with features from laboratory analysis of about 300 tissue samples.

Usage: Classify the type of cancer, based on 9 attributes, some of which are linear and some are categorical.

Related Research: Wohlberg, W.H., Street, W.N., & Mangasarian, O.L. (1995). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Importing data from AzureML

By default, the AzureML package is installed on the Jupyter server. The AzureML package allows you to import datasets from the AzureML-Studio to your local R session.

You use the workspace() function to configure a connection to your AzureML Studio workspace.

Note that the Jupyter workspace in AzureML already contains a file at ~/.azureml/setting.json that contains your workspace credentials. This has the benefit that you won't reveal your credentials when sharing the notebook.

Thus, when you use workspace() with a Jupyter notebook, you don't have to provide your credentials.

Use the function download.intermediate.dataset() to download the data from the AzureML Studio to your Jupyter notebook.



In [ ]:

    
library("AzureML")
ws <- workspace()
dat <- download.datasets(ws, "Breast cancer data")

Once the data is downloaded to your Jupyter session, you can use any R function to inspect or manipulate the data.

For example, inspect the structure of your dataset using str():



In [ ]:

    
str(dat)

The IRKernel displays data frames in a nice tabular format:



In [ ]:

    
head(dat)

Plotting the correlation matrix

Notebooks allows you to plot data, and the plot gets displayed directly in the output.

Note you can install packages from MRAN (the Microsoft CRAN mirror). In this example, install the corrgram package to plot the correlation matrix.



In [ ]:

    
# Change plot size
options(jupyter.plot_mimetypes = 'image/png') 
options(repr.plot.width = 6, repr.plot.height = 6)

if(!require("corrgram", quietly = TRUE)) install.packages("corrgram")
library(corrgram, quietly = TRUE)
corrgram(dat, order = TRUE, 
         lower.panel = panel.ellipse,
         upper.panel = panel.shade, 
         text.panel = panel.txt,
         main = "Breast cancer data in PC2/PC1 Order",
         cex.labels = 0.7)

Creating a binary classifier model

The column Class in the breast cancer data is an indicator whether a person had breast cancer or not. Logistic regression is an algorithm that allows you to fit a binary classifier to data. A binary classifier predicts data with two classes, for example TRUE or FALSE, or 1 or 0.

Using R, You can fit a logistic regression model using the glm() function.

But first, separate the data into a training and test sample.



In [ ]:

    
set.seed(1)
idx <- sample.int(nrow(dat), nrow(dat) * 0.8) # create an 80% sample index
train <- dat[idx, ]  # keep the 80% sample
test  <- dat[-idx, ] # discard the 80% sample

# fit the model
model <- glm(Class ~ ., data = dat, family = binomial)

Now inspect the model using summary().



In [ ]:

    
summary(model)

Evaluating model accuracy

To evaluate the model accuracy, you can use the ROCR package to plot various ROC plots. ROC plots are widely used in machine learning to visualize model accuracy. The higher the area under the curve, the better the model. This is why this type of plot is sometimes also called AUC plots (Area Unde the Curve).



In [ ]:

    
if(!require(ROCR, quietly = TRUE)) install.packages("ROCR")
library(ROCR, quietly = TRUE)

# First, create predictions using the holdout (test) set
predictions <- predict(model, test, type = "response")

# Using ROCR functions to produce a simple ROC plot:
pred <- prediction(predictions, test$Class)
perf <- performance(pred, measure = "tpr", x.measure = "fpr") 
    
options(repr.plot.width = 5, repr.plot.height = 4)
plot(perf, col = rainbow(10), main = "Model performance")

The model performs very well using the test data.