Introduction to Julia for Statistics

Paul Stey, PhD

Brown Center for Biomedical Informatics

April 5, 2017

AWS Server 1 (odd-numbered birthday)

http://ec2-52-90-118-171.compute-1.amazonaws.com:8889

AWS Server 2 (even-numbered birthday)

http://ec2-184-73-77-58.compute-1.amazonaws.com:8888

Question 1

Using the data in chronic_kidney_disease.csv, determine whether or not the oldest patient in the sample has chronic kidney disease. Note that the class variable indicates a patient's CKD status.

As a hint, you will probably want to use the maximum() function and the find() function.

And you'll need to consider how to handle NA values; there is a function dropna() that will be useful.

Question 2

Using the chronic_kidney_disease.csv dataset from above, use an appropriate statistical test to determine if patients with chronic kidney disease (CKD) have significantly higher blood urea than patients without CKD.

As a hint, you will likely want to use the HypothesisTests package.

Question 3

Using the chronic_kidney_disease.csv data, determine which of the following predictors (if any) are related chronic kidney disease (CKD):

blood_urea,

hemoglobin,

red_blood_cell_count,

white_blood_cell_count.

Hints:

There are a few ways this could be done, let's use a single regression model of some kind
Our outcome variable is class in the CKD data, this needs to be re-coded as 0/1

Question 4

Using the stagec data from above, fit several random forest models predicting the G2 score. Try experimenting with different numbers of trees and different numbers of variable subsets for candidate splitting (i.e., m_try, the third argument to the build_forest() function).

In order to evaluate the quality of the models on training data, write a function that calculates the mean-squared error of the fitted model. The function should take 3 arguments: (1) the fitted model, (2) the vector with the outcome variable and (3) the matrix of predictors.

What was the mean-squared error of your best-fitting model?

Data can be loaded with the code below.



In [ ]:

    
using DecisionTree
using RDatasets

stagec = dataset("rpart", "stagec")

Question 5

Using the aldh2 dataset from the gap package in R, try fitting a few random forest (or bagged tree) models using Julia to predict whether a given patient is an alcoholic using their genetic information.

What is the prediction accuracy of your best model? What were the meta-paremeters of your best-fitting model?

The data can be loaded using the code below.



In [ ]:

    
using RDatasets
using DecisionTree

aldh2 = dataset("gap", "aldh2")

Question 6

Use R and the randomForest package via the RCall.jl package in Julia to fit a random forest model on the chronic_kidney_disease.csv data set. In particular, fit a model with 5000 trees to predict whether or not patients have chronic kidney disease.

After fitting the model, extract the variable importance estimates (mean Gini decrease) from the fitted model. Pass a dataframe back to Julia that has two columns (1) name of the predictor, and (2) mean Gini decrease for that predictor.

Sort the returned data frame such that more important predictors (i.e., larger values) are at the top.

The following steps should serve as a general to complete this:

Read the data in to Julia
Ensure RCall.jl is loaded
Pass the dataframe from Julia to R
Fit the model in R using randomForest() function with argument ntrees = 5000
Use the importance() function to extract the estimates of variable importance from the fitted model
Pass the estimates back to Julia

Some Hints:

The importance() function in R returns a data frame whose row names are the variable names, and the last column is mean Gini decrease
The documentation for the randomForest package in R can be found here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
The sortperm() function will be useful for sorting in the last step