Using the data in chronic_kidney_disease.csv, determine whether or not the oldest patient in the sample has chronic kidney disease. Note that the class variable indicates a patient's CKD status.
As a hint, you will probably want to use the maximum() function and the find() function.
And you'll need to consider how to handle NA values; there is a function dropna() that will be useful.
In [37]:
using DataFrames
ckd = readtable("../data/chronic_kidney_disease.csv");
In [10]:
maxage = maximum(dropna(ckd[:age]))
oldest_idx = find(ckd[:age] .== maxage)
ckd[oldest_idx, :class]
Out[10]:
Using the chronic_kidney_disease.csv dataset from above, use an appropriate statistical test to determine if patients with chronic kidney disease (CKD) have significantly higher blood urea than patients without CKD.
As a hint, you will likely want to use the HypothesisTests package.
In [10]:
using DataFrames
using HypothesisTests
ckd = readtable("../data/chronic_kidney_disease.csv")
# get row indices
ckd_idcs = find(ckd[:class] .== "ckd")
nonckd_idcs = find(ckd[:class] .== "notckd")
# get `blood_urea` scores
ckd_grp = ckd[ckd_idcs, :blood_urea]
nonckd_grp = ckd[nonckd_idcs, :blood_urea]
UnequalVarianceTTest(dropna(ckd_grp), dropna(nonckd_grp))
Out[10]:
Using the chronic_kidney_disease.csv data, determine which of the following predictors (if any) are related chronic kidney disease (CKD):
blood_urea,
hemoglobin,
red_blood_cell_count,
white_blood_cell_count.
Hints:
class in the CKD data, this needs to be re-coded as 0/1
In [6]:
using GLM
using DataFrames
ckd = readtable("../data/chronic_kidney_disease.csv", makefactors = true)
ckd[:has_ckd] = [x == "ckd" ? 1 : 0 for x in ckd[:class]]
mod3 = @formula(has_ckd ~ 1 + blood_urea + hemoglobin + red_blood_cell_count + white_blood_cell_count)
fm3 = glm(mod3, ckd, Binomial())
Out[6]:
Using the stagec data from above, fit several random forest models predicting the G2 score. Try experimenting with different numbers of trees and different numbers of variable subsets for candidate splitting (i.e., m_try, the third argument to the build_forest() function).
In order to evaluate the quality of the models on training data, write a function that calculates the mean-squared error of the fitted model. The function should take 3 arguments: (1) the fitted model, (2) the vector with the outcome variable and (3) the matrix of predictors.
What was the mean-squared error of your best-fitting model?
Data can be loaded with the code below.
In [11]:
# This is a quick function to obtain the mean-squared
# error of a fitted random forest (or bagged tree) model.
function mse(fitted, y, X)
yhat = apply_forest(fitted, X)
sqerr = (y .- yhat).^2
out = mean(sqerr)
return out
end
Out[11]:
In [14]:
# Load the data
using DecisionTree
using RDatasets
stagec = dataset("rpart", "stagec")
stagec_comp = stagec[completecases(stagec), :];
In [15]:
# Clean up data and fit model
is_tetraploid = stagec_comp[:Ploidy] .== "tetraploid"
stagec_comp[:tetra] = is_tetraploid
# must convert to Array
y = convert(Array{Float64,1}, stagec_comp[:G2])
X = convert(Array{Float64,2}, stagec_comp[[:Age, :Grade, :Gleason, :EET, :tetra]])
fm4a = build_forest(y, X, 5, 100)
fm4b = build_forest(y, X, 3, 100)
fm4c = build_forest(y, X, 5, 500)
Out[15]:
In [16]:
@show mse(fm4a, y, X)
@show mse(fm4b, y, X)
@show mse(fm4c, y, X)
Out[16]:
Using the aldh2 dataset from the gap package in R, try fitting a few random forest (or bagged tree) models using Julia to predict whether a given patient is an alcoholic using their genetic information.
What is the prediction accuracy of your best model? What were the meta-paremeters of your best-fitting model?
The data can be loaded using the code below.
In [30]:
using RDatasets
using DecisionTree
aldh2 = dataset("gap", "aldh2");
In [33]:
# must convert to Array
y = convert(Array{Float64,1}, aldh2[:Y])
X = convert(Array{Float64,2}, aldh2[:, 3:end])
# fit model
fm10 = build_forest(y, X, 10, 500)
Out[33]:
In [34]:
# get prediction accuracy
yhat_bool = apply_forest(fm10, X) .> 0.5
ybool = y .== 1
mean(ybool .== yhat_bool)
Out[34]:
Use R and the randomForest package via the RCall.jl package in Julia to fit a random forest model on the chronic_kidney_disease.csv data set. In particular, fit a model with 5000 trees to predict whether or not patients have chronic kidney disease.
After fitting the model, extract the variable importance estimates (mean Gini decrease) from the fitted model. Pass a dataframe back to Julia that has two columns (1) name of the predictor, and (2) mean Gini decrease for that predictor.
Sort the returned data frame such that larger values are at the top.
The following steps should serve as a general to complete this:
Read the data in to Julia
Ensure RCall.jl is loaded
Pass the dataframe from Julia to R
Fit the model in R using randomForest() function with argument ntrees = 5000
Use the importance() function to extract the estimates of variable importance from the fitted model
Pass the estimates back to Julia
Some Hints:
The importance() function in R returns a data frame whose row names are the variable names, and the last column is mean Gini decrease
The documentation for the randomForest package in R can be found here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
The sortperm() function in Julia will be useful for sorting in the last step
In [35]:
using DataFrames
using RCall
ckd = readtable("../data/chronic_kidney_disease.csv", makefactors = true)
@rput ckd
R"
library(randomForest)
ckd2 <- ckd[complete.cases(ckd), ]
fm6 <- randomForest(class ~ ., ckd2, importance = TRUE, ntree = 5000)
imp_df <- importance(fm6)
gini_decrease <- imp_df[, ncol(imp_df)]
preds <- row.names(imp_df)
var_imp_df <- data.frame(preds, gini_decrease)
"
@rget var_imp_df;
In [36]:
indcs = sortperm(var_imp_df[:, 2], rev = true)
res = var_imp_df[indcs, :]
Out[36]:
In [ ]: