In [1]:
version
for windows: https://www.r-statistics.com/2015/06/a-step-by-step-screenshots-tutorial-for-upgrading-r-on-windows/
In [2]:
# this will install the required packages
if (!require(data.table)) {
install.packages('data.table', repos='https://cloud.r-project.org/')
require(data.table)
}
if (!require(ggplot2)) {
install.packages('ggplot2', repos='https://cloud.r-project.org/')
require(ggplot2)
}
https://github.com/Rdatatable/data.table/wiki/Installation#openmp-enabled-compiler-for-mac
In [3]:
# shows the current working directory (wd)
getwd()
In [1]:
'C:\Users\me\github\data.csv'
In [2]:
'C:\\Users\\me\\github\\data.csv'
In [3]:
'C:/Users/me/github/data.csv'
In [4]:
# loading takes a while...
# this should be the path to the data. Adapt for your system
filepath <- 'C:/Users/ngeorge/Documents/GitHub/preprocess_lending_club_data/full_data/'
accepted_def <- read.csv(gzfile(paste(filepath, 'accepted_2007_to_2016.csv.gz', sep='')), na.strings='')
acc_dt <- as.data.table(accepted_def)
In [5]:
# that's a lot of observations
dim(acc_dt)
In [6]:
# and a lot of columns
names(acc_dt)
In [7]:
str(acc_dt, list.len=ncol(acc_dt))
In [8]:
# outliers are screwing it up!
hist(acc_dt[, dti])
In [9]:
# from here: http://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
In [10]:
dti_no_outliers <- remove_outliers(acc_dt[, dti])
In [11]:
hist(dti_no_outliers)
A B C D E F G
https://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/unique.html
In [12]:
sort(unique(acc_dt[, grade]))
grade | Avg_Interest_Rate |
---|---|
A | 7.129947 |
B | 10.626637 |
C | 13.918715 |
D | 17.502870 |
E | 20.574477 |
F | 24.230820 |
G | 26.653138 |
http://www.statmethods.net/management/sorting.html
http://www.r-tutor.com/elementary-statistics/numerical-measures/mean
In [13]:
avg_gr_int <- acc_dt[, .(Avg_Interest_Rate=mean(int_rate)), by=grade]
avg_gr_int
In [14]:
# this is one way to sort things in the data.table package
# http://stackoverflow.com/questions/12353820/sort-rows-in-data-table
avg_gr_int[order(grade)]
In [15]:
# here's another:
setkey(avg_gr_int, grade) # this 'invisibly' sets the order in the data table, so that it is permanent. see the cheat sheet:
# https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf
avg_gr_int
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/barplot.html
In [16]:
barplot(height=avg_gr_int[, Avg_Interest_Rate], names.arg=avg_gr_int[, grade], ylab='Average Interest Rate')
https://www.r-bloggers.com/summarising-data-using-box-and-whisker-plots/
In [17]:
boxplot(int_rate ~ grade, data=acc_dt, xlab='Grade', ylab='Interest Rate')
http://www.statmethods.net/stats/anova.html
http://www.gardenersown.co.uk/education/lectures/r/anova.htm
In [18]:
fit <- aov(int_rate ~ grade, data=acc_dt)
In [19]:
plot(fit)
In [20]:
summary(fit)
http://www.gardenersown.co.uk/education/lectures/r/anova.htm
In [21]:
TukeyHSD(fit)
# here we're looking at the p adj column, which is the p-value of the pairwise comparisons. If this is less than 0.05,
# we can say it's a statistically significant difference
In [22]:
# I choose to look at home ownership
boxplot(int_rate ~ home_ownership, data=acc_dt, xlab='home_ownership', ylab='interest rate')
# I hypothesize these are barely statistically significant due to the large number of samples
In [23]:
# the differences are too small to see for many of them. Let's look at the hard numbers
acc_dt[, .(Avg_Interest_Rate=mean(int_rate)), by=home_ownership]
In [24]:
fit <- aov(int_rate ~ home_ownership, data=acc_dt)
In [25]:
summary(fit)
In [26]:
TukeyHSD(fit)
# very interesting...only a few pairs are statistically different
# comparing this with the box plots is useful. take a look above
# essentially, if you rent vs own, you will pay about 0.4% higher interest rates on average.