In [1]:
version


               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          3.3                         
year           2017                        
month          03                          
day            06                          
svn rev        72310                       
language       R                           
version.string R version 3.3.3 (2017-03-06)
nickname       Another Canoe               

If your R version isn't the latest (3.3.3 at the time of this writing), upgrade R if need be:

for windows: https://www.r-statistics.com/2015/06/a-step-by-step-screenshots-tutorial-for-upgrading-r-on-windows/


In [2]:
# this will install the required packages
if (!require(data.table)) {
    install.packages('data.table', repos='https://cloud.r-project.org/')
    require(data.table)
    }

if (!require(ggplot2)) {
    install.packages('ggplot2', repos='https://cloud.r-project.org/')
    require(ggplot2)
    }


Loading required package: data.table
Loading required package: ggplot2

If on mac and you get an error with something about openmp, follow these instructions to fix it:

https://github.com/Rdatatable/data.table/wiki/Installation#openmp-enabled-compiler-for-mac

If you don't fix it, your code will run slowly because it will only be using one of your processer cores.


In [3]:
# shows the current working directory (wd)
getwd()


'C:/Users/ngeorge/Documents/GitHub/lending_club_EDA'

Getting started: finding filepaths on your computer

I noticed some people struggled with finding their filepaths to the data, and couldn't load the data due to this. Basically you're going to want to google 'copy filepath in windows', or 'find filepath in mac', or 'copy filepath in ubuntu', or something similar for whatever operating system you're using.

Windows: http://www.pcworld.com/article/251406/windows_tips_copy_a_file_path_show_or_hide_extensions.html (you can also get the path from properties, or from the address bar)

Mac: https://www.cnet.com/how-to/how-to-copy-a-file-path-in-os-x/

Ubuntu: http://askubuntu.com/questions/23413/how-can-you-quickly-get-the-complete-path-to-a-file-for-use-in-terminal

A note on strings in R:

The backslash (\) character is a special one in strings (strings are text blocks in quotes, like 'this is a string'). It 'escapes' characters:

http://stackoverflow.com/questions/11806501/how-to-escape-backslashes-in-r-string

So if you're copying the path 'C:\Users\me\github\data.csv', it should be either:

'C:\\Users\\me\\github\\data.csv'

or

'C:/Users/me/github/data.csv'

Otherwise, you'll get an error.


In [1]:
'C:\Users\me\github\data.csv'


Error: '\U' used without hex digits in character string starting "'C:\U"
Traceback:

In [2]:
'C:\\Users\\me\\github\\data.csv'


'C:\Users\me\github\data.csv'

In [3]:
'C:/Users/me/github/data.csv'


'C:/Users/me/github/data.csv'

In [4]:
# loading takes a while...

# this should be the path to the data.  Adapt for your system
filepath <- 'C:/Users/ngeorge/Documents/GitHub/preprocess_lending_club_data/full_data/'

accepted_def <- read.csv(gzfile(paste(filepath, 'accepted_2007_to_2016.csv.gz', sep='')), na.strings='')
acc_dt <- as.data.table(accepted_def)

In [5]:
# that's a lot of observations
dim(acc_dt)


  1. 1321847
  2. 110

In [6]:
# and a lot of columns
names(acc_dt)


  1. 'id'
  2. 'member_id'
  3. 'loan_amnt'
  4. 'funded_amnt'
  5. 'funded_amnt_inv'
  6. 'term'
  7. 'int_rate'
  8. 'installment'
  9. 'grade'
  10. 'sub_grade'
  11. 'emp_title'
  12. 'emp_length'
  13. 'home_ownership'
  14. 'annual_inc'
  15. 'verification_status'
  16. 'issue_d'
  17. 'loan_status'
  18. 'pymnt_plan'
  19. 'desc'
  20. 'purpose'
  21. 'title'
  22. 'zip_code'
  23. 'addr_state'
  24. 'dti'
  25. 'delinq_2yrs'
  26. 'earliest_cr_line'
  27. 'inq_last_6mths'
  28. 'mths_since_last_delinq'
  29. 'mths_since_last_record'
  30. 'open_acc'
  31. 'pub_rec'
  32. 'revol_bal'
  33. 'revol_util'
  34. 'total_acc'
  35. 'initial_list_status'
  36. 'out_prncp'
  37. 'out_prncp_inv'
  38. 'total_pymnt'
  39. 'total_pymnt_inv'
  40. 'total_rec_prncp'
  41. 'total_rec_int'
  42. 'total_rec_late_fee'
  43. 'recoveries'
  44. 'collection_recovery_fee'
  45. 'last_pymnt_d'
  46. 'last_pymnt_amnt'
  47. 'next_pymnt_d'
  48. 'last_credit_pull_d'
  49. 'collections_12_mths_ex_med'
  50. 'mths_since_last_major_derog'
  51. 'policy_code'
  52. 'application_type'
  53. 'annual_inc_joint'
  54. 'dti_joint'
  55. 'verification_status_joint'
  56. 'acc_now_delinq'
  57. 'tot_coll_amt'
  58. 'tot_cur_bal'
  59. 'open_acc_6m'
  60. 'open_il_6m'
  61. 'open_il_12m'
  62. 'open_il_24m'
  63. 'mths_since_rcnt_il'
  64. 'total_bal_il'
  65. 'il_util'
  66. 'open_rv_12m'
  67. 'open_rv_24m'
  68. 'max_bal_bc'
  69. 'all_util'
  70. 'total_rev_hi_lim'
  71. 'inq_fi'
  72. 'total_cu_tl'
  73. 'inq_last_12m'
  74. 'acc_open_past_24mths'
  75. 'avg_cur_bal'
  76. 'bc_open_to_buy'
  77. 'bc_util'
  78. 'chargeoff_within_12_mths'
  79. 'delinq_amnt'
  80. 'mo_sin_old_il_acct'
  81. 'mo_sin_old_rev_tl_op'
  82. 'mo_sin_rcnt_rev_tl_op'
  83. 'mo_sin_rcnt_tl'
  84. 'mort_acc'
  85. 'mths_since_recent_bc'
  86. 'mths_since_recent_bc_dlq'
  87. 'mths_since_recent_inq'
  88. 'mths_since_recent_revol_delinq'
  89. 'num_accts_ever_120_pd'
  90. 'num_actv_bc_tl'
  91. 'num_actv_rev_tl'
  92. 'num_bc_sats'
  93. 'num_bc_tl'
  94. 'num_il_tl'
  95. 'num_op_rev_tl'
  96. 'num_rev_accts'
  97. 'num_rev_tl_bal_gt_0'
  98. 'num_sats'
  99. 'num_tl_120dpd_2m'
  100. 'num_tl_30dpd'
  101. 'num_tl_90g_dpd_24m'
  102. 'num_tl_op_past_12m'
  103. 'pct_tl_nvr_dlq'
  104. 'percent_bc_gt_75'
  105. 'pub_rec_bankruptcies'
  106. 'tax_liens'
  107. 'tot_hi_cred_lim'
  108. 'total_bal_ex_mort'
  109. 'total_bc_limit'
  110. 'total_il_high_credit_limit'

In [7]:
str(acc_dt, list.len=ncol(acc_dt))


Classes 'data.table' and 'data.frame':	1321847 obs. of  110 variables:
 $ id                            : int  1077501 1077430 1077175 1076863 1075358 1075269 1069639 1072053 1071795 1071570 ...
 $ member_id                     : num  1296599 1314167 1313524 1277178 1311748 ...
 $ loan_amnt                     : num  5000 2500 2400 10000 3000 ...
 $ funded_amnt                   : num  5000 2500 2400 10000 3000 ...
 $ funded_amnt_inv               : num  4975 2500 2400 10000 3000 ...
 $ term                          : Factor w/ 2 levels " 36 months"," 60 months": 1 2 1 1 2 1 2 1 2 2 ...
 $ int_rate                      : num  10.6 15.3 16 13.5 12.7 ...
 $ installment                   : num  162.9 59.8 84.3 339.3 67.8 ...
 $ grade                         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 3 2 1 3 5 6 2 ...
 $ sub_grade                     : Factor w/ 35 levels "A1","A2","A3",..: 7 14 15 11 10 4 15 21 27 10 ...
 $ emp_title                     : Factor w/ 381900 levels "'Property Manager",..: NA 286944 NA 12233 361022 365337 315798 217775 NA 329499 ...
 $ emp_length                    : Factor w/ 12 levels "< 1 year","1 year",..: 3 1 3 3 2 5 10 11 6 1 ...
 $ home_ownership                : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 6 6 6 6 6 5 6 ...
 $ annual_inc                    : num  24000 30000 12252 49200 80000 ...
 $ verification_status           : Factor w/ 3 levels "Not Verified",..: 3 2 1 2 2 2 1 2 2 3 ...
 $ issue_d                       : Factor w/ 115 levels "Apr-2008","Apr-2009",..: 24 24 24 24 24 24 24 24 24 24 ...
 $ loan_status                   : Factor w/ 10 levels "Charged Off",..: 6 1 6 6 6 6 6 6 1 1 ...
 $ pymnt_plan                    : Factor w/ 1 level "n": 1 1 1 1 1 1 1 1 1 1 ...
 $ desc                          : Factor w/ 124502 levels "- Pay off Dell Financial: $ 1300.00 - Pay off IRS for 2005: $ 1400.00 - Pay off Mac Comp    : $ 1700.00 - Pay off Bill Me Later"| __truncated__,..: 113416 113421 NA 113272 113248 NA 112361 111645 113246 111648 ...
 $ purpose                       : Factor w/ 14 levels "car","credit_card",..: 2 1 12 10 10 14 3 1 12 10 ...
 $ title                         : Factor w/ 63156 levels "'08 & '09 Roth IRA Investments",..: 10498 4975 52509 50883 50276 42604 36956 7265 24375 6112 ...
 $ zip_code                      : Factor w/ 948 levels "007xx","008xx",..: 820 300 581 867 921 813 271 850 908 739 ...
 $ addr_state                    : Factor w/ 51 levels "AK","AL","AR",..: 4 11 15 5 38 4 28 5 5 44 ...
 $ dti                           : num  27.65 1 8.72 20 17.94 ...
 $ delinq_2yrs                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ earliest_cr_line              : Factor w/ 718 levels "Apr-1955","Apr-1958",..: 272 42 589 216 283 592 351 294 47 709 ...
 $ inq_last_6mths                : num  1 5 2 1 0 3 1 2 2 0 ...
 $ mths_since_last_delinq        : num  NA NA NA 35 38 NA NA NA NA NA ...
 $ mths_since_last_record        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ open_acc                      : num  3 3 2 10 15 9 7 4 11 2 ...
 $ pub_rec                       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ revol_bal                     : num  13648 1687 2956 5598 27783 ...
 $ revol_util                    : num  83.7 9.4 98.5 21 53.9 28.3 85.6 87.5 32.6 36.5 ...
 $ total_acc                     : num  9 4 10 37 38 12 11 4 13 3 ...
 $ initial_list_status           : Factor w/ 2 levels "f","w": 1 1 1 1 1 1 1 1 1 1 ...
 $ out_prncp                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ out_prncp_inv                 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ total_pymnt                   : num  5863 1015 3006 12232 4067 ...
 $ total_pymnt_inv               : num  5834 1015 3006 12232 4067 ...
 $ total_rec_prncp               : num  5000 456 2400 10000 3000 ...
 $ total_rec_int                 : num  863 435 606 2215 1067 ...
 $ total_rec_late_fee            : num  0 0 0 17 0 ...
 $ recoveries                    : num  0 123 0 0 0 ...
 $ collection_recovery_fee       : num  0 1.11 0 0 0 0 0 0 2.09 2.52 ...
 $ last_pymnt_d                  : Factor w/ 110 levels "Apr-2008","Apr-2009",..: 45 6 63 45 47 45 83 45 5 88 ...
 $ last_pymnt_amnt               : num  171.6 119.7 649.9 357.5 67.3 ...
 $ next_pymnt_d                  : Factor w/ 103 levels "Apr-2008","Apr-2009",..: NA NA NA NA NA NA NA NA NA NA ...
 $ last_credit_pull_d            : Factor w/ 115 levels "Apr-2009","Apr-2010",..: 47 105 47 8 47 46 115 26 105 28 ...
 $ collections_12_mths_ex_med    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ mths_since_last_major_derog   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ policy_code                   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ application_type              : Factor w/ 3 levels "DIRECT_PAY","INDIVIDUAL",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ annual_inc_joint              : num  NA NA NA NA NA NA NA NA NA NA ...
 $ dti_joint                     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ verification_status_joint     : Factor w/ 3 levels "Not Verified",..: NA NA NA NA NA NA NA NA NA NA ...
 $ acc_now_delinq                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tot_coll_amt                  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ tot_cur_bal                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ open_acc_6m                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ open_il_6m                    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ open_il_12m                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ open_il_24m                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mths_since_rcnt_il            : num  NA NA NA NA NA NA NA NA NA NA ...
 $ total_bal_il                  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ il_util                       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ open_rv_12m                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ open_rv_24m                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ max_bal_bc                    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ all_util                      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ total_rev_hi_lim              : num  NA NA NA NA NA NA NA NA NA NA ...
 $ inq_fi                        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ total_cu_tl                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ inq_last_12m                  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ acc_open_past_24mths          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ avg_cur_bal                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ bc_open_to_buy                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ bc_util                       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ chargeoff_within_12_mths      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ delinq_amnt                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ mo_sin_old_il_acct            : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mo_sin_old_rev_tl_op          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mo_sin_rcnt_rev_tl_op         : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mo_sin_rcnt_tl                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mort_acc                      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mths_since_recent_bc          : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mths_since_recent_bc_dlq      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mths_since_recent_inq         : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mths_since_recent_revol_delinq: num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_accts_ever_120_pd         : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_actv_bc_tl                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_actv_rev_tl               : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_bc_sats                   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_bc_tl                     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_il_tl                     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_op_rev_tl                 : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_rev_accts                 : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_rev_tl_bal_gt_0           : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_sats                      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_tl_120dpd_2m              : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_tl_30dpd                  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_tl_90g_dpd_24m            : num  NA NA NA NA NA NA NA NA NA NA ...
 $ num_tl_op_past_12m            : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pct_tl_nvr_dlq                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ percent_bc_gt_75              : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pub_rec_bankruptcies          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tax_liens                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tot_hi_cred_lim               : num  NA NA NA NA NA NA NA NA NA NA ...
 $ total_bal_ex_mort             : num  NA NA NA NA NA NA NA NA NA NA ...
 $ total_bc_limit                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ total_il_high_credit_limit    : num  NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, ".internal.selfref")=<externalptr> 

In [8]:
# outliers are screwing it up!
hist(acc_dt[, dti])



In [9]:
# from here: http://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

In [10]:
dti_no_outliers <- remove_outliers(acc_dt[, dti])

In [11]:
hist(dti_no_outliers)


Lets look at interest rates for the different loan grades. Are they statistically significant from each other?

For this we can do ANOVA to test if the mean interest rate for different loan grades is stastistically significantly different.

This data.table cheat sheet will come in handy here.

First things first: can you show the unique values in a column? Let's look at the unique values for the 'grade' column. Print them out in alphabetical order so it looks like:

A B C D E F G

Hints:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html

https://stat.ethz.ch/R-manual/R-devel/library/base/html/unique.html


In [12]:
sort(unique(acc_dt[, grade]))


  1. A
  2. B
  3. C
  4. D
  5. E
  6. F
  7. G

Now, can you show the grades and their average interest rates? You will need to group by the grades and calculate the mean to answer this question. It should look like this:

grade Avg_Interest_Rate
A 7.129947
B 10.626637
C 13.918715
D 17.502870
E 20.574477
F 24.230820
G 26.653138

Hints:

http://www.statmethods.net/management/sorting.html

http://www.r-tutor.com/elementary-statistics/numerical-measures/mean


In [13]:
avg_gr_int <- acc_dt[, .(Avg_Interest_Rate=mean(int_rate)), by=grade]
avg_gr_int


gradeAvg_Interest_Rate
B 10.626637
C 13.918715
A 7.129947
E 20.574477
F 24.230820
D 17.502870
G 26.653138

In [14]:
# this is one way to sort things in the data.table package
# http://stackoverflow.com/questions/12353820/sort-rows-in-data-table
avg_gr_int[order(grade)]


gradeAvg_Interest_Rate
A 7.129947
B 10.626637
C 13.918715
D 17.502870
E 20.574477
F 24.230820
G 26.653138

In [15]:
# here's another:
setkey(avg_gr_int, grade) # this 'invisibly' sets the order in the data table, so that it is permanent.  see the cheat sheet: 
# https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf
avg_gr_int


gradeAvg_Interest_Rate
A 7.129947
B 10.626637
C 13.918715
D 17.502870
E 20.574477
F 24.230820
G 26.653138

Now make a bar plot with that data; it should look like this:

Hints:

https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/barplot.html


In [16]:
barplot(height=avg_gr_int[, Avg_Interest_Rate], names.arg=avg_gr_int[, grade], ylab='Average Interest Rate')


That's nice, but it's hard to tell how the spread of the interest rates differs, and how many outliers there are. Instead, let's make a box and whisker plot.

See this for an explanation of box plots: http://www.physics.csbsju.edu/stats/box2.html

Hints:

https://www.r-bloggers.com/summarising-data-using-box-and-whisker-plots/


In [17]:
boxplot(int_rate ~ grade, data=acc_dt, xlab='Grade', ylab='Interest Rate')


Now let's do one-way ANOVA. All this test will tell us is if the grade of the loan has any statistically significant effect on the interest rate. What do you hypothesize? That it will, or won't?

Hints:

http://www.statmethods.net/stats/anova.html

http://www.gardenersown.co.uk/education/lectures/r/anova.htm


In [18]:
fit <- aov(int_rate ~ grade, data=acc_dt)

warning...this next cell takes some time to run. at least a minute or two...it will make many plots

we're looking for homoscedasticity here: https://en.wikipedia.org/wiki/Homoscedasticity

if the residuals are very biased to one side or the other, the analysis may not be trustworthy


In [19]:
plot(fit)



In [20]:
summary(fit)


                 Df   Sum Sq Mean Sq F value Pr(>F)    
grade             6 25166007 4194334 2264787 <2e-16 ***
Residuals   1321840  2448017       2                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see from the Pr(>F), i.e. the probability of the F-statistic being greater than the value we found on the F-distribution, is very small. Another way to think about this is the area under the F-distribution curve (with F greater than the F value we found) is very small.

Usually we use 0.05 as a threshold for statistical significance, so in this case, Pr(>F) is much smaller than 0.05, and therefore is statistically signficant.

Let's see which loan grades are statistically different from the others. For this, we can do a Tukey HSD test.

Hints:

http://www.gardenersown.co.uk/education/lectures/r/anova.htm


In [21]:
TukeyHSD(fit)
# here we're looking at the p adj column, which is the p-value of the pairwise comparisons.  If this is less than 0.05, 
# we can say it's a statistically significant difference


  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = int_rate ~ grade, data = acc_dt)

$grade
         diff       lwr       upr p adj
B-A  3.496690  3.485973  3.507408     0
C-A  6.788768  6.777994  6.799542     0
D-A 10.372923 10.360493 10.385353     0
E-A 13.444530 13.429029 13.460032     0
F-A 17.100874 17.076986 17.124761     0
G-A 19.523191 19.477630 19.568752     0
C-B  3.292078  3.282915  3.301241     0
D-B  6.876233  6.865170  6.887295     0
E-B  9.947840  9.933412  9.962268     0
F-B 13.604183 13.580977 13.627389     0
G-B 16.026501 15.981294 16.071708     0
D-C  3.584155  3.573038  3.595272     0
E-C  6.655762  6.641292  6.670232     0
F-C 10.312106 10.288874 10.335337     0
G-C 12.734423 12.689203 12.779643     0
E-D  3.071607  3.055865  3.087349     0
F-D  6.727950  6.703906  6.751995     0
G-D  9.150268  9.104625  9.195911     0
F-E  3.656343  3.630576  3.682110     0
G-E  6.078661  6.032087  6.125234     0
G-F  2.422318  2.372323  2.472312     0

Now find another variable other than grade, and see if the interest rate depends on that variable in a statistically significant way. Go through a similar process as we did above -- first plotting the data to visually inspect the differences, then doing statistical tests to check for differences.

Make sure to choose a variable that is a 'Factor' (you can check them in the summary near the top), and don't use subgrade since we already used grade (unless you want to get fancy and subset one of the grades, then check the subgrades within that grade).

Also make sure not to choose one of the factors with 100s of levels; these were mistaken as factors by R during auto-parsing of the data.


In [22]:
# I choose to look at home ownership
boxplot(int_rate ~ home_ownership, data=acc_dt, xlab='home_ownership', ylab='interest rate')
# I hypothesize these are barely statistically significant due to the large number of samples



In [23]:
# the differences are too small to see for many of them.  Let's look at the hard numbers
acc_dt[, .(Avg_Interest_Rate=mean(int_rate)), by=home_ownership]


home_ownershipAvg_Interest_Rate
RENT 13.56629
OWN 13.18450
MORTGAGE12.86813
OTHER 13.46214
NONE 14.47420
ANY 12.56823

In [24]:
fit <- aov(int_rate ~ home_ownership, data=acc_dt)

In [25]:
summary(fit)


                    Df   Sum Sq Mean Sq F value Pr(>F)    
home_ownership       5   142330   28466    1370 <2e-16 ***
Residuals      1321841 27471694      21                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In [26]:
TukeyHSD(fit)
# very interesting...only a few pairs are statistically different
# comparing this with the box plots is useful. take a look above
# essentially, if you rent vs own, you will pay about 0.4% higher interest rates on average.


  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = int_rate ~ home_ownership, data = acc_dt)

$home_ownership
                     diff        lwr       upr     p adj
MORTGAGE-ANY    0.2999016 -0.9223238 1.5221270 0.9820786
NONE-ANY        1.9059699 -0.3006253 4.1125651 0.1357009
OTHER-ANY       0.8939128 -0.6620141 2.4498396 0.5736969
OWN-ANY         0.6162652 -0.6063461 1.8388766 0.7046996
RENT-ANY        0.9980556 -0.2241957 2.2203070 0.1831198
NONE-MORTGAGE   1.6060683 -0.2312511 3.4433877 0.1266351
OTHER-MORTGAGE  0.5940112 -0.3691029 1.5571252 0.4934814
OWN-MORTGAGE    0.3163636  0.2781702 0.3545571 0.0000000
RENT-MORTGAGE   0.6981540  0.6741007 0.7222074 0.0000000
OTHER-NONE     -1.0120571 -3.0863801 1.0622658 0.7329472
OWN-NONE       -1.2897047 -3.1272808 0.5478714 0.3421405
RENT-NONE      -0.9079143 -2.7452509 0.9294223 0.7221368
OWN-OTHER      -0.2776475 -1.2412512 0.6859562 0.9637495
RENT-OTHER      0.1041429 -0.8590040 1.0672898 0.9996310
RENT-OWN        0.3817904  0.3427770 0.4208039 0.0000000