In [1]:
install.packages(c('data.table', 'ggplot2'), repos='http://cran.us.r-project.org')
library(data.table)
library(ggplot2)
package 'data.table' successfully unpacked and MD5 sums checked
package 'ggplot2' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\ngeorge\AppData\Local\Temp\RtmpugfXS7\downloaded_packages
In [2]:
# shows the current working directory (wd)
getwd()
'C:/Users/ngeorge/Documents/GitHub/lending_club_EDA'
In [3]:
# loading takes a while...
# this should be the path to the data. Adapt for your system
win_filepath <- 'C:/Users/ngeorge/Documents/GitHub/preprocess_lending_club_data/full_data/'
# same path from virtualbox shared folder
linux_filepath <- '/media/sf_C_DRIVE/Users/ngeorge/Documents/GitHub/preprocess_lending_club_data/full_data/'
accepted_def <- read.csv(gzfile(paste(win_filepath, 'accepted_2007_to_2016.csv.gz', sep='')), na.strings='')
acc_dt <- as.data.table(accepted_def)
rejected_def <- read.csv(gzfile(paste(win_filepath, 'rejected_2007_to_2016.csv.gz', sep='')), na.strings='')
rej_dt <- as.data.table(accepted_def)
In [4]:
# that's a lot of observations
dim(acc_dt)
- 1321847
- 111
In [5]:
names(acc_dt)
- 'id'
- 'member_id'
- 'loan_amnt'
- 'funded_amnt'
- 'funded_amnt_inv'
- 'term'
- 'int_rate'
- 'installment'
- 'grade'
- 'sub_grade'
- 'emp_title'
- 'emp_length'
- 'home_ownership'
- 'annual_inc'
- 'verification_status'
- 'issue_d'
- 'loan_status'
- 'pymnt_plan'
- 'url'
- 'desc'
- 'purpose'
- 'title'
- 'zip_code'
- 'addr_state'
- 'dti'
- 'delinq_2yrs'
- 'earliest_cr_line'
- 'inq_last_6mths'
- 'mths_since_last_delinq'
- 'mths_since_last_record'
- 'open_acc'
- 'pub_rec'
- 'revol_bal'
- 'revol_util'
- 'total_acc'
- 'initial_list_status'
- 'out_prncp'
- 'out_prncp_inv'
- 'total_pymnt'
- 'total_pymnt_inv'
- 'total_rec_prncp'
- 'total_rec_int'
- 'total_rec_late_fee'
- 'recoveries'
- 'collection_recovery_fee'
- 'last_pymnt_d'
- 'last_pymnt_amnt'
- 'next_pymnt_d'
- 'last_credit_pull_d'
- 'collections_12_mths_ex_med'
- 'mths_since_last_major_derog'
- 'policy_code'
- 'application_type'
- 'annual_inc_joint'
- 'dti_joint'
- 'verification_status_joint'
- 'acc_now_delinq'
- 'tot_coll_amt'
- 'tot_cur_bal'
- 'open_acc_6m'
- 'open_il_6m'
- 'open_il_12m'
- 'open_il_24m'
- 'mths_since_rcnt_il'
- 'total_bal_il'
- 'il_util'
- 'open_rv_12m'
- 'open_rv_24m'
- 'max_bal_bc'
- 'all_util'
- 'total_rev_hi_lim'
- 'inq_fi'
- 'total_cu_tl'
- 'inq_last_12m'
- 'acc_open_past_24mths'
- 'avg_cur_bal'
- 'bc_open_to_buy'
- 'bc_util'
- 'chargeoff_within_12_mths'
- 'delinq_amnt'
- 'mo_sin_old_il_acct'
- 'mo_sin_old_rev_tl_op'
- 'mo_sin_rcnt_rev_tl_op'
- 'mo_sin_rcnt_tl'
- 'mort_acc'
- 'mths_since_recent_bc'
- 'mths_since_recent_bc_dlq'
- 'mths_since_recent_inq'
- 'mths_since_recent_revol_delinq'
- 'num_accts_ever_120_pd'
- 'num_actv_bc_tl'
- 'num_actv_rev_tl'
- 'num_bc_sats'
- 'num_bc_tl'
- 'num_il_tl'
- 'num_op_rev_tl'
- 'num_rev_accts'
- 'num_rev_tl_bal_gt_0'
- 'num_sats'
- 'num_tl_120dpd_2m'
- 'num_tl_30dpd'
- 'num_tl_90g_dpd_24m'
- 'num_tl_op_past_12m'
- 'pct_tl_nvr_dlq'
- 'percent_bc_gt_75'
- 'pub_rec_bankruptcies'
- 'tax_liens'
- 'tot_hi_cred_lim'
- 'total_bal_ex_mort'
- 'total_bc_limit'
- 'total_il_high_credit_limit'
In [6]:
str(acc_dt)
Classes 'data.table' and 'data.frame': 1321847 obs. of 111 variables:
$ id : int 1077501 1077430 1077175 1076863 1075358 1075269 1069639 1072053 1071795 1071570 ...
$ member_id : num 1296599 1314167 1313524 1277178 1311748 ...
$ loan_amnt : num 5000 2500 2400 10000 3000 ...
$ funded_amnt : num 5000 2500 2400 10000 3000 ...
$ funded_amnt_inv : num 4975 2500 2400 10000 3000 ...
$ term : Factor w/ 2 levels " 36 months"," 60 months": 1 2 1 1 2 1 2 1 2 2 ...
$ int_rate : Factor w/ 593 levels " 5.32%"," 5.42%",..: 110 290 315 211 177 43 315 409 480 177 ...
$ installment : num 162.9 59.8 84.3 339.3 67.8 ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 3 2 1 3 5 6 2 ...
$ sub_grade : Factor w/ 35 levels "A1","A2","A3",..: 7 14 15 11 10 4 15 21 27 10 ...
$ emp_title : Factor w/ 381900 levels "'Property Manager",..: NA 286944 NA 12233 361022 365337 315798 217775 NA 329499 ...
$ emp_length : Factor w/ 12 levels "< 1 year","1 year",..: 3 1 3 3 2 5 10 11 6 1 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 6 6 6 6 6 5 6 ...
$ annual_inc : num 24000 30000 12252 49200 80000 ...
$ verification_status : Factor w/ 3 levels "Not Verified",..: 3 2 1 2 2 2 1 2 2 3 ...
$ issue_d : Factor w/ 115 levels "Apr-2008","Apr-2009",..: 24 24 24 24 24 24 24 24 24 24 ...
$ loan_status : Factor w/ 10 levels "Charged Off",..: 6 1 6 6 6 6 6 6 1 1 ...
$ pymnt_plan : Factor w/ 1 level "n": 1 1 1 1 1 1 1 1 1 1 ...
$ url : Factor w/ 1321847 levels "https://lendingclub.com/browse/loanDetail.action?loan_id=1000007",..: 21302 21266 21252 21230 20701 20693 19198 19819 19804 19665 ...
$ desc : Factor w/ 124502 levels "- Pay off Dell Financial: $ 1300.00 - Pay off IRS for 2005: $ 1400.00 - Pay off Mac Comp : $ 1700.00 - Pay off Bill Me Later"| __truncated__,..: 113416 113421 NA 113272 113248 NA 112361 111645 113246 111648 ...
$ purpose : Factor w/ 14 levels "car","credit_card",..: 2 1 12 10 10 14 3 1 12 10 ...
$ title : Factor w/ 63156 levels "'08 & '09 Roth IRA Investments",..: 10498 4975 52509 50883 50276 42604 36956 7265 24375 6112 ...
$ zip_code : Factor w/ 948 levels "007xx","008xx",..: 820 300 581 867 921 813 271 850 908 739 ...
$ addr_state : Factor w/ 51 levels "AK","AL","AR",..: 4 11 15 5 38 4 28 5 5 44 ...
$ dti : num 27.65 1 8.72 20 17.94 ...
$ delinq_2yrs : num 0 0 0 0 0 0 0 0 0 0 ...
$ earliest_cr_line : Factor w/ 718 levels "Apr-1955","Apr-1958",..: 272 42 589 216 283 592 351 294 47 709 ...
$ inq_last_6mths : num 1 5 2 1 0 3 1 2 2 0 ...
$ mths_since_last_delinq : num NA NA NA 35 38 NA NA NA NA NA ...
$ mths_since_last_record : num NA NA NA NA NA NA NA NA NA NA ...
$ open_acc : num 3 3 2 10 15 9 7 4 11 2 ...
$ pub_rec : num 0 0 0 0 0 0 0 0 0 0 ...
$ revol_bal : num 13648 1687 2956 5598 27783 ...
$ revol_util : Factor w/ 1386 levels "0%","0.01%","0.03%",..: 1209 1279 1372 456 860 542 1228 1247 601 646 ...
$ total_acc : num 9 4 10 37 38 12 11 4 13 3 ...
$ initial_list_status : Factor w/ 2 levels "f","w": 1 1 1 1 1 1 1 1 1 1 ...
$ out_prncp : num 0 0 0 0 0 0 0 0 0 0 ...
$ out_prncp_inv : num 0 0 0 0 0 0 0 0 0 0 ...
$ total_pymnt : num 5863 1015 3006 12232 4067 ...
$ total_pymnt_inv : num 5834 1015 3006 12232 4067 ...
$ total_rec_prncp : num 5000 456 2400 10000 3000 ...
$ total_rec_int : num 863 435 606 2215 1067 ...
$ total_rec_late_fee : num 0 0 0 17 0 ...
$ recoveries : num 0 123 0 0 0 ...
$ collection_recovery_fee : num 0 1.11 0 0 0 0 0 0 2.09 2.52 ...
$ last_pymnt_d : Factor w/ 110 levels "Apr-2008","Apr-2009",..: 45 6 63 45 47 45 83 45 5 88 ...
$ last_pymnt_amnt : num 171.6 119.7 649.9 357.5 67.3 ...
$ next_pymnt_d : Factor w/ 103 levels "Apr-2008","Apr-2009",..: NA NA NA NA NA NA NA NA NA NA ...
$ last_credit_pull_d : Factor w/ 115 levels "Apr-2009","Apr-2010",..: 47 105 47 8 47 46 115 26 105 28 ...
$ collections_12_mths_ex_med : num 0 0 0 0 0 0 0 0 0 0 ...
$ mths_since_last_major_derog : num NA NA NA NA NA NA NA NA NA NA ...
$ policy_code : num 1 1 1 1 1 1 1 1 1 1 ...
$ application_type : Factor w/ 3 levels "DIRECT_PAY","INDIVIDUAL",..: 2 2 2 2 2 2 2 2 2 2 ...
$ annual_inc_joint : num NA NA NA NA NA NA NA NA NA NA ...
$ dti_joint : num NA NA NA NA NA NA NA NA NA NA ...
$ verification_status_joint : Factor w/ 3 levels "Not Verified",..: NA NA NA NA NA NA NA NA NA NA ...
$ acc_now_delinq : num 0 0 0 0 0 0 0 0 0 0 ...
$ tot_coll_amt : num NA NA NA NA NA NA NA NA NA NA ...
$ tot_cur_bal : num NA NA NA NA NA NA NA NA NA NA ...
$ open_acc_6m : num NA NA NA NA NA NA NA NA NA NA ...
$ open_il_6m : num NA NA NA NA NA NA NA NA NA NA ...
$ open_il_12m : num NA NA NA NA NA NA NA NA NA NA ...
$ open_il_24m : num NA NA NA NA NA NA NA NA NA NA ...
$ mths_since_rcnt_il : num NA NA NA NA NA NA NA NA NA NA ...
$ total_bal_il : num NA NA NA NA NA NA NA NA NA NA ...
$ il_util : num NA NA NA NA NA NA NA NA NA NA ...
$ open_rv_12m : num NA NA NA NA NA NA NA NA NA NA ...
$ open_rv_24m : num NA NA NA NA NA NA NA NA NA NA ...
$ max_bal_bc : num NA NA NA NA NA NA NA NA NA NA ...
$ all_util : num NA NA NA NA NA NA NA NA NA NA ...
$ total_rev_hi_lim : num NA NA NA NA NA NA NA NA NA NA ...
$ inq_fi : num NA NA NA NA NA NA NA NA NA NA ...
$ total_cu_tl : num NA NA NA NA NA NA NA NA NA NA ...
$ inq_last_12m : num NA NA NA NA NA NA NA NA NA NA ...
$ acc_open_past_24mths : num NA NA NA NA NA NA NA NA NA NA ...
$ avg_cur_bal : num NA NA NA NA NA NA NA NA NA NA ...
$ bc_open_to_buy : num NA NA NA NA NA NA NA NA NA NA ...
$ bc_util : num NA NA NA NA NA NA NA NA NA NA ...
$ chargeoff_within_12_mths : num 0 0 0 0 0 0 0 0 0 0 ...
$ delinq_amnt : num 0 0 0 0 0 0 0 0 0 0 ...
$ mo_sin_old_il_acct : num NA NA NA NA NA NA NA NA NA NA ...
$ mo_sin_old_rev_tl_op : num NA NA NA NA NA NA NA NA NA NA ...
$ mo_sin_rcnt_rev_tl_op : num NA NA NA NA NA NA NA NA NA NA ...
$ mo_sin_rcnt_tl : num NA NA NA NA NA NA NA NA NA NA ...
$ mort_acc : num NA NA NA NA NA NA NA NA NA NA ...
$ mths_since_recent_bc : num NA NA NA NA NA NA NA NA NA NA ...
$ mths_since_recent_bc_dlq : num NA NA NA NA NA NA NA NA NA NA ...
$ mths_since_recent_inq : num NA NA NA NA NA NA NA NA NA NA ...
$ mths_since_recent_revol_delinq: num NA NA NA NA NA NA NA NA NA NA ...
$ num_accts_ever_120_pd : num NA NA NA NA NA NA NA NA NA NA ...
$ num_actv_bc_tl : num NA NA NA NA NA NA NA NA NA NA ...
$ num_actv_rev_tl : num NA NA NA NA NA NA NA NA NA NA ...
$ num_bc_sats : num NA NA NA NA NA NA NA NA NA NA ...
$ num_bc_tl : num NA NA NA NA NA NA NA NA NA NA ...
$ num_il_tl : num NA NA NA NA NA NA NA NA NA NA ...
$ num_op_rev_tl : num NA NA NA NA NA NA NA NA NA NA ...
$ num_rev_accts : num NA NA NA NA NA NA NA NA NA NA ...
$ num_rev_tl_bal_gt_0 : num NA NA NA NA NA NA NA NA NA NA ...
$ num_sats : num NA NA NA NA NA NA NA NA NA NA ...
[list output truncated]
- attr(*, ".internal.selfref")=<externalptr>
In [7]:
# outliers are screwing it up!
hist(acc_dt[, dti])
In [8]:
# from here: http://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
In [9]:
dti_no_outliers <- remove_outliers(acc_dt[, dti])
In [10]:
hist(dti_no_outliers)
Content source: nateGeorge/lending_club_EDA
Similar notebooks: