Check out the keyboard shortcuts by going to help -> keyboard shortcuts.
I frequently use esc to exit a cell, a to add a cell above, b to add a cell below, enter to edit a cell, shift+enter to run a cell, arrow keys to go up and down between cells, ctrl+m to change a cell to markdown (like this cell), and sometimes shift+m to merge cells.
Two very important libraries in Python for data science are pandas and numpy. Pandas was made by Wes McKinney, initially for financial stuff. He also wrote a book on it in 2012 (now getting kind of dated).
Be careful of the order in which you run things (cells) in Jupyter notebooks. If you set a variable in one cell, then go back to a cell above it and change the variable in a different way, it can make things confusing. After completing a notebook, it's usually a good idea to go to Kernel -> Restart and Run All in the menu, to make sure everything is still working like you expect.
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
# "magic" command to make plots show up in the notebook
%matplotlib inline
In [2]:
# this should be the path to the data. Adapt for your system
win_filepath = 'C:/Users/ngeorge/Documents/GitHub/preprocess_lending_club_data/full_data/'
# same path from virtualbox shared folder
linux_filepath = '/media/sf_C_DRIVE/Users/ngeorge/Documents/GitHub/preprocess_lending_club_data/full_data/'
In [3]:
# gzip is a type of compression...it cuts the file size in less than half in this case.
# pandas will automatically detect the compression type
# check out the rediculous amount of options for reading csvs:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
# we're looking at some lendingclub data here. It's a peer-to-peer loan company
acc_df = pd.read_csv(linux_filepath + 'accepted_2007_to_2016.csv.gz')
# this is a dataset with rejected loans from lendingclub
# rej_df = pd.read_csv(linux_filepath + 'rejected_2007_to_2016.csv.gz')
# the files are read into a dataframe
/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (18,46,54) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
In [4]:
print type(acc_df)
<class 'pandas.core.frame.DataFrame'>
In [5]:
# this is how we can see how many entries are in the dataframe (df)
# it also works for numpy arrays
acc_df.shape
Out[5]:
(1321847, 110)
In [6]:
# this is how many rows pandas will show by default with methods like pd.dataframe.head()
pd.options.display.max_rows
Out[6]:
60
In [7]:
# we want to increase it, because in this case there are a lot of column names
pd.options.display.max_rows = 1000
In [8]:
# the .T is transposing the matrix.
# We do this so the 111 column dataframe is easier to read (easier to scroll down than sideways)
# .head() shows the first few rows of the data
acc_df.head().T
Out[8]:
0
1
2
3
4
id
1077501
1077430
1077175
1076863
1075358
member_id
1.2966e+06
1.31417e+06
1.31352e+06
1.27718e+06
1.31175e+06
loan_amnt
5000
2500
2400
10000
3000
funded_amnt
5000
2500
2400
10000
3000
funded_amnt_inv
4975
2500
2400
10000
3000
term
36 months
60 months
36 months
36 months
60 months
int_rate
10.65
15.27
15.96
13.49
12.69
installment
162.87
59.83
84.33
339.31
67.79
grade
B
C
C
C
B
sub_grade
B2
C4
C5
C1
B5
emp_title
NaN
Ryder
NaN
AIR RESOURCES BOARD
University Medical Group
emp_length
10+ years
< 1 year
10+ years
10+ years
1 year
home_ownership
RENT
RENT
RENT
RENT
RENT
annual_inc
24000
30000
12252
49200
80000
verification_status
Verified
Source Verified
Not Verified
Source Verified
Source Verified
issue_d
Dec-2011
Dec-2011
Dec-2011
Dec-2011
Dec-2011
loan_status
Fully Paid
Charged Off
Fully Paid
Fully Paid
Fully Paid
pymnt_plan
n
n
n
n
n
desc
Borrower added on 12/22/11 > I need to upgra...
Borrower added on 12/22/11 > I plan to use t...
NaN
Borrower added on 12/21/11 > to pay for prop...
Borrower added on 12/21/11 > I plan on combi...
purpose
credit_card
car
small_business
other
other
title
Computer
bike
real estate business
personel
Personal
zip_code
860xx
309xx
606xx
917xx
972xx
addr_state
AZ
GA
IL
CA
OR
dti
27.65
1
8.72
20
17.94
delinq_2yrs
0
0
0
0
0
earliest_cr_line
Jan-1985
Apr-1999
Nov-2001
Feb-1996
Jan-1996
inq_last_6mths
1
5
2
1
0
mths_since_last_delinq
NaN
NaN
NaN
35
38
mths_since_last_record
NaN
NaN
NaN
NaN
NaN
open_acc
3
3
2
10
15
pub_rec
0
0
0
0
0
revol_bal
13648
1687
2956
5598
27783
revol_util
83.7
9.4
98.5
21
53.9
total_acc
9
4
10
37
38
initial_list_status
f
f
f
f
f
out_prncp
0
0
0
0
0
out_prncp_inv
0
0
0
0
0
total_pymnt
5863.16
1014.53
3005.67
12231.9
4066.91
total_pymnt_inv
5833.84
1014.53
3005.67
12231.9
4066.91
total_rec_prncp
5000
456.46
2400
10000
3000
total_rec_int
863.16
435.17
605.67
2214.92
1066.91
total_rec_late_fee
0
0
0
16.97
0
recoveries
0
122.9
0
0
0
collection_recovery_fee
0
1.11
0
0
0
last_pymnt_d
Jan-2015
Apr-2013
Jun-2014
Jan-2015
Jan-2017
last_pymnt_amnt
171.62
119.66
649.91
357.48
67.3
next_pymnt_d
NaN
NaN
NaN
NaN
NaN
last_credit_pull_d
Jan-2017
Oct-2016
Jan-2017
Apr-2016
Jan-2017
collections_12_mths_ex_med
0
0
0
0
0
mths_since_last_major_derog
NaN
NaN
NaN
NaN
NaN
policy_code
1
1
1
1
1
application_type
INDIVIDUAL
INDIVIDUAL
INDIVIDUAL
INDIVIDUAL
INDIVIDUAL
annual_inc_joint
NaN
NaN
NaN
NaN
NaN
dti_joint
NaN
NaN
NaN
NaN
NaN
verification_status_joint
NaN
NaN
NaN
NaN
NaN
acc_now_delinq
0
0
0
0
0
tot_coll_amt
NaN
NaN
NaN
NaN
NaN
tot_cur_bal
NaN
NaN
NaN
NaN
NaN
open_acc_6m
NaN
NaN
NaN
NaN
NaN
open_il_6m
NaN
NaN
NaN
NaN
NaN
open_il_12m
NaN
NaN
NaN
NaN
NaN
open_il_24m
NaN
NaN
NaN
NaN
NaN
mths_since_rcnt_il
NaN
NaN
NaN
NaN
NaN
total_bal_il
NaN
NaN
NaN
NaN
NaN
il_util
NaN
NaN
NaN
NaN
NaN
open_rv_12m
NaN
NaN
NaN
NaN
NaN
open_rv_24m
NaN
NaN
NaN
NaN
NaN
max_bal_bc
NaN
NaN
NaN
NaN
NaN
all_util
NaN
NaN
NaN
NaN
NaN
total_rev_hi_lim
NaN
NaN
NaN
NaN
NaN
inq_fi
NaN
NaN
NaN
NaN
NaN
total_cu_tl
NaN
NaN
NaN
NaN
NaN
inq_last_12m
NaN
NaN
NaN
NaN
NaN
acc_open_past_24mths
NaN
NaN
NaN
NaN
NaN
avg_cur_bal
NaN
NaN
NaN
NaN
NaN
bc_open_to_buy
NaN
NaN
NaN
NaN
NaN
bc_util
NaN
NaN
NaN
NaN
NaN
chargeoff_within_12_mths
0
0
0
0
0
delinq_amnt
0
0
0
0
0
mo_sin_old_il_acct
NaN
NaN
NaN
NaN
NaN
mo_sin_old_rev_tl_op
NaN
NaN
NaN
NaN
NaN
mo_sin_rcnt_rev_tl_op
NaN
NaN
NaN
NaN
NaN
mo_sin_rcnt_tl
NaN
NaN
NaN
NaN
NaN
mort_acc
NaN
NaN
NaN
NaN
NaN
mths_since_recent_bc
NaN
NaN
NaN
NaN
NaN
mths_since_recent_bc_dlq
NaN
NaN
NaN
NaN
NaN
mths_since_recent_inq
NaN
NaN
NaN
NaN
NaN
mths_since_recent_revol_delinq
NaN
NaN
NaN
NaN
NaN
num_accts_ever_120_pd
NaN
NaN
NaN
NaN
NaN
num_actv_bc_tl
NaN
NaN
NaN
NaN
NaN
num_actv_rev_tl
NaN
NaN
NaN
NaN
NaN
num_bc_sats
NaN
NaN
NaN
NaN
NaN
num_bc_tl
NaN
NaN
NaN
NaN
NaN
num_il_tl
NaN
NaN
NaN
NaN
NaN
num_op_rev_tl
NaN
NaN
NaN
NaN
NaN
num_rev_accts
NaN
NaN
NaN
NaN
NaN
num_rev_tl_bal_gt_0
NaN
NaN
NaN
NaN
NaN
num_sats
NaN
NaN
NaN
NaN
NaN
num_tl_120dpd_2m
NaN
NaN
NaN
NaN
NaN
num_tl_30dpd
NaN
NaN
NaN
NaN
NaN
num_tl_90g_dpd_24m
NaN
NaN
NaN
NaN
NaN
num_tl_op_past_12m
NaN
NaN
NaN
NaN
NaN
pct_tl_nvr_dlq
NaN
NaN
NaN
NaN
NaN
percent_bc_gt_75
NaN
NaN
NaN
NaN
NaN
pub_rec_bankruptcies
0
0
0
0
0
tax_liens
0
0
0
0
0
tot_hi_cred_lim
NaN
NaN
NaN
NaN
NaN
total_bal_ex_mort
NaN
NaN
NaN
NaN
NaN
total_bc_limit
NaN
NaN
NaN
NaN
NaN
total_il_high_credit_limit
NaN
NaN
NaN
NaN
NaN
In [9]:
# .tail() shows the last few rows
acc_df.tail().T
Out[9]:
1321842
1321843
1321844
1321845
1321846
id
89885898
88977788
88985880
88224441
88215728
member_id
9.62797e+07
9.53207e+07
9.53287e+07
9.45252e+07
9.45164e+07
loan_amnt
24000
24000
40000
24000
14000
funded_amnt
24000
24000
40000
24000
14000
funded_amnt_inv
24000
24000
40000
24000
14000
term
60 months
60 months
60 months
60 months
60 months
int_rate
12.79
10.49
10.49
14.49
14.49
installment
543.5
515.74
859.56
564.56
329.33
grade
C
B
B
C
C
sub_grade
C1
B3
B3
C4
C4
emp_title
Unit Operator
Database Administrator
Vice President
Program Manager
Customer Service Technician
emp_length
7 years
10+ years
9 years
6 years
10+ years
home_ownership
MORTGAGE
MORTGAGE
MORTGAGE
RENT
MORTGAGE
annual_inc
95000
108000
227000
110000
95000
verification_status
Source Verified
Not Verified
Verified
Not Verified
Verified
issue_d
Oct-2016
Oct-2016
Oct-2016
Oct-2016
Oct-2016
loan_status
Current
Current
Current
Current
Current
pymnt_plan
n
n
n
n
n
desc
NaN
NaN
NaN
NaN
NaN
purpose
home_improvement
debt_consolidation
debt_consolidation
debt_consolidation
debt_consolidation
title
Home improvement
Debt consolidation
NaN
Debt consolidation
NaN
zip_code
356xx
840xx
907xx
334xx
770xx
addr_state
AL
UT
CA
FL
TX
dti
19.61
34.94
12.75
18.3
23.36
delinq_2yrs
0
0
7
0
0
earliest_cr_line
Dec-1999
Feb-1991
Feb-1995
Jul-1999
Jun-1996
inq_last_6mths
0
1
1
0
1
mths_since_last_delinq
NaN
60
9
67
37
mths_since_last_record
NaN
69
NaN
72
NaN
open_acc
5
24
5
10
8
pub_rec
0
1
0
1
0
revol_bal
49431
21665
8633
17641
7662
revol_util
84.4
39
64.9
68.1
54
total_acc
54
58
37
31
22
initial_list_status
f
f
f
f
w
out_prncp
23127.7
23074.1
38456.9
23165.7
13499.3
out_prncp_inv
23127.7
23074.1
38456.9
23165.7
13499.3
total_pymnt
1613.45
1533.23
2555.37
1674.36
990.39
total_pymnt_inv
1613.45
1533.23
2555.37
1674.36
990.39
total_rec_prncp
872.33
925.87
1543.09
834.27
500.67
total_rec_int
741.12
607.36
1012.28
840.09
489.72
total_rec_late_fee
0
0
0
0
0
recoveries
0
0
0
0
0
collection_recovery_fee
0
0
0
0
0
last_pymnt_d
Jan-2017
Jan-2017
Jan-2017
Jan-2017
Jan-2017
last_pymnt_amnt
543.5
515.74
859.56
564.56
329.33
next_pymnt_d
Feb-2017
Feb-2017
Feb-2017
Feb-2017
Feb-2017
last_credit_pull_d
Jan-2017
Jan-2017
Jan-2017
Jan-2017
Jan-2017
collections_12_mths_ex_med
0
0
0
0
0
mths_since_last_major_derog
NaN
60
10
67
37
policy_code
1
1
1
1
1
application_type
INDIVIDUAL
INDIVIDUAL
INDIVIDUAL
INDIVIDUAL
INDIVIDUAL
annual_inc_joint
NaN
NaN
NaN
NaN
NaN
dti_joint
NaN
NaN
NaN
NaN
NaN
verification_status_joint
NaN
NaN
NaN
NaN
NaN
acc_now_delinq
0
0
0
0
0
tot_coll_amt
0
0
0
0
0
tot_cur_bal
104422
140118
28398
62426
143066
open_acc_6m
2
0
0
0
0
open_il_6m
2
4
2
2
1
open_il_12m
1
0
0
0
1
open_il_24m
3
5
1
2
2
mths_since_rcnt_il
3
16
15
20
7
total_bal_il
54991
118453
19765
44785
32071
il_util
106
77
46
78
94
open_rv_12m
2
0
0
1
0
open_rv_24m
2
2
0
5
1
max_bal_bc
0
4090
5141
6172
4093
all_util
103
59
51
73
84
total_rev_hi_lim
50500
55600
13300
25900
14200
inq_fi
1
1
3
0
3
total_cu_tl
19
12
0
0
1
inq_last_12m
1
4
2
1
2
acc_open_past_24mths
5
7
2
7
3
avg_cur_bal
26106
6369
5680
6243
17883
bc_open_to_buy
NaN
15307
4070
4660
1838
bc_util
NaN
45.1
66.9
77.5
80.7
chargeoff_within_12_mths
0
0
0
0
0
delinq_amnt
0
0
0
0
0
mo_sin_old_il_acct
201
166
154
132
149
mo_sin_old_rev_tl_op
88
307
258
206
243
mo_sin_rcnt_rev_tl_op
3
15
33
9
22
mo_sin_rcnt_tl
3
15
15
9
7
mort_acc
0
4
3
2
1
mths_since_recent_bc
NaN
15
41
9
75
mths_since_recent_bc_dlq
NaN
60
9
NaN
39
mths_since_recent_inq
4
4
1
9
6
mths_since_recent_revol_delinq
NaN
60
9
NaN
37
num_accts_ever_120_pd
0
12
6
1
4
num_actv_bc_tl
0
5
2
5
3
num_actv_rev_tl
2
12
3
7
3
num_bc_sats
0
7
2
5
4
num_bc_tl
3
18
15
15
12
num_il_tl
43
17
9
4
3
num_op_rev_tl
3
19
3
8
5
num_rev_accts
11
37
23
24
17
num_rev_tl_bal_gt_0
2
12
3
7
3
num_sats
5
24
5
10
8
num_tl_120dpd_2m
0
0
0
0
0
num_tl_30dpd
0
0
0
0
0
num_tl_90g_dpd_24m
0
0
7
0
0
num_tl_op_past_12m
3
0
0
1
1
pct_tl_nvr_dlq
100
75.9
75.7
96.2
81.8
percent_bc_gt_75
NaN
42.9
50
40
50
pub_rec_bankruptcies
0
0
0
1
0
tax_liens
0
1
0
0
0
tot_hi_cred_lim
102517
227883
55970
84664
163804
total_bal_ex_mort
104422
140118
28398
62426
44215
total_bc_limit
0
27900
12300
20700
9500
total_il_high_credit_limit
52017
172283
42670
58764
34169
In [10]:
# .info() tells us the datatype(int64, `object` is a string)
# and will also tell us the number of non-null (not missing) data points for each column
# because this dataframe is so large, we have to force it to show the datatypes and non-null numbers with the arguments
acc_df.info(verbose=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1321847 entries, 0 to 1321846
Data columns (total 110 columns):
id 1321847 non-null int64
member_id 1321847 non-null float64
loan_amnt 1321847 non-null float64
funded_amnt 1321847 non-null float64
funded_amnt_inv 1321847 non-null float64
term 1321847 non-null object
int_rate 1321847 non-null float64
installment 1321847 non-null float64
grade 1321847 non-null object
sub_grade 1321847 non-null object
emp_title 1241887 non-null object
emp_length 1321847 non-null object
home_ownership 1321847 non-null object
annual_inc 1321843 non-null float64
verification_status 1321847 non-null object
issue_d 1321847 non-null object
loan_status 1321847 non-null object
pymnt_plan 1321847 non-null object
desc 126068 non-null object
purpose 1321847 non-null object
title 1298523 non-null object
zip_code 1321847 non-null object
addr_state 1321847 non-null object
dti 1321847 non-null float64
delinq_2yrs 1321818 non-null float64
earliest_cr_line 1321818 non-null object
inq_last_6mths 1321817 non-null float64
mths_since_last_delinq 662677 non-null float64
mths_since_last_record 219977 non-null float64
open_acc 1321818 non-null float64
pub_rec 1321818 non-null float64
revol_bal 1321847 non-null float64
revol_util 1321081 non-null float64
total_acc 1321818 non-null float64
initial_list_status 1321847 non-null object
out_prncp 1321847 non-null float64
out_prncp_inv 1321847 non-null float64
total_pymnt 1321847 non-null float64
total_pymnt_inv 1321847 non-null float64
total_rec_prncp 1321847 non-null float64
total_rec_int 1321847 non-null float64
total_rec_late_fee 1321847 non-null float64
recoveries 1321847 non-null float64
collection_recovery_fee 1321847 non-null float64
last_pymnt_d 1314178 non-null object
last_pymnt_amnt 1321847 non-null float64
next_pymnt_d 829089 non-null object
last_credit_pull_d 1321770 non-null object
collections_12_mths_ex_med 1321702 non-null float64
mths_since_last_major_derog 347331 non-null float64
policy_code 1321847 non-null float64
application_type 1321847 non-null object
annual_inc_joint 9300 non-null float64
dti_joint 9296 non-null float64
verification_status_joint 9300 non-null object
acc_now_delinq 1321818 non-null float64
tot_coll_amt 1251571 non-null float64
tot_cur_bal 1251571 non-null float64
open_acc_6m 455717 non-null float64
open_il_6m 455718 non-null float64
open_il_12m 455718 non-null float64
open_il_24m 455718 non-null float64
mths_since_rcnt_il 443459 non-null float64
total_bal_il 455718 non-null float64
il_util 395615 non-null float64
open_rv_12m 455718 non-null float64
open_rv_24m 455718 non-null float64
max_bal_bc 455718 non-null float64
all_util 455695 non-null float64
total_rev_hi_lim 1251571 non-null float64
inq_fi 455718 non-null float64
total_cu_tl 455717 non-null float64
inq_last_12m 455717 non-null float64
acc_open_past_24mths 1271817 non-null float64
avg_cur_bal 1251559 non-null float64
bc_open_to_buy 1259244 non-null float64
bc_util 1258527 non-null float64
chargeoff_within_12_mths 1321702 non-null float64
delinq_amnt 1321818 non-null float64
mo_sin_old_il_acct 1214312 non-null float64
mo_sin_old_rev_tl_op 1251570 non-null float64
mo_sin_rcnt_rev_tl_op 1251570 non-null float64
mo_sin_rcnt_tl 1251571 non-null float64
mort_acc 1271817 non-null float64
mths_since_recent_bc 1260066 non-null float64
mths_since_recent_bc_dlq 318765 non-null float64
mths_since_recent_inq 1139670 non-null float64
mths_since_recent_revol_delinq 448529 non-null float64
num_accts_ever_120_pd 1251571 non-null float64
num_actv_bc_tl 1251571 non-null float64
num_actv_rev_tl 1251571 non-null float64
num_bc_sats 1263257 non-null float64
num_bc_tl 1251571 non-null float64
num_il_tl 1251571 non-null float64
num_op_rev_tl 1251571 non-null float64
num_rev_accts 1251570 non-null float64
num_rev_tl_bal_gt_0 1251571 non-null float64
num_sats 1263257 non-null float64
num_tl_120dpd_2m 1202848 non-null float64
num_tl_30dpd 1251571 non-null float64
num_tl_90g_dpd_24m 1251571 non-null float64
num_tl_op_past_12m 1251571 non-null float64
pct_tl_nvr_dlq 1251418 non-null float64
percent_bc_gt_75 1258823 non-null float64
pub_rec_bankruptcies 1320482 non-null float64
tax_liens 1321742 non-null float64
tot_hi_cred_lim 1251571 non-null float64
total_bal_ex_mort 1271817 non-null float64
total_bc_limit 1271817 non-null float64
total_il_high_credit_limit 1251571 non-null float64
dtypes: float64(87), int64(1), object(22)
memory usage: 1.1+ GB
In [11]:
# shows some common summary statistics
# again, transposing with .T to make it easier to read
acc_df.describe().T
Out[11]:
count
mean
std
min
25%
50%
75%
max
id
1321847.0
4.872486e+07
3.024669e+07
54734.00
1.858618e+07
54363305.00
7.457674e+07
9.645316e+07
member_id
1321847.0
5.236894e+07
3.218029e+07
70473.00
2.071726e+07
57934813.00
7.995784e+07
1.035709e+08
loan_amnt
1321847.0
1.474822e+04
8.622143e+03
500.00
8.000000e+03
12900.00
2.000000e+04
4.000000e+04
funded_amnt
1321847.0
1.473922e+04
8.618488e+03
500.00
8.000000e+03
12850.00
2.000000e+04
4.000000e+04
funded_amnt_inv
1321847.0
1.471071e+04
8.625729e+03
0.00
8.000000e+03
12800.00
2.000000e+04
4.000000e+04
int_rate
1321847.0
1.317963e+01
4.570613e+00
5.32
9.750000e+00
12.79
1.580000e+01
3.099000e+01
installment
1321847.0
4.391420e+02
2.532324e+02
15.67
2.559000e+02
380.26
5.783100e+02
1.584900e+03
annual_inc
1321843.0
7.649594e+04
6.911192e+04
0.00
4.600000e+04
65000.00
9.000000e+04
9.573072e+06
dti
1321847.0
1.887239e+01
7.098519e+01
-1.00
1.204000e+01
17.83
2.424000e+01
9.999000e+03
delinq_2yrs
1321818.0
3.295938e-01
8.920428e-01
0.00
0.000000e+00
0.00
0.000000e+00
3.900000e+01
inq_last_6mths
1321817.0
6.508745e-01
9.576629e-01
0.00
0.000000e+00
0.00
1.000000e+00
3.300000e+01
mths_since_last_delinq
662677.0
3.389481e+01
2.189243e+01
0.00
1.500000e+01
31.00
4.900000e+01
1.950000e+02
mths_since_last_record
219977.0
6.898570e+01
2.686579e+01
0.00
5.100000e+01
70.00
8.800000e+01
1.290000e+02
open_acc
1321818.0
1.165590e+01
5.469410e+00
0.00
8.000000e+00
11.00
1.400000e+01
9.700000e+01
pub_rec
1321818.0
2.150281e-01
6.171629e-01
0.00
0.000000e+00
0.00
0.000000e+00
8.600000e+01
revol_bal
1321847.0
1.692801e+04
2.293390e+04
0.00
6.295000e+03
11652.00
2.058400e+04
2.904836e+06
revol_util
1321081.0
5.362472e+01
2.404374e+01
0.00
3.580000e+01
54.20
7.220000e+01
8.923000e+02
total_acc
1321818.0
2.502217e+01
1.190355e+01
1.00
1.600000e+01
23.00
3.200000e+01
1.760000e+02
out_prncp
1321847.0
6.755341e+03
7.938161e+03
0.00
0.000000e+00
4124.87
1.120028e+04
4.156034e+04
out_prncp_inv
1321847.0
6.752559e+03
7.935202e+03
0.00
0.000000e+00
4123.36
1.119756e+04
4.156034e+04
total_pymnt
1321847.0
9.250698e+03
8.558047e+03
0.00
2.996065e+03
6662.34
1.271974e+04
5.982952e+04
total_pymnt_inv
1321847.0
9.223675e+03
8.545527e+03
0.00
2.978780e+03
6635.78
1.268514e+04
5.970131e+04
total_rec_prncp
1321847.0
7.139344e+03
7.236062e+03
0.00
1.942560e+03
4705.53
1.000000e+04
4.000000e+04
total_rec_int
1321847.0
2.040949e+03
2.304573e+03
0.00
5.811850e+02
1296.70
2.617710e+03
2.635950e+04
total_rec_late_fee
1321847.0
5.746782e-01
5.254665e+00
0.00
0.000000e+00
0.00
0.000000e+00
3.676000e+02
recoveries
1321847.0
6.982961e+01
4.927118e+02
0.00
0.000000e+00
0.00
0.000000e+00
3.944437e+04
collection_recovery_fee
1321847.0
9.962762e+00
8.154662e+01
0.00
0.000000e+00
0.00
0.000000e+00
7.002190e+03
last_pymnt_amnt
1321847.0
2.619154e+03
5.356282e+03
0.00
2.886600e+02
489.45
1.072560e+03
4.214853e+04
collections_12_mths_ex_med
1321702.0
1.678518e-02
1.467218e-01
0.00
0.000000e+00
0.00
0.000000e+00
2.000000e+01
mths_since_last_major_derog
347331.0
4.441145e+01
2.228005e+01
0.00
2.700000e+01
44.00
6.200000e+01
1.970000e+02
policy_code
1321847.0
1.000000e+00
0.000000e+00
1.00
1.000000e+00
1.00
1.000000e+00
1.000000e+00
annual_inc_joint
9300.0
1.115636e+05
5.019977e+04
11000.00
7.849250e+04
103000.00
1.340000e+05
1.050000e+06
dti_joint
9296.0
1.829942e+01
7.020419e+00
0.32
1.332000e+01
18.01
2.290000e+01
6.949000e+01
acc_now_delinq
1321818.0
5.581706e-03
8.119766e-02
0.00
0.000000e+00
0.00
0.000000e+00
1.400000e+01
tot_coll_amt
1251571.0
2.411676e+02
8.444208e+03
0.00
0.000000e+00
0.00
0.000000e+00
9.152545e+06
tot_cur_bal
1251571.0
1.408428e+05
1.563637e+05
0.00
3.001700e+04
80847.00
2.097875e+05
8.000078e+06
open_acc_6m
455717.0
1.008288e+00
1.188975e+00
0.00
0.000000e+00
1.00
2.000000e+00
1.800000e+01
open_il_6m
455718.0
2.831942e+00
3.052855e+00
0.00
1.000000e+00
2.00
3.000000e+00
4.800000e+01
open_il_12m
455718.0
7.279414e-01
9.787083e-01
0.00
0.000000e+00
0.00
1.000000e+00
2.500000e+01
open_il_24m
455718.0
1.629345e+00
1.657855e+00
0.00
0.000000e+00
1.00
2.000000e+00
5.100000e+01
mths_since_rcnt_il
443459.0
2.127134e+01
2.662149e+01
0.00
7.000000e+00
13.00
2.400000e+01
5.110000e+02
total_bal_il
455718.0
3.561987e+04
4.257060e+04
0.00
9.495000e+03
23781.00
4.636400e+04
1.547285e+06
il_util
395615.0
7.098847e+01
2.317377e+01
0.00
5.800000e+01
74.00
8.700000e+01
1.000000e+03
open_rv_12m
455718.0
1.373542e+00
1.536616e+00
0.00
0.000000e+00
1.00
2.000000e+00
2.800000e+01
open_rv_24m
455718.0
2.903416e+00
2.631446e+00
0.00
1.000000e+00
2.00
4.000000e+00
6.000000e+01
max_bal_bc
455718.0
5.777328e+03
5.619377e+03
0.00
2.362000e+03
4385.00
7.460000e+03
7.768430e+05
all_util
455695.0
6.022548e+01
2.015962e+01
0.00
4.700000e+01
61.00
7.400000e+01
2.040000e+02
total_rev_hi_lim
1251571.0
3.276469e+04
3.669014e+04
0.00
1.410000e+04
24100.00
4.060000e+04
9.999999e+06
inq_fi
455718.0
9.650091e-01
1.497837e+00
0.00
0.000000e+00
0.00
1.000000e+00
4.800000e+01
total_cu_tl
455717.0
1.498709e+00
2.722067e+00
0.00
0.000000e+00
0.00
2.000000e+00
1.110000e+02
inq_last_12m
455717.0
2.166066e+00
2.459050e+00
0.00
0.000000e+00
1.00
3.000000e+00
4.900000e+01
acc_open_past_24mths
1271817.0
4.551259e+00
3.120103e+00
0.00
2.000000e+00
4.00
6.000000e+00
6.400000e+01
avg_cur_bal
1251559.0
1.336583e+04
1.600496e+04
0.00
3.149000e+03
7424.00
1.851800e+04
9.580840e+05
bc_open_to_buy
1259244.0
9.547611e+03
1.458259e+04
0.00
1.345000e+03
4359.00
1.152000e+04
7.111400e+05
bc_util
1258527.0
6.210613e+01
2.751823e+01
0.00
4.160000e+01
65.80
8.620000e+01
3.396000e+02
chargeoff_within_12_mths
1321702.0
8.962686e-03
1.083899e-01
0.00
0.000000e+00
0.00
0.000000e+00
1.000000e+01
delinq_amnt
1321818.0
1.561729e+01
8.128777e+02
0.00
0.000000e+00
0.00
0.000000e+00
1.854080e+05
mo_sin_old_il_acct
1214312.0
1.270696e+02
5.201166e+01
0.00
1.000000e+02
130.00
1.530000e+02
7.240000e+02
mo_sin_old_rev_tl_op
1251570.0
1.843689e+02
9.484415e+01
2.00
1.190000e+02
167.00
2.340000e+02
9.010000e+02
mo_sin_rcnt_rev_tl_op
1251570.0
1.344386e+01
1.680114e+01
0.00
4.000000e+00
8.00
1.600000e+01
4.380000e+02
mo_sin_rcnt_tl
1251571.0
8.086934e+00
9.065880e+00
0.00
3.000000e+00
6.00
1.000000e+01
3.140000e+02
mort_acc
1271817.0
1.689627e+00
2.013544e+00
0.00
0.000000e+00
1.00
3.000000e+00
6.100000e+01
mths_since_recent_bc
1260066.0
2.461952e+01
3.149733e+01
0.00
6.000000e+00
14.00
2.900000e+01
6.390000e+02
mths_since_recent_bc_dlq
318765.0
3.941078e+01
2.275385e+01
0.00
2.000000e+01
38.00
5.800000e+01
1.950000e+02
mths_since_recent_inq
1139670.0
6.833985e+00
5.923765e+00
0.00
2.000000e+00
5.00
1.000000e+01
2.500000e+01
mths_since_recent_revol_delinq
448529.0
3.546095e+01
2.241687e+01
0.00
1.700000e+01
32.00
5.200000e+01
1.970000e+02
num_accts_ever_120_pd
1251571.0
5.053049e-01
1.309245e+00
0.00
0.000000e+00
0.00
0.000000e+00
5.100000e+01
num_actv_bc_tl
1251571.0
3.714223e+00
2.264803e+00
0.00
2.000000e+00
3.00
5.000000e+00
4.700000e+01
num_actv_rev_tl
1251571.0
5.778005e+00
3.334688e+00
0.00
3.000000e+00
5.00
7.000000e+00
5.900000e+01
num_bc_sats
1263257.0
4.745405e+00
2.942177e+00
0.00
3.000000e+00
4.00
6.000000e+00
7.100000e+01
num_bc_tl
1251571.0
8.150501e+00
4.788671e+00
0.00
5.000000e+00
7.00
1.100000e+01
7.900000e+01
num_il_tl
1251571.0
8.514671e+00
7.352151e+00
0.00
3.000000e+00
7.00
1.100000e+01
1.590000e+02
num_op_rev_tl
1251571.0
8.319066e+00
4.541131e+00
0.00
5.000000e+00
7.00
1.100000e+01
9.100000e+01
num_rev_accts
1251570.0
1.468883e+01
8.080868e+00
0.00
9.000000e+00
13.00
1.900000e+01
1.180000e+02
num_rev_tl_bal_gt_0
1251571.0
5.731843e+00
3.253593e+00
0.00
3.000000e+00
5.00
7.000000e+00
5.900000e+01
num_sats
1263257.0
1.170843e+01
5.475699e+00
0.00
8.000000e+00
11.00
1.400000e+01
9.700000e+01
num_tl_120dpd_2m
1202848.0
8.646146e-04
3.117603e-02
0.00
0.000000e+00
0.00
0.000000e+00
6.000000e+00
num_tl_30dpd
1251571.0
3.875929e-03
6.640046e-02
0.00
0.000000e+00
0.00
0.000000e+00
4.000000e+00
num_tl_90g_dpd_24m
1251571.0
9.073636e-02
5.063076e-01
0.00
0.000000e+00
0.00
0.000000e+00
3.900000e+01
num_tl_op_past_12m
1251571.0
2.109639e+00
1.804914e+00
0.00
1.000000e+00
2.00
3.000000e+00
3.200000e+01
pct_tl_nvr_dlq
1251418.0
9.406750e+01
8.790950e+00
0.00
9.110000e+01
97.60
1.000000e+02
1.000000e+02
percent_bc_gt_75
1258823.0
4.745467e+01
3.583138e+01
0.00
1.670000e+01
50.00
7.500000e+01
1.000000e+02
pub_rec_bankruptcies
1320482.0
1.267870e-01
3.726020e-01
0.00
0.000000e+00
0.00
0.000000e+00
1.200000e+01
tax_liens
1321742.0
5.661468e-02
4.183959e-01
0.00
0.000000e+00
0.00
0.000000e+00
8.500000e+01
tot_hi_cred_lim
1251571.0
1.736727e+05
1.768078e+05
0.00
4.978900e+04
112389.00
2.508545e+05
9.999999e+06
total_bal_ex_mort
1271817.0
5.029927e+04
4.779373e+04
0.00
2.143000e+04
37888.00
6.325100e+04
2.921551e+06
total_bc_limit
1271817.0
2.148790e+04
2.130429e+04
0.00
7.800000e+03
15000.00
2.800000e+04
1.105500e+06
total_il_high_credit_limit
1251571.0
4.221178e+04
4.330529e+04
0.00
1.482500e+04
31728.00
5.671200e+04
2.101913e+06
We find ourselves doing the same types of things often in pandas. For example, masking.
Maybe we only want to see one column. We do this like
acc_df['int_rate']
We can also subset a dataframe based on some critereon. Let's look at only the highest interest rates
acc_df[acc_df['int_rate'] > 20]
We can take the mean, median, get the standard deviation of columns:
acc_df['int_rate'].mean()
In [12]:
acc_df[acc_df['int_rate'] > 20]['int_rate'].mean()
# thats a bit of a complicated statement. Breaking it down:
# acc_df['int_rate'] > 20 returns a mask: an array of True/False values
# putting this array into acc_df[] returns the dataframe rows where the interest rates are greater than 20
# so this first part: acc_df[acc_df['int_rate'] > 20]
# gives us a dataframe
# we select a column with ['int_rate'] at the end. Then we get the average value with .mean()
Out[12]:
23.147697456492637
In [13]:
acc_df['grade'].unique()
Out[13]:
array(['B', 'C', 'A', 'E', 'F', 'D', 'G'], dtype=object)
In [15]:
# selecting only grade A loans:
acc_df[acc_df['grade'] == 'A'].describe().T
Out[15]:
count
mean
std
min
25%
50%
75%
max
id
219054.0
4.832391e+07
2.946658e+07
56121.00
2.017032e+07
5.445476e+07
7.367424e+07
9.642933e+07
member_id
219054.0
5.191378e+07
3.136139e+07
70473.00
2.242162e+07
5.803389e+07
7.899566e+07
1.035410e+08
loan_amnt
219054.0
1.412557e+04
7.840254e+03
500.00
8.000000e+03
1.200000e+04
2.000000e+04
4.000000e+04
funded_amnt
219054.0
1.411501e+04
7.841662e+03
500.00
8.000000e+03
1.200000e+04
2.000000e+04
4.000000e+04
funded_amnt_inv
219054.0
1.409732e+04
7.848511e+03
0.00
8.000000e+03
1.200000e+04
2.000000e+04
4.000000e+04
int_rate
219054.0
7.129947e+00
1.017642e+00
5.32
6.490000e+00
7.260000e+00
7.890000e+00
9.630000e+00
installment
219054.0
4.278599e+02
2.387787e+02
15.67
2.458500e+02
3.726700e+02
5.995800e+02
1.264380e+03
annual_inc
219050.0
8.942210e+04
9.121994e+04
0.00
5.400000e+04
7.500000e+04
1.050000e+05
9.573072e+06
dti
219054.0
1.585561e+01
4.331526e+01
0.00
1.012000e+01
1.513000e+01
2.081000e+01
9.999000e+03
delinq_2yrs
219043.0
2.042110e-01
6.567539e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
2.000000e+01
inq_last_6mths
219043.0
3.743055e-01
6.946100e-01
0.00
0.000000e+00
0.000000e+00
1.000000e+00
1.000000e+01
mths_since_last_delinq
86975.0
3.576146e+01
2.161122e+01
0.00
1.800000e+01
3.300000e+01
5.100000e+01
1.700000e+02
mths_since_last_record
19140.0
6.988130e+01
2.796503e+01
0.00
5.200000e+01
7.100000e+01
9.000000e+01
1.210000e+02
open_acc
219043.0
1.198380e+01
5.467297e+00
1.00
8.000000e+00
1.100000e+01
1.500000e+01
8.600000e+01
pub_rec
219043.0
1.111471e-01
4.541204e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
3.700000e+01
revol_bal
219054.0
1.920072e+04
2.840560e+04
0.00
6.805000e+03
1.264750e+04
2.244000e+04
2.568995e+06
revol_util
218987.0
4.086190e+01
2.287719e+01
0.00
2.320000e+01
3.890000e+01
5.690000e+01
1.720000e+02
total_acc
219043.0
2.646038e+01
1.157362e+01
1.00
1.800000e+01
2.500000e+01
3.300000e+01
1.370000e+02
out_prncp
219054.0
5.846108e+03
6.782759e+03
0.00
0.000000e+00
3.971180e+03
9.321020e+03
4.156034e+04
out_prncp_inv
219054.0
5.843938e+03
6.780002e+03
0.00
0.000000e+00
3.971065e+03
9.320880e+03
4.156034e+04
total_pymnt
219054.0
9.089074e+03
7.441278e+03
0.00
3.432205e+03
6.987388e+03
1.275515e+04
4.202719e+04
total_pymnt_inv
219054.0
9.072305e+03
7.438505e+03
0.00
3.417060e+03
6.967070e+03
1.273452e+04
4.202719e+04
total_rec_prncp
219054.0
8.083666e+03
6.852942e+03
0.00
2.889110e+03
6.000000e+03
1.139958e+04
4.000000e+04
total_rec_int
219054.0
9.908640e+02
7.894928e+02
0.00
4.065600e+02
7.915350e+02
1.370910e+03
5.752190e+03
total_rec_late_fee
219054.0
2.201795e-01
2.991340e+00
0.00
0.000000e+00
0.000000e+00
0.000000e+00
3.128500e+02
recoveries
219054.0
1.432414e+01
2.179274e+02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
2.156742e+04
collection_recovery_fee
219054.0
2.007859e+00
3.679182e+01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
5.774800e+03
last_pymnt_amnt
219054.0
2.357615e+03
4.473266e+03
0.00
2.948725e+02
4.903600e+02
1.095330e+03
4.033284e+04
collections_12_mths_ex_med
219019.0
9.049443e-03
1.153875e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
2.000000e+01
mths_since_last_major_derog
37505.0
4.521227e+01
2.113264e+01
0.00
2.900000e+01
4.500000e+01
6.200000e+01
1.700000e+02
policy_code
219054.0
1.000000e+00
0.000000e+00
1.00
1.000000e+00
1.000000e+00
1.000000e+00
1.000000e+00
annual_inc_joint
832.0
1.293609e+05
5.792891e+04
25800.00
9.300000e+04
1.183540e+05
1.540000e+05
4.800000e+05
dti_joint
832.0
1.562950e+01
6.026367e+00
2.07
1.137000e+01
1.526500e+01
1.931250e+01
4.478000e+01
acc_now_delinq
219043.0
1.867213e-03
4.573841e-02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
4.000000e+00
tot_coll_amt
202308.0
1.516021e+02
1.587768e+03
0.00
0.000000e+00
0.000000e+00
0.000000e+00
2.241070e+05
tot_cur_bal
202308.0
1.777121e+05
1.779848e+05
0.00
3.926100e+04
1.358080e+05
2.602600e+05
4.035706e+06
open_acc_6m
74545.0
7.127373e-01
9.433867e-01
0.00
0.000000e+00
0.000000e+00
1.000000e+00
1.200000e+01
open_il_6m
74545.0
2.675887e+00
2.895929e+00
0.00
1.000000e+00
2.000000e+00
3.000000e+00
3.800000e+01
open_il_12m
74545.0
4.518076e-01
7.392276e-01
0.00
0.000000e+00
0.000000e+00
1.000000e+00
1.100000e+01
open_il_24m
74545.0
1.223006e+00
1.310623e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
2.000000e+01
mths_since_rcnt_il
73115.0
2.567870e+01
2.782954e+01
0.00
9.000000e+00
1.700000e+01
3.000000e+01
5.030000e+02
total_bal_il
74545.0
3.394634e+04
4.367126e+04
0.00
7.895000e+03
2.146600e+04
4.331300e+04
1.044916e+06
il_util
63289.0
6.402236e+01
2.373237e+01
0.00
4.900000e+01
6.600000e+01
8.100000e+01
2.750000e+02
open_rv_12m
74545.0
1.000322e+00
1.203798e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
1.600000e+01
open_rv_24m
74545.0
2.199209e+00
2.124902e+00
0.00
1.000000e+00
2.000000e+00
3.000000e+00
3.800000e+01
max_bal_bc
74545.0
6.816120e+03
6.369973e+03
0.00
2.880000e+03
5.208000e+03
8.959000e+03
4.575210e+05
all_util
74542.0
4.878659e+01
1.987433e+01
0.00
3.500000e+01
4.800000e+01
6.200000e+01
1.720000e+02
total_rev_hi_lim
202308.0
4.804376e+04
5.372299e+04
0.00
2.190000e+04
3.660000e+04
5.980000e+04
9.999999e+06
inq_fi
74545.0
6.542089e-01
1.097309e+00
0.00
0.000000e+00
0.000000e+00
1.000000e+00
1.600000e+01
total_cu_tl
74545.0
1.460970e+00
2.573779e+00
0.00
0.000000e+00
0.000000e+00
2.000000e+00
4.900000e+01
inq_last_12m
74545.0
1.435603e+00
1.774784e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
2.800000e+01
acc_open_past_24mths
206693.0
3.645382e+00
2.605672e+00
0.00
2.000000e+00
3.000000e+00
5.000000e+00
3.900000e+01
avg_cur_bal
202308.0
1.679908e+04
1.837788e+04
0.00
3.925000e+03
1.142200e+04
2.349200e+04
4.636980e+05
bc_open_to_buy
205655.0
1.998775e+04
2.217775e+04
0.00
5.240500e+03
1.271100e+04
2.682900e+04
4.974450e+05
bc_util
205556.0
4.628191e+01
2.631821e+01
0.00
2.540000e+01
4.460000e+01
6.650000e+01
1.870000e+02
chargeoff_within_12_mths
219019.0
4.894553e-03
8.002961e-02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
6.000000e+00
delinq_amnt
219043.0
6.655173e+00
5.446757e+02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
6.565200e+04
mo_sin_old_il_acct
198167.0
1.339263e+02
4.874892e+01
0.00
1.100000e+02
1.350000e+02
1.570000e+02
7.240000e+02
mo_sin_old_rev_tl_op
202308.0
2.098849e+02
9.790199e+01
5.00
1.390000e+02
1.890000e+02
2.630000e+02
8.510000e+02
mo_sin_rcnt_rev_tl_op
202308.0
1.674689e+01
1.927746e+01
0.00
5.000000e+00
1.100000e+01
2.100000e+01
4.380000e+02
mo_sin_rcnt_tl
202308.0
1.002895e+01
1.032866e+01
0.00
4.000000e+00
7.000000e+00
1.300000e+01
2.260000e+02
mort_acc
206693.0
2.078014e+00
2.110272e+00
0.00
0.000000e+00
2.000000e+00
3.000000e+00
3.100000e+01
mths_since_recent_bc
205755.0
2.911749e+01
3.526430e+01
0.00
8.000000e+00
1.700000e+01
3.600000e+01
6.390000e+02
mths_since_recent_bc_dlq
39518.0
3.944850e+01
2.231340e+01
0.00
2.100000e+01
3.800000e+01
5.800000e+01
1.700000e+02
mths_since_recent_inq
172696.0
8.471204e+00
6.297372e+00
0.00
3.000000e+00
7.000000e+00
1.300000e+01
2.500000e+01
mths_since_recent_revol_delinq
58226.0
3.598997e+01
2.190405e+01
0.00
1.800000e+01
3.300000e+01
5.100000e+01
1.700000e+02
num_accts_ever_120_pd
202308.0
3.329231e-01
1.057903e+00
0.00
0.000000e+00
0.000000e+00
0.000000e+00
3.300000e+01
num_actv_bc_tl
202308.0
3.680388e+00
2.264157e+00
0.00
2.000000e+00
3.000000e+00
5.000000e+00
4.700000e+01
num_actv_rev_tl
202308.0
5.270538e+00
3.120475e+00
0.00
3.000000e+00
5.000000e+00
7.000000e+00
5.900000e+01
num_bc_sats
204483.0
5.219818e+00
3.159926e+00
0.00
3.000000e+00
5.000000e+00
7.000000e+00
6.100000e+01
num_bc_tl
202308.0
9.122402e+00
5.078673e+00
0.00
5.000000e+00
8.000000e+00
1.200000e+01
6.800000e+01
num_il_tl
202308.0
8.566557e+00
7.013565e+00
0.00
4.000000e+00
7.000000e+00
1.100000e+01
1.130000e+02
num_op_rev_tl
202308.0
8.599141e+00
4.687915e+00
0.00
5.000000e+00
8.000000e+00
1.100000e+01
7.900000e+01
num_rev_accts
202308.0
1.577117e+01
8.360803e+00
1.00
1.000000e+01
1.400000e+01
2.000000e+01
1.040000e+02
num_rev_tl_bal_gt_0
202308.0
5.237954e+00
3.033714e+00
0.00
3.000000e+00
5.000000e+00
7.000000e+00
5.900000e+01
num_sats
204483.0
1.212833e+01
5.504708e+00
1.00
8.000000e+00
1.100000e+01
1.500000e+01
8.500000e+01
num_tl_120dpd_2m
197437.0
3.038944e-04
1.828095e-02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
2.000000e+00
num_tl_30dpd
202308.0
1.369199e-03
3.803182e-02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
3.000000e+00
num_tl_90g_dpd_24m
202308.0
4.810487e-02
3.469449e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
2.000000e+01
num_tl_op_past_12m
202308.0
1.549350e+00
1.448455e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
2.000000e+01
pct_tl_nvr_dlq
202292.0
9.601827e+01
6.999248e+00
18.20
9.440000e+01
1.000000e+02
1.000000e+02
1.000000e+02
percent_bc_gt_75
205650.0
2.793310e+01
3.069985e+01
0.00
0.000000e+00
2.000000e+01
5.000000e+01
1.000000e+02
pub_rec_bankruptcies
218886.0
5.868808e-02
2.536086e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
8.000000e+00
tax_liens
219024.0
3.726532e-02
3.351089e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
3.700000e+01
tot_hi_cred_lim
202308.0
2.292160e+05
2.075162e+05
1600.00
7.658500e+04
1.839230e+05
3.211510e+05
9.999999e+06
total_bal_ex_mort
206693.0
5.271477e+04
5.219592e+04
0.00
2.203700e+04
3.883700e+04
6.548100e+04
2.688920e+06
total_bc_limit
206693.0
3.387474e+04
2.841031e+04
0.00
1.430000e+04
2.600000e+04
4.500000e+04
7.600000e+05
total_il_high_credit_limit
202308.0
4.463529e+04
4.674582e+04
0.00
1.547175e+04
3.320700e+04
5.946825e+04
2.101913e+06
In [16]:
acc_df['loan_status'].unique()
Out[16]:
array(['Fully Paid', 'Charged Off', 'Late (31-120 days)', 'Current',
'Late (16-30 days)', 'In Grace Period', 'Default',
'Does not meet the credit policy. Status:Fully Paid',
'Does not meet the credit policy. Status:Charged Off', 'Issued'], dtype=object)
In [19]:
# looking at only defaulted loans:
default_categories = ['Default', 'Charged Off', 'Does not meet the credit policy. Status:Charged Off']
# .isin() is a trick for checking if something is in a list
# it's a pandas-specific function
acc_df[acc_df['loan_status'].isin(default_categories)].describe().T
# check out the average interest rate and dti (debt-to-income)
Out[19]:
count
mean
std
min
25%
50%
75%
max
id
98930.0
2.699739e+07
2.301416e+07
56413.00
6.310303e+06
2.013981e+07
4.568428e+07
91330911.00
member_id
98930.0
2.923833e+07
2.437432e+07
70694.00
7.610162e+06
2.239203e+07
4.878197e+07
98011682.00
loan_amnt
98930.0
1.508133e+04
8.500336e+03
500.00
8.500000e+03
1.377500e+04
2.000000e+04
40000.00
funded_amnt
98930.0
1.505748e+04
8.489354e+03
500.00
8.500000e+03
1.375000e+04
2.000000e+04
40000.00
funded_amnt_inv
98930.0
1.497194e+04
8.521248e+03
0.00
8.400000e+03
1.350000e+04
2.000000e+04
40000.00
int_rate
98930.0
1.587433e+01
4.382355e+00
5.32
1.299000e+01
1.561000e+01
1.855000e+01
30.99
installment
98930.0
4.515717e+02
2.493054e+02
15.91
2.733900e+02
3.975600e+02
5.845900e+02
1527.00
annual_inc
98930.0
6.769918e+04
6.799978e+04
0.00
4.200000e+04
5.900000e+04
8.000000e+04
8900060.00
dti
98930.0
1.958989e+01
3.282763e+01
0.00
1.329000e+01
1.931000e+01
2.550000e+01
9999.00
delinq_2yrs
98927.0
3.204383e-01
8.769726e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
20.00
inq_last_6mths
98927.0
9.264306e-01
1.169960e+00
0.00
0.000000e+00
1.000000e+00
1.000000e+00
33.00
mths_since_last_delinq
48079.0
3.381019e+01
2.209809e+01
0.00
1.500000e+01
3.000000e+01
5.000000e+01
152.00
mths_since_last_record
16060.0
7.284732e+01
2.910663e+01
0.00
5.400000e+01
7.500000e+01
9.700000e+01
129.00
open_acc
98927.0
1.158510e+01
5.297463e+00
0.00
8.000000e+00
1.100000e+01
1.400000e+01
76.00
pub_rec
98927.0
1.992783e-01
6.243241e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
86.00
revol_bal
98930.0
1.543650e+04
1.967720e+04
0.00
6.121000e+03
1.122700e+04
1.937575e+04
1746716.00
revol_util
98848.0
5.782778e+01
2.355960e+01
0.00
4.120000e+01
5.930000e+01
7.630000e+01
366.60
total_acc
98927.0
2.494847e+01
1.194085e+01
1.00
1.600000e+01
2.300000e+01
3.200000e+01
151.00
out_prncp
98930.0
1.866649e+02
1.760048e+03
0.00
0.000000e+00
0.000000e+00
0.000000e+00
35000.00
out_prncp_inv
98930.0
1.865517e+02
1.758890e+03
0.00
0.000000e+00
0.000000e+00
0.000000e+00
35000.00
total_pymnt
98930.0
6.836523e+03
5.737630e+03
0.00
2.799695e+03
5.276090e+03
9.125892e+03
52956.86
total_pymnt_inv
98930.0
6.793016e+03
5.729176e+03
0.00
2.761983e+03
5.232555e+03
9.087438e+03
52956.86
total_rec_prncp
98930.0
3.586177e+03
3.505095e+03
0.00
1.229388e+03
2.529110e+03
4.763060e+03
34983.14
total_rec_int
98930.0
2.314493e+03
2.400856e+03
0.00
7.488125e+02
1.554030e+03
2.994255e+03
23448.10
total_rec_late_fee
98930.0
2.829853e+00
1.171959e+01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
358.68
recoveries
98930.0
9.330239e+02
1.561515e+03
0.00
0.000000e+00
9.717500e+01
1.408938e+03
39444.37
collection_recovery_fee
98930.0
1.331168e+02
2.691809e+02
0.00
0.000000e+00
1.900000e+00
1.740271e+02
7002.19
last_pymnt_amnt
98930.0
4.665552e+02
6.263558e+02
0.00
2.563000e+02
3.882800e+02
5.844800e+02
34833.35
collections_12_mths_ex_med
98902.0
1.513620e-02
1.347703e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
6.00
mths_since_last_major_derog
24667.0
4.402315e+01
2.232647e+01
0.00
2.600000e+01
4.400000e+01
6.200000e+01
152.00
policy_code
98930.0
1.000000e+00
0.000000e+00
1.00
1.000000e+00
1.000000e+00
1.000000e+00
1.00
annual_inc_joint
73.0
8.860393e+04
3.698016e+04
17950.00
7.113600e+04
8.400000e+04
1.035000e+05
200000.00
dti_joint
73.0
2.156918e+01
8.538098e+00
4.50
1.445000e+01
2.281000e+01
2.676000e+01
38.45
acc_now_delinq
98927.0
5.357486e-03
8.138018e-02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
6.00
tot_coll_amt
87903.0
1.902361e+02
1.593015e+03
0.00
0.000000e+00
0.000000e+00
0.000000e+00
126297.00
tot_cur_bal
87903.0
1.160545e+05
1.304739e+05
0.00
2.708250e+04
6.099400e+04
1.718290e+05
3437283.00
open_acc_6m
5822.0
1.405187e+00
1.414041e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
12.00
open_il_6m
5822.0
3.027825e+00
3.232541e+00
0.00
1.000000e+00
2.000000e+00
4.000000e+00
38.00
open_il_12m
5822.0
1.154243e+00
1.250012e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
10.00
open_il_24m
5822.0
2.250086e+00
2.022544e+00
0.00
1.000000e+00
2.000000e+00
3.000000e+00
20.00
mths_since_rcnt_il
5673.0
1.547858e+01
2.228752e+01
0.00
4.000000e+00
8.000000e+00
1.700000e+01
245.00
total_bal_il
5822.0
3.736793e+04
4.051580e+04
0.00
1.214100e+04
2.723800e+04
4.968300e+04
796104.00
il_util
5186.0
7.642653e+01
2.081345e+01
0.00
6.500000e+01
8.000000e+01
9.100000e+01
208.00
open_rv_12m
5822.0
1.635349e+00
1.674451e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
18.00
open_rv_24m
5822.0
3.463586e+00
2.859950e+00
0.00
1.000000e+00
3.000000e+00
5.000000e+00
30.00
max_bal_bc
5822.0
5.351811e+03
5.308727e+03
0.00
1.927000e+03
3.945500e+03
7.024000e+03
84988.00
all_util
5822.0
6.464875e+01
1.984519e+01
0.00
5.300000e+01
6.600000e+01
7.900000e+01
161.00
total_rev_hi_lim
87903.0
2.742641e+04
2.703635e+04
0.00
1.250000e+04
2.100000e+04
3.440000e+04
1998700.00
inq_fi
5822.0
1.325318e+00
1.879731e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
23.00
total_cu_tl
5822.0
1.692546e+00
2.899439e+00
0.00
0.000000e+00
0.000000e+00
2.000000e+00
28.00
inq_last_12m
5822.0
2.847303e+00
2.912727e+00
0.00
1.000000e+00
2.000000e+00
4.000000e+00
31.00
acc_open_past_24mths
91295.0
5.151826e+00
3.226986e+00
0.00
3.000000e+00
5.000000e+00
7.000000e+00
41.00
avg_cur_bal
87899.0
1.083432e+04
1.291830e+04
0.00
2.794000e+03
5.784000e+03
1.481450e+04
276369.00
bc_open_to_buy
90264.0
6.547262e+03
1.071069e+04
0.00
8.330000e+02
2.851000e+03
7.629000e+03
282091.00
bc_util
90198.0
6.735443e+01
2.665687e+01
0.00
4.940000e+01
7.290000e+01
9.030000e+01
255.20
chargeoff_within_12_mths
98902.0
8.564033e-03
1.059282e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
5.00
delinq_amnt
98927.0
1.101499e+01
6.290389e+02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
65000.00
mo_sin_old_il_acct
84975.0
1.225154e+02
5.352688e+01
1.00
9.200000e+01
1.260000e+02
1.500000e+02
720.00
mo_sin_old_rev_tl_op
87902.0
1.707054e+02
9.069158e+01
4.00
1.090000e+02
1.540000e+02
2.140000e+02
842.00
mo_sin_rcnt_rev_tl_op
87902.0
1.114201e+01
1.383892e+01
0.00
3.000000e+00
7.000000e+00
1.400000e+01
372.00
mo_sin_rcnt_tl
87903.0
6.797504e+00
7.642844e+00
0.00
2.000000e+00
5.000000e+00
9.000000e+00
197.00
mort_acc
91295.0
1.492174e+00
1.966648e+00
0.00
0.000000e+00
1.000000e+00
2.000000e+00
23.00
mths_since_recent_bc
90347.0
2.078892e+01
2.675381e+01
0.00
5.000000e+00
1.200000e+01
2.500000e+01
538.00
mths_since_recent_bc_dlq
22672.0
3.977135e+01
2.278040e+01
0.00
2.000000e+01
3.900000e+01
5.900000e+01
152.00
mths_since_recent_inq
84998.0
5.998176e+00
5.587802e+00
0.00
1.000000e+00
4.000000e+00
9.000000e+00
25.00
mths_since_recent_revol_delinq
31417.0
3.567467e+01
2.245821e+01
0.00
1.600000e+01
3.300000e+01
5.300000e+01
152.00
num_accts_ever_120_pd
87903.0
4.785957e-01
1.200960e+00
0.00
0.000000e+00
0.000000e+00
0.000000e+00
24.00
num_actv_bc_tl
87903.0
3.784399e+00
2.263239e+00
0.00
2.000000e+00
3.000000e+00
5.000000e+00
26.00
num_actv_rev_tl
87903.0
6.057768e+00
3.345165e+00
0.00
4.000000e+00
5.000000e+00
8.000000e+00
40.00
num_bc_sats
89932.0
4.706389e+00
2.806615e+00
0.00
3.000000e+00
4.000000e+00
6.000000e+00
35.00
num_bc_tl
87903.0
8.398325e+00
4.892832e+00
0.00
5.000000e+00
8.000000e+00
1.100000e+01
52.00
num_il_tl
87903.0
8.542791e+00
7.470562e+00
0.00
3.000000e+00
7.000000e+00
1.100000e+01
117.00
num_op_rev_tl
87903.0
8.473875e+00
4.409085e+00
0.00
5.000000e+00
8.000000e+00
1.100000e+01
62.00
num_rev_accts
87903.0
1.500139e+01
8.060028e+00
0.00
9.000000e+00
1.400000e+01
1.900000e+01
82.00
num_rev_tl_bal_gt_0
87903.0
6.025995e+00
3.300846e+00
0.00
4.000000e+00
5.000000e+00
8.000000e+00
39.00
num_sats
89932.0
1.175865e+01
5.304950e+00
0.00
8.000000e+00
1.100000e+01
1.400000e+01
76.00
num_tl_120dpd_2m
85106.0
8.695039e-04
3.594113e-02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
6.00
num_tl_30dpd
87903.0
3.788267e-03
6.503084e-02
0.00
0.000000e+00
0.000000e+00
0.000000e+00
3.00
num_tl_90g_dpd_24m
87903.0
9.226079e-02
4.905162e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
20.00
num_tl_op_past_12m
87903.0
2.436618e+00
1.873113e+00
0.00
1.000000e+00
2.000000e+00
3.000000e+00
25.00
pct_tl_nvr_dlq
87885.0
9.456235e+01
8.153480e+00
12.50
9.200000e+01
1.000000e+02
1.000000e+02
100.00
percent_bc_gt_75
90229.0
5.493363e+01
3.466952e+01
0.00
2.500000e+01
5.000000e+01
8.570000e+01
100.00
pub_rec_bankruptcies
98596.0
1.289505e-01
3.703407e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
8.00
tax_liens
98919.0
4.356089e-02
4.450575e-01
0.00
0.000000e+00
0.000000e+00
0.000000e+00
85.00
tot_hi_cred_lim
87903.0
1.412272e+05
1.437184e+05
0.00
4.275050e+04
8.540800e+04
2.018460e+05
3867129.00
total_bal_ex_mort
91295.0
4.693377e+04
4.247270e+04
0.00
2.095100e+04
3.633800e+04
5.937600e+04
1896461.00
total_bc_limit
91295.0
1.738864e+04
1.731728e+04
0.00
6.400000e+03
1.220000e+04
2.250000e+04
560800.00
total_il_high_credit_limit
87903.0
3.906871e+04
3.904861e+04
0.00
1.383700e+04
2.990400e+04
5.278000e+04
757807.00
In [ ]:
# TODO: compare the dti, interest rate of defaulted and non-defaulted loans
In [9]:
# seaborn is a recently-created Python library for easily making nice-looking plots
# you will have to install it with `conda install seaborn` or `pip install seaborn`, etc
# the docs are here for this function: http://seaborn.pydata.org/generated/seaborn.distplot.html
# found by Googling 'seaborn histogram'
f = sns.distplot(acc_df['dti'])
In [10]:
# outliers are screwing up the histogram... remove them
# adapted from http://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list
# we're using interquartile range to determine outliers
def reject_outliers(sr, iq_range=0.5, side='left', return_mask=False):
"""
Takes an array (or pandas series) and returns an array with outliers excluded, according to the
interquartile range.
Parameters:
-----------
sr: array
array of numeric values
iq_range: float
percent to calculate quartiles by, 0.5 will yield 25% and 75%ile quartiles
side: string
if 'left', will return everything below the highest quartile
if 'right', will return everything above the lowest quartile
if 'both', will return everything between the high and low quartiles
"""
pcnt = (1 - iq_range) / 2
qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
iqr = qhigh - qlow
if side=='both':
mask = (sr - median).abs() <= iqr
elif side=='left':
mask = (sr - median) <= iqr
elif side=='right':
mask = (sr - median) >= iqr
else:
print 'options for side are left, right, or both'
if return_mask:
return mask
return sr[mask]
In [11]:
# sweeeeeeeetttt....
dti_no_outliers = reject_outliers(acc_df['dti'], iq_range=0.85) # arrived at 0.85 via trial and error
f = sns.distplot(dti_no_outliers)
# other types of plot examples:
# http://seaborn.pydata.org/examples/
In [12]:
# sets the xkcd style if you want to make the plots look funny...may neet to install some fonts
plt.xkcd()
Out[12]:
<matplotlib.rc_context at 0x7f107bfd9150>
In [13]:
f = sns.distplot(dti_no_outliers)
/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'xkcd', u'Humor Sans', u'Comic Sans MS'] not found. Falling back to DejaVu Sans
(prop.get_family(), self.defaultFamily[fontext]))
In [14]:
# back to default
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
In [15]:
f = sns.distplot(dti_no_outliers)
In [16]:
# other styles:
# http://matplotlib.org/users/style_sheets.html
# this is the R ggplot style, which some people really like
plt.style.use('ggplot')
In [17]:
# http://seaborn.pydata.org/generated/seaborn.regplot.html
# takes a long time because there are a lot of points, but works for smaller datasets
# f = sns.regplot(data=acc_df, x='dti', y='int_rate', fit_reg=False)
# instead, lets make a heatmap:
# http://stackoverflow.com/questions/2369492/generate-a-heatmap-in-matplotlib-using-a-scatter-data-set
# http://seaborn.pydata.org/generated/seaborn.jointplot.html
mask = reject_outliers(acc_df['dti'], iq_range=0.85, return_mask=True)
f = sns.jointplot(data=acc_df.ix[mask, :], x='dti', y='int_rate', kind='hex', joint_kws=dict(gridsize=50))
In [ ]:
# suggetion for assignment:
# explore the data and see what differences you can find between defaulted and paid-back loans
# Are there any variables that look like they might predict if a loan will default?
# feel free to use any other dataset you want for the assignment though
# kaggle.com/datasets and http://archive.ics.uci.edu/ml/ have a lot of datasets, or data.gov
Content source: nateGeorge/lending_club_EDA
Similar notebooks: