Intro to Jupyter notebooks and Python for data science

Check out the keyboard shortcuts by going to help -> keyboard shortcuts.

I frequently use esc to exit a cell, a to add a cell above, b to add a cell below, enter to edit a cell, shift+enter to run a cell, arrow keys to go up and down between cells, ctrl+m to change a cell to markdown (like this cell), and sometimes shift+m to merge cells.

Pandas and NumPy

Two very important libraries in Python for data science are pandas and numpy. Pandas was made by Wes McKinney, initially for financial stuff. He also wrote a book on it in 2012 (now getting kind of dated).

Jupyter notebooks

Be careful of the order in which you run things (cells) in Jupyter notebooks. If you set a variable in one cell, then go back to a cell above it and change the variable in a different way, it can make things confusing. After completing a notebook, it's usually a good idea to go to Kernel -> Restart and Run All in the menu, to make sure everything is still working like you expect.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
# "magic" command to make plots show up in the notebook
%matplotlib inline

In [2]:
# this should be the path to the data.  Adapt for your system
win_filepath = 'C:/Users/ngeorge/Documents/GitHub/preprocess_lending_club_data/full_data/'
# same path from virtualbox shared folder
linux_filepath = '/media/sf_C_DRIVE/Users/ngeorge/Documents/GitHub/preprocess_lending_club_data/full_data/'

In [3]:
# gzip is a type of compression...it cuts the file size in less than half in this case.
# pandas will automatically detect the compression type
# check out the rediculous amount of options for reading csvs:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

# we're looking at some lendingclub data here.  It's a peer-to-peer loan company
acc_df = pd.read_csv(linux_filepath + 'accepted_2007_to_2016.csv.gz')

# this is a dataset with rejected loans from lendingclub
# rej_df = pd.read_csv(linux_filepath + 'rejected_2007_to_2016.csv.gz')
# the files are read into a dataframe


/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (18,46,54) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [4]:
print type(acc_df)


<class 'pandas.core.frame.DataFrame'>

In [5]:
# this is how we can see how many entries are in the dataframe (df)
# it also works for numpy arrays
acc_df.shape


Out[5]:
(1321847, 110)

In [6]:
# this is how many rows pandas will show by default with methods like pd.dataframe.head()
pd.options.display.max_rows


Out[6]:
60

In [7]:
# we want to increase it, because in this case there are a lot of column names
pd.options.display.max_rows = 1000

In [8]:
# the .T is transposing the matrix.
# We do this so the 111 column dataframe is easier to read (easier to scroll down than sideways) 
# .head() shows the first few rows of the data
acc_df.head().T


Out[8]:
0 1 2 3 4
id 1077501 1077430 1077175 1076863 1075358
member_id 1.2966e+06 1.31417e+06 1.31352e+06 1.27718e+06 1.31175e+06
loan_amnt 5000 2500 2400 10000 3000
funded_amnt 5000 2500 2400 10000 3000
funded_amnt_inv 4975 2500 2400 10000 3000
term 36 months 60 months 36 months 36 months 60 months
int_rate 10.65 15.27 15.96 13.49 12.69
installment 162.87 59.83 84.33 339.31 67.79
grade B C C C B
sub_grade B2 C4 C5 C1 B5
emp_title NaN Ryder NaN AIR RESOURCES BOARD University Medical Group
emp_length 10+ years < 1 year 10+ years 10+ years 1 year
home_ownership RENT RENT RENT RENT RENT
annual_inc 24000 30000 12252 49200 80000
verification_status Verified Source Verified Not Verified Source Verified Source Verified
issue_d Dec-2011 Dec-2011 Dec-2011 Dec-2011 Dec-2011
loan_status Fully Paid Charged Off Fully Paid Fully Paid Fully Paid
pymnt_plan n n n n n
desc Borrower added on 12/22/11 > I need to upgra... Borrower added on 12/22/11 > I plan to use t... NaN Borrower added on 12/21/11 > to pay for prop... Borrower added on 12/21/11 > I plan on combi...
purpose credit_card car small_business other other
title Computer bike real estate business personel Personal
zip_code 860xx 309xx 606xx 917xx 972xx
addr_state AZ GA IL CA OR
dti 27.65 1 8.72 20 17.94
delinq_2yrs 0 0 0 0 0
earliest_cr_line Jan-1985 Apr-1999 Nov-2001 Feb-1996 Jan-1996
inq_last_6mths 1 5 2 1 0
mths_since_last_delinq NaN NaN NaN 35 38
mths_since_last_record NaN NaN NaN NaN NaN
open_acc 3 3 2 10 15
pub_rec 0 0 0 0 0
revol_bal 13648 1687 2956 5598 27783
revol_util 83.7 9.4 98.5 21 53.9
total_acc 9 4 10 37 38
initial_list_status f f f f f
out_prncp 0 0 0 0 0
out_prncp_inv 0 0 0 0 0
total_pymnt 5863.16 1014.53 3005.67 12231.9 4066.91
total_pymnt_inv 5833.84 1014.53 3005.67 12231.9 4066.91
total_rec_prncp 5000 456.46 2400 10000 3000
total_rec_int 863.16 435.17 605.67 2214.92 1066.91
total_rec_late_fee 0 0 0 16.97 0
recoveries 0 122.9 0 0 0
collection_recovery_fee 0 1.11 0 0 0
last_pymnt_d Jan-2015 Apr-2013 Jun-2014 Jan-2015 Jan-2017
last_pymnt_amnt 171.62 119.66 649.91 357.48 67.3
next_pymnt_d NaN NaN NaN NaN NaN
last_credit_pull_d Jan-2017 Oct-2016 Jan-2017 Apr-2016 Jan-2017
collections_12_mths_ex_med 0 0 0 0 0
mths_since_last_major_derog NaN NaN NaN NaN NaN
policy_code 1 1 1 1 1
application_type INDIVIDUAL INDIVIDUAL INDIVIDUAL INDIVIDUAL INDIVIDUAL
annual_inc_joint NaN NaN NaN NaN NaN
dti_joint NaN NaN NaN NaN NaN
verification_status_joint NaN NaN NaN NaN NaN
acc_now_delinq 0 0 0 0 0
tot_coll_amt NaN NaN NaN NaN NaN
tot_cur_bal NaN NaN NaN NaN NaN
open_acc_6m NaN NaN NaN NaN NaN
open_il_6m NaN NaN NaN NaN NaN
open_il_12m NaN NaN NaN NaN NaN
open_il_24m NaN NaN NaN NaN NaN
mths_since_rcnt_il NaN NaN NaN NaN NaN
total_bal_il NaN NaN NaN NaN NaN
il_util NaN NaN NaN NaN NaN
open_rv_12m NaN NaN NaN NaN NaN
open_rv_24m NaN NaN NaN NaN NaN
max_bal_bc NaN NaN NaN NaN NaN
all_util NaN NaN NaN NaN NaN
total_rev_hi_lim NaN NaN NaN NaN NaN
inq_fi NaN NaN NaN NaN NaN
total_cu_tl NaN NaN NaN NaN NaN
inq_last_12m NaN NaN NaN NaN NaN
acc_open_past_24mths NaN NaN NaN NaN NaN
avg_cur_bal NaN NaN NaN NaN NaN
bc_open_to_buy NaN NaN NaN NaN NaN
bc_util NaN NaN NaN NaN NaN
chargeoff_within_12_mths 0 0 0 0 0
delinq_amnt 0 0 0 0 0
mo_sin_old_il_acct NaN NaN NaN NaN NaN
mo_sin_old_rev_tl_op NaN NaN NaN NaN NaN
mo_sin_rcnt_rev_tl_op NaN NaN NaN NaN NaN
mo_sin_rcnt_tl NaN NaN NaN NaN NaN
mort_acc NaN NaN NaN NaN NaN
mths_since_recent_bc NaN NaN NaN NaN NaN
mths_since_recent_bc_dlq NaN NaN NaN NaN NaN
mths_since_recent_inq NaN NaN NaN NaN NaN
mths_since_recent_revol_delinq NaN NaN NaN NaN NaN
num_accts_ever_120_pd NaN NaN NaN NaN NaN
num_actv_bc_tl NaN NaN NaN NaN NaN
num_actv_rev_tl NaN NaN NaN NaN NaN
num_bc_sats NaN NaN NaN NaN NaN
num_bc_tl NaN NaN NaN NaN NaN
num_il_tl NaN NaN NaN NaN NaN
num_op_rev_tl NaN NaN NaN NaN NaN
num_rev_accts NaN NaN NaN NaN NaN
num_rev_tl_bal_gt_0 NaN NaN NaN NaN NaN
num_sats NaN NaN NaN NaN NaN
num_tl_120dpd_2m NaN NaN NaN NaN NaN
num_tl_30dpd NaN NaN NaN NaN NaN
num_tl_90g_dpd_24m NaN NaN NaN NaN NaN
num_tl_op_past_12m NaN NaN NaN NaN NaN
pct_tl_nvr_dlq NaN NaN NaN NaN NaN
percent_bc_gt_75 NaN NaN NaN NaN NaN
pub_rec_bankruptcies 0 0 0 0 0
tax_liens 0 0 0 0 0
tot_hi_cred_lim NaN NaN NaN NaN NaN
total_bal_ex_mort NaN NaN NaN NaN NaN
total_bc_limit NaN NaN NaN NaN NaN
total_il_high_credit_limit NaN NaN NaN NaN NaN

In [9]:
# .tail() shows the last few rows
acc_df.tail().T


Out[9]:
1321842 1321843 1321844 1321845 1321846
id 89885898 88977788 88985880 88224441 88215728
member_id 9.62797e+07 9.53207e+07 9.53287e+07 9.45252e+07 9.45164e+07
loan_amnt 24000 24000 40000 24000 14000
funded_amnt 24000 24000 40000 24000 14000
funded_amnt_inv 24000 24000 40000 24000 14000
term 60 months 60 months 60 months 60 months 60 months
int_rate 12.79 10.49 10.49 14.49 14.49
installment 543.5 515.74 859.56 564.56 329.33
grade C B B C C
sub_grade C1 B3 B3 C4 C4
emp_title Unit Operator Database Administrator Vice President Program Manager Customer Service Technician
emp_length 7 years 10+ years 9 years 6 years 10+ years
home_ownership MORTGAGE MORTGAGE MORTGAGE RENT MORTGAGE
annual_inc 95000 108000 227000 110000 95000
verification_status Source Verified Not Verified Verified Not Verified Verified
issue_d Oct-2016 Oct-2016 Oct-2016 Oct-2016 Oct-2016
loan_status Current Current Current Current Current
pymnt_plan n n n n n
desc NaN NaN NaN NaN NaN
purpose home_improvement debt_consolidation debt_consolidation debt_consolidation debt_consolidation
title Home improvement Debt consolidation NaN Debt consolidation NaN
zip_code 356xx 840xx 907xx 334xx 770xx
addr_state AL UT CA FL TX
dti 19.61 34.94 12.75 18.3 23.36
delinq_2yrs 0 0 7 0 0
earliest_cr_line Dec-1999 Feb-1991 Feb-1995 Jul-1999 Jun-1996
inq_last_6mths 0 1 1 0 1
mths_since_last_delinq NaN 60 9 67 37
mths_since_last_record NaN 69 NaN 72 NaN
open_acc 5 24 5 10 8
pub_rec 0 1 0 1 0
revol_bal 49431 21665 8633 17641 7662
revol_util 84.4 39 64.9 68.1 54
total_acc 54 58 37 31 22
initial_list_status f f f f w
out_prncp 23127.7 23074.1 38456.9 23165.7 13499.3
out_prncp_inv 23127.7 23074.1 38456.9 23165.7 13499.3
total_pymnt 1613.45 1533.23 2555.37 1674.36 990.39
total_pymnt_inv 1613.45 1533.23 2555.37 1674.36 990.39
total_rec_prncp 872.33 925.87 1543.09 834.27 500.67
total_rec_int 741.12 607.36 1012.28 840.09 489.72
total_rec_late_fee 0 0 0 0 0
recoveries 0 0 0 0 0
collection_recovery_fee 0 0 0 0 0
last_pymnt_d Jan-2017 Jan-2017 Jan-2017 Jan-2017 Jan-2017
last_pymnt_amnt 543.5 515.74 859.56 564.56 329.33
next_pymnt_d Feb-2017 Feb-2017 Feb-2017 Feb-2017 Feb-2017
last_credit_pull_d Jan-2017 Jan-2017 Jan-2017 Jan-2017 Jan-2017
collections_12_mths_ex_med 0 0 0 0 0
mths_since_last_major_derog NaN 60 10 67 37
policy_code 1 1 1 1 1
application_type INDIVIDUAL INDIVIDUAL INDIVIDUAL INDIVIDUAL INDIVIDUAL
annual_inc_joint NaN NaN NaN NaN NaN
dti_joint NaN NaN NaN NaN NaN
verification_status_joint NaN NaN NaN NaN NaN
acc_now_delinq 0 0 0 0 0
tot_coll_amt 0 0 0 0 0
tot_cur_bal 104422 140118 28398 62426 143066
open_acc_6m 2 0 0 0 0
open_il_6m 2 4 2 2 1
open_il_12m 1 0 0 0 1
open_il_24m 3 5 1 2 2
mths_since_rcnt_il 3 16 15 20 7
total_bal_il 54991 118453 19765 44785 32071
il_util 106 77 46 78 94
open_rv_12m 2 0 0 1 0
open_rv_24m 2 2 0 5 1
max_bal_bc 0 4090 5141 6172 4093
all_util 103 59 51 73 84
total_rev_hi_lim 50500 55600 13300 25900 14200
inq_fi 1 1 3 0 3
total_cu_tl 19 12 0 0 1
inq_last_12m 1 4 2 1 2
acc_open_past_24mths 5 7 2 7 3
avg_cur_bal 26106 6369 5680 6243 17883
bc_open_to_buy NaN 15307 4070 4660 1838
bc_util NaN 45.1 66.9 77.5 80.7
chargeoff_within_12_mths 0 0 0 0 0
delinq_amnt 0 0 0 0 0
mo_sin_old_il_acct 201 166 154 132 149
mo_sin_old_rev_tl_op 88 307 258 206 243
mo_sin_rcnt_rev_tl_op 3 15 33 9 22
mo_sin_rcnt_tl 3 15 15 9 7
mort_acc 0 4 3 2 1
mths_since_recent_bc NaN 15 41 9 75
mths_since_recent_bc_dlq NaN 60 9 NaN 39
mths_since_recent_inq 4 4 1 9 6
mths_since_recent_revol_delinq NaN 60 9 NaN 37
num_accts_ever_120_pd 0 12 6 1 4
num_actv_bc_tl 0 5 2 5 3
num_actv_rev_tl 2 12 3 7 3
num_bc_sats 0 7 2 5 4
num_bc_tl 3 18 15 15 12
num_il_tl 43 17 9 4 3
num_op_rev_tl 3 19 3 8 5
num_rev_accts 11 37 23 24 17
num_rev_tl_bal_gt_0 2 12 3 7 3
num_sats 5 24 5 10 8
num_tl_120dpd_2m 0 0 0 0 0
num_tl_30dpd 0 0 0 0 0
num_tl_90g_dpd_24m 0 0 7 0 0
num_tl_op_past_12m 3 0 0 1 1
pct_tl_nvr_dlq 100 75.9 75.7 96.2 81.8
percent_bc_gt_75 NaN 42.9 50 40 50
pub_rec_bankruptcies 0 0 0 1 0
tax_liens 0 1 0 0 0
tot_hi_cred_lim 102517 227883 55970 84664 163804
total_bal_ex_mort 104422 140118 28398 62426 44215
total_bc_limit 0 27900 12300 20700 9500
total_il_high_credit_limit 52017 172283 42670 58764 34169

In [10]:
# .info() tells us the datatype(int64, `object` is a string)
# and will also tell us the number of non-null (not missing) data points for each column
# because this dataframe is so large, we have to force it to show the datatypes and non-null numbers with the arguments
acc_df.info(verbose=True, null_counts=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1321847 entries, 0 to 1321846
Data columns (total 110 columns):
id                                1321847 non-null int64
member_id                         1321847 non-null float64
loan_amnt                         1321847 non-null float64
funded_amnt                       1321847 non-null float64
funded_amnt_inv                   1321847 non-null float64
term                              1321847 non-null object
int_rate                          1321847 non-null float64
installment                       1321847 non-null float64
grade                             1321847 non-null object
sub_grade                         1321847 non-null object
emp_title                         1241887 non-null object
emp_length                        1321847 non-null object
home_ownership                    1321847 non-null object
annual_inc                        1321843 non-null float64
verification_status               1321847 non-null object
issue_d                           1321847 non-null object
loan_status                       1321847 non-null object
pymnt_plan                        1321847 non-null object
desc                              126068 non-null object
purpose                           1321847 non-null object
title                             1298523 non-null object
zip_code                          1321847 non-null object
addr_state                        1321847 non-null object
dti                               1321847 non-null float64
delinq_2yrs                       1321818 non-null float64
earliest_cr_line                  1321818 non-null object
inq_last_6mths                    1321817 non-null float64
mths_since_last_delinq            662677 non-null float64
mths_since_last_record            219977 non-null float64
open_acc                          1321818 non-null float64
pub_rec                           1321818 non-null float64
revol_bal                         1321847 non-null float64
revol_util                        1321081 non-null float64
total_acc                         1321818 non-null float64
initial_list_status               1321847 non-null object
out_prncp                         1321847 non-null float64
out_prncp_inv                     1321847 non-null float64
total_pymnt                       1321847 non-null float64
total_pymnt_inv                   1321847 non-null float64
total_rec_prncp                   1321847 non-null float64
total_rec_int                     1321847 non-null float64
total_rec_late_fee                1321847 non-null float64
recoveries                        1321847 non-null float64
collection_recovery_fee           1321847 non-null float64
last_pymnt_d                      1314178 non-null object
last_pymnt_amnt                   1321847 non-null float64
next_pymnt_d                      829089 non-null object
last_credit_pull_d                1321770 non-null object
collections_12_mths_ex_med        1321702 non-null float64
mths_since_last_major_derog       347331 non-null float64
policy_code                       1321847 non-null float64
application_type                  1321847 non-null object
annual_inc_joint                  9300 non-null float64
dti_joint                         9296 non-null float64
verification_status_joint         9300 non-null object
acc_now_delinq                    1321818 non-null float64
tot_coll_amt                      1251571 non-null float64
tot_cur_bal                       1251571 non-null float64
open_acc_6m                       455717 non-null float64
open_il_6m                        455718 non-null float64
open_il_12m                       455718 non-null float64
open_il_24m                       455718 non-null float64
mths_since_rcnt_il                443459 non-null float64
total_bal_il                      455718 non-null float64
il_util                           395615 non-null float64
open_rv_12m                       455718 non-null float64
open_rv_24m                       455718 non-null float64
max_bal_bc                        455718 non-null float64
all_util                          455695 non-null float64
total_rev_hi_lim                  1251571 non-null float64
inq_fi                            455718 non-null float64
total_cu_tl                       455717 non-null float64
inq_last_12m                      455717 non-null float64
acc_open_past_24mths              1271817 non-null float64
avg_cur_bal                       1251559 non-null float64
bc_open_to_buy                    1259244 non-null float64
bc_util                           1258527 non-null float64
chargeoff_within_12_mths          1321702 non-null float64
delinq_amnt                       1321818 non-null float64
mo_sin_old_il_acct                1214312 non-null float64
mo_sin_old_rev_tl_op              1251570 non-null float64
mo_sin_rcnt_rev_tl_op             1251570 non-null float64
mo_sin_rcnt_tl                    1251571 non-null float64
mort_acc                          1271817 non-null float64
mths_since_recent_bc              1260066 non-null float64
mths_since_recent_bc_dlq          318765 non-null float64
mths_since_recent_inq             1139670 non-null float64
mths_since_recent_revol_delinq    448529 non-null float64
num_accts_ever_120_pd             1251571 non-null float64
num_actv_bc_tl                    1251571 non-null float64
num_actv_rev_tl                   1251571 non-null float64
num_bc_sats                       1263257 non-null float64
num_bc_tl                         1251571 non-null float64
num_il_tl                         1251571 non-null float64
num_op_rev_tl                     1251571 non-null float64
num_rev_accts                     1251570 non-null float64
num_rev_tl_bal_gt_0               1251571 non-null float64
num_sats                          1263257 non-null float64
num_tl_120dpd_2m                  1202848 non-null float64
num_tl_30dpd                      1251571 non-null float64
num_tl_90g_dpd_24m                1251571 non-null float64
num_tl_op_past_12m                1251571 non-null float64
pct_tl_nvr_dlq                    1251418 non-null float64
percent_bc_gt_75                  1258823 non-null float64
pub_rec_bankruptcies              1320482 non-null float64
tax_liens                         1321742 non-null float64
tot_hi_cred_lim                   1251571 non-null float64
total_bal_ex_mort                 1271817 non-null float64
total_bc_limit                    1271817 non-null float64
total_il_high_credit_limit        1251571 non-null float64
dtypes: float64(87), int64(1), object(22)
memory usage: 1.1+ GB

In [11]:
# shows some common summary statistics
# again, transposing with .T to make it easier to read
acc_df.describe().T


Out[11]:
count mean std min 25% 50% 75% max
id 1321847.0 4.872486e+07 3.024669e+07 54734.00 1.858618e+07 54363305.00 7.457674e+07 9.645316e+07
member_id 1321847.0 5.236894e+07 3.218029e+07 70473.00 2.071726e+07 57934813.00 7.995784e+07 1.035709e+08
loan_amnt 1321847.0 1.474822e+04 8.622143e+03 500.00 8.000000e+03 12900.00 2.000000e+04 4.000000e+04
funded_amnt 1321847.0 1.473922e+04 8.618488e+03 500.00 8.000000e+03 12850.00 2.000000e+04 4.000000e+04
funded_amnt_inv 1321847.0 1.471071e+04 8.625729e+03 0.00 8.000000e+03 12800.00 2.000000e+04 4.000000e+04
int_rate 1321847.0 1.317963e+01 4.570613e+00 5.32 9.750000e+00 12.79 1.580000e+01 3.099000e+01
installment 1321847.0 4.391420e+02 2.532324e+02 15.67 2.559000e+02 380.26 5.783100e+02 1.584900e+03
annual_inc 1321843.0 7.649594e+04 6.911192e+04 0.00 4.600000e+04 65000.00 9.000000e+04 9.573072e+06
dti 1321847.0 1.887239e+01 7.098519e+01 -1.00 1.204000e+01 17.83 2.424000e+01 9.999000e+03
delinq_2yrs 1321818.0 3.295938e-01 8.920428e-01 0.00 0.000000e+00 0.00 0.000000e+00 3.900000e+01
inq_last_6mths 1321817.0 6.508745e-01 9.576629e-01 0.00 0.000000e+00 0.00 1.000000e+00 3.300000e+01
mths_since_last_delinq 662677.0 3.389481e+01 2.189243e+01 0.00 1.500000e+01 31.00 4.900000e+01 1.950000e+02
mths_since_last_record 219977.0 6.898570e+01 2.686579e+01 0.00 5.100000e+01 70.00 8.800000e+01 1.290000e+02
open_acc 1321818.0 1.165590e+01 5.469410e+00 0.00 8.000000e+00 11.00 1.400000e+01 9.700000e+01
pub_rec 1321818.0 2.150281e-01 6.171629e-01 0.00 0.000000e+00 0.00 0.000000e+00 8.600000e+01
revol_bal 1321847.0 1.692801e+04 2.293390e+04 0.00 6.295000e+03 11652.00 2.058400e+04 2.904836e+06
revol_util 1321081.0 5.362472e+01 2.404374e+01 0.00 3.580000e+01 54.20 7.220000e+01 8.923000e+02
total_acc 1321818.0 2.502217e+01 1.190355e+01 1.00 1.600000e+01 23.00 3.200000e+01 1.760000e+02
out_prncp 1321847.0 6.755341e+03 7.938161e+03 0.00 0.000000e+00 4124.87 1.120028e+04 4.156034e+04
out_prncp_inv 1321847.0 6.752559e+03 7.935202e+03 0.00 0.000000e+00 4123.36 1.119756e+04 4.156034e+04
total_pymnt 1321847.0 9.250698e+03 8.558047e+03 0.00 2.996065e+03 6662.34 1.271974e+04 5.982952e+04
total_pymnt_inv 1321847.0 9.223675e+03 8.545527e+03 0.00 2.978780e+03 6635.78 1.268514e+04 5.970131e+04
total_rec_prncp 1321847.0 7.139344e+03 7.236062e+03 0.00 1.942560e+03 4705.53 1.000000e+04 4.000000e+04
total_rec_int 1321847.0 2.040949e+03 2.304573e+03 0.00 5.811850e+02 1296.70 2.617710e+03 2.635950e+04
total_rec_late_fee 1321847.0 5.746782e-01 5.254665e+00 0.00 0.000000e+00 0.00 0.000000e+00 3.676000e+02
recoveries 1321847.0 6.982961e+01 4.927118e+02 0.00 0.000000e+00 0.00 0.000000e+00 3.944437e+04
collection_recovery_fee 1321847.0 9.962762e+00 8.154662e+01 0.00 0.000000e+00 0.00 0.000000e+00 7.002190e+03
last_pymnt_amnt 1321847.0 2.619154e+03 5.356282e+03 0.00 2.886600e+02 489.45 1.072560e+03 4.214853e+04
collections_12_mths_ex_med 1321702.0 1.678518e-02 1.467218e-01 0.00 0.000000e+00 0.00 0.000000e+00 2.000000e+01
mths_since_last_major_derog 347331.0 4.441145e+01 2.228005e+01 0.00 2.700000e+01 44.00 6.200000e+01 1.970000e+02
policy_code 1321847.0 1.000000e+00 0.000000e+00 1.00 1.000000e+00 1.00 1.000000e+00 1.000000e+00
annual_inc_joint 9300.0 1.115636e+05 5.019977e+04 11000.00 7.849250e+04 103000.00 1.340000e+05 1.050000e+06
dti_joint 9296.0 1.829942e+01 7.020419e+00 0.32 1.332000e+01 18.01 2.290000e+01 6.949000e+01
acc_now_delinq 1321818.0 5.581706e-03 8.119766e-02 0.00 0.000000e+00 0.00 0.000000e+00 1.400000e+01
tot_coll_amt 1251571.0 2.411676e+02 8.444208e+03 0.00 0.000000e+00 0.00 0.000000e+00 9.152545e+06
tot_cur_bal 1251571.0 1.408428e+05 1.563637e+05 0.00 3.001700e+04 80847.00 2.097875e+05 8.000078e+06
open_acc_6m 455717.0 1.008288e+00 1.188975e+00 0.00 0.000000e+00 1.00 2.000000e+00 1.800000e+01
open_il_6m 455718.0 2.831942e+00 3.052855e+00 0.00 1.000000e+00 2.00 3.000000e+00 4.800000e+01
open_il_12m 455718.0 7.279414e-01 9.787083e-01 0.00 0.000000e+00 0.00 1.000000e+00 2.500000e+01
open_il_24m 455718.0 1.629345e+00 1.657855e+00 0.00 0.000000e+00 1.00 2.000000e+00 5.100000e+01
mths_since_rcnt_il 443459.0 2.127134e+01 2.662149e+01 0.00 7.000000e+00 13.00 2.400000e+01 5.110000e+02
total_bal_il 455718.0 3.561987e+04 4.257060e+04 0.00 9.495000e+03 23781.00 4.636400e+04 1.547285e+06
il_util 395615.0 7.098847e+01 2.317377e+01 0.00 5.800000e+01 74.00 8.700000e+01 1.000000e+03
open_rv_12m 455718.0 1.373542e+00 1.536616e+00 0.00 0.000000e+00 1.00 2.000000e+00 2.800000e+01
open_rv_24m 455718.0 2.903416e+00 2.631446e+00 0.00 1.000000e+00 2.00 4.000000e+00 6.000000e+01
max_bal_bc 455718.0 5.777328e+03 5.619377e+03 0.00 2.362000e+03 4385.00 7.460000e+03 7.768430e+05
all_util 455695.0 6.022548e+01 2.015962e+01 0.00 4.700000e+01 61.00 7.400000e+01 2.040000e+02
total_rev_hi_lim 1251571.0 3.276469e+04 3.669014e+04 0.00 1.410000e+04 24100.00 4.060000e+04 9.999999e+06
inq_fi 455718.0 9.650091e-01 1.497837e+00 0.00 0.000000e+00 0.00 1.000000e+00 4.800000e+01
total_cu_tl 455717.0 1.498709e+00 2.722067e+00 0.00 0.000000e+00 0.00 2.000000e+00 1.110000e+02
inq_last_12m 455717.0 2.166066e+00 2.459050e+00 0.00 0.000000e+00 1.00 3.000000e+00 4.900000e+01
acc_open_past_24mths 1271817.0 4.551259e+00 3.120103e+00 0.00 2.000000e+00 4.00 6.000000e+00 6.400000e+01
avg_cur_bal 1251559.0 1.336583e+04 1.600496e+04 0.00 3.149000e+03 7424.00 1.851800e+04 9.580840e+05
bc_open_to_buy 1259244.0 9.547611e+03 1.458259e+04 0.00 1.345000e+03 4359.00 1.152000e+04 7.111400e+05
bc_util 1258527.0 6.210613e+01 2.751823e+01 0.00 4.160000e+01 65.80 8.620000e+01 3.396000e+02
chargeoff_within_12_mths 1321702.0 8.962686e-03 1.083899e-01 0.00 0.000000e+00 0.00 0.000000e+00 1.000000e+01
delinq_amnt 1321818.0 1.561729e+01 8.128777e+02 0.00 0.000000e+00 0.00 0.000000e+00 1.854080e+05
mo_sin_old_il_acct 1214312.0 1.270696e+02 5.201166e+01 0.00 1.000000e+02 130.00 1.530000e+02 7.240000e+02
mo_sin_old_rev_tl_op 1251570.0 1.843689e+02 9.484415e+01 2.00 1.190000e+02 167.00 2.340000e+02 9.010000e+02
mo_sin_rcnt_rev_tl_op 1251570.0 1.344386e+01 1.680114e+01 0.00 4.000000e+00 8.00 1.600000e+01 4.380000e+02
mo_sin_rcnt_tl 1251571.0 8.086934e+00 9.065880e+00 0.00 3.000000e+00 6.00 1.000000e+01 3.140000e+02
mort_acc 1271817.0 1.689627e+00 2.013544e+00 0.00 0.000000e+00 1.00 3.000000e+00 6.100000e+01
mths_since_recent_bc 1260066.0 2.461952e+01 3.149733e+01 0.00 6.000000e+00 14.00 2.900000e+01 6.390000e+02
mths_since_recent_bc_dlq 318765.0 3.941078e+01 2.275385e+01 0.00 2.000000e+01 38.00 5.800000e+01 1.950000e+02
mths_since_recent_inq 1139670.0 6.833985e+00 5.923765e+00 0.00 2.000000e+00 5.00 1.000000e+01 2.500000e+01
mths_since_recent_revol_delinq 448529.0 3.546095e+01 2.241687e+01 0.00 1.700000e+01 32.00 5.200000e+01 1.970000e+02
num_accts_ever_120_pd 1251571.0 5.053049e-01 1.309245e+00 0.00 0.000000e+00 0.00 0.000000e+00 5.100000e+01
num_actv_bc_tl 1251571.0 3.714223e+00 2.264803e+00 0.00 2.000000e+00 3.00 5.000000e+00 4.700000e+01
num_actv_rev_tl 1251571.0 5.778005e+00 3.334688e+00 0.00 3.000000e+00 5.00 7.000000e+00 5.900000e+01
num_bc_sats 1263257.0 4.745405e+00 2.942177e+00 0.00 3.000000e+00 4.00 6.000000e+00 7.100000e+01
num_bc_tl 1251571.0 8.150501e+00 4.788671e+00 0.00 5.000000e+00 7.00 1.100000e+01 7.900000e+01
num_il_tl 1251571.0 8.514671e+00 7.352151e+00 0.00 3.000000e+00 7.00 1.100000e+01 1.590000e+02
num_op_rev_tl 1251571.0 8.319066e+00 4.541131e+00 0.00 5.000000e+00 7.00 1.100000e+01 9.100000e+01
num_rev_accts 1251570.0 1.468883e+01 8.080868e+00 0.00 9.000000e+00 13.00 1.900000e+01 1.180000e+02
num_rev_tl_bal_gt_0 1251571.0 5.731843e+00 3.253593e+00 0.00 3.000000e+00 5.00 7.000000e+00 5.900000e+01
num_sats 1263257.0 1.170843e+01 5.475699e+00 0.00 8.000000e+00 11.00 1.400000e+01 9.700000e+01
num_tl_120dpd_2m 1202848.0 8.646146e-04 3.117603e-02 0.00 0.000000e+00 0.00 0.000000e+00 6.000000e+00
num_tl_30dpd 1251571.0 3.875929e-03 6.640046e-02 0.00 0.000000e+00 0.00 0.000000e+00 4.000000e+00
num_tl_90g_dpd_24m 1251571.0 9.073636e-02 5.063076e-01 0.00 0.000000e+00 0.00 0.000000e+00 3.900000e+01
num_tl_op_past_12m 1251571.0 2.109639e+00 1.804914e+00 0.00 1.000000e+00 2.00 3.000000e+00 3.200000e+01
pct_tl_nvr_dlq 1251418.0 9.406750e+01 8.790950e+00 0.00 9.110000e+01 97.60 1.000000e+02 1.000000e+02
percent_bc_gt_75 1258823.0 4.745467e+01 3.583138e+01 0.00 1.670000e+01 50.00 7.500000e+01 1.000000e+02
pub_rec_bankruptcies 1320482.0 1.267870e-01 3.726020e-01 0.00 0.000000e+00 0.00 0.000000e+00 1.200000e+01
tax_liens 1321742.0 5.661468e-02 4.183959e-01 0.00 0.000000e+00 0.00 0.000000e+00 8.500000e+01
tot_hi_cred_lim 1251571.0 1.736727e+05 1.768078e+05 0.00 4.978900e+04 112389.00 2.508545e+05 9.999999e+06
total_bal_ex_mort 1271817.0 5.029927e+04 4.779373e+04 0.00 2.143000e+04 37888.00 6.325100e+04 2.921551e+06
total_bc_limit 1271817.0 2.148790e+04 2.130429e+04 0.00 7.800000e+03 15000.00 2.800000e+04 1.105500e+06
total_il_high_credit_limit 1251571.0 4.221178e+04 4.330529e+04 0.00 1.482500e+04 31728.00 5.671200e+04 2.101913e+06

Some common pandas functions

We find ourselves doing the same types of things often in pandas. For example, masking.

Masking dataframes

Maybe we only want to see one column. We do this like

acc_df['int_rate']

We can also subset a dataframe based on some critereon. Let's look at only the highest interest rates

acc_df[acc_df['int_rate'] > 20]

We can take the mean, median, get the standard deviation of columns:

acc_df['int_rate'].mean()


In [12]:
acc_df[acc_df['int_rate'] > 20]['int_rate'].mean()
# thats a bit of a complicated statement.  Breaking it down:
# acc_df['int_rate'] > 20 returns a mask: an array of True/False values
# putting this array into acc_df[] returns the dataframe rows where the interest rates are greater than 20

# so this first part: acc_df[acc_df['int_rate'] > 20]
# gives us a dataframe

# we select a column with ['int_rate'] at the end.  Then we get the average value with .mean()


Out[12]:
23.147697456492637

Finding unique values

Numpy has a function for finding the unique values in an array:

np.unique(array)

This is built into pandas:

acc_df['grade'].unique()

shows us the unique values in that column.


In [13]:
acc_df['grade'].unique()


Out[13]:
array(['B', 'C', 'A', 'E', 'F', 'D', 'G'], dtype=object)

In [15]:
# selecting only grade A loans:
acc_df[acc_df['grade'] == 'A'].describe().T


Out[15]:
count mean std min 25% 50% 75% max
id 219054.0 4.832391e+07 2.946658e+07 56121.00 2.017032e+07 5.445476e+07 7.367424e+07 9.642933e+07
member_id 219054.0 5.191378e+07 3.136139e+07 70473.00 2.242162e+07 5.803389e+07 7.899566e+07 1.035410e+08
loan_amnt 219054.0 1.412557e+04 7.840254e+03 500.00 8.000000e+03 1.200000e+04 2.000000e+04 4.000000e+04
funded_amnt 219054.0 1.411501e+04 7.841662e+03 500.00 8.000000e+03 1.200000e+04 2.000000e+04 4.000000e+04
funded_amnt_inv 219054.0 1.409732e+04 7.848511e+03 0.00 8.000000e+03 1.200000e+04 2.000000e+04 4.000000e+04
int_rate 219054.0 7.129947e+00 1.017642e+00 5.32 6.490000e+00 7.260000e+00 7.890000e+00 9.630000e+00
installment 219054.0 4.278599e+02 2.387787e+02 15.67 2.458500e+02 3.726700e+02 5.995800e+02 1.264380e+03
annual_inc 219050.0 8.942210e+04 9.121994e+04 0.00 5.400000e+04 7.500000e+04 1.050000e+05 9.573072e+06
dti 219054.0 1.585561e+01 4.331526e+01 0.00 1.012000e+01 1.513000e+01 2.081000e+01 9.999000e+03
delinq_2yrs 219043.0 2.042110e-01 6.567539e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+01
inq_last_6mths 219043.0 3.743055e-01 6.946100e-01 0.00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+01
mths_since_last_delinq 86975.0 3.576146e+01 2.161122e+01 0.00 1.800000e+01 3.300000e+01 5.100000e+01 1.700000e+02
mths_since_last_record 19140.0 6.988130e+01 2.796503e+01 0.00 5.200000e+01 7.100000e+01 9.000000e+01 1.210000e+02
open_acc 219043.0 1.198380e+01 5.467297e+00 1.00 8.000000e+00 1.100000e+01 1.500000e+01 8.600000e+01
pub_rec 219043.0 1.111471e-01 4.541204e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 3.700000e+01
revol_bal 219054.0 1.920072e+04 2.840560e+04 0.00 6.805000e+03 1.264750e+04 2.244000e+04 2.568995e+06
revol_util 218987.0 4.086190e+01 2.287719e+01 0.00 2.320000e+01 3.890000e+01 5.690000e+01 1.720000e+02
total_acc 219043.0 2.646038e+01 1.157362e+01 1.00 1.800000e+01 2.500000e+01 3.300000e+01 1.370000e+02
out_prncp 219054.0 5.846108e+03 6.782759e+03 0.00 0.000000e+00 3.971180e+03 9.321020e+03 4.156034e+04
out_prncp_inv 219054.0 5.843938e+03 6.780002e+03 0.00 0.000000e+00 3.971065e+03 9.320880e+03 4.156034e+04
total_pymnt 219054.0 9.089074e+03 7.441278e+03 0.00 3.432205e+03 6.987388e+03 1.275515e+04 4.202719e+04
total_pymnt_inv 219054.0 9.072305e+03 7.438505e+03 0.00 3.417060e+03 6.967070e+03 1.273452e+04 4.202719e+04
total_rec_prncp 219054.0 8.083666e+03 6.852942e+03 0.00 2.889110e+03 6.000000e+03 1.139958e+04 4.000000e+04
total_rec_int 219054.0 9.908640e+02 7.894928e+02 0.00 4.065600e+02 7.915350e+02 1.370910e+03 5.752190e+03
total_rec_late_fee 219054.0 2.201795e-01 2.991340e+00 0.00 0.000000e+00 0.000000e+00 0.000000e+00 3.128500e+02
recoveries 219054.0 1.432414e+01 2.179274e+02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 2.156742e+04
collection_recovery_fee 219054.0 2.007859e+00 3.679182e+01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 5.774800e+03
last_pymnt_amnt 219054.0 2.357615e+03 4.473266e+03 0.00 2.948725e+02 4.903600e+02 1.095330e+03 4.033284e+04
collections_12_mths_ex_med 219019.0 9.049443e-03 1.153875e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+01
mths_since_last_major_derog 37505.0 4.521227e+01 2.113264e+01 0.00 2.900000e+01 4.500000e+01 6.200000e+01 1.700000e+02
policy_code 219054.0 1.000000e+00 0.000000e+00 1.00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
annual_inc_joint 832.0 1.293609e+05 5.792891e+04 25800.00 9.300000e+04 1.183540e+05 1.540000e+05 4.800000e+05
dti_joint 832.0 1.562950e+01 6.026367e+00 2.07 1.137000e+01 1.526500e+01 1.931250e+01 4.478000e+01
acc_now_delinq 219043.0 1.867213e-03 4.573841e-02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00
tot_coll_amt 202308.0 1.516021e+02 1.587768e+03 0.00 0.000000e+00 0.000000e+00 0.000000e+00 2.241070e+05
tot_cur_bal 202308.0 1.777121e+05 1.779848e+05 0.00 3.926100e+04 1.358080e+05 2.602600e+05 4.035706e+06
open_acc_6m 74545.0 7.127373e-01 9.433867e-01 0.00 0.000000e+00 0.000000e+00 1.000000e+00 1.200000e+01
open_il_6m 74545.0 2.675887e+00 2.895929e+00 0.00 1.000000e+00 2.000000e+00 3.000000e+00 3.800000e+01
open_il_12m 74545.0 4.518076e-01 7.392276e-01 0.00 0.000000e+00 0.000000e+00 1.000000e+00 1.100000e+01
open_il_24m 74545.0 1.223006e+00 1.310623e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 2.000000e+01
mths_since_rcnt_il 73115.0 2.567870e+01 2.782954e+01 0.00 9.000000e+00 1.700000e+01 3.000000e+01 5.030000e+02
total_bal_il 74545.0 3.394634e+04 4.367126e+04 0.00 7.895000e+03 2.146600e+04 4.331300e+04 1.044916e+06
il_util 63289.0 6.402236e+01 2.373237e+01 0.00 4.900000e+01 6.600000e+01 8.100000e+01 2.750000e+02
open_rv_12m 74545.0 1.000322e+00 1.203798e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 1.600000e+01
open_rv_24m 74545.0 2.199209e+00 2.124902e+00 0.00 1.000000e+00 2.000000e+00 3.000000e+00 3.800000e+01
max_bal_bc 74545.0 6.816120e+03 6.369973e+03 0.00 2.880000e+03 5.208000e+03 8.959000e+03 4.575210e+05
all_util 74542.0 4.878659e+01 1.987433e+01 0.00 3.500000e+01 4.800000e+01 6.200000e+01 1.720000e+02
total_rev_hi_lim 202308.0 4.804376e+04 5.372299e+04 0.00 2.190000e+04 3.660000e+04 5.980000e+04 9.999999e+06
inq_fi 74545.0 6.542089e-01 1.097309e+00 0.00 0.000000e+00 0.000000e+00 1.000000e+00 1.600000e+01
total_cu_tl 74545.0 1.460970e+00 2.573779e+00 0.00 0.000000e+00 0.000000e+00 2.000000e+00 4.900000e+01
inq_last_12m 74545.0 1.435603e+00 1.774784e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 2.800000e+01
acc_open_past_24mths 206693.0 3.645382e+00 2.605672e+00 0.00 2.000000e+00 3.000000e+00 5.000000e+00 3.900000e+01
avg_cur_bal 202308.0 1.679908e+04 1.837788e+04 0.00 3.925000e+03 1.142200e+04 2.349200e+04 4.636980e+05
bc_open_to_buy 205655.0 1.998775e+04 2.217775e+04 0.00 5.240500e+03 1.271100e+04 2.682900e+04 4.974450e+05
bc_util 205556.0 4.628191e+01 2.631821e+01 0.00 2.540000e+01 4.460000e+01 6.650000e+01 1.870000e+02
chargeoff_within_12_mths 219019.0 4.894553e-03 8.002961e-02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 6.000000e+00
delinq_amnt 219043.0 6.655173e+00 5.446757e+02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 6.565200e+04
mo_sin_old_il_acct 198167.0 1.339263e+02 4.874892e+01 0.00 1.100000e+02 1.350000e+02 1.570000e+02 7.240000e+02
mo_sin_old_rev_tl_op 202308.0 2.098849e+02 9.790199e+01 5.00 1.390000e+02 1.890000e+02 2.630000e+02 8.510000e+02
mo_sin_rcnt_rev_tl_op 202308.0 1.674689e+01 1.927746e+01 0.00 5.000000e+00 1.100000e+01 2.100000e+01 4.380000e+02
mo_sin_rcnt_tl 202308.0 1.002895e+01 1.032866e+01 0.00 4.000000e+00 7.000000e+00 1.300000e+01 2.260000e+02
mort_acc 206693.0 2.078014e+00 2.110272e+00 0.00 0.000000e+00 2.000000e+00 3.000000e+00 3.100000e+01
mths_since_recent_bc 205755.0 2.911749e+01 3.526430e+01 0.00 8.000000e+00 1.700000e+01 3.600000e+01 6.390000e+02
mths_since_recent_bc_dlq 39518.0 3.944850e+01 2.231340e+01 0.00 2.100000e+01 3.800000e+01 5.800000e+01 1.700000e+02
mths_since_recent_inq 172696.0 8.471204e+00 6.297372e+00 0.00 3.000000e+00 7.000000e+00 1.300000e+01 2.500000e+01
mths_since_recent_revol_delinq 58226.0 3.598997e+01 2.190405e+01 0.00 1.800000e+01 3.300000e+01 5.100000e+01 1.700000e+02
num_accts_ever_120_pd 202308.0 3.329231e-01 1.057903e+00 0.00 0.000000e+00 0.000000e+00 0.000000e+00 3.300000e+01
num_actv_bc_tl 202308.0 3.680388e+00 2.264157e+00 0.00 2.000000e+00 3.000000e+00 5.000000e+00 4.700000e+01
num_actv_rev_tl 202308.0 5.270538e+00 3.120475e+00 0.00 3.000000e+00 5.000000e+00 7.000000e+00 5.900000e+01
num_bc_sats 204483.0 5.219818e+00 3.159926e+00 0.00 3.000000e+00 5.000000e+00 7.000000e+00 6.100000e+01
num_bc_tl 202308.0 9.122402e+00 5.078673e+00 0.00 5.000000e+00 8.000000e+00 1.200000e+01 6.800000e+01
num_il_tl 202308.0 8.566557e+00 7.013565e+00 0.00 4.000000e+00 7.000000e+00 1.100000e+01 1.130000e+02
num_op_rev_tl 202308.0 8.599141e+00 4.687915e+00 0.00 5.000000e+00 8.000000e+00 1.100000e+01 7.900000e+01
num_rev_accts 202308.0 1.577117e+01 8.360803e+00 1.00 1.000000e+01 1.400000e+01 2.000000e+01 1.040000e+02
num_rev_tl_bal_gt_0 202308.0 5.237954e+00 3.033714e+00 0.00 3.000000e+00 5.000000e+00 7.000000e+00 5.900000e+01
num_sats 204483.0 1.212833e+01 5.504708e+00 1.00 8.000000e+00 1.100000e+01 1.500000e+01 8.500000e+01
num_tl_120dpd_2m 197437.0 3.038944e-04 1.828095e-02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00
num_tl_30dpd 202308.0 1.369199e-03 3.803182e-02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00
num_tl_90g_dpd_24m 202308.0 4.810487e-02 3.469449e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+01
num_tl_op_past_12m 202308.0 1.549350e+00 1.448455e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 2.000000e+01
pct_tl_nvr_dlq 202292.0 9.601827e+01 6.999248e+00 18.20 9.440000e+01 1.000000e+02 1.000000e+02 1.000000e+02
percent_bc_gt_75 205650.0 2.793310e+01 3.069985e+01 0.00 0.000000e+00 2.000000e+01 5.000000e+01 1.000000e+02
pub_rec_bankruptcies 218886.0 5.868808e-02 2.536086e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 8.000000e+00
tax_liens 219024.0 3.726532e-02 3.351089e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 3.700000e+01
tot_hi_cred_lim 202308.0 2.292160e+05 2.075162e+05 1600.00 7.658500e+04 1.839230e+05 3.211510e+05 9.999999e+06
total_bal_ex_mort 206693.0 5.271477e+04 5.219592e+04 0.00 2.203700e+04 3.883700e+04 6.548100e+04 2.688920e+06
total_bc_limit 206693.0 3.387474e+04 2.841031e+04 0.00 1.430000e+04 2.600000e+04 4.500000e+04 7.600000e+05
total_il_high_credit_limit 202308.0 4.463529e+04 4.674582e+04 0.00 1.547175e+04 3.320700e+04 5.946825e+04 2.101913e+06

In [16]:
acc_df['loan_status'].unique()


Out[16]:
array(['Fully Paid', 'Charged Off', 'Late (31-120 days)', 'Current',
       'Late (16-30 days)', 'In Grace Period', 'Default',
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off', 'Issued'], dtype=object)

In [19]:
# looking at only defaulted loans:
default_categories = ['Default', 'Charged Off', 'Does not meet the credit policy. Status:Charged Off']
# .isin() is a trick for checking if something is in a list
# it's a pandas-specific function
acc_df[acc_df['loan_status'].isin(default_categories)].describe().T
# check out the average interest rate and dti (debt-to-income)


Out[19]:
count mean std min 25% 50% 75% max
id 98930.0 2.699739e+07 2.301416e+07 56413.00 6.310303e+06 2.013981e+07 4.568428e+07 91330911.00
member_id 98930.0 2.923833e+07 2.437432e+07 70694.00 7.610162e+06 2.239203e+07 4.878197e+07 98011682.00
loan_amnt 98930.0 1.508133e+04 8.500336e+03 500.00 8.500000e+03 1.377500e+04 2.000000e+04 40000.00
funded_amnt 98930.0 1.505748e+04 8.489354e+03 500.00 8.500000e+03 1.375000e+04 2.000000e+04 40000.00
funded_amnt_inv 98930.0 1.497194e+04 8.521248e+03 0.00 8.400000e+03 1.350000e+04 2.000000e+04 40000.00
int_rate 98930.0 1.587433e+01 4.382355e+00 5.32 1.299000e+01 1.561000e+01 1.855000e+01 30.99
installment 98930.0 4.515717e+02 2.493054e+02 15.91 2.733900e+02 3.975600e+02 5.845900e+02 1527.00
annual_inc 98930.0 6.769918e+04 6.799978e+04 0.00 4.200000e+04 5.900000e+04 8.000000e+04 8900060.00
dti 98930.0 1.958989e+01 3.282763e+01 0.00 1.329000e+01 1.931000e+01 2.550000e+01 9999.00
delinq_2yrs 98927.0 3.204383e-01 8.769726e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 20.00
inq_last_6mths 98927.0 9.264306e-01 1.169960e+00 0.00 0.000000e+00 1.000000e+00 1.000000e+00 33.00
mths_since_last_delinq 48079.0 3.381019e+01 2.209809e+01 0.00 1.500000e+01 3.000000e+01 5.000000e+01 152.00
mths_since_last_record 16060.0 7.284732e+01 2.910663e+01 0.00 5.400000e+01 7.500000e+01 9.700000e+01 129.00
open_acc 98927.0 1.158510e+01 5.297463e+00 0.00 8.000000e+00 1.100000e+01 1.400000e+01 76.00
pub_rec 98927.0 1.992783e-01 6.243241e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 86.00
revol_bal 98930.0 1.543650e+04 1.967720e+04 0.00 6.121000e+03 1.122700e+04 1.937575e+04 1746716.00
revol_util 98848.0 5.782778e+01 2.355960e+01 0.00 4.120000e+01 5.930000e+01 7.630000e+01 366.60
total_acc 98927.0 2.494847e+01 1.194085e+01 1.00 1.600000e+01 2.300000e+01 3.200000e+01 151.00
out_prncp 98930.0 1.866649e+02 1.760048e+03 0.00 0.000000e+00 0.000000e+00 0.000000e+00 35000.00
out_prncp_inv 98930.0 1.865517e+02 1.758890e+03 0.00 0.000000e+00 0.000000e+00 0.000000e+00 35000.00
total_pymnt 98930.0 6.836523e+03 5.737630e+03 0.00 2.799695e+03 5.276090e+03 9.125892e+03 52956.86
total_pymnt_inv 98930.0 6.793016e+03 5.729176e+03 0.00 2.761983e+03 5.232555e+03 9.087438e+03 52956.86
total_rec_prncp 98930.0 3.586177e+03 3.505095e+03 0.00 1.229388e+03 2.529110e+03 4.763060e+03 34983.14
total_rec_int 98930.0 2.314493e+03 2.400856e+03 0.00 7.488125e+02 1.554030e+03 2.994255e+03 23448.10
total_rec_late_fee 98930.0 2.829853e+00 1.171959e+01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 358.68
recoveries 98930.0 9.330239e+02 1.561515e+03 0.00 0.000000e+00 9.717500e+01 1.408938e+03 39444.37
collection_recovery_fee 98930.0 1.331168e+02 2.691809e+02 0.00 0.000000e+00 1.900000e+00 1.740271e+02 7002.19
last_pymnt_amnt 98930.0 4.665552e+02 6.263558e+02 0.00 2.563000e+02 3.882800e+02 5.844800e+02 34833.35
collections_12_mths_ex_med 98902.0 1.513620e-02 1.347703e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 6.00
mths_since_last_major_derog 24667.0 4.402315e+01 2.232647e+01 0.00 2.600000e+01 4.400000e+01 6.200000e+01 152.00
policy_code 98930.0 1.000000e+00 0.000000e+00 1.00 1.000000e+00 1.000000e+00 1.000000e+00 1.00
annual_inc_joint 73.0 8.860393e+04 3.698016e+04 17950.00 7.113600e+04 8.400000e+04 1.035000e+05 200000.00
dti_joint 73.0 2.156918e+01 8.538098e+00 4.50 1.445000e+01 2.281000e+01 2.676000e+01 38.45
acc_now_delinq 98927.0 5.357486e-03 8.138018e-02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 6.00
tot_coll_amt 87903.0 1.902361e+02 1.593015e+03 0.00 0.000000e+00 0.000000e+00 0.000000e+00 126297.00
tot_cur_bal 87903.0 1.160545e+05 1.304739e+05 0.00 2.708250e+04 6.099400e+04 1.718290e+05 3437283.00
open_acc_6m 5822.0 1.405187e+00 1.414041e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 12.00
open_il_6m 5822.0 3.027825e+00 3.232541e+00 0.00 1.000000e+00 2.000000e+00 4.000000e+00 38.00
open_il_12m 5822.0 1.154243e+00 1.250012e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 10.00
open_il_24m 5822.0 2.250086e+00 2.022544e+00 0.00 1.000000e+00 2.000000e+00 3.000000e+00 20.00
mths_since_rcnt_il 5673.0 1.547858e+01 2.228752e+01 0.00 4.000000e+00 8.000000e+00 1.700000e+01 245.00
total_bal_il 5822.0 3.736793e+04 4.051580e+04 0.00 1.214100e+04 2.723800e+04 4.968300e+04 796104.00
il_util 5186.0 7.642653e+01 2.081345e+01 0.00 6.500000e+01 8.000000e+01 9.100000e+01 208.00
open_rv_12m 5822.0 1.635349e+00 1.674451e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 18.00
open_rv_24m 5822.0 3.463586e+00 2.859950e+00 0.00 1.000000e+00 3.000000e+00 5.000000e+00 30.00
max_bal_bc 5822.0 5.351811e+03 5.308727e+03 0.00 1.927000e+03 3.945500e+03 7.024000e+03 84988.00
all_util 5822.0 6.464875e+01 1.984519e+01 0.00 5.300000e+01 6.600000e+01 7.900000e+01 161.00
total_rev_hi_lim 87903.0 2.742641e+04 2.703635e+04 0.00 1.250000e+04 2.100000e+04 3.440000e+04 1998700.00
inq_fi 5822.0 1.325318e+00 1.879731e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 23.00
total_cu_tl 5822.0 1.692546e+00 2.899439e+00 0.00 0.000000e+00 0.000000e+00 2.000000e+00 28.00
inq_last_12m 5822.0 2.847303e+00 2.912727e+00 0.00 1.000000e+00 2.000000e+00 4.000000e+00 31.00
acc_open_past_24mths 91295.0 5.151826e+00 3.226986e+00 0.00 3.000000e+00 5.000000e+00 7.000000e+00 41.00
avg_cur_bal 87899.0 1.083432e+04 1.291830e+04 0.00 2.794000e+03 5.784000e+03 1.481450e+04 276369.00
bc_open_to_buy 90264.0 6.547262e+03 1.071069e+04 0.00 8.330000e+02 2.851000e+03 7.629000e+03 282091.00
bc_util 90198.0 6.735443e+01 2.665687e+01 0.00 4.940000e+01 7.290000e+01 9.030000e+01 255.20
chargeoff_within_12_mths 98902.0 8.564033e-03 1.059282e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 5.00
delinq_amnt 98927.0 1.101499e+01 6.290389e+02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 65000.00
mo_sin_old_il_acct 84975.0 1.225154e+02 5.352688e+01 1.00 9.200000e+01 1.260000e+02 1.500000e+02 720.00
mo_sin_old_rev_tl_op 87902.0 1.707054e+02 9.069158e+01 4.00 1.090000e+02 1.540000e+02 2.140000e+02 842.00
mo_sin_rcnt_rev_tl_op 87902.0 1.114201e+01 1.383892e+01 0.00 3.000000e+00 7.000000e+00 1.400000e+01 372.00
mo_sin_rcnt_tl 87903.0 6.797504e+00 7.642844e+00 0.00 2.000000e+00 5.000000e+00 9.000000e+00 197.00
mort_acc 91295.0 1.492174e+00 1.966648e+00 0.00 0.000000e+00 1.000000e+00 2.000000e+00 23.00
mths_since_recent_bc 90347.0 2.078892e+01 2.675381e+01 0.00 5.000000e+00 1.200000e+01 2.500000e+01 538.00
mths_since_recent_bc_dlq 22672.0 3.977135e+01 2.278040e+01 0.00 2.000000e+01 3.900000e+01 5.900000e+01 152.00
mths_since_recent_inq 84998.0 5.998176e+00 5.587802e+00 0.00 1.000000e+00 4.000000e+00 9.000000e+00 25.00
mths_since_recent_revol_delinq 31417.0 3.567467e+01 2.245821e+01 0.00 1.600000e+01 3.300000e+01 5.300000e+01 152.00
num_accts_ever_120_pd 87903.0 4.785957e-01 1.200960e+00 0.00 0.000000e+00 0.000000e+00 0.000000e+00 24.00
num_actv_bc_tl 87903.0 3.784399e+00 2.263239e+00 0.00 2.000000e+00 3.000000e+00 5.000000e+00 26.00
num_actv_rev_tl 87903.0 6.057768e+00 3.345165e+00 0.00 4.000000e+00 5.000000e+00 8.000000e+00 40.00
num_bc_sats 89932.0 4.706389e+00 2.806615e+00 0.00 3.000000e+00 4.000000e+00 6.000000e+00 35.00
num_bc_tl 87903.0 8.398325e+00 4.892832e+00 0.00 5.000000e+00 8.000000e+00 1.100000e+01 52.00
num_il_tl 87903.0 8.542791e+00 7.470562e+00 0.00 3.000000e+00 7.000000e+00 1.100000e+01 117.00
num_op_rev_tl 87903.0 8.473875e+00 4.409085e+00 0.00 5.000000e+00 8.000000e+00 1.100000e+01 62.00
num_rev_accts 87903.0 1.500139e+01 8.060028e+00 0.00 9.000000e+00 1.400000e+01 1.900000e+01 82.00
num_rev_tl_bal_gt_0 87903.0 6.025995e+00 3.300846e+00 0.00 4.000000e+00 5.000000e+00 8.000000e+00 39.00
num_sats 89932.0 1.175865e+01 5.304950e+00 0.00 8.000000e+00 1.100000e+01 1.400000e+01 76.00
num_tl_120dpd_2m 85106.0 8.695039e-04 3.594113e-02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 6.00
num_tl_30dpd 87903.0 3.788267e-03 6.503084e-02 0.00 0.000000e+00 0.000000e+00 0.000000e+00 3.00
num_tl_90g_dpd_24m 87903.0 9.226079e-02 4.905162e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 20.00
num_tl_op_past_12m 87903.0 2.436618e+00 1.873113e+00 0.00 1.000000e+00 2.000000e+00 3.000000e+00 25.00
pct_tl_nvr_dlq 87885.0 9.456235e+01 8.153480e+00 12.50 9.200000e+01 1.000000e+02 1.000000e+02 100.00
percent_bc_gt_75 90229.0 5.493363e+01 3.466952e+01 0.00 2.500000e+01 5.000000e+01 8.570000e+01 100.00
pub_rec_bankruptcies 98596.0 1.289505e-01 3.703407e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 8.00
tax_liens 98919.0 4.356089e-02 4.450575e-01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 85.00
tot_hi_cred_lim 87903.0 1.412272e+05 1.437184e+05 0.00 4.275050e+04 8.540800e+04 2.018460e+05 3867129.00
total_bal_ex_mort 91295.0 4.693377e+04 4.247270e+04 0.00 2.095100e+04 3.633800e+04 5.937600e+04 1896461.00
total_bc_limit 91295.0 1.738864e+04 1.731728e+04 0.00 6.400000e+03 1.220000e+04 2.250000e+04 560800.00
total_il_high_credit_limit 87903.0 3.906871e+04 3.904861e+04 0.00 1.383700e+04 2.990400e+04 5.278000e+04 757807.00

In [ ]:
# TODO: compare the dti, interest rate of defaulted and non-defaulted loans

In [9]:
# seaborn is a recently-created Python library for easily making nice-looking plots
# you will have to install it with `conda install seaborn` or `pip install seaborn`, etc

# the docs are here for this function: http://seaborn.pydata.org/generated/seaborn.distplot.html
# found by Googling 'seaborn histogram'
f = sns.distplot(acc_df['dti'])



In [10]:
# outliers are screwing up the histogram... remove them
# adapted from http://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list
# we're using interquartile range to determine outliers
def reject_outliers(sr, iq_range=0.5, side='left', return_mask=False):
    """
    Takes an array (or pandas series) and returns an array with outliers excluded, according to the
    interquartile range.
    
    Parameters:
    -----------
    sr: array
        array of numeric values
    iq_range: float
        percent to calculate quartiles by, 0.5 will yield 25% and 75%ile quartiles
    side: string
        if 'left', will return everything below the highest quartile
        if 'right', will return everything above the lowest quartile
        if 'both', will return everything between the high and low quartiles
    """
    pcnt = (1 - iq_range) / 2
    qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
    iqr = qhigh - qlow
    if side=='both':
        mask = (sr - median).abs() <= iqr
    elif side=='left':
        mask = (sr - median) <= iqr
    elif side=='right':
        mask = (sr - median) >= iqr
    else:
        print 'options for side are left, right, or both'
    
    if return_mask:
        return mask
    
    return sr[mask]

In [11]:
# sweeeeeeeetttt....
dti_no_outliers = reject_outliers(acc_df['dti'], iq_range=0.85) # arrived at 0.85 via trial and error
f = sns.distplot(dti_no_outliers)
# other types of plot examples:
# http://seaborn.pydata.org/examples/



In [12]:
# sets the xkcd style if you want to make the plots look funny...may neet to install some fonts
plt.xkcd()


Out[12]:
<matplotlib.rc_context at 0x7f107bfd9150>

In [13]:
f = sns.distplot(dti_no_outliers)


/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'xkcd', u'Humor Sans', u'Comic Sans MS'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

In [14]:
# back to default
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)

In [15]:
f = sns.distplot(dti_no_outliers)



In [16]:
# other styles:
# http://matplotlib.org/users/style_sheets.html
# this is the R ggplot style, which some people really like
plt.style.use('ggplot')

In [17]:
# http://seaborn.pydata.org/generated/seaborn.regplot.html
# takes a long time because there are a lot of points, but works for smaller datasets
# f = sns.regplot(data=acc_df, x='dti', y='int_rate', fit_reg=False)
# instead, lets make a heatmap:
# http://stackoverflow.com/questions/2369492/generate-a-heatmap-in-matplotlib-using-a-scatter-data-set
# http://seaborn.pydata.org/generated/seaborn.jointplot.html
mask = reject_outliers(acc_df['dti'], iq_range=0.85, return_mask=True)
f = sns.jointplot(data=acc_df.ix[mask, :], x='dti', y='int_rate', kind='hex', joint_kws=dict(gridsize=50))



In [ ]:
# suggetion for assignment:
# explore the data and see what differences you can find between defaulted and paid-back loans
# Are there any variables that look like they might predict if a loan will default?
# feel free to use any other dataset you want for the assignment though
# kaggle.com/datasets and http://archive.ics.uci.edu/ml/ have a lot of datasets, or data.gov