Regression - Frame, Acquire & Refine

Raw Data

You are provided with the following data: loan_data.csv
This is the historical data that the bank has provided. It has the following columns

Application Attributes:

  • years: Number of years the applicant has been employed
  • ownership: Whether the applicant owns a house or not
  • income: Annual income of the applicant
  • age: Age of the applicant

Behavioural Attributes:

  • grade: Credit grade of the applicant

Outcome Variable:

  • amount : Amount of Loan provided to the applicant
  • default : Whether the applicant has defaulted or not
  • interest: Interest rate charged for the applicant

Frame the Problem

  • What are the features?
  • What is the target?

Discuss?


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Acquire the Data


In [1]:
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#Defualt Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [7]:
#Load the dataset
df = pd.read_csv('data/loan_data.csv')

In [8]:
#View the first few rows of train
df.head()


Out[8]:
default amount interest grade years ownership income age
0 0 5000 10.65 B 10.00 RENT 24000.00 33
1 0 2400 nan C 25.00 RENT 12252.00 31
2 0 10000 13.49 C 13.00 RENT 49200.00 24
3 0 5000 nan A 3.00 RENT 36000.00 39
4 0 3000 nan E 9.00 RENT 48000.00 24

In [10]:
#View the columns of the train dataset
df.columns


Out[10]:
Index(['default', 'amount', 'interest', 'grade', 'years', 'ownership',
       'income', 'age'],
      dtype='object')

In [12]:
#View the data types of the train dataset
df.dtypes


Out[12]:
default        int64
amount         int64
interest     float64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object

In [13]:
#View the number of records in the data
df.shape


Out[13]:
(29092, 8)

In [14]:
#View summary of raw data 
df.describe()


Out[14]:
default amount interest years income age
count 29092.00 29092.00 26316.00 28283.00 29092.00 29092.00
mean 0.11 9593.51 11.00 6.15 67168.83 27.70
std 0.31 6323.42 3.24 6.68 63606.52 6.23
min 0.00 500.00 5.42 0.00 4000.00 20.00
25% 0.00 5000.00 7.90 2.00 40000.00 23.00
50% 0.00 8000.00 10.99 4.00 56424.00 26.00
75% 0.00 12250.00 13.47 8.00 80000.00 30.00
max 1.00 35000.00 23.22 62.00 6000000.00 144.00

Refine the Data

Lets check the dataset for quality and compeleteness

  1. Missing Values
  2. Outliers

Check for Missing Values


In [16]:
# Find if df has missing values. Hint: There is a isnull() function
df.isnull().head()


Out[16]:
default amount interest grade years ownership income age
0 False False False False False False False False
1 False False True False False False False False
2 False False False False False False False False
3 False False True False False False False False
4 False False True False False False False False

One consideration we check here is the number of observations with missing values for those columns that have missing values. If a column has too many missing values, it might make sense to drop the column.


In [41]:
#let's see how many missing values are present
df.isnull().sum()


Out[41]:
default         0
amount          0
interest     2776
grade           0
years         809
ownership       0
income          0
age             0
dtype: int64

In [42]:
df.isnull()


Out[42]:
default amount interest grade years ownership income age
0 False False False False False False False False
1 False False True False False False False False
2 False False False False False False False False
3 False False True False False False False False
4 False False True False False False False False
5 False False False False False False False False
6 False False False False False False False False
7 False False False False False False False False
8 False False False False False False False False
9 False False False False False False False False
10 False False True False False False False False
11 False False False False False False False False
12 False False True False False False False False
13 False False False False False False False False
14 False False False False False False False False
15 False False False False False False False False
16 False False False False False False False False
17 False False False False False False False False
18 False False False False False False False False
19 False False False False False False False False
20 False False False False False False False False
21 False False False False False False False False
22 False False False False False False False False
23 False False False False False False False False
24 False False False False False False False False
25 False False False False False False False False
26 False False False False False False False False
27 False False False False False False False False
28 False False False False False False False False
29 False False False False False False False False
... ... ... ... ... ... ... ... ...
29062 False False False False False False False False
29063 False False True False False False False False
29064 False False False False False False False False
29065 False False False False False False False False
29066 False False False False False False False False
29067 False False False False False False False False
29068 False False False False False False False False
29069 False False True False False False False False
29070 False False False False False False False False
29071 False False False False False False False False
29072 False False False False False False False False
29073 False False False False False False False False
29074 False False False False False False False False
29075 False False False False False False False False
29076 False False False False False False False False
29077 False False True False False False False False
29078 False False False False False False False False
29079 False False False False False False False False
29080 False False False False False False False False
29081 False False False False False False False False
29082 False False False False False False False False
29083 False False True False False False False False
29084 False False False False False False False False
29085 False False False False False False False False
29086 False False False False False False False False
29087 False False False False False False False False
29088 False False False False False False False False
29089 False False False False False False False False
29090 False False False False False False False False
29091 False False True False False False False False

29092 rows × 8 columns


In [43]:
df[df.isnull().any(axis=1)].head()


Out[43]:
default amount interest grade years ownership income age
1 0 2400 nan C 25.00 RENT 12252.00 31
3 0 5000 nan A 3.00 RENT 36000.00 39
4 0 3000 nan E 9.00 RENT 48000.00 24
10 0 10000 nan C 4.00 RENT 42000.00 23
12 0 6000 nan B 1.00 MORTGAGE 84000.00 30

So, we see that two columns have missing values: interest and years. Both the columns are numeric. We have three options for dealing with this missing values

Options to treat Missing Values

  • REMOVE - NAN rows
  • IMPUTATION - Replace them with something??
    • Mean
    • Median
    • Fixed Number - Domain Relevant
    • High Number (999) - Issue with modelling
  • BINNING - Categorical variable and "Missing becomes a number*
  • DOMAIN SPECIFIC - Entry error, pipeline, etc.

In [44]:
#Let's replace missing values with the median of the column
df.interest.median()


Out[44]:
10.99

In [52]:
?df.fillna

In [58]:
df[df.isnull().any(axis=1)].head(20)


Out[58]:
default amount interest grade years ownership income age
1 0 2400 nan C 25.00 RENT 12252.00 31
3 0 5000 nan A 3.00 RENT 36000.00 39
4 0 3000 nan E 9.00 RENT 48000.00 24
10 0 10000 nan C 4.00 RENT 42000.00 23
12 0 6000 nan B 1.00 MORTGAGE 84000.00 30
35 0 14000 nan B 6.00 RENT 80000.00 26
38 0 10000 nan A 11.00 RENT 60000.00 37
41 0 25600 nan B 9.00 RENT 110000.00 29
48 0 8000 nan D 0.00 RENT 62000.00 29
58 0 15000 nan A 1.00 MORTGAGE 52000.00 37
76 0 6000 nan C 30.00 RENT 80000.00 32
87 0 12000 nan B 1.00 RENT 50000.00 26
91 0 18000 nan A 12.00 RENT 56964.00 29
100 0 6375 nan A 4.00 RENT 44000.00 25
115 0 12000 nan C 0.00 RENT 57000.00 23
125 1 2500 7.51 A nan OWN 12000.00 21
128 0 10000 nan C 2.00 RENT 45900.00 65
138 1 15000 nan D 8.00 RENT 75000.00 50
140 0 9000 nan B 9.00 MORTGAGE 74000.00 22
146 1 12000 nan C 8.00 MORTGAGE 70000.00 33

In [64]:
#there's a fillna function
df.fillna(df.median(), inplace=True)


Out[64]:
default amount interest grade years ownership income age
0 0 5000 10.65 B 10.00 RENT 24000.00 33
1 0 2400 10.99 C 25.00 RENT 12252.00 31
2 0 10000 13.49 C 13.00 RENT 49200.00 24
3 0 5000 10.99 A 3.00 RENT 36000.00 39
4 0 3000 10.99 E 9.00 RENT 48000.00 24
5 0 12000 12.69 B 11.00 OWN 75000.00 28
6 1 9000 13.49 C 0.00 RENT 30000.00 22
7 0 3000 9.91 B 3.00 RENT 15000.00 22
8 1 10000 10.65 B 3.00 RENT 100000.00 28
9 0 1000 16.29 D 0.00 RENT 28000.00 22
10 0 10000 10.99 C 4.00 RENT 42000.00 23
11 0 3600 6.03 A 13.00 MORTGAGE 110000.00 27
12 0 6000 10.99 B 1.00 MORTGAGE 84000.00 30
13 0 9200 6.03 A 6.00 RENT 77385.19 24
14 1 21000 12.42 B 17.00 RENT 105000.00 29
15 0 10000 11.71 B 13.00 OWN 50000.00 23
16 0 10000 11.71 B 5.00 RENT 50000.00 22
17 1 6000 11.71 B 1.00 RENT 76000.00 31
18 0 15000 9.91 B 2.00 MORTGAGE 92000.00 23
19 1 15000 14.27 C 9.00 RENT 60000.00 30
20 0 4000 11.71 B 19.00 MORTGAGE 106000.00 27
21 0 8500 11.71 B 0.00 RENT 25000.00 24
22 0 4375 7.51 A 7.00 MORTGAGE 17108.00 38
23 0 31825 7.90 A 5.00 MORTGAGE 75000.00 23
24 0 5000 8.90 A 2.00 RENT 24044.00 22
25 0 7000 15.96 C 7.00 RENT 34000.00 27
26 0 12400 10.65 B 18.00 RENT 41000.00 36
27 0 10800 9.91 B 2.00 RENT 55596.00 35
28 0 15000 7.90 A 1.00 RENT 45000.00 23
29 0 6000 12.42 B 1.00 RENT 36852.00 25
... ... ... ... ... ... ... ... ...
29062 0 12000 12.49 D 9.00 OWN 75000.00 23
29063 1 20000 10.99 E 6.00 MORTGAGE 145000.00 28
29064 0 1200 11.54 C 0.00 RENT 20000.00 29
29065 0 12250 10.59 C 3.00 RENT 60000.00 23
29066 0 2700 15.96 F 1.00 RENT 52200.00 25
29067 0 5000 9.01 B 1.00 MORTGAGE 250000.00 30
29068 0 3500 9.96 B 0.00 RENT 13500.00 21
29069 0 5000 10.99 A 8.00 MORTGAGE 150000.00 30
29070 0 5000 7.43 A 7.00 MORTGAGE 95000.00 25
29071 0 25000 10.59 C 2.00 RENT 48000.00 27
29072 0 10000 10.28 C 0.00 RENT 27376.00 22
29073 0 5000 9.01 B 1.00 RENT 73000.00 24
29074 0 5100 8.38 A 4.00 RENT 32000.00 25
29075 0 5400 7.75 A 0.00 RENT 8000.00 22
29076 0 5300 8.38 A 7.00 MORTGAGE 74000.00 42
29077 0 5000 10.99 C 1.00 OWN 30000.00 25
29078 0 3000 10.28 C 2.00 MORTGAGE 200000.00 67
29079 0 3000 8.70 B 0.00 MORTGAGE 25000.00 36
29080 0 5000 7.43 A 8.00 MORTGAGE 120000.00 29
29081 0 5000 8.07 A 0.00 MORTGAGE 180000.00 21
29082 0 5000 9.33 B 2.00 MORTGAGE 180000.00 23
29083 0 5000 10.99 B 4.00 MORTGAGE 48000.00 33
29084 0 5000 11.22 C 0.00 OWN 80000.00 39
29085 0 5000 7.43 A 1.00 OWN 85000.00 46
29086 0 5000 8.70 B 5.00 MORTGAGE 75000.00 35
29087 0 2500 8.07 A 4.00 MORTGAGE 110000.00 27
29088 0 8500 10.28 C 3.00 RENT 18000.00 25
29089 0 5000 8.07 A 0.00 MORTGAGE 100000.00 27
29090 0 5000 7.43 A 0.00 MORTGAGE 200000.00 23
29091 0 7500 10.99 E 0.00 OWN 22000.00 25

29092 rows × 8 columns


In [65]:
#Now, let's check if train has missing values or not
df.isnull().sum()


Out[65]:
default      0
amount       0
interest     0
grade        0
years        0
ownership    0
income       0
age          0
dtype: int64

Check for Outlier Values

Let us check first the categorical variables


In [66]:
# Which variables are Categorical?
df.dtypes


Out[66]:
default        int64
amount         int64
interest     float64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object

In [69]:
df.grade.value_counts()


Out[69]:
A    9649
B    9329
C    5748
D    3231
E     868
F     211
G      56
Name: grade, dtype: int64

In [68]:
# Create a Crosstab of those variables with another variable
pd.crosstab(df.grade, df.default)


Out[68]:
default 0 1
grade
A 9084 565
B 8344 985
C 4904 844
D 2651 580
E 692 176
F 155 56
G 35 21

In [70]:
# Create a Crosstab of those variables with another variable
df.ownership.value_counts()


Out[70]:
RENT        14692
MORTGAGE    12002
OWN          2301
OTHER          97
Name: ownership, dtype: int64

Let us check outliers in the continuous variable

  • Plotting
    • Histogram
    • Box-Plot
  • Measuring
    • Z-score > 3
    • Modified Z-score > 3.5 where modified Z-score = 0.6745 * (x - x_median) / MAD

In [71]:
# Describe the data set continuous values
df.describe()


Out[71]:
default amount interest years income age
count 29092.00 29092.00 29092.00 29092.00 29092.00 29092.00
mean 0.11 9593.51 11.00 6.09 67168.83 27.70
std 0.31 6323.42 3.08 6.59 63606.52 6.23
min 0.00 500.00 5.42 0.00 4000.00 20.00
25% 0.00 5000.00 8.49 2.00 40000.00 23.00
50% 0.00 8000.00 10.99 4.00 56424.00 26.00
75% 0.00 12250.00 13.11 8.00 80000.00 30.00
max 1.00 35000.00 23.22 62.00 6000000.00 144.00

Clearly the age variable looks like it has an outlier - Age cannot be greater 100! Also the income variable looks like it may also have an outlier.


In [87]:
?plt.yscale

In [88]:
# Make a histogram of age

df.age.hist()
#plt.xlim(60,80)
#plt.ylim(0,100)
plt.yscale('log')



In [74]:
import seaborn as sns

In [75]:
sns.distplot(df.age)


/Users/amitkaps/miniconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x1136b4828>

In [89]:
# Make a histogram of income
df.income.hist()
plt.yscale('log')



In [91]:
# Make Histograms for all other 
df.years.hist()
plt.yscale('log')



In [95]:
plt.boxplot(df.income)


Out[95]:
{'boxes': [<matplotlib.lines.Line2D at 0x1147d0cc0>],
 'caps': [<matplotlib.lines.Line2D at 0x1147be780>,
  <matplotlib.lines.Line2D at 0x1147be8d0>],
 'fliers': [<matplotlib.lines.Line2D at 0x114bf1be0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x114793da0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x1147d07f0>,
  <matplotlib.lines.Line2D at 0x11476c240>]}

In [98]:
plt.boxplot(df.years)


Out[98]:
{'boxes': [<matplotlib.lines.Line2D at 0x114da7ba8>],
 'caps': [<matplotlib.lines.Line2D at 0x114da1fd0>,
  <matplotlib.lines.Line2D at 0x114d9aef0>],
 'fliers': [<matplotlib.lines.Line2D at 0x1147d5fd0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x1147d50f0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x114da7e10>,
  <matplotlib.lines.Line2D at 0x114da1e10>]}

In [99]:
# Make a scatter of age and income
plt.scatter(df.income, df.age)


Out[99]:
<matplotlib.collections.PathCollection at 0x114dec6d8>

Find the observation which has age = 144 and remove it from the dataframe


In [102]:
# Find the observation 
df[df.age == df.age.max()]


Out[102]:
default amount interest grade years ownership income age
19485 0 5000 12.73 C 12.00 MORTGAGE 6000000.00 144

In [105]:
df[df.age == df.age.max()].index


Out[105]:
Int64Index([19485], dtype='int64')

In [108]:
df.drop?

In [110]:
# Use drop to remove the observation inplace
df.drop(19485, inplace=True)

In [111]:
# Find the shape of the df
df.shape


Out[111]:
(29091, 8)

In [112]:
# Check again for outliers
plt.scatter(df.age, df.income)


Out[112]:
<matplotlib.collections.PathCollection at 0x113d20710>

In [113]:
# Save the new file as cleaned data
df.to_csv("data/loan_data_clean.csv", index=False)

In [ ]:
#We are good to go to the next step

In [ ]: