Frame, Acquire & Refine

Raw Data

You are provided with the following data: loan_data.csv
This is the historical data that the bank has provided. It has the following columns

Application Attributes:

  • years: Number of years the applicant has been employed
  • ownership: Whether the applicant owns a house or not
  • income: Annual income of the applicant
  • age: Age of the applicant

Behavioural Attributes:

  • grade: Credit grade of the applicant

Outcome Variable:

  • amount : Amount of Loan provided to the applicant
  • default : Whether the applicant has defaulted or not
  • interest: Interest rate charged for the applicant

Frame the Problem

  • What are the features?
  • What is the target?

Discuss?


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Acquire the Data


In [2]:
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
#Defualt Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [4]:
#Load the dataset
df = pd.read_csv("data/loan_data.csv")

In [5]:
#View the first few rows of train
df.head()


Out[5]:
default amount interest grade years ownership income age
0 0 5000 10.65 B 10.00 RENT 24000.00 33
1 0 2400 nan C 25.00 RENT 12252.00 31
2 0 10000 13.49 C 13.00 RENT 49200.00 24
3 0 5000 nan A 3.00 RENT 36000.00 39
4 0 3000 nan E 9.00 RENT 48000.00 24

In [6]:
#View the columns of the train dataset
df.columns


Out[6]:
Index(['default', 'amount', 'interest', 'grade', 'years', 'ownership',
       'income', 'age'],
      dtype='object')

In [7]:
#View the data types of the train dataset
df.dtypes


Out[7]:
default        int64
amount         int64
interest     float64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object

In [8]:
#View the number of records in the data
df.shape


Out[8]:
(29092, 8)

In [9]:
#View summary of raw data 
df.describe()


Out[9]:
default amount interest years income age
count 29092.00 29092.00 26316.00 28283.00 29092.00 29092.00
mean 0.11 9593.51 11.00 6.15 67168.83 27.70
std 0.31 6323.42 3.24 6.68 63606.52 6.23
min 0.00 500.00 5.42 0.00 4000.00 20.00
25% 0.00 5000.00 7.90 2.00 40000.00 23.00
50% 0.00 8000.00 10.99 4.00 56424.00 26.00
75% 0.00 12250.00 13.47 8.00 80000.00 30.00
max 1.00 35000.00 23.22 62.00 6000000.00 144.00

Refine the Data

Lets check the dataset for quality and compeleteness

  1. Missing Values
  2. Outliers

Check for Missing Values


In [10]:
# Find if df has missing values. Hint: There is a isnull() function
df.isnull().head()


Out[10]:
default amount interest grade years ownership income age
0 False False False False False False False False
1 False False True False False False False False
2 False False False False False False False False
3 False False True False False False False False
4 False False True False False False False False

One consideration we check here is the number of observations with missing values for those columns that have missing values. If a column has too many missing values, it might make sense to drop the column.


In [11]:
#let's see how many missing values are present
df.isnull().sum()


Out[11]:
default         0
amount          0
interest     2776
grade           0
years         809
ownership       0
income          0
age             0
dtype: int64

So, we see that two columns have missing values: interest and years. Both the columns are numeric. We have three options for dealing with this missing values

Options to treat Missing Values

  • REMOVE - NAN rows
  • IMPUTATION - Replace them with something??
    • Mean
    • Median
    • Fixed Number - Domain Relevant
    • High Number (999) - Issue with modelling
  • BINNING - Categorical variable and "Missing becomes a number*
  • DOMAIN SPECIFIC - Entry error, pipeline, etc.

In [12]:
#Let's replace missing values with the median of the column
df.describe()


Out[12]:
default amount interest years income age
count 29092.00 29092.00 26316.00 28283.00 29092.00 29092.00
mean 0.11 9593.51 11.00 6.15 67168.83 27.70
std 0.31 6323.42 3.24 6.68 63606.52 6.23
min 0.00 500.00 5.42 0.00 4000.00 20.00
25% 0.00 5000.00 7.90 2.00 40000.00 23.00
50% 0.00 8000.00 10.99 4.00 56424.00 26.00
75% 0.00 12250.00 13.47 8.00 80000.00 30.00
max 1.00 35000.00 23.22 62.00 6000000.00 144.00

In [13]:
#there's a fillna function
df = df.fillna(df.median())

In [50]:
#Now, let's check if train has missing values or not
df.isnull().any()


Out[50]:
default      False
amount       False
interest     False
grade        False
years        False
ownership    False
income       False
age          False
dtype: bool

Check for Outlier Values

Let us check first the categorical variables


In [15]:
# Which variables are Categorical?
df.dtypes


Out[15]:
default        int64
amount         int64
interest     float64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object

In [16]:
# Create a Crosstab of those variables with another variable
pd.crosstab(df.default, df.grade)


Out[16]:
grade A B C D E F G
default
0 9084 8344 4904 2651 692 155 35
1 565 985 844 580 176 56 21

In [17]:
# Create a Crosstab of those variables with another variable
pd.crosstab(df.default, df.ownership)


Out[17]:
ownership MORTGAGE OTHER OWN RENT
default
0 10821 80 2049 12915
1 1181 17 252 1777

Let us check outliers in the continuous variable

  • Plotting
    • Histogram
    • Box-Plot
  • Measuring
    • Z-score > 3
    • Modified Z-score > 3.5 where modified Z-score = 0.6745 * (x - x_median) / MAD

In [54]:
# Describe the data set continuous values
df.describe()


Out[54]:
default amount interest years income age
count 29092.00 29092.00 29092.00 29092.00 29092.00 29092.00
mean 0.11 9593.51 11.00 6.09 67168.83 27.70
std 0.31 6323.42 3.08 6.59 63606.52 6.23
min 0.00 500.00 5.42 0.00 4000.00 20.00
25% 0.00 5000.00 8.49 2.00 40000.00 23.00
50% 0.00 8000.00 10.99 4.00 56424.00 26.00
75% 0.00 12250.00 13.11 8.00 80000.00 30.00
max 1.00 35000.00 23.22 62.00 6000000.00 144.00

Clearly the age variable looks like it has an outlier - Age cannot be greater 100! Also the income variable looks like it may also have an outlier.


In [55]:
# Make a histogram of age
df.age.hist(bins=100)


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x11224eac8>

In [56]:
# Make a histogram of income
df.income.hist(bins=100)


Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x10cba3be0>

In [18]:
# Make Histograms for all other variables

In [ ]:


In [ ]:


In [ ]:


In [67]:
# Make a scatter of age and income
plt.scatter(df.age, df.income)


Out[67]:
<matplotlib.collections.PathCollection at 0x1127a5f28>

Find the observation which has age = 144 and remove it from the dataframe


In [57]:
# Find the observation 
df[df.age == 144]


Out[57]:
default amount interest grade years ownership income age
19485 0 5000 12.73 C 12.00 MORTGAGE 6000000.00 144

In [60]:
df[df.age == 144].index


Out[60]:
Int64Index([19485], dtype='int64')

In [64]:
# Use drop to remove the observation inplace
df.drop(df[df.age == 144].index, axis=0, inplace=True)

In [68]:
# Find the shape of the df
df.shape


Out[68]:
(29091, 8)

In [70]:
# Check again for outliers
df.describe()


Out[70]:
default amount interest years income age
count 29091.00 29091.00 29091.00 29091.00 29091.00 29091.00
mean 0.11 9593.66 11.00 6.09 66964.89 27.70
std 0.31 6323.47 3.08 6.59 53253.18 6.19
min 0.00 500.00 5.42 0.00 4000.00 20.00
25% 0.00 5000.00 8.49 2.00 40000.00 23.00
50% 0.00 8000.00 10.99 4.00 56400.00 26.00
75% 0.00 12250.00 13.11 8.00 80000.00 30.00
max 1.00 35000.00 23.22 62.00 2039784.00 94.00

In [72]:
# Save the new file as cleaned data
df.to_csv("data/loan_data_clean.csv", index=False)

In [19]:
#We are good to go to the next step

In [ ]: