Frame, Acquire & Refine

Raw Data

You are provided with the following data: loan_data.csv
This is the historical data that the bank has provided. It has the following columns

Application Attributes:

years: Number of years the applicant has been employed
ownership: Whether the applicant owns a house or not
income: Annual income of the applicant
age: Age of the applicant

Behavioural Attributes:

grade: Credit grade of the applicant

Outcome Variable:

amount : Amount of Loan provided to the applicant
default : Whether the applicant has defaulted or not
interest: Interest rate charged for the applicant

Frame the Problem

What are the features?
What is the target?

Discuss?



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Acquire the Data



In [2]:

    
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



In [3]:

    
#Defualt Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)



In [4]:

    
#Load the dataset
df = pd.read_csv("data/loan_data.csv")



In [5]:

    
#View the first few rows of train
df.head()



In [6]:

    
#View the columns of the train dataset
df.columns









    Out[6]:





Index(['default', 'amount', 'interest', 'grade', 'years', 'ownership',
       'income', 'age'],
      dtype='object')



In [7]:

    
#View the data types of the train dataset
df.dtypes









    Out[7]:





default        int64
amount         int64
interest     float64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object



In [8]:

    
#View the number of records in the data
df.shape









    Out[8]:





(29092, 8)



In [9]:

    
#View summary of raw data 
df.describe()

Refine the Data

Lets check the dataset for quality and compeleteness

Missing Values
Outliers

Check for Missing Values



In [10]:

    
# Find if df has missing values. Hint: There is a isnull() function
df.isnull().head()









    Out[10]:






  
    
      
      default
      amount
      interest
      grade
      years
      ownership
      income
      age
    
  
  
    
      0
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      1
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      2
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      3
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      4
      False
      False
      True
      False
      False
      False
      False
      False

One consideration we check here is the number of observations with missing values for those columns that have missing values. If a column has too many missing values, it might make sense to drop the column.



In [11]:

    
#let's see how many missing values are present
df.isnull().sum()









    Out[11]:





default         0
amount          0
interest     2776
grade           0
years         809
ownership       0
income          0
age             0
dtype: int64

So, we see that two columns have missing values: interest and years. Both the columns are numeric. We have three options for dealing with this missing values

Options to treat Missing Values

REMOVE - NAN rows
IMPUTATION - Replace them with something??
- Mean
- Median
- Fixed Number - Domain Relevant
- High Number (999) - Issue with modelling
BINNING - Categorical variable and "Missing becomes a number*
DOMAIN SPECIFIC - Entry error, pipeline, etc.



In [12]:

    
#Let's replace missing values with the median of the column
df.describe()



In [13]:

    
#there's a fillna function
df = df.fillna(df.median())



In [50]:

    
#Now, let's check if train has missing values or not
df.isnull().any()









    Out[50]:





default      False
amount       False
interest     False
grade        False
years        False
ownership    False
income       False
age          False
dtype: bool

Check for Outlier Values

Let us check first the categorical variables



In [15]:

    
# Which variables are Categorical?
df.dtypes









    Out[15]:





default        int64
amount         int64
interest     float64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object



In [16]:

    
# Create a Crosstab of those variables with another variable
pd.crosstab(df.default, df.grade)



In [17]:

    
# Create a Crosstab of those variables with another variable
pd.crosstab(df.default, df.ownership)

Let us check outliers in the continuous variable

Plotting
- Histogram
- Box-Plot
Measuring
- Z-score > 3
- Modified Z-score > 3.5 where modified Z-score = 0.6745 * (x - x_median) / MAD



In [54]:

    
# Describe the data set continuous values
df.describe()

Clearly the age variable looks like it has an outlier - Age cannot be greater 100! Also the income variable looks like it may also have an outlier.



In [55]:

    
# Make a histogram of age
df.age.hist(bins=100)









    Out[55]:





<matplotlib.axes._subplots.AxesSubplot at 0x11224eac8>



In [56]:

    
# Make a histogram of income
df.income.hist(bins=100)









    Out[56]:





<matplotlib.axes._subplots.AxesSubplot at 0x10cba3be0>



In [18]:

    
# Make Histograms for all other variables



In [ ]:



In [ ]:



In [ ]:



In [67]:

    
# Make a scatter of age and income
plt.scatter(df.age, df.income)









    Out[67]:





<matplotlib.collections.PathCollection at 0x1127a5f28>

Find the observation which has age = 144 and remove it from the dataframe



In [57]:

    
# Find the observation 
df[df.age == 144]









    Out[57]:






  
    
      
      default
      amount
      interest
      grade
      years
      ownership
      income
      age
    
  
  
    
      19485
      0
      5000
      12.73
      C
      12.00
      MORTGAGE
      6000000.00
      144



In [60]:

    
df[df.age == 144].index









    Out[60]:





Int64Index([19485], dtype='int64')



In [64]:

    
# Use drop to remove the observation inplace
df.drop(df[df.age == 144].index, axis=0, inplace=True)



In [68]:

    
# Find the shape of the df
df.shape









    Out[68]:





(29091, 8)



In [70]:

    
# Check again for outliers
df.describe()



In [72]:

    
# Save the new file as cleaned data
df.to_csv("data/loan_data_clean.csv", index=False)



In [19]:

    
#We are good to go to the next step



In [ ]:

	amount	interest	grade	years	ownership	income	age
0	5000	10.65	B	10.00	RENT	24000.00	33
1	2400	nan	C	25.00	RENT	12252.00	31
2	10000	13.49	C	13.00	RENT	49200.00	24
3	5000	nan	A	3.00	RENT	36000.00	39
4	3000	nan	E	9.00	RENT	48000.00	24

	default	amount	interest	years	income	age
count	29092.00	29092.00	26316.00	28283.00	29092.00	29092.00
mean	0.11	9593.51	11.00	6.15	67168.83	27.70
std	0.31	6323.42	3.24	6.68	63606.52	6.23
min	0.00	500.00	5.42	0.00	4000.00	20.00
25%	0.00	5000.00	7.90	2.00	40000.00	23.00
50%	0.00	8000.00	10.99	4.00	56424.00	26.00
75%	0.00	12250.00	13.47	8.00	80000.00	30.00
max	1.00	35000.00	23.22	62.00	6000000.00	144.00

	default	amount	interest	years	income	age
count	29092.00	29092.00	26316.00	28283.00	29092.00	29092.00
mean	0.11	9593.51	11.00	6.15	67168.83	27.70
std	0.31	6323.42	3.24	6.68	63606.52	6.23
min	0.00	500.00	5.42	0.00	4000.00	20.00
25%	0.00	5000.00	7.90	2.00	40000.00	23.00
50%	0.00	8000.00	10.99	4.00	56424.00	26.00
75%	0.00	12250.00	13.47	8.00	80000.00	30.00
max	1.00	35000.00	23.22	62.00	6000000.00	144.00

grade	A	B	C	D	E	F	G
default
0	9084	8344	4904	2651	692	155	35
1	565	985	844	580	176	56	21

ownership	MORTGAGE	OTHER	OWN	RENT
default
0	10821	80	2049	12915
1	1181	17	252	1777

	default	amount	interest	grade	years	ownership	income	age
0	False	False	False	False	False	False	False	False
1	False	False	True	False	False	False	False	False
2	False	False	False	False	False	False	False	False
3	False	False	True	False	False	False	False	False
4	False	False	True	False	False	False	False	False

	default	amount	interest	years	income	age
count	29091.00	29091.00	29091.00	29091.00	29091.00	29091.00
mean	0.11	9593.66	11.00	6.09	66964.89	27.70
std	0.31	6323.47	3.08	6.59	53253.18	6.19
min	0.00	500.00	5.42	0.00	4000.00	20.00
25%	0.00	5000.00	8.49	2.00	40000.00	23.00
50%	0.00	8000.00	10.99	4.00	56400.00	26.00
75%	0.00	12250.00	13.11	8.00	80000.00	30.00
max	1.00	35000.00	23.22	62.00	2039784.00	94.00