You are provided with the following data: loan_data.csv
This is the historical data that the bank has provided. It has the following columns
Application Attributes:
years
: Number of years the applicant has been employed ownership
: Whether the applicant owns a house or not income
: Annual income of the applicant age
: Age of the applicant Behavioural Attributes:
grade
: Credit grade of the applicantOutcome Variable:
amount
: Amount of Loan provided to the applicant default
: Whether the applicant has defaulted or not interest
: Interest rate charged for the applicant Discuss?
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [2]:
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [3]:
#Defualt Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)
In [4]:
#Load the dataset
df = pd.read_csv("data/loan_data.csv")
In [5]:
#View the first few rows of train
df.head()
Out[5]:
In [6]:
#View the columns of the train dataset
df.columns
Out[6]:
In [7]:
#View the data types of the train dataset
df.dtypes
Out[7]:
In [8]:
#View the number of records in the data
df.shape
Out[8]:
In [9]:
#View summary of raw data
df.describe()
Out[9]:
In [10]:
# Find if df has missing values. Hint: There is a isnull() function
df.isnull().head()
Out[10]:
One consideration we check here is the number of observations with missing values for those columns that have missing values. If a column has too many missing values, it might make sense to drop the column.
In [11]:
#let's see how many missing values are present
df.isnull().sum()
Out[11]:
So, we see that two columns have missing values: interest and years. Both the columns are numeric. We have three options for dealing with this missing values
Options to treat Missing Values
In [12]:
#Let's replace missing values with the median of the column
df.describe()
Out[12]:
In [13]:
#there's a fillna function
df = df.fillna(df.median())
In [50]:
#Now, let's check if train has missing values or not
df.isnull().any()
Out[50]:
In [15]:
# Which variables are Categorical?
df.dtypes
Out[15]:
In [16]:
# Create a Crosstab of those variables with another variable
pd.crosstab(df.default, df.grade)
Out[16]:
In [17]:
# Create a Crosstab of those variables with another variable
pd.crosstab(df.default, df.ownership)
Out[17]:
Let us check outliers in the continuous variable
In [54]:
# Describe the data set continuous values
df.describe()
Out[54]:
Clearly the age
variable looks like it has an outlier - Age cannot be greater 100!
Also the income
variable looks like it may also have an outlier.
In [55]:
# Make a histogram of age
df.age.hist(bins=100)
Out[55]:
In [56]:
# Make a histogram of income
df.income.hist(bins=100)
Out[56]:
In [18]:
# Make Histograms for all other variables
In [ ]:
In [ ]:
In [ ]:
In [67]:
# Make a scatter of age and income
plt.scatter(df.age, df.income)
Out[67]:
Find the observation which has age = 144 and remove it from the dataframe
In [57]:
# Find the observation
df[df.age == 144]
Out[57]:
In [60]:
df[df.age == 144].index
Out[60]:
In [64]:
# Use drop to remove the observation inplace
df.drop(df[df.age == 144].index, axis=0, inplace=True)
In [68]:
# Find the shape of the df
df.shape
Out[68]:
In [70]:
# Check again for outliers
df.describe()
Out[70]:
In [72]:
# Save the new file as cleaned data
df.to_csv("data/loan_data_clean.csv", index=False)
In [19]:
#We are good to go to the next step
In [ ]: