You are provided with the following data: loan_data.csv
This is the historical data that the bank has provided. It has the following columns
Application Attributes:
years
: Number of years the applicant has been employed ownership
: Whether the applicant owns a house or not income
: Annual income of the applicant age
: Age of the applicant Behavioural Attributes:
grade
: Credit grade of the applicantOutcome Variable:
amount
: Amount of Loan provided to the applicant default
: Whether the applicant has defaulted or not interest
: Interest rate charged for the applicant Discuss?
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [1]:
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
#Defualt Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)
In [7]:
#Load the dataset
df = pd.read_csv('data/loan_data.csv')
In [8]:
#View the first few rows of train
df.head()
Out[8]:
In [10]:
#View the columns of the train dataset
df.columns
Out[10]:
In [12]:
#View the data types of the train dataset
df.dtypes
Out[12]:
In [13]:
#View the number of records in the data
df.shape
Out[13]:
In [14]:
#View summary of raw data
df.describe()
Out[14]:
In [16]:
# Find if df has missing values. Hint: There is a isnull() function
df.isnull().head()
Out[16]:
One consideration we check here is the number of observations with missing values for those columns that have missing values. If a column has too many missing values, it might make sense to drop the column.
In [41]:
#let's see how many missing values are present
df.isnull().sum()
Out[41]:
In [42]:
df.isnull()
Out[42]:
In [43]:
df[df.isnull().any(axis=1)].head()
Out[43]:
So, we see that two columns have missing values: interest and years. Both the columns are numeric. We have three options for dealing with this missing values
Options to treat Missing Values
In [44]:
#Let's replace missing values with the median of the column
df.interest.median()
Out[44]:
In [52]:
?df.fillna
In [58]:
df[df.isnull().any(axis=1)].head(20)
Out[58]:
In [64]:
#there's a fillna function
df.fillna(df.median(), inplace=True)
Out[64]:
In [65]:
#Now, let's check if train has missing values or not
df.isnull().sum()
Out[65]:
In [66]:
# Which variables are Categorical?
df.dtypes
Out[66]:
In [69]:
df.grade.value_counts()
Out[69]:
In [68]:
# Create a Crosstab of those variables with another variable
pd.crosstab(df.grade, df.default)
Out[68]:
In [70]:
# Create a Crosstab of those variables with another variable
df.ownership.value_counts()
Out[70]:
Let us check outliers in the continuous variable
In [71]:
# Describe the data set continuous values
df.describe()
Out[71]:
Clearly the age
variable looks like it has an outlier - Age cannot be greater 100!
Also the income
variable looks like it may also have an outlier.
In [87]:
?plt.yscale
In [88]:
# Make a histogram of age
df.age.hist()
#plt.xlim(60,80)
#plt.ylim(0,100)
plt.yscale('log')
In [74]:
import seaborn as sns
In [75]:
sns.distplot(df.age)
Out[75]:
In [89]:
# Make a histogram of income
df.income.hist()
plt.yscale('log')
In [91]:
# Make Histograms for all other
df.years.hist()
plt.yscale('log')
In [95]:
plt.boxplot(df.income)
Out[95]:
In [98]:
plt.boxplot(df.years)
Out[98]:
In [99]:
# Make a scatter of age and income
plt.scatter(df.income, df.age)
Out[99]:
Find the observation which has age = 144 and remove it from the dataframe
In [102]:
# Find the observation
df[df.age == df.age.max()]
Out[102]:
In [105]:
df[df.age == df.age.max()].index
Out[105]:
In [108]:
df.drop?
In [110]:
# Use drop to remove the observation inplace
df.drop(19485, inplace=True)
In [111]:
# Find the shape of the df
df.shape
Out[111]:
In [112]:
# Check again for outliers
plt.scatter(df.age, df.income)
Out[112]:
In [113]:
# Save the new file as cleaned data
df.to_csv("data/loan_data_clean.csv", index=False)
In [ ]:
#We are good to go to the next step
In [ ]: