Based on this great tutorial: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/
From the challange hosted at: https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.
The company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.
Variable | Description |
---|---|
Loan_ID | Unique Loan ID |
Gender | Male/ Female |
Married | Applicant married (Y/N) |
Dependents | Number of dependents |
Education | Applicant Education (Graduate/ Under Graduate) |
Self_Employed | Self employed (Y/N) |
ApplicantIncome | Applicant income |
CoapplicantIncome | Coapplicant income |
LoanAmount | Loan amount in thousands |
Loan_Amount_Term | Term of loan in months |
Credit_History | credit history meets guidelines |
Property_Area | Urban/ Semi Urban/ Rural |
Loan_Status | Loan approved (Y/N) |
Evaluation Metric is accuracy i.e. percentage of loan approval you correctly predict.
You may upload the solution in the format of "sample_submission.csv"
To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt:
In [1]:
%pylab inline
This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):
In [2]:
plot(arange(5))
Out[2]:
Following are the libraries we will use during this task:
Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in the code, in case you use the code in a different environment.
In [3]:
import pandas as pd
import numpy as np
import matplotlib as plt
After importing the library, you read the dataset using function read_csv(). The file is assumed to be downloaded from the moodle to the data folder in your working directory.
In [4]:
df = pd.read_csv("./data/train.csv") #Reading the dataset in a dataframe using Pandas
test_df = pd.read_csv("./data/test.csv") #Reading the dataset in a dataframe using Pandas
Once you have read the dataset, you can have a look at few top rows by using the function head()
In [5]:
df.head(10)
Out[5]:
This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.
Next, you can look at summary of numerical fields by using describe() function
In [6]:
df.describe() # get the summary of numerical variables
Out[6]:
describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output
Try to learn also from test set
In [7]:
test_df.describe()
Out[7]:
as we can see, there is no significant difference between test and train, so for the numeric value we will do the same preprocessing So it is not necessary to combine train and test data for filling missing value.
For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency distribution to understand whether they make sense or not. The frequency table can be printed by following command:
In [8]:
df['Property_Area'].value_counts()
Out[8]:
In [9]:
test_df['Property_Area'].value_counts()
Out[9]:
Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables – namely ApplicantIncome and LoanAmount
Lets start by plotting the histogram of ApplicantIncome using the following commands:
In [10]:
df['ApplicantIncome'].hist(bins=50)
Out[10]:
In [11]:
test_df['ApplicantIncome'].hist(bins=50)
Out[11]:
Here we observe that there are few extreme values. This is also the reason why 50 bins are required to depict the distribution clearly.
Next, we look at box plots to understand the distributions. Box plot can be plotted by:
In [12]:
df.boxplot(column='ApplicantIncome')
Out[12]:
This confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:
In [13]:
df.boxplot(column='ApplicantIncome', by = 'Education')
Out[13]:
We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are a higher number of graduates with very high incomes, which are appearing to be the outliers.
Plot the histogram and boxplot of LoanAmount
In [14]:
df['LoanAmount'].hist(bins=50)
Out[14]:
In [15]:
test_df['LoanAmount'].hist(bins=50)
Out[15]:
In [16]:
df.boxplot(column='LoanAmount')
Out[16]:
Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some amount of data munging. LoanAmount has missing and well as extreme values values, while ApplicantIncome has a few extreme values, which demand deeper understanding. We will take this up in coming sections.
Frequency Table for Credit History:
In [17]:
temp1 = df['Credit_History'].value_counts(ascending=True)
temp1
Out[17]:
In [18]:
temp2 = test_df['Credit_History'].value_counts(ascending=True)
temp2
Out[18]:
Probability of getting loan for each Credit History class:
In [19]:
temp3 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x: x.map({'Y':1,'N':0}).mean())
temp3
Out[19]:
This can be plotted as a bar chart using the “matplotlib” library with following code:
In [20]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title("Applicants by Credit_History")
temp1.plot(kind='bar')
ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probability of getting loan')
ax2.set_title("Probability of getting loan by credit history")
Out[20]:
This shows that the chances of getting a loan are eight-fold if the applicant has a valid credit history. You can plot similar graphs by Married, Self-Employed, Property_Area, etc.
Alternately, these two plots can also be visualized by combining them in a stacked chart::
In [21]:
temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)
Out[21]:
We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would have increased by now – given the amount of help, the library can provide you in analyzing datasets.
Next let’s explore ApplicantIncome and LoanStatus variables further, perform data munging and create a dataset for applying various modeling techniques. I would strongly urge that you take another dataset and problem and go through an independent example before reading further.
While our exploration of the data, we found a few problems in the data set, which needs to be solved before the data is ready for a good model. This exercise is typically referred as “Data Munging”. Here are the problems, we are already aware of:
In addition to these problems with numerical fields, we should also look at the non-numerical fields i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful information.
Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not. So, let us check the number of nulls / NaNs in the dataset.
This command should tell us the number of missing values in each column as isnull() returns 1, if the value is null.
In [22]:
df.apply(lambda x: sum(x.isnull()),axis=0)
Out[22]:
In [23]:
test_df.apply(lambda x: sum(x.isnull()),axis=0)
Out[23]:
Though the missing values are not very high in number, but many variables have them and each one of these should be estimated and added in the data.
Note: Remember that missing values may not always be NaNs. For instance, if the Loan_Amount_Term is 0, does it makes sense or would you consider that missing? I suppose your answer is missing and you’re right. So we should check for values which are unpractical.
How to fill missing values in LoanAmount?
There are numerous ways to fill the missing values of loan amount – the simplest being replacement by mean, which can be done by following code:
In [24]:
# df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
The other extreme could be to build a supervised learning model to predict loan amount on the basis of other variables and then use age along with other variables to predict survival.
Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies some where in between these 2 extremes. A key hypothesis is that whether a person is educated or self-employed can combine to give a good estimate of loan amount.
But first, we have to ensure that each of Self_Employed and Education variables should not have a missing values.
As we say earlier, Self_Employed has some missing values. Let’s look at the frequency table:
In [25]:
df['Self_Employed'].value_counts()
Out[25]:
Since ~86% values are “No”, it is safe to impute the missing values as “No” as there is a high probability of success. This can be done using the following code:
In [26]:
#df['Self_Employed'].fillna('No',inplace=True)
In [27]:
from numpy.random import choice
draw = choice(["Yes","No"], 1, p=[0.14,0.86])[0]
df['Self_Employed'].fillna(draw,inplace=True)
Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed and Education features. Next, we define a function, which returns the values of these cells and apply it to fill the missing values of loan amount:
In [28]:
table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
table
Out[28]:
Define function to return value of this pivot_table:
In [29]:
def fage(x):
return table.loc[x['Self_Employed'],x['Education']]
Replace missing values:
In [30]:
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
This should provide you a good way to impute missing values of loan amount.
Let’s analyze LoanAmount first. Since the extreme values are practically possible, i.e. some people might apply for high value loans due to specific needs. So instead of treating them as outliers, let’s try a log transformation to nullify their effect:
In [31]:
df['LoanAmount_log'] = np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)
Out[31]:
Now the distribution looks much closer to normal and effect of extreme values has been significantly subsided.
Coming to ApplicantIncome. One intuition can be that some applicants have lower income but strong support Co-applicants. So it might be a good idea to combine both incomes as total income and take a log transformation of the same.
In [32]:
df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome_log'] = np.log(df['TotalIncome'])
df['LoanAmount_log'].hist(bins=20)
Out[32]:
Now we see that the distribution is much better than before.
In [33]:
df['Loan_Amount_Term'].value_counts()
Out[33]:
In [34]:
df['Loan_Amount_Term'].fillna(360, inplace=True)
df['Credit_History'].fillna(1, inplace=True)
In [35]:
df['Dependents'].value_counts()
Out[35]:
In [36]:
df['Dependents'].fillna(0, inplace=True)
In [37]:
df['Married'].value_counts()
Out[37]:
In [38]:
#df['Married'].fillna('Yes', inplace=True)
#As in the self employed
draw = choice(["Yes","No"], 1, p=[0.66,0.34])[0]
df['Married'].fillna(draw,inplace=True)
In [39]:
df['Gender'].value_counts()
Out[39]:
In [40]:
#df['Gender'].fillna('Male', inplace=True)
#Save the ratio
draw = choice(["Male","Female"], 1, p=[0.82,0.18])[0]
df['Gender'].fillna(draw,inplace=True)
Next, we will look at making predictive models.
After, we have made the data useful for modeling, let’s now look at the python code to create a predictive model on our data set. Skicit-Learn (sklearn) is the most commonly used library in Python for this purpose and we will follow the trail.
Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:
In [41]:
df.dtypes
Out[41]:
In [42]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i].astype(str))
In [43]:
df.dtypes
Out[43]:
Next, we will import the required modules. Then we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation scores.
In [44]:
#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
#Fit the model:
model.fit(data[predictors],data[outcome])
#Make predictions on training set:
predictions = model.predict(data[predictors])
#Print accuracy
accuracy = metrics.accuracy_score(predictions,data[outcome])
print("Accuracy : %s" % "{0:.3%}".format(accuracy))
#Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train,:])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
#Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])
Let’s make our first Logistic Regression model. One way would be to take all the variables into the model but this might result in overfitting. In simple words, taking all variables might result in the model understanding complex relations specific to the data and will not generalize well.
We can easily make some intuitive hypothesis to set the ball rolling. The chances of getting a loan will be higher for:
So let’s make our first model with ‘Credit_History’.
In [45]:
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df,predictor_var,outcome_var)
We can try different combination of variables:
In [46]:
predictor_var = ['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
#write_predict(model.predict(test_df[predictor_var]), test_df)
Generally we expect the accuracy to increase on adding variables. But this is a more challenging case. The accuracy and cross-validation score are not getting impacted by less important variables. Credit_History is dominating the mode. We have two options now:
Decision tree is another method for making a predictive model. It is known to provide higher accuracy than logistic regression model.
In [47]:
model = DecisionTreeClassifier()
predictor_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,predictor_var,outcome_var)
Here the model based on categorical variables is unable to have an impact because Credit History is dominating over them. Let’s try a few numerical variables:
In [48]:
#We can try different combination of variables:
predictor_var = ['Credit_History','Loan_Amount_Term','LoanAmount_log']
classification_model(model, df,predictor_var,outcome_var)
2.Not much better according to the test score
Here we observed that although the accuracy went up on adding variables, the cross-validation error went down. This is the result of model over-fitting the data. Let’s try an even more sophisticated algorithm and see if it helps:
Random forest is another algorithm for solving the classification problem.
An advantage with Random Forest is that we can make it work with all the features and it returns a feature importance matrix which can be used to select features.
In [49]:
model = RandomForestClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
'LoanAmount_log','TotalIncome_log']
classification_model(model, df,predictor_var,outcome_var)
Here we see that the accuracy is 100% for the training set. This is the ultimate case of overfitting and can be resolved in two ways:
Let’s try both of these. First we see the feature importance matrix from which we’ll take the most important features.
In [50]:
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)
Let’s use the top 5 variables for creating a model. Also, we will modify the parameters of random forest model a little bit:
In [51]:
model = RandomForestClassifier(n_estimators=100, min_samples_split=25)
predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Gender','Married','Self_Employed','Dependents','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
#write_predict(model.predict(test_df[predictor_var]), test_df)
Notice that although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing well. Remember that random forest models are not exactly repeatable. Different runs will result in slight variations because of randomization. But the output should stay in the ballpark.
You would have noticed that even after some basic parameter tuning on random forest, we have reached a cross-validation accuracy only slightly better than the original logistic regression model. This exercise gives us some very interesting and unique learning:
Be proud of yourself for getting this far! You are invited to improve your result and submit to the site to test your place in the leaderboard.
In [52]:
from numpy.random import choice
from sklearn.preprocessing import LabelEncoder
def preprocess(df, is_test):
draw = choice(["Yes","No"], 1, p=[0.14,0.86])[0]
df['Self_Employed'].fillna(draw,inplace=True)
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
df['LoanAmount_log'] = np.log(df['LoanAmount'])
df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome_log'] = np.log(df['TotalIncome'])
term_list = df['Loan_Amount_Term'].value_counts().index.tolist()
term_mean = np.mean(term_list)
df['Loan_Amount_Term'].fillna(term_mean, inplace=True)
draw = choice([1,0], 1, p=[0.14,0.86])[0]
df['Credit_History'].fillna(1, inplace=True)
df['Dependents'].fillna(0, inplace=True)
draw = choice(["Yes","No"], 1, p=[0.66,0.34])[0]
df['Married'].fillna(draw,inplace=True)
draw = choice(["Male","Female"], 1, p=[0.82,0.18])[0]
df['Gender'].fillna(draw,inplace=True)
if is_test:
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area']
else:
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i].astype(str))
def write_predict(result, df):
df["Loan_Status"] = ['Y' if x==1 else 'N' for x in result]
df.to_csv('./data/result.csv',columns=['Loan_ID','Loan_Status'],index=False)
In [53]:
test_df = pd.read_csv("./data/test.csv") #Reading the dataset in a dataframe using Pandas
preprocess(test_df,True)
df = pd.read_csv("./data/train.csv") #Reading the dataset in a dataframe using Pandas
preprocess(df, False)
according to articles on this domain, the squre of number of trees needs to equals number of features. In addition, there is a relation between dataset size, which is small, and the number of trees that can be enlarged and affects on the improvement of the model. Also we did some empirical experiments and we found the best value for our case is 20-100
In [54]:
model = ExtraTreesClassifier(n_estimators=20)
predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
write_predict(model.predict(test_df[predictor_var]), test_df)
In [55]:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)
In [56]:
model = GradientBoostingClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
'LoanAmount_log','TotalIncome_log']
classification_model(model, df,predictor_var,outcome_var)
write_predict(model.predict(test_df[predictor_var]), test_df)
#With different loss function
predictor_var = ['Self_Employed', 'Credit_History', 'Property_Area']
model = GradientBoostingClassifier(n_estimators=10,loss='exponential')
classification_model(model, df,predictor_var,outcome_var)
write_predict(model.predict(test_df[predictor_var]), test_df)
In [57]:
model = RandomForestClassifier(n_estimators=30,min_samples_split=15)
predictor_var = ['TotalIncome_log','LoanAmount_log','Loan_Amount_Term','Credit_History','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
write_predict(model.predict(test_df[predictor_var]), test_df)
Submissions
Leaderboard