Exercise 2: Data Analysis with Python

Based on this great tutorial: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/

Our Task: Loan Prediction Practice Problem

From the challange hosted at: https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/

Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.

The company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.

The Data

Variable	Description
Loan_ID	Unique Loan ID
Gender	Male/ Female
Married	Applicant married (Y/N)
Dependents	Number of dependents
Education	Applicant Education (Graduate/ Under Graduate)
Self_Employed	Self employed (Y/N)
ApplicantIncome	Applicant income
CoapplicantIncome	Coapplicant income
LoanAmount	Loan amount in thousands
Loan_Amount_Term	Term of loan in months
Credit_History	credit history meets guidelines
Property_Area	Urban/ Semi Urban/ Rural
Loan_Status	Loan approved (Y/N)

Evaluation Metric is accuracy i.e. percentage of loan approval you correctly predict.

You may upload the solution in the format of "sample_submission.csv"

Setups

To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt:



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):



In [2]:

    
plot(arange(5))









    Out[2]:





[<matplotlib.lines.Line2D at 0x88a3a50>]

Following are the libraries we will use during this task:

numpy
matplotlib
pandas

Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in the code, in case you use the code in a different environment.



In [3]:

    
import pandas as pd
import numpy as np
import matplotlib as plt

After importing the library, you read the dataset using function read_csv(). The file is assumed to be downloaded from the moodle to the data folder in your working directory.



In [4]:

    
df = pd.read_csv("./data/train.csv") #Reading the dataset in a dataframe using Pandas
test_df = pd.read_csv("./data/test.csv") #Reading the dataset in a dataframe using Pandas

Let’s begin with exploration

Quick Data Exploration

Once you have read the dataset, you can have a look at few top rows by using the function head()



In [5]:

    
df.head(10)









    Out[5]:







  
    
      
      Loan_ID
      Gender
      Married
      Dependents
      Education
      Self_Employed
      ApplicantIncome
      CoapplicantIncome
      LoanAmount
      Loan_Amount_Term
      Credit_History
      Property_Area
      Loan_Status
    
  
  
    
      0
      LP001002
      Male
      No
      0
      Graduate
      No
      5849
      0.0
      NaN
      360.0
      1.0
      Urban
      Y
    
    
      1
      LP001003
      Male
      Yes
      1
      Graduate
      No
      4583
      1508.0
      128.0
      360.0
      1.0
      Rural
      N
    
    
      2
      LP001005
      Male
      Yes
      0
      Graduate
      Yes
      3000
      0.0
      66.0
      360.0
      1.0
      Urban
      Y
    
    
      3
      LP001006
      Male
      Yes
      0
      Not Graduate
      No
      2583
      2358.0
      120.0
      360.0
      1.0
      Urban
      Y
    
    
      4
      LP001008
      Male
      No
      0
      Graduate
      No
      6000
      0.0
      141.0
      360.0
      1.0
      Urban
      Y
    
    
      5
      LP001011
      Male
      Yes
      2
      Graduate
      Yes
      5417
      4196.0
      267.0
      360.0
      1.0
      Urban
      Y
    
    
      6
      LP001013
      Male
      Yes
      0
      Not Graduate
      No
      2333
      1516.0
      95.0
      360.0
      1.0
      Urban
      Y
    
    
      7
      LP001014
      Male
      Yes
      3+
      Graduate
      No
      3036
      2504.0
      158.0
      360.0
      0.0
      Semiurban
      N
    
    
      8
      LP001018
      Male
      Yes
      2
      Graduate
      No
      4006
      1526.0
      168.0
      360.0
      1.0
      Urban
      Y
    
    
      9
      LP001020
      Male
      Yes
      1
      Graduate
      No
      12841
      10968.0
      349.0
      360.0
      1.0
      Semiurban
      N

This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.

Next, you can look at summary of numerical fields by using describe() function



In [6]:

    
df.describe() # get the summary of numerical variables









    Out[6]:







  
    
      
      ApplicantIncome
      CoapplicantIncome
      LoanAmount
      Loan_Amount_Term
      Credit_History
    
  
  
    
      count
      614.000000
      614.000000
      592.000000
      600.00000
      564.000000
    
    
      mean
      5403.459283
      1621.245798
      146.412162
      342.00000
      0.842199
    
    
      std
      6109.041673
      2926.248369
      85.587325
      65.12041
      0.364878
    
    
      min
      150.000000
      0.000000
      9.000000
      12.00000
      0.000000
    
    
      25%
      2877.500000
      0.000000
      100.000000
      360.00000
      1.000000
    
    
      50%
      3812.500000
      1188.500000
      128.000000
      360.00000
      1.000000
    
    
      75%
      5795.000000
      2297.250000
      168.000000
      360.00000
      1.000000
    
    
      max
      81000.000000
      41667.000000
      700.000000
      480.00000
      1.000000

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output

Try to learn also from test set



In [7]:

    
test_df.describe()









    Out[7]:







  
    
      
      ApplicantIncome
      CoapplicantIncome
      LoanAmount
      Loan_Amount_Term
      Credit_History
    
  
  
    
      count
      367.000000
      367.000000
      362.000000
      361.000000
      338.000000
    
    
      mean
      4805.599455
      1569.577657
      136.132597
      342.537396
      0.825444
    
    
      std
      4910.685399
      2334.232099
      61.366652
      65.156643
      0.380150
    
    
      min
      0.000000
      0.000000
      28.000000
      6.000000
      0.000000
    
    
      25%
      2864.000000
      0.000000
      100.250000
      360.000000
      1.000000
    
    
      50%
      3786.000000
      1025.000000
      125.000000
      360.000000
      1.000000
    
    
      75%
      5060.000000
      2430.500000
      158.000000
      360.000000
      1.000000
    
    
      max
      72529.000000
      24000.000000
      550.000000
      480.000000
      1.000000

as we can see, there is no significant difference between test and train, so for the numeric value we will do the same preprocessing So it is not necessary to combine train and test data for filling missing value.

For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency distribution to understand whether they make sense or not. The frequency table can be printed by following command:



In [8]:

    
df['Property_Area'].value_counts()









    Out[8]:





Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64



In [9]:

    
test_df['Property_Area'].value_counts()









    Out[9]:





Urban        140
Semiurban    116
Rural        111
Name: Property_Area, dtype: int64

Distribution analysis

Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables – namely ApplicantIncome and LoanAmount

Lets start by plotting the histogram of ApplicantIncome using the following commands:



In [10]:

    
df['ApplicantIncome'].hist(bins=50)









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x9130550>



In [11]:

    
test_df['ApplicantIncome'].hist(bins=50)









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x91f5fb0>

Here we observe that there are few extreme values. This is also the reason why 50 bins are required to depict the distribution clearly.

Next, we look at box plots to understand the distributions. Box plot can be plotted by:



In [12]:

    
df.boxplot(column='ApplicantIncome')









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x93e3510>

This confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:



In [13]:

    
df.boxplot(column='ApplicantIncome', by = 'Education')









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x9616df0>

We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are a higher number of graduates with very high incomes, which are appearing to be the outliers.

Task 2: Distribution Analysis

Plot the histogram and boxplot of LoanAmount

Check yourself:



In [14]:

    
df['LoanAmount'].hist(bins=50)









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0xa4069f0>



In [15]:

    
test_df['LoanAmount'].hist(bins=50)









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x87af750>



In [16]:

    
df.boxplot(column='LoanAmount')









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0xa871f50>

Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some amount of data munging. LoanAmount has missing and well as extreme values values, while ApplicantIncome has a few extreme values, which demand deeper understanding. We will take this up in coming sections.

Categorical variable analysis

Frequency Table for Credit History:



In [17]:

    
temp1 = df['Credit_History'].value_counts(ascending=True)
temp1









    Out[17]:





0.0     89
1.0    475
Name: Credit_History, dtype: int64



In [18]:

    
temp2 = test_df['Credit_History'].value_counts(ascending=True)
temp2









    Out[18]:





0.0     59
1.0    279
Name: Credit_History, dtype: int64

Probability of getting loan for each Credit History class:



In [19]:

    
temp3 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x: x.map({'Y':1,'N':0}).mean())
temp3









    Out[19]:







  
    
      
      Loan_Status
    
    
      Credit_History
      
    
  
  
    
      0.0
      0.078652
    
    
      1.0
      0.795789

This can be plotted as a bar chart using the “matplotlib” library with following code:



In [20]:

    
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title("Applicants by Credit_History")
temp1.plot(kind='bar')

ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probability of getting loan')
ax2.set_title("Probability of getting loan by credit history")









    Out[20]:





<matplotlib.text.Text at 0xac0d450>

This shows that the chances of getting a loan are eight-fold if the applicant has a valid credit history. You can plot similar graphs by Married, Self-Employed, Property_Area, etc.

Alternately, these two plots can also be visualized by combining them in a stacked chart::



In [21]:

    
temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0xa8c29b0>

We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would have increased by now – given the amount of help, the library can provide you in analyzing datasets.

Next let’s explore ApplicantIncome and LoanStatus variables further, perform data munging and create a dataset for applying various modeling techniques. I would strongly urge that you take another dataset and problem and go through an independent example before reading further.

Data Munging in Python : Using Pandas

Data munging – recap of the need

While our exploration of the data, we found a few problems in the data set, which needs to be solved before the data is ready for a good model. This exercise is typically referred as “Data Munging”. Here are the problems, we are already aware of:

There are missing values in some variables. We should estimate those values wisely depending on the amount of missing values and the expected importance of variables.
While looking at the distributions, we saw that ApplicantIncome and LoanAmount seemed to contain extreme values at either end. Though they might make intuitive sense, but should be treated appropriately.

In addition to these problems with numerical fields, we should also look at the non-numerical fields i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful information.

Check missing values in the dataset

Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not. So, let us check the number of nulls / NaNs in the dataset.

This command should tell us the number of missing values in each column as isnull() returns 1, if the value is null.



In [22]:

    
df.apply(lambda x: sum(x.isnull()),axis=0)









    Out[22]:





Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64



In [23]:

    
test_df.apply(lambda x: sum(x.isnull()),axis=0)









    Out[23]:





Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

Though the missing values are not very high in number, but many variables have them and each one of these should be estimated and added in the data.

Note: Remember that missing values may not always be NaNs. For instance, if the Loan_Amount_Term is 0, does it makes sense or would you consider that missing? I suppose your answer is missing and you’re right. So we should check for values which are unpractical.

How to fill missing values in LoanAmount?

There are numerous ways to fill the missing values of loan amount – the simplest being replacement by mean, which can be done by following code:



In [24]:

    
# df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

The other extreme could be to build a supervised learning model to predict loan amount on the basis of other variables and then use age along with other variables to predict survival.

Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies some where in between these 2 extremes. A key hypothesis is that whether a person is educated or self-employed can combine to give a good estimate of loan amount.

But first, we have to ensure that each of Self_Employed and Education variables should not have a missing values.

As we say earlier, Self_Employed has some missing values. Let’s look at the frequency table:



In [25]:

    
df['Self_Employed'].value_counts()









    Out[25]:





No     500
Yes     82
Name: Self_Employed, dtype: int64

Since ~86% values are “No”, it is safe to impute the missing values as “No” as there is a high probability of success. This can be done using the following code:



In [26]:

    
#df['Self_Employed'].fillna('No',inplace=True)

Self Employe missing value

If we replace all na with "No" the ratio will be more than 86% for "No". So we want to preserve this ratio, so radomly set 86% to No and the remaining to Yes



In [27]:

    
from numpy.random import choice
draw = choice(["Yes","No"], 1, p=[0.14,0.86])[0]
df['Self_Employed'].fillna(draw,inplace=True)

Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed and Education features. Next, we define a function, which returns the values of these cells and apply it to fill the missing values of loan amount:



In [28]:

    
table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
table









    Out[28]:







  
    
      Education
      Graduate
      Not Graduate
    
    
      Self_Employed
      
      
    
  
  
    
      No
      130.0
      113.0
    
    
      Yes
      157.5
      130.0

Define function to return value of this pivot_table:



In [29]:

    
def fage(x):
 return table.loc[x['Self_Employed'],x['Education']]

Replace missing values:



In [30]:

    
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

This should provide you a good way to impute missing values of loan amount.

How to treat for extreme values in distribution of LoanAmount and ApplicantIncome?

Let’s analyze LoanAmount first. Since the extreme values are practically possible, i.e. some people might apply for high value loans due to specific needs. So instead of treating them as outliers, let’s try a log transformation to nullify their effect:



In [31]:

    
df['LoanAmount_log'] = np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0xace0050>

Now the distribution looks much closer to normal and effect of extreme values has been significantly subsided.

Coming to ApplicantIncome. One intuition can be that some applicants have lower income but strong support Co-applicants. So it might be a good idea to combine both incomes as total income and take a log transformation of the same.



In [32]:

    
df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome_log'] = np.log(df['TotalIncome'])
df['LoanAmount_log'].hist(bins=20)









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0xad521d0>

Now we see that the distribution is much better than before.

Task 3: Remove nulls

Impute the missing values for Gender, Married, Dependents, Loan_Amount_Term, Credit_History
Also, I encourage you to think about possible additional information which can be derived from the data. For example, creating a column for LoanAmount/TotalIncome might make sense as it gives an idea of how well the applicant is suited to pay back his loan.

check Yourself



In [33]:

    
df['Loan_Amount_Term'].value_counts()









    Out[33]:





360.0    512
180.0     44
480.0     15
300.0     13
84.0       4
240.0      4
120.0      3
36.0       2
60.0       2
12.0       1
Name: Loan_Amount_Term, dtype: int64



In [34]:

    
df['Loan_Amount_Term'].fillna(360, inplace=True)
df['Credit_History'].fillna(1, inplace=True)



In [35]:

    
df['Dependents'].value_counts()









    Out[35]:





0     345
1     102
2     101
3+     51
Name: Dependents, dtype: int64



In [36]:

    
df['Dependents'].fillna(0, inplace=True)



In [37]:

    
df['Married'].value_counts()









    Out[37]:





Yes    398
No     213
Name: Married, dtype: int64



In [38]:

    
#df['Married'].fillna('Yes', inplace=True)
#As in the self employed
draw = choice(["Yes","No"], 1, p=[0.66,0.34])[0]
df['Married'].fillna(draw,inplace=True)



In [39]:

    
df['Gender'].value_counts()









    Out[39]:





Male      489
Female    112
Name: Gender, dtype: int64



In [40]:

    
#df['Gender'].fillna('Male', inplace=True)
#Save the ratio
draw = choice(["Male","Female"], 1, p=[0.82,0.18])[0]
df['Gender'].fillna(draw,inplace=True)

Next, we will look at making predictive models.

Building a Predictive Model in Python

After, we have made the data useful for modeling, let’s now look at the python code to create a predictive model on our data set. Skicit-Learn (sklearn) is the most commonly used library in Python for this purpose and we will follow the trail.

Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:



In [41]:

    
df.dtypes









    Out[41]:





Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
LoanAmount_log       float64
TotalIncome          float64
TotalIncome_log      float64
dtype: object



In [42]:

    
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i].astype(str))



In [43]:

    
df.dtypes









    Out[43]:





Loan_ID               object
Gender                 int32
Married                int32
Dependents             int32
Education              int32
Self_Employed          int32
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int32
Loan_Status            int32
LoanAmount_log       float64
TotalIncome          float64
TotalIncome_log      float64
dtype: object

Next, we will import the required modules. Then we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation scores.



In [44]:

    
#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz 
from sklearn import metrics

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])
  
  #Make predictions on training set:
  predictions = model.predict(data[predictors])
  
  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print("Accuracy : %s" % "{0:.3%}".format(accuracy))

  #Perform k-fold cross-validation with 5 folds
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]
    
    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)
    
    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
  print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

  #Fit the model again so that it can be refered outside the function:
  model.fit(data[predictors],data[outcome])









    



c:\python27\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Logistic Regression

Let’s make our first Logistic Regression model. One way would be to take all the variables into the model but this might result in overfitting. In simple words, taking all variables might result in the model understanding complex relations specific to the data and will not generalize well.

We can easily make some intuitive hypothesis to set the ball rolling. The chances of getting a loan will be higher for:

Applicants having a credit history (remember we observed this in exploration?)
Applicants with higher applicant and co-applicant incomes
Applicants with higher education level
Properties in urban areas with high growth perspectives

So let’s make our first model with ‘Credit_History’.



In [45]:

    
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df,predictor_var,outcome_var)









    



Accuracy : 80.945%
Cross-Validation Score : 80.946%

We can try different combination of variables:



In [46]:

    
predictor_var = ['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
#write_predict(model.predict(test_df[predictor_var]), test_df)









    



Accuracy : 80.945%
Cross-Validation Score : 80.946%

Generally we expect the accuracy to increase on adding variables. But this is a more challenging case. The accuracy and cross-validation score are not getting impacted by less important variables. Credit_History is dominating the mode. We have two options now:

Feature Engineering: dereive new information and try to predict those. I will leave this to your creativity.
Better modeling techniques. Let’s explore this next.

Decision Tree

Decision tree is another method for making a predictive model. It is known to provide higher accuracy than logistic regression model.



In [47]:

    
model = DecisionTreeClassifier()
predictor_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,predictor_var,outcome_var)









    



Accuracy : 80.945%
Cross-Validation Score : 80.946%

Here the model based on categorical variables is unable to have an impact because Credit History is dominating over them. Let’s try a few numerical variables:

Task 4: Modeling the Data

Train a decision tree classifier that accepts the following attributes as input: Credit_History, Loan_Amount_Term, LoanAmount_log
Is it better that the previous decision tree?

Check Yourself:



In [48]:

    
#We can try different combination of variables:
predictor_var = ['Credit_History','Loan_Amount_Term','LoanAmount_log']
classification_model(model, df,predictor_var,outcome_var)









    



Accuracy : 88.925%
Cross-Validation Score : 69.374%

2.Not much better according to the test score

Here we observed that although the accuracy went up on adding variables, the cross-validation error went down. This is the result of model over-fitting the data. Let’s try an even more sophisticated algorithm and see if it helps:

Random Forest

Random forest is another algorithm for solving the classification problem.

An advantage with Random Forest is that we can make it work with all the features and it returns a feature importance matrix which can be used to select features.



In [49]:

    
model = RandomForestClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
        'LoanAmount_log','TotalIncome_log']
classification_model(model, df,predictor_var,outcome_var)









    



Accuracy : 100.000%
Cross-Validation Score : 77.529%

Here we see that the accuracy is 100% for the training set. This is the ultimate case of overfitting and can be resolved in two ways:

Reducing the number of predictors
Tuning the model parameters

Let’s try both of these. First we see the feature importance matrix from which we’ll take the most important features.



In [50]:

    
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)









    



TotalIncome_log     0.270649
Credit_History      0.259280
LoanAmount_log      0.226463
Property_Area       0.053353
Dependents          0.053034
Loan_Amount_Term    0.046285
Married             0.027708
Education           0.023442
Gender              0.020318
Self_Employed       0.019467
dtype: float64

Let’s use the top 5 variables for creating a model. Also, we will modify the parameters of random forest model a little bit:



In [51]:

    
model = RandomForestClassifier(n_estimators=100, min_samples_split=25)
predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Gender','Married','Self_Employed','Dependents','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
#write_predict(model.predict(test_df[predictor_var]), test_df)









    



Accuracy : 83.062%
Cross-Validation Score : 79.807%

Notice that although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing well. Remember that random forest models are not exactly repeatable. Different runs will result in slight variations because of randomization. But the output should stay in the ballpark.

You would have noticed that even after some basic parameter tuning on random forest, we have reached a cross-validation accuracy only slightly better than the original logistic regression model. This exercise gives us some very interesting and unique learning:

Using a more sophisticated model does not guarantee better results.
Avoid using complex modeling techniques as a black box without understanding the underlying concepts. Doing so would increase the tendency of overfitting thus making your models less interpretable
Feature Engineering is the key to success. Everyone can use an Xgboost models but the real art and creativity lies in enhancing your features to better suit the model.

Be proud of yourself for getting this far! You are invited to improve your result and submit to the site to test your place in the leaderboard.

Data preprocessing

Same as above, just in a function



In [52]:

    
from numpy.random import choice
from sklearn.preprocessing import LabelEncoder
def preprocess(df, is_test):   
    draw = choice(["Yes","No"], 1, p=[0.14,0.86])[0]
    df['Self_Employed'].fillna(draw,inplace=True)
    df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
    df['LoanAmount_log'] = np.log(df['LoanAmount'])
    df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
    df['TotalIncome_log'] = np.log(df['TotalIncome'])
    term_list = df['Loan_Amount_Term'].value_counts().index.tolist()
    term_mean = np.mean(term_list)
    df['Loan_Amount_Term'].fillna(term_mean, inplace=True)
    draw = choice([1,0], 1, p=[0.14,0.86])[0]
    df['Credit_History'].fillna(1, inplace=True)
    df['Dependents'].fillna(0, inplace=True)
    draw = choice(["Yes","No"], 1, p=[0.66,0.34])[0]
    df['Married'].fillna(draw,inplace=True)
    draw = choice(["Male","Female"], 1, p=[0.82,0.18])[0]
    df['Gender'].fillna(draw,inplace=True)    
    if is_test:
        var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area']
    else:
        var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
    le = LabelEncoder()
    for i in var_mod:
        df[i] = le.fit_transform(df[i].astype(str))
        
def write_predict(result, df):    
    df["Loan_Status"] = ['Y' if x==1 else 'N' for x in result]
    df.to_csv('./data/result.csv',columns=['Loan_ID','Loan_Status'],index=False)

Test Set

read again and do the same preprocessing for train and test before build the classification model



In [53]:

    
test_df = pd.read_csv("./data/test.csv") #Reading the dataset in a dataframe using Pandas
preprocess(test_df,True)
df = pd.read_csv("./data/train.csv") #Reading the dataset in a dataframe using Pandas
preprocess(df, False)

Model classifier parameters:

n_estimators

according to articles on this domain, the squre of number of trees needs to equals number of features. In addition, there is a relation between dataset size, which is small, and the number of trees that can be enlarged and affects on the improvement of the model. Also we did some empirical experiments and we found the best value for our case is 20-100

Extra Tree

This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.



In [54]:

    
model = ExtraTreesClassifier(n_estimators=20)
predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
write_predict(model.predict(test_df[predictor_var]), test_df)









    



Accuracy : 100.000%
Cross-Validation Score : 73.291%



In [55]:

    
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)









    



TotalIncome_log    0.367410
LoanAmount_log     0.313690
Credit_History     0.290704
Property_Area      0.028196
dtype: float64

Gradient Boosting

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.



In [56]:

    
model = GradientBoostingClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
        'LoanAmount_log','TotalIncome_log']
classification_model(model, df,predictor_var,outcome_var)
write_predict(model.predict(test_df[predictor_var]), test_df)
#With different loss function
predictor_var = ['Self_Employed', 'Credit_History', 'Property_Area']
model = GradientBoostingClassifier(n_estimators=10,loss='exponential')
classification_model(model, df,predictor_var,outcome_var)
write_predict(model.predict(test_df[predictor_var]), test_df)









    



Accuracy : 88.762%
Cross-Validation Score : 77.693%
Accuracy : 80.945%
Cross-Validation Score : 80.946%



In [57]:

    
model = RandomForestClassifier(n_estimators=30,min_samples_split=15)
predictor_var = ['TotalIncome_log','LoanAmount_log','Loan_Amount_Term','Credit_History','Property_Area']

classification_model(model, df,predictor_var,outcome_var)
write_predict(model.predict(test_df[predictor_var]), test_df)









    



Accuracy : 86.319%
Cross-Validation Score : 79.150%

Submissions

Leaderboard

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001002	Male	No	0	Graduate	No	5849	0.0	NaN	360.0	1.0	Urban	Y
1	LP001003	Male	Yes	1	Graduate	No	4583	1508.0	128.0	360.0	1.0	Rural	N
2	LP001005	Male	Yes	0	Graduate	Yes	3000	0.0	66.0	360.0	1.0	Urban	Y
3	LP001006	Male	Yes	0	Not Graduate	No	2583	2358.0	120.0	360.0	1.0	Urban	Y
4	LP001008	Male	No	0	Graduate	No	6000	0.0	141.0	360.0	1.0	Urban	Y
5	LP001011	Male	Yes	2	Graduate	Yes	5417	4196.0	267.0	360.0	1.0	Urban	Y
6	LP001013	Male	Yes	0	Not Graduate	No	2333	1516.0	95.0	360.0	1.0	Urban	Y
7	LP001014	Male	Yes	3+	Graduate	No	3036	2504.0	158.0	360.0	0.0	Semiurban	N
8	LP001018	Male	Yes	2	Graduate	No	4006	1526.0	168.0	360.0	1.0	Urban	Y
9	LP001020	Male	Yes	1	Graduate	No	12841	10968.0	349.0	360.0	1.0	Semiurban	N

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History
count	614.000000	614.000000	592.000000	600.00000	564.000000
mean	5403.459283	1621.245798	146.412162	342.00000	0.842199
std	6109.041673	2926.248369	85.587325	65.12041	0.364878
min	150.000000	0.000000	9.000000	12.00000	0.000000
25%	2877.500000	0.000000	100.000000	360.00000	1.000000
50%	3812.500000	1188.500000	128.000000	360.00000	1.000000
75%	5795.000000	2297.250000	168.000000	360.00000	1.000000
max	81000.000000	41667.000000	700.000000	480.00000	1.000000

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History
count	367.000000	367.000000	362.000000	361.000000	338.000000
mean	4805.599455	1569.577657	136.132597	342.537396	0.825444
std	4910.685399	2334.232099	61.366652	65.156643	0.380150
min	0.000000	0.000000	28.000000	6.000000	0.000000
25%	2864.000000	0.000000	100.250000	360.000000	1.000000
50%	3786.000000	1025.000000	125.000000	360.000000	1.000000
75%	5060.000000	2430.500000	158.000000	360.000000	1.000000
max	72529.000000	24000.000000	550.000000	480.000000	1.000000