Sophie Frank Data Bootcamp Research Project

Predicters of loan returns in women's microfinance

Raw Data from Madura Micorfinance Data Team Winter 2016

Spring 2017

This project uses data acquired from a contact at Madura microfinance in India. I was particularly interest in exploring this topic because I think that microfinance is one of the most important keys to economic development, particularly in India. Mohammad Yunus conceived of the concept for microfinance in Bangladesh. The concept is centered around a group of women offering joint collateral for small-scale loans so they are able to run businesses. This data set includes 997 randomly selected memebers. I hypothesized that women with greater access to education would be more likely to successfully return their loan. This data set indicated that members are either "in arrears" or have a "regular" return status. I altered these to be indicated by a zero when a member is in arrears and a 1 when the member is regular. I created several descriptive graphs with the data and ran a logistic regression to see how different factors predict return status.



In [274]:

    
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
%matplotlib inline 
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dnt
import statistics as stats
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)



In [275]:

    
# read the data in
df = pd.read_excel("/Users/Sophie/Dropbox/Senior_Year/Honors/Research/Data/Madura_Raw_Data.xlsx")

df.head()









    Out[275]:






  
    
      
      Member_Id
      Id _2
      name_2
      Id _3
      name_3
      Id _4
      name_4
      Pop_2011
      Pop_2001
      NoHshlds
      ...
      Latitude
      Longitude
      Years residing at the above address
      Age
      Count of phone no
      Current Residence Status - Own property Or Leased/Rented Property
      Household Annual income
      Education Details of Self
      Education Details of Father
      Default Status
    
  
  
    
      0
      1
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      50
      1
      2
      NaN
      2
      0
      Arrear
    
    
      1
      2
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      42
      1
      2
      NaN
      3
      0
      Arrear
    
    
      2
      3
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      26
      1
      2
      NaN
      2
      0
      Arrear
    
    
      3
      4
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      26
      1
      2
      NaN
      4
      0
      Arrear
    
    
      4
      5
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      26
      1
      2
      NaN
      2
      0
      Arrear
    
  

5 rows × 84 columns



In [276]:

    
df.shape









    Out[276]:





(997, 84)



In [295]:

    
df = df.rename(index=str, columns={"Member_Id": "Member ID", "Id _2": "Region ID", "name_2": "Region", 
                              "Id _3": "District ID", "name_3": 
                             "District", "Id _4": "Village ID", "name_4": "Village", 
                             "Pop_2011": "Village Population 2011", "Pop_2001": "Village Population 2001",
                             "NoHshlds": "Number of Households in Village", "literacy_rate": "literacy rate in village",
                             "Agri_ratio": "Percent of Business in Agriculture in Vilalge", 
                             "Distance to Tertiary Road (Km)": "Km between village and closest tertiary road",
                             "Bank Branches 10Km": "Private Bank Branches Within 10Km of Village",
                             "Age": "Age of Member", "Education Details of Self": "Number of Years of Education for Member",
                             "Education Details of Father": "Number of Years of Education for Member's Father",
                             "Default Status": "Default Status of Member Loans at Madura"})



In [296]:

    
df.head()









    Out[296]:






  
    
      
      Member ID
      Region ID
      Region
      District ID
      District
      Village ID
      Village
      Village Population 2011
      Village Population 2001
      Number of Households in Village
      ...
      Latitude
      Longitude
      Years residing at the above address
      Age of Member
      Count of phone no
      Current Residence Status - Own property Or Leased/Rented Property
      Household Annual income
      Number of Years of Education for Member
      Number of Years of Education for Member's Father
      Default Status of Member Loans at Madura
    
  
  
    
      0
      1
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      50
      1
      2
      NaN
      2
      0
      Arrear
    
    
      1
      2
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      42
      1
      2
      NaN
      3
      0
      Arrear
    
    
      2
      3
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      26
      1
      2
      NaN
      2
      0
      Arrear
    
    
      3
      4
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      26
      1
      2
      NaN
      4
      0
      Arrear
    
    
      4
      5
      608
      Salem
      5743
      Attur
      634290
      Sadasivapuram
      3430
      3108.0
      912
      ...
      11.57598
      78.67694
      3.0
      26
      1
      2
      NaN
      2
      0
      Arrear
    
  

5 rows × 84 columns



In [297]:

    
df_1 = df[['Member ID', 'Village Population 2011',
                             'literacy rate in village',
                             'Percent of Business in Agriculture in Vilalge', 
                             'Km between village and closest tertiary road',
                             'Private Bank Branches Within 10Km of Village', 'Age of Member', 
                             'Number of Years of Education for Member',
                             "Number of Years of Education for Member's Father",
                             'Default Status of Member Loans at Madura']]
df_1 = df_1.replace(['Arrear', 'Regular'], [0, 1])



In [298]:

    
df_1.head()









    Out[298]:






  
    
      
      Member ID
      Village Population 2011
      literacy rate in village
      Percent of Business in Agriculture in Vilalge
      Km between village and closest tertiary road
      Private Bank Branches Within 10Km of Village
      Age of Member
      Number of Years of Education for Member
      Number of Years of Education for Member's Father
      Default Status of Member Loans at Madura
    
  
  
    
      0
      1
      3430
      74
      0.81
      7.05
      23
      50
      2
      0
      0
    
    
      1
      2
      3430
      74
      0.81
      7.05
      23
      42
      3
      0
      0
    
    
      2
      3
      3430
      74
      0.81
      7.05
      23
      26
      2
      0
      0
    
    
      3
      4
      3430
      74
      0.81
      7.05
      23
      26
      4
      0
      0
    
    
      4
      5
      3430
      74
      0.81
      7.05
      23
      26
      2
      0
      0



In [299]:

    
fig, ax = plt.subplots()
df_1['Number of Years of Education for Member'].plot(ax=ax, kind='hist', title = 'Number of Members per each Education Level')









    Out[299]:





<matplotlib.axes._subplots.AxesSubplot at 0x129184828>

It is easy to observe that most members have less than five years of formal schooling.



In [303]:

    
df_1.plot('literacy rate in village', 'Number of Years of Education for Member', kind='scatter', 
                                                     title = 'Literacy and Education')









    Out[303]:





<matplotlib.axes._subplots.AxesSubplot at 0x11cb1c5f8>

This graph does not offer any useful information because there are no true cluster patterns, indicating there is little to no relationship between the overall literacy rate of the village and the member's education.



In [304]:

    
df_1.plot('Private Bank Branches Within 10Km of Village', 'Km between village and closest tertiary road', kind='scatter', 
                                                     title = 'Factors of Access')









    Out[304]:





<matplotlib.axes._subplots.AxesSubplot at 0x129456c18>

The majority of members have few, if any banks near by, but there is seemingly no relationship between the ditance to the closest tertiary road and the number of bank branches near the village. There seem to be many data points on the lower end of each scale, however.



In [282]:

    
df_1.describe()









    Out[282]:






  
    
      
      Member ID
      Village Population 2011
      literacy rate in village
      Percent of Business in Agriculture in Vilalge
      Km between village and closest tertiary road
      Private Bank Branches Within 10Km of Village
      Age of Member
      Number of Years of Education for Member
      Number of Years of Education for Member's Father
      Default Status of Member Loans at Madura
    
  
  
    
      count
      997.000000
      9.970000e+02
      997.000000
      997.000000
      997.000000
      997.000000
      997.000000
      997.000000
      997.000000
      997.000000
    
    
      mean
      499.000000
      7.582116e+04
      80.064193
      0.352728
      2.366098
      37.278837
      36.795386
      3.123370
      0.543631
      0.936810
    
    
      std
      287.953411
      1.988713e+05
      8.601502
      0.283688
      2.628567
      61.188369
      7.878486
      1.694494
      1.093680
      0.243426
    
    
      min
      1.000000
      1.240000e+02
      58.000000
      0.010000
      0.010000
      2.000000
      20.000000
      1.000000
      0.000000
      0.000000
    
    
      25%
      250.000000
      3.549000e+03
      73.000000
      0.080000
      0.280000
      7.000000
      30.000000
      2.000000
      0.000000
      1.000000
    
    
      50%
      499.000000
      8.774000e+03
      81.000000
      0.330000
      1.390000
      13.000000
      36.000000
      3.000000
      0.000000
      1.000000
    
    
      75%
      748.000000
      2.546900e+04
      88.000000
      0.610000
      3.740000
      31.000000
      42.000000
      4.000000
      1.000000
      1.000000
    
    
      max
      997.000000
      1.017865e+06
      94.000000
      0.980000
      11.740000
      260.000000
      64.000000
      8.000000
      7.000000
      1.000000



In [283]:

    
# relationship between literacy rate and agricultural business

mydata = df_1[["literacy rate in village", "Percent of Business in Agriculture in Vilalge"]].dropna(how="any")
vals = mydata.values
plt.scatter(vals[:, 0], vals[:, 1])









    Out[283]:





<matplotlib.collections.PathCollection at 0x1292a4d68>

I wanted to see if there is any apparent correlation in a scatterplot between the emphasis on agriculture in a community and the literacy rate of that area. There does not appear to be much of a relationship.



In [286]:

    
df_1.head()









    Out[286]:






  
    
      
      Member ID
      Village Population 2011
      literacy rate in village
      Percent of Business in Agriculture in Vilalge
      Km between village and closest tertiary road
      Private Bank Branches Within 10Km of Village
      Age of Member
      Number of Years of Education for Member
      Number of Years of Education for Member's Father
      Default Status of Member Loans at Madura
    
  
  
    
      0
      1
      3430
      74
      0.81
      7.05
      23
      50
      2
      0
      0
    
    
      1
      2
      3430
      74
      0.81
      7.05
      23
      42
      3
      0
      0
    
    
      2
      3
      3430
      74
      0.81
      7.05
      23
      26
      2
      0
      0
    
    
      3
      4
      3430
      74
      0.81
      7.05
      23
      26
      4
      0
      0
    
    
      4
      5
      3430
      74
      0.81
      7.05
      23
      26
      2
      0
      0



In [287]:

    
np.asarray(df_1)

#Checking if the logisitc regression will run, or if there are objects in the data set









    Out[287]:





array([[  1.00000000e+00,   3.43000000e+03,   7.40000000e+01, ...,
          2.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  2.00000000e+00,   3.43000000e+03,   7.40000000e+01, ...,
          3.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  3.00000000e+00,   3.43000000e+03,   7.40000000e+01, ...,
          2.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [  9.95000000e+02,   1.47900000e+03,   8.20000000e+01, ...,
          4.00000000e+00,   0.00000000e+00,   1.00000000e+00],
       [  9.96000000e+02,   2.68300000e+04,   8.00000000e+01, ...,
          3.00000000e+00,   0.00000000e+00,   1.00000000e+00],
       [  9.97000000e+02,   2.44500000e+03,   7.20000000e+01, ...,
          4.00000000e+00,   0.00000000e+00,   0.00000000e+00]])



In [288]:

    
train_cols = df_1.columns[1:9]

# Index(['Member ID', 'Region','District', 'Village'], dtype=str)

logit = sm.Logit(df_1['Default Status of Member Loans at Madura'], df_1[train_cols])



In [289]:

    
result = logit.fit()









    



Optimization terminated successfully.
         Current function value: 0.213775
         Iterations 10



In [290]:

    
print(result.summary())









    



                                      Logit Regression Results                                      
====================================================================================================
Dep. Variable:     Default Status of Member Loans at Madura   No. Observations:                  997
Model:                                                Logit   Df Residuals:                      989
Method:                                                 MLE   Df Model:                            7
Date:                                      Wed, 26 Apr 2017   Pseudo R-squ.:                 0.09285
Time:                                              20:12:37   Log-Likelihood:                -213.13
converged:                                             True   LL-Null:                       -234.95
                                                              LLR p-value:                 2.521e-07
====================================================================================================================
                                                       coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------------------------------
Village Population 2011                           3.124e-06   4.55e-06      0.687      0.492     -5.79e-06   1.2e-05
literacy rate in village                             0.0283      0.009      3.023      0.003         0.010     0.047
Percent of Business in Agriculture in Vilalge       -0.8310      0.517     -1.609      0.108        -1.843     0.181
Km between village and closest tertiary road        -0.0695      0.046     -1.509      0.131        -0.160     0.021
Private Bank Branches Within 10Km of Village         0.0097      0.009      1.136      0.256        -0.007     0.026
Age of Member                                        0.0034      0.017      0.201      0.841        -0.029     0.036
Number of Years of Education for Member              0.2238      0.095      2.351      0.019         0.037     0.410
Number of Years of Education for Member's Father     0.1155      0.177      0.653      0.513        -0.231     0.462
====================================================================================================================



In [291]:

    
print(result.conf_int())









    



                                                         0         1
Village Population 2011                          -0.000006  0.000012
literacy rate in village                          0.009966  0.046723
Percent of Business in Agriculture in Vilalge    -1.843427  0.181486
Km between village and closest tertiary road     -0.159890  0.020793
Private Bank Branches Within 10Km of Village     -0.007038  0.026434
Age of Member                                    -0.029429  0.036160
Number of Years of Education for Member           0.037220  0.410470
Number of Years of Education for Member's Father -0.230852  0.461759



In [292]:

    
params = result.params
  conf = result.conf_int()
  conf['OR'] = params
  conf.columns = ['2.5%', '97.5%', 'OR']
  print(np.exp(conf))









    



                                                      2.5%     97.5%        OR
Village Population 2011                           0.999994  1.000012  1.000003
literacy rate in village                          1.010016  1.047832  1.028750
Percent of Business in Agriculture in Vilalge     0.158274  1.198998  0.435626
Km between village and closest tertiary road      0.852237  1.021011  0.932815
Private Bank Branches Within 10Km of Village      0.992987  1.026787  1.009746
Age of Member                                     0.970999  1.036822  1.003371
Number of Years of Education for Member           1.037922  1.507526  1.250877
Number of Years of Education for Member's Father  0.793857  1.586864  1.122383

Conclusions

The only variables with meaningful p values are the literacy rate of the member's village and the number of years they have been educated. This draws a fascinating conclusion about education having a stronger effect than other factors on a member's likelihood or repaying the loan.

	Member_Id	Id _2	name_2	Id _3	name_3	Id _4	name_4	Pop_2011	Pop_2001	NoHshlds	...	Latitude	Longitude	Years residing at the above address	Age	Count of phone no	Current Residence Status - Own property Or Leased/Rented Property	Household Annual income	Education Details of Self	Default Status
0	1	608	Salem	5743	Attur	634290	Sadasivapuram	3430	3108.0	912	...	11.57598	78.67694	3.0	50	1	2	NaN	2	Arrear
1	2	608	Salem	5743	Attur	634290	Sadasivapuram	3430	3108.0	912	...	11.57598	78.67694	3.0	42	1	2	NaN	3	Arrear
2	3	608	Salem	5743	Attur	634290	Sadasivapuram	3430	3108.0	912	...	11.57598	78.67694	3.0	26	1	2	NaN	2	Arrear
3	4	608	Salem	5743	Attur	634290	Sadasivapuram	3430	3108.0	912	...	11.57598	78.67694	3.0	26	1	2	NaN	4	Arrear
4	5	608	Salem	5743	Attur	634290	Sadasivapuram	3430	3108.0	912	...	11.57598	78.67694	3.0	26	1	2	NaN	2	Arrear

	Member ID	Village Population 2011	literacy rate in village	Percent of Business in Agriculture in Vilalge	Km between village and closest tertiary road	Private Bank Branches Within 10Km of Village	Age of Member	Number of Years of Education for Member
0	1	3430	74	0.81	7.05	23	50	2
1	2	3430	74	0.81	7.05	23	42	3
2	3	3430	74	0.81	7.05	23	26	2
3	4	3430	74	0.81	7.05	23	26	4
4	5	3430	74	0.81	7.05	23	26	2

	Member ID	Village Population 2011	literacy rate in village	Percent of Business in Agriculture in Vilalge	Km between village and closest tertiary road	Private Bank Branches Within 10Km of Village	Age of Member	Number of Years of Education for Member	Number of Years of Education for Member's Father	Default Status of Member Loans at Madura
count	997.000000	9.970000e+02	997.000000	997.000000	997.000000	997.000000	997.000000	997.000000	997.000000	997.000000
mean	499.000000	7.582116e+04	80.064193	0.352728	2.366098	37.278837	36.795386	3.123370	0.543631	0.936810
std	287.953411	1.988713e+05	8.601502	0.283688	2.628567	61.188369	7.878486	1.694494	1.093680	0.243426
min	1.000000	1.240000e+02	58.000000	0.010000	0.010000	2.000000	20.000000	1.000000	0.000000	0.000000
25%	250.000000	3.549000e+03	73.000000	0.080000	0.280000	7.000000	30.000000	2.000000	0.000000	1.000000
50%	499.000000	8.774000e+03	81.000000	0.330000	1.390000	13.000000	36.000000	3.000000	0.000000	1.000000
75%	748.000000	2.546900e+04	88.000000	0.610000	3.740000	31.000000	42.000000	4.000000	1.000000	1.000000
max	997.000000	1.017865e+06	94.000000	0.980000	11.740000	260.000000	64.000000	8.000000	7.000000	1.000000