This project uses data acquired from a contact at Madura microfinance in India. I was particularly interest in exploring this topic because I think that microfinance is one of the most important keys to economic development, particularly in India. Mohammad Yunus conceived of the concept for microfinance in Bangladesh. The concept is centered around a group of women offering joint collateral for small-scale loans so they are able to run businesses. This data set includes 997 randomly selected memebers. I hypothesized that women with greater access to education would be more likely to successfully return their loan. This data set indicated that members are either "in arrears" or have a "regular" return status. I altered these to be indicated by a zero when a member is in arrears and a 1 when the member is regular. I created several descriptive graphs with the data and ran a logistic regression to see how different factors predict return status.
In [274]:
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
%matplotlib inline
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dnt
import statistics as stats
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
In [275]:
# read the data in
df = pd.read_excel("/Users/Sophie/Dropbox/Senior_Year/Honors/Research/Data/Madura_Raw_Data.xlsx")
df.head()
Out[275]:
In [276]:
df.shape
Out[276]:
In [295]:
df = df.rename(index=str, columns={"Member_Id": "Member ID", "Id _2": "Region ID", "name_2": "Region",
"Id _3": "District ID", "name_3":
"District", "Id _4": "Village ID", "name_4": "Village",
"Pop_2011": "Village Population 2011", "Pop_2001": "Village Population 2001",
"NoHshlds": "Number of Households in Village", "literacy_rate": "literacy rate in village",
"Agri_ratio": "Percent of Business in Agriculture in Vilalge",
"Distance to Tertiary Road (Km)": "Km between village and closest tertiary road",
"Bank Branches 10Km": "Private Bank Branches Within 10Km of Village",
"Age": "Age of Member", "Education Details of Self": "Number of Years of Education for Member",
"Education Details of Father": "Number of Years of Education for Member's Father",
"Default Status": "Default Status of Member Loans at Madura"})
In [296]:
df.head()
Out[296]:
In [297]:
df_1 = df[['Member ID', 'Village Population 2011',
'literacy rate in village',
'Percent of Business in Agriculture in Vilalge',
'Km between village and closest tertiary road',
'Private Bank Branches Within 10Km of Village', 'Age of Member',
'Number of Years of Education for Member',
"Number of Years of Education for Member's Father",
'Default Status of Member Loans at Madura']]
df_1 = df_1.replace(['Arrear', 'Regular'], [0, 1])
In [298]:
df_1.head()
Out[298]:
In [299]:
fig, ax = plt.subplots()
df_1['Number of Years of Education for Member'].plot(ax=ax, kind='hist', title = 'Number of Members per each Education Level')
Out[299]:
It is easy to observe that most members have less than five years of formal schooling.
In [303]:
df_1.plot('literacy rate in village', 'Number of Years of Education for Member', kind='scatter',
title = 'Literacy and Education')
Out[303]:
This graph does not offer any useful information because there are no true cluster patterns, indicating there is little to no relationship between the overall literacy rate of the village and the member's education.
In [304]:
df_1.plot('Private Bank Branches Within 10Km of Village', 'Km between village and closest tertiary road', kind='scatter',
title = 'Factors of Access')
Out[304]:
The majority of members have few, if any banks near by, but there is seemingly no relationship between the ditance to the closest tertiary road and the number of bank branches near the village. There seem to be many data points on the lower end of each scale, however.
In [282]:
df_1.describe()
Out[282]:
In [283]:
# relationship between literacy rate and agricultural business
mydata = df_1[["literacy rate in village", "Percent of Business in Agriculture in Vilalge"]].dropna(how="any")
vals = mydata.values
plt.scatter(vals[:, 0], vals[:, 1])
Out[283]:
I wanted to see if there is any apparent correlation in a scatterplot between the emphasis on agriculture in a community and the literacy rate of that area. There does not appear to be much of a relationship.
In [286]:
df_1.head()
Out[286]:
In [287]:
np.asarray(df_1)
#Checking if the logisitc regression will run, or if there are objects in the data set
Out[287]:
In [288]:
train_cols = df_1.columns[1:9]
# Index(['Member ID', 'Region','District', 'Village'], dtype=str)
logit = sm.Logit(df_1['Default Status of Member Loans at Madura'], df_1[train_cols])
In [289]:
result = logit.fit()
In [290]:
print(result.summary())
In [291]:
print(result.conf_int())
In [292]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print(np.exp(conf))