Small Business Administration (SBA) is a government agency that has delivered millions of loans, loan guarantees, contracts, counseling sessions and other forms of assistance to small businesses since 1953.
SBA offers a variety of loan programs for very specific purposes - the most common program is the general small business loans (7(a) program). SBA does NOT provide grants for starting and expanding a business - it provides guarantees on loans that are distributed by lending institutions, thus reducing the risk to those institutions.
7(a) loans have a maximum loan amount of \$5 million. The average 7(a) loan amount in fiscal year 2015 was \$371,628. The actual interest rate for a 7(a) loan guaranteed by the SBA is negotiated between the applicant and lender and subject to the SBA maximums plus some spread, and not available for the data analyzed here.
SBA can guarantee as much as 85 percent on loans of up to \$150,000 and 75 percent on loans of more than \$150,000. SBA’s maximum exposure amount is \$3,750,000.
SMall businesses play a huge role in the U.S. economy - accounting for more than 50% of all active jobs, and contributing to more than 60% of all new jobs since 1995.
SBA provides assistance to small businesses to help them start new operations, expand existing ones, or recover after natural disasters. However (as will be shown below) a significant portion of SBA-backed loans default. This prompts the following questions:
As a part of the "Open Government" policy, SBA has released a loan-level list of all 7(a) loans approved since October 1990 (Fiscal Year 1991 - present). These are released as multiple Excel files. We will first load these using Pandas and then save them for easier/quicker access later.
In [1]:
# Import necessary Python packages
# Data analysis tools
import numpy as np
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
# Plotting tools and figure display options
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import graphviz
sns.set(context = 'poster', style = 'white')
sns.set(font_scale = 1.3)
In [2]:
# Load and concatenate the data (3 parts)
#loans_7a_part1 = pd.read_excel('FOIA - 7(a)(FY1991-FY1999).xlsx', skiprows = 1)
#loans_7a_part2 = pd.read_excel('FOIA - 7(a)(FY2000-FY2009).xlsx', skiprows = 1)
#loans_7a_part3 = pd.read_excel('FOIA - 7(a)(FY2010-Present).xlsx', skiprows = 1)
#loans_7a = pd.concat([loans_7a_part1, loans_7a_part2, loans_7a_part3])
In [3]:
# Save for quicker access later
#loans_7a.to_pickle('loans_7a')
In [4]:
# Read in the raw data
loans_7a = pd.read_pickle('loans_7a')
In [5]:
# Examine the data
loans_7a.columns
Out[5]:
In [6]:
loans_7a.describe()
Out[6]:
In [7]:
# See how many values are missing in each column
loans_7a.isnull().sum()
Out[7]:
We can see that majority of loans are missing information about the interest rate. NaicsCode is missing for 202825 loans.
In [8]:
# Examine the first 5 rows
loans_7a.head(5)
Out[8]:
Our target variable could be Loan Status (for binary classification) or Charged Off Amount (for continuos prediction/regression).
We may choose to limit the data to only:
In [9]:
# See how many loans there are in the data set per each status
loans_7a['LoanStatus'].value_counts()
sns.countplot(x = 'LoanStatus', data = loans_7a)
Out[9]:
It is important to keep in mind that the countplot displayed above includes loans that may have been paid in full before they were due, while as some loans that were awarded at the same time may eventually default.
Potentially useful features:
In [10]:
# Remove irrelevant columns
loans_7a.drop(['Program', 'BorrName', 'BorrStreet', 'BorrCity', 'BorrZip', 'BankStreet', 'BankCity', 'BankState',
'BankZip', 'NaicsDescription', 'FranchiseName', 'ProjectCounty', 'ProjectState',
'SBADistrictOffice', 'CongressionalDistrict', 'BankName', 'DeliveryMethod', 'subpgmdesc'], axis = 1, inplace = True)
In [11]:
loans_7a.columns
Out[11]:
In [12]:
# Since initial interest rate is missing in 990758 rows, remove it as well
loans_7a.drop('InitialInterestRate', axis = 1, inplace = True)
Let's check the correlation beetween the total approval amount and guaranteed amount.
In [13]:
loans_7a['GrossApproval'].corr(loans_7a['SBAGuaranteedApproval'])
Out[13]:
Since these two quantitites are highly correlated we will drop the gross approval amount.
In [14]:
loans_7a.drop('GrossApproval', axis = 1, inplace = True)
In [15]:
loans_7a['BusinessType'].value_counts()
sns.countplot(x = 'BusinessType', data = loans_7a)
Out[15]:
Most businesses receiving SBA loans are corporations, but a large portion are individuals, too.
Let's examine how many jobs these businesses support.
In [16]:
loans_7a['JobsSupported'].value_counts().head()
Out[16]:
In [17]:
loans_7a['JobsSupported'].value_counts()[0]/len(loans_7a)
Out[17]:
About 36% of businesses are non-employers! Jobs Supported = "Total Jobs Created + Jobs Retained as reported by lender on SBA Loan Application. SBA does not review, audit, or validate these numbers - they are simply self-reported, good faith estimates by the lender."
In [18]:
loans_7a['FranchiseCode'].value_counts().head(10)
Out[18]:
Franchise code is 0 or 1 for non-franchises and a 5-digit code otherwise!
In [19]:
# Percent of franchises
(loans_7a['FranchiseCode'] > 1).sum()/len(loans_7a)*100
Out[19]:
Only 6.4% of approved loans were given to franchises!
In [20]:
# Turn franchise code into a binary variable
def binarizeFranCode(code):
if code > 1:
return 1
else:
return 0
loans_7a['FranchiseCodeBin'] = loans_7a['FranchiseCode'].apply(binarizeFranCode)
print(loans_7a['FranchiseCodeBin'].value_counts())
loans_7a.drop('FranchiseCode', axis = 1, inplace = True)
The North American Industry Classification System (NAICS) classifies business establishments based on the activities in which they are primarily engaged.
In [21]:
loans_7a['NaicsCode'].nunique()
Out[21]:
There are more than 1300 different NAICS values in this dataset!
In [22]:
loans_7a['NaicsCode'].isnull().sum()
Out[22]:
In [23]:
# Remove rows with missing NAICS Codes
loans_7a.dropna(subset = ['NaicsCode'], inplace = True)
In [24]:
loans_7a.reset_index(drop = True, inplace = True)
In [25]:
# Replace CHGOFF with DEFAULT for simplicity
def repCHGOFF_DEFAULT(status):
if status == 'CHGOFF':
return 'DEFAULT'
else:
return status
loans_7a['LoanStatus'] = loans_7a['LoanStatus'].apply(repCHGOFF_DEFAULT)
As a an indicator of economic state at the time a loan is awarded, we will use S&P 1500 composite index. Source: http://www.econ.yale.edu/~shiller/data.htm
In [26]:
# Load monthly consumer price index and composite S&P index data
cpi_sp = pd.read_excel('ie_data.xls', sep = ' ', skiprows=7)
# Use only values starting in 1990
cpi_sp = cpi_sp[cpi_sp['Date']>1989]
In [27]:
# Keep only the relevant columns: date (year+month), S&P 1500 index and Consumer Price Index (CPI)
cpi_sp = cpi_sp[['Date', 'P', 'CPI']]
In [28]:
# Rename the columns
cpi_sp.columns = ['Date', 'SP', 'CPI']
cpi_sp.head()
Out[28]:
In [29]:
# Change the format of the date from year.month to datetime (and deal with issues of .10 getting shortented to .1)
def split_date(x):
tmp = str(x).split('.')
if tmp[1] == '1':
tmp[1] = '10'
return tmp
cpi_sp['Date_dt'] = cpi_sp['Date'].apply(split_date)
s = '-'
cpi_sp['Date_dt'] = cpi_sp['Date_dt'].apply(lambda x: s.join(x))
cpi_sp['Date_dt'] = cpi_sp['Date_dt'].apply(lambda x: x + '-01')
cpi_sp.reset_index()
cpi_sp.head(10)
Out[29]:
In [30]:
cpi_sp.drop('Date', axis = 1, inplace = True)
# Rename the columns
cpi_sp.columns = ['SP', 'CPI', 'Date']
In [31]:
# Adjust S&P based on CPI, relative to January 2016 US Dollar value
cpi2016 = float(cpi_sp[cpi_sp['Date'] == '2016-01-01']['CPI'])
cpi_sp['CPI_to2016'] = cpi2016/cpi_sp['CPI']
cpi_sp['SP_to2016'] = cpi_sp['SP'].multiply(cpi_sp['CPI_to2016'])
cpi_sp.drop(['CPI', 'SP'], axis = 1, inplace = True)
cpi_sp.reset_index(inplace = True, drop = True)
cpi_sp.head()
Out[31]:
In [32]:
# Plot monhtly S&P (adjusted to 2016)
fig = plt.figure(figsize = (10,5))
axes = fig.add_subplot(1, 1, 1, facecolor ='gainsboro')
axes.plot(pd.to_datetime(cpi_sp['Date']), cpi_sp['SP_to2016'], c = 'k', linewidth = 3)
plt.xlabel('Year', fontsize = 20)
plt.ylabel('S&P 1500', fontsize = 20)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
plt.tight_layout()
plt.show()
In [33]:
fig.savefig('1_SP1500.png', dpi = 300)
In [34]:
# Use S&P value relative to loan's approval date
cpi_sp.columns = [['ApprovalDate', 'CPI_to2016', 'SP_to2016']]
# Merge loan data with CPI + S&P data
# Keep only the date and set it to first of the month in order to use CPI and SP data
loans_7a['ApprovalDate'] = loans_7a['ApprovalDate'].apply(lambda x:x.date())
loans_7a['ApprovalDate'] = loans_7a['ApprovalDate'].apply(lambda x: x.replace(day = 1))
loans_7a['ApprovalDate'] = pd.to_datetime(loans_7a['ApprovalDate'])
cpi_sp['ApprovalDate'] = pd.to_datetime(cpi_sp['ApprovalDate'])
We may be tempted to incorporate changes in S&P index during the loan pay-off period into the model; however,
In [35]:
loans_7a = pd.merge(loans_7a, cpi_sp, on = 'ApprovalDate', how = 'left')
In [36]:
loans_7a.head()
Out[36]:
In [37]:
# Check for NaNs
loans_7a['SP_to2016'].isnull().sum()
Out[37]:
In [38]:
# Adjust all dollar amounts to 2016 US Dollar value
loans_7a['SBAGuaranteedApprovalAdj'] = loans_7a['SBAGuaranteedApproval'].multiply(loans_7a['CPI_to2016']).astype('float64')
loans_7a['GrossChargeOffAmountAdj'] = loans_7a['GrossChargeOffAmount'].multiply(loans_7a['CPI_to2016']).astype('float64')
In [39]:
# Get a sense of number of each Loan Status per approval year
clrs = ["#4878CF", "#6ACC65", "#D65F5F", "#B47CC7", "#C4AD66", "#77BEDB"]
newPal = dict(PIF = clrs[0], DEFAULT = clrs[2], CANCLD = clrs[1], COMMIT = clrs[3], EXEMPT = clrs[4])
g = sns.factorplot(x ='ApprovalFiscalYear', y = 'SBAGuaranteedApprovalAdj',
hue = 'LoanStatus', hue_order = ['PIF', 'DEFAULT', 'CANCLD', 'COMMIT', 'EXEMPT'],
kind = 'point', data = loans_7a, estimator = np.count_nonzero,
ci = None, size = 5, aspect = 3, palette = newPal, legend = False)
g.set_xticklabels(rotation = 30)
sns.plt.title('7A Loans')
sns.plt.xlabel('Approval Fiscal Year')
sns.plt.ylabel('Number of Loans')
sns.plt.legend(['Paid in Full', 'Defaulted', 'Cancelled', 'Committed', 'Exempt'], loc = 'best', fontsize = 20)
Out[39]:
In [40]:
g.savefig('2_AllLoans_Counts_FY.png', dpi = 300)
We see a clear spike in defaults for loans approved immediately before and during the U.S. financial crisis. This indicates that macroeconomic factors may be important.
In [41]:
# Group data by approval fiscal year
loans_7a_by_FY = loans_7a.groupby('ApprovalFiscalYear')
In [42]:
# Get a sense of the number of each Loan Status per approval year
g = sns.factorplot(x ='ApprovalFiscalYear', y = 'SBAGuaranteedApprovalAdj',
hue = 'LoanStatus', hue_order = ['PIF', 'DEFAULT', 'CANCLD', 'COMMIT', 'EXEMPT'],
kind = 'bar', data = loans_7a, estimator = np.count_nonzero,
ci = None, size = 5, aspect = 3, palette = newPal, legend = False)
g.set_xticklabels(rotation = 30)
sns.plt.title('7A Loans')
sns.plt.xlabel('Approval Fiscal Year')
sns.plt.ylabel('Number of Loans')
sns.plt.legend(['Paid in Full', 'Defaulted', 'Cancelled', 'Committed', 'Exempt'], loc = 'best', fontsize = 20)
Out[42]:
The plot bellow shows the actual amounts paid in full and charged off each year.
In [43]:
# Group by year and loan status
loans_7a_by_FY_by_LoanStatus = loans_7a.groupby(by = ['ApprovalFiscalYear', 'LoanStatus'], as_index = False).sum()
loans_7a_by_FY_by_LoanStatus.head()
Out[43]:
In [44]:
# Plot amount paid in full and amount charged off per approval year
fig = plt.figure(figsize = (16, 6))
ax = fig.add_subplot(1, 1, 1, facecolor ='gainsboro')
n_groups = loans_7a['ApprovalFiscalYear'].nunique()
sum_pif = loans_7a_by_FY_by_LoanStatus[loans_7a_by_FY_by_LoanStatus['LoanStatus']=='PIF']['SBAGuaranteedApprovalAdj']
sum_choff = loans_7a_by_FY_by_LoanStatus[loans_7a_by_FY_by_LoanStatus['LoanStatus']== 'DEFAULT']['GrossChargeOffAmountAdj']
x = np.arange(1991, 2017)
y = np.array(sum_pif)[:-1]
z = np.array(sum_choff)
ax.bar(x-0.4, y,width=0.5,color='b',align='center')
ax.bar(x, z,width=0.5,color='r',align='center')
plt.xlabel('Approval Fiscal Year', fontsize = 20)
plt.xticks(loans_7a['ApprovalFiscalYear'].unique(), rotation = 30)
plt.ylabel('2016 US Dollars', fontsize = 20)
plt.title('7A Loans', fontsize = 20)
plt.legend(labels = ['Paid in Full', 'Defaulted'], fontsize = 20)
plt.tight_layout()
plt.show()
In [45]:
fig.savefig('3_Completed_per_FY.png', dpi = 300)
The plot above may be too optimistic since it accounts for loans that were paid in full before they were due, and does not account for loans that have been awarded but may default in the future.
In [46]:
# Look at Maturity Date and only use loans whose maturity data has pased
loans_7a['MaturityDate'] = loans_7a['ApprovalDate'] + loans_7a['TermInMonths'].values.astype("timedelta64[M]")
loans_7a['MaturityYear'] = loans_7a['MaturityDate'].dt.year
loans_7a_matured = loans_7a[loans_7a['MaturityDate'] < '2017-07-01']
loans_7a_matured.reset_index(inplace = True, drop = True)
In [47]:
loans_7a_matured['MaturityYear'].isnull().sum()
Out[47]:
In [48]:
loans_7a_matured.info()
In [49]:
# Plot amount paid in full and amount charged off per approval year when accounting for matured loans only
n_groups = loans_7a_matured['ApprovalFiscalYear'].nunique()
loans_7a_by_AY_by_LoanStatus = loans_7a_matured.groupby(by = ['ApprovalFiscalYear', 'LoanStatus'], as_index = False).sum()
sum_pif = loans_7a_by_AY_by_LoanStatus[loans_7a_by_AY_by_LoanStatus['LoanStatus']=='PIF']['SBAGuaranteedApprovalAdj']
sum_def = loans_7a_by_AY_by_LoanStatus[loans_7a_by_AY_by_LoanStatus['LoanStatus']=='DEFAULT']['GrossChargeOffAmountAdj']
In [50]:
fig = plt.figure(figsize = (16, 6))
ax = fig.add_subplot(1, 1, 1, facecolor = "gainsboro")
sns.set_context('poster', font_scale = 1.2)
x = np.arange(1991, 2017)
y = np.array(sum_pif)[:-1]
z = np.array(sum_def)
ax.bar(x-0.4, y,width=0.5,color='b',align='center')
ax.bar(x, z,width=0.5,color='r',align='center')
plt.xlabel('Approval Fiscal Year')
plt.xticks(loans_7a_matured['ApprovalFiscalYear'].unique(), rotation = 30)
plt.ylabel('2016 US Dollars', fontsize = 20)
plt.title('SBA 7A Loans (Matured)', fontsize = 20)
plt.legend(labels = ['Paid in Full', 'Defaulted'], fontsize = 20)
plt.tight_layout()
plt.show()
In [51]:
fig.savefig('4_Matured_per_FY.png', dpi = 300)
In [52]:
loans_7a_matured['MaturityYear'].unique()
Out[52]:
In [53]:
loans_7a_matured['NaicsCode'].nunique()
Out[53]:
In [54]:
# Missing NAICS codes?
loans_7a_matured['NaicsCode'].isnull().sum()/len(loans_7a_matured)
Out[54]:
In [55]:
code_groups = {'Agriculture': 11, 'Mining' : 21, 'Utilities' : 22,
'Construction' : 23, 'Manufacturing' : [31, 32, 33], 'Wholesale Trade' : 42,
'Retail Trade' : [44, 45], 'Transportation & Warehousing' : [48, 49],
'Information' : 51, 'Finance & Insurance' : 52,
'Real Estate Rental & Leasing' : 53,
'Professional, Scientific, & Technical Services' : 54,
'Management of Companies & Enterprises': 55,
'Administrative/Support/Waste Management' : 56,
'Educational Services' : 61,
'Health Care & Social Assistance' : 62,
'Arts, Entertainment, & Recreation' : 71,
'Accommodation & Food Services' : 72,
'Other Services' : 81,
'Public Administration' : 92}
In [56]:
# Get the "Sector" variable from NAICS codes
# Only the first two digits of NAICS code are relevant to group into 20 sectors
loans_7a_matured['Sector'] = (loans_7a_matured['NaicsCode'].apply(lambda x:x/10000))
loans_7a_matured['Sector'].fillna(0, inplace = True)
loans_7a_matured['Sector'] = (loans_7a_matured['Sector'].apply(int))
code_groups.values()
tmp = 0
list_codes = []
list_cat = []
for i in range(len(code_groups.values())):
val = list(code_groups.values())[i]
if type(val) == int:
list_codes.append(val)
list_cat.append(list(code_groups.keys())[i])
else:
for j in range(len(val)):
list_codes.append(val[j])
list_cat.append(list(code_groups.keys())[i])
code_cat = pd.DataFrame(list_codes, columns = ['Code'])
code_cat['Sector'] = list_cat
code_cat
cols = list(loans_7a_matured.columns)
cols[-1] = 'Code'
loans_7a_matured.columns = cols
loans_7a_matured = pd.merge(loans_7a_matured, code_cat, on = 'Code', how = 'left')
loans_7a_matured.head()
Out[56]:
In [57]:
loans_7a_matured['Sector'].isnull().sum()
Out[57]:
In [58]:
loans_7a_matured.dropna(subset = ['Sector'], inplace = True)
In [59]:
loans_7a_matured.reset_index(inplace = True, drop=True)
In [60]:
# Keep only paid in full or defaulted loans
loans_7a_matured = loans_7a_matured[(loans_7a_matured['LoanStatus']=='PIF') |
(loans_7a_matured['LoanStatus']=='DEFAULT')]
In [61]:
#Group matured loans by industry sector
by_sector = loans_7a_matured.groupby('Sector').sum()
by_sector.head()
Out[61]:
In [62]:
# Look at total guaranteed and total charged off amount per sector
fig = plt.figure(figsize=(16,6))
axes = fig.add_subplot(1, 1, 1, facecolor = 'gainsboro')
axes.plot(by_sector['SBAGuaranteedApprovalAdj'], by_sector['GrossChargeOffAmountAdj'], 'ro')
plt.xlabel('Total SBA Guaranteed Amount per Sector \n(2016 US Dollars)')
plt.ylabel('Total Charged Off Amount per Sectos \n(2016 US Dollars)')
plt.tight_layout()
Not surprising: industry sectors that receive more help, also default on higher amounts!
In [63]:
by_sector.reset_index(inplace = True)
In [64]:
# Look at percent defaulted (in terms of US $) per sector
by_sector['Percent Defaulted'] = (by_sector['GrossChargeOffAmountAdj']/ by_sector['SBAGuaranteedApprovalAdj'])*100
In [65]:
# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(8, 7))
pco = by_sector.sort_values("Percent Defaulted", ascending=False)
# Plot the total % defaulted (in dollars)
sns.set_color_codes("pastel")
sns.barplot(y = "Sector", x = "Percent Defaulted", data=pco, palette="Reds_d")
ax.set_xlabel('Percent Defaulted ($)')
ax.set_ylabel('')
ax.set_title('Industry Sector')
Out[65]:
In [66]:
# Number of loans per sector
by_sector_ct = loans_7a_matured.groupby('Sector').agg(lambda ts: (ts > 0).sum())
by_sector_ct.reset_index(inplace = True)
In [67]:
# Look at percent defaulted per sector (in terms of the number of loans)
by_sector_ct['Percent Defaulted'] = (by_sector_ct['GrossChargeOffAmountAdj']/ by_sector_ct['SBAGuaranteedApprovalAdj'])*100
In [68]:
# Look at percent defaulted per sector (in terms of the total number of loans)
# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(8, 7))
pco = by_sector_ct.sort_values("Percent Defaulted", ascending=False)
# plot the total percent defaulted (in number of loans)
sns.set_color_codes("pastel")
sns.barplot(y="Sector", x= "Percent Defaulted", data=pco, palette="Reds_d")
ax.set_xlabel('Percent Defaulted (# of Loans)')
ax.set_ylabel('')
ax.set_title('Industry Sector')
Out[68]:
In [69]:
f.savefig('Sector.png', dpi = 300)
In [70]:
# Examine state information (borrower location)
borr_states = set(loans_7a_matured['BorrState'].unique())
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
#US teritories
borr_states.difference(states)
#Count all the "odd" ones (not one of 50 states)
for s in borr_states.difference(states):
print((loans_7a_matured['BorrState'] == s).sum())
#PR has 9834 loans, all others < 1000
print((loans_7a_matured['BorrState']).isnull().sum())
In [71]:
# Number of missing values for states
print((loans_7a_matured['BorrState']).isnull().sum())
# Drop the territories except for PR
for s in {'FM', 'MP', 'GU', 'MH', 'PW', 'AE', 'VI', 'AS', 'TQ'}:
loans_7a_matured = loans_7a_matured[loans_7a_matured['BorrState'] != s]
# Drop missing values
loans_7a_matured = loans_7a_matured[loans_7a_matured['BorrState'].notnull()]
In [72]:
loans_7a_matured.info()
In [73]:
#One-hot encoding: state, year approved, category, business type, loan tatus
loans_7a_matured = pd.concat([loans_7a_matured, pd.get_dummies(loans_7a_matured['BorrState'], drop_first=True)], axis = 1)
loans_7a_matured = pd.concat([loans_7a_matured, pd.get_dummies(loans_7a_matured['Sector'], drop_first=True)], axis = 1)
loans_7a_matured = pd.concat([loans_7a_matured, pd.get_dummies(loans_7a_matured['LoanStatus'], drop_first = True)], axis = 1)
loans_7a_matured = pd.concat([loans_7a_matured, pd.get_dummies(loans_7a_matured['BusinessType'], drop_first = True)], axis = 1)
loans_7a_matured = pd.concat([loans_7a_matured, pd.get_dummies(loans_7a_matured['ApprovalFiscalYear'], drop_first = True)], axis = 1)
loans_7a_matured.reset_index(inplace = True, drop = True)
In [74]:
list(loans_7a_matured.columns)
Out[74]:
In [75]:
# Drop those columns that we won't use in the (binary classification) model
loans_7a_matured.drop(['BorrState', 'SBAGuaranteedApproval', 'ApprovalDate', 'FirstDisbursementDate',
'NaicsCode', 'BusinessType', 'LoanStatus', 'ChargeOffDate', 'GrossChargeOffAmount',
'CPI_to2016','GrossChargeOffAmountAdj', 'MaturityDate','MaturityYear', 'Code', 'Sector', ], axis = 1, inplace = True)
In [76]:
#Save for later use
pd.to_pickle(loans_7a_matured, 'loans_7a_matured')