Regression - Frame, Acquire & Refine

Raw Data

You are provided with the following data: loan_data.csv
This is the historical data that the bank has provided. It has the following columns

Application Attributes:

years: Number of years the applicant has been employed
ownership: Whether the applicant owns a house or not
income: Annual income of the applicant
age: Age of the applicant

Behavioural Attributes:

grade: Credit grade of the applicant

Outcome Variable:

amount : Amount of Loan provided to the applicant
default : Whether the applicant has defaulted or not
interest: Interest rate charged for the applicant

Frame the Problem

What are the features?
What is the target?

Discuss?



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Acquire the Data



In [1]:

    
#Load the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



In [2]:

    
#Defualt Variables
%matplotlib inline
plt.rcParams['figure.figsize'] = (16,9)
plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.2f' % x)



In [7]:

    
#Load the dataset
df = pd.read_csv('data/loan_data.csv')



In [8]:

    
#View the first few rows of train
df.head()



In [10]:

    
#View the columns of the train dataset
df.columns









    Out[10]:





Index(['default', 'amount', 'interest', 'grade', 'years', 'ownership',
       'income', 'age'],
      dtype='object')



In [12]:

    
#View the data types of the train dataset
df.dtypes









    Out[12]:





default        int64
amount         int64
interest     float64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object



In [13]:

    
#View the number of records in the data
df.shape









    Out[13]:





(29092, 8)



In [14]:

    
#View summary of raw data 
df.describe()

Refine the Data

Lets check the dataset for quality and compeleteness

Missing Values
Outliers

Check for Missing Values



In [16]:

    
# Find if df has missing values. Hint: There is a isnull() function
df.isnull().head()









    Out[16]:






  
    
      
      default
      amount
      interest
      grade
      years
      ownership
      income
      age
    
  
  
    
      0
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      1
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      2
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      3
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      4
      False
      False
      True
      False
      False
      False
      False
      False

One consideration we check here is the number of observations with missing values for those columns that have missing values. If a column has too many missing values, it might make sense to drop the column.



In [41]:

    
#let's see how many missing values are present
df.isnull().sum()









    Out[41]:





default         0
amount          0
interest     2776
grade           0
years         809
ownership       0
income          0
age             0
dtype: int64



In [42]:

    
df.isnull()









    Out[42]:






  
    
      
      default
      amount
      interest
      grade
      years
      ownership
      income
      age
    
  
  
    
      0
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      1
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      2
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      3
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      4
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      5
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      6
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      7
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      8
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      9
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      10
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      11
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      12
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      13
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      14
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      15
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      16
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      17
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      18
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      19
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      20
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      21
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      22
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      23
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      24
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      25
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      26
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      27
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      28
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      29062
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29063
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      29064
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29065
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29066
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29067
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29068
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29069
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      29070
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29071
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29072
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29073
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29074
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29075
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29076
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29077
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      29078
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29079
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29080
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29081
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29082
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29083
      False
      False
      True
      False
      False
      False
      False
      False
    
    
      29084
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29085
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29086
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29087
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29088
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29089
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29090
      False
      False
      False
      False
      False
      False
      False
      False
    
    
      29091
      False
      False
      True
      False
      False
      False
      False
      False
    
  

29092 rows × 8 columns



In [43]:

    
df[df.isnull().any(axis=1)].head()

So, we see that two columns have missing values: interest and years. Both the columns are numeric. We have three options for dealing with this missing values

Options to treat Missing Values

REMOVE - NAN rows
IMPUTATION - Replace them with something??
- Mean
- Median
- Fixed Number - Domain Relevant
- High Number (999) - Issue with modelling
BINNING - Categorical variable and "Missing becomes a number*
DOMAIN SPECIFIC - Entry error, pipeline, etc.



In [44]:

    
#Let's replace missing values with the median of the column
df.interest.median()









    Out[44]:





10.99



In [52]:

    
?df.fillna



In [58]:

    
df[df.isnull().any(axis=1)].head(20)



In [64]:

    
#there's a fillna function
df.fillna(df.median(), inplace=True)









    Out[64]:






  
    
      
      default
      amount
      interest
      grade
      years
      ownership
      income
      age
    
  
  
    
      0
      0
      5000
      10.65
      B
      10.00
      RENT
      24000.00
      33
    
    
      1
      0
      2400
      10.99
      C
      25.00
      RENT
      12252.00
      31
    
    
      2
      0
      10000
      13.49
      C
      13.00
      RENT
      49200.00
      24
    
    
      3
      0
      5000
      10.99
      A
      3.00
      RENT
      36000.00
      39
    
    
      4
      0
      3000
      10.99
      E
      9.00
      RENT
      48000.00
      24
    
    
      5
      0
      12000
      12.69
      B
      11.00
      OWN
      75000.00
      28
    
    
      6
      1
      9000
      13.49
      C
      0.00
      RENT
      30000.00
      22
    
    
      7
      0
      3000
      9.91
      B
      3.00
      RENT
      15000.00
      22
    
    
      8
      1
      10000
      10.65
      B
      3.00
      RENT
      100000.00
      28
    
    
      9
      0
      1000
      16.29
      D
      0.00
      RENT
      28000.00
      22
    
    
      10
      0
      10000
      10.99
      C
      4.00
      RENT
      42000.00
      23
    
    
      11
      0
      3600
      6.03
      A
      13.00
      MORTGAGE
      110000.00
      27
    
    
      12
      0
      6000
      10.99
      B
      1.00
      MORTGAGE
      84000.00
      30
    
    
      13
      0
      9200
      6.03
      A
      6.00
      RENT
      77385.19
      24
    
    
      14
      1
      21000
      12.42
      B
      17.00
      RENT
      105000.00
      29
    
    
      15
      0
      10000
      11.71
      B
      13.00
      OWN
      50000.00
      23
    
    
      16
      0
      10000
      11.71
      B
      5.00
      RENT
      50000.00
      22
    
    
      17
      1
      6000
      11.71
      B
      1.00
      RENT
      76000.00
      31
    
    
      18
      0
      15000
      9.91
      B
      2.00
      MORTGAGE
      92000.00
      23
    
    
      19
      1
      15000
      14.27
      C
      9.00
      RENT
      60000.00
      30
    
    
      20
      0
      4000
      11.71
      B
      19.00
      MORTGAGE
      106000.00
      27
    
    
      21
      0
      8500
      11.71
      B
      0.00
      RENT
      25000.00
      24
    
    
      22
      0
      4375
      7.51
      A
      7.00
      MORTGAGE
      17108.00
      38
    
    
      23
      0
      31825
      7.90
      A
      5.00
      MORTGAGE
      75000.00
      23
    
    
      24
      0
      5000
      8.90
      A
      2.00
      RENT
      24044.00
      22
    
    
      25
      0
      7000
      15.96
      C
      7.00
      RENT
      34000.00
      27
    
    
      26
      0
      12400
      10.65
      B
      18.00
      RENT
      41000.00
      36
    
    
      27
      0
      10800
      9.91
      B
      2.00
      RENT
      55596.00
      35
    
    
      28
      0
      15000
      7.90
      A
      1.00
      RENT
      45000.00
      23
    
    
      29
      0
      6000
      12.42
      B
      1.00
      RENT
      36852.00
      25
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      29062
      0
      12000
      12.49
      D
      9.00
      OWN
      75000.00
      23
    
    
      29063
      1
      20000
      10.99
      E
      6.00
      MORTGAGE
      145000.00
      28
    
    
      29064
      0
      1200
      11.54
      C
      0.00
      RENT
      20000.00
      29
    
    
      29065
      0
      12250
      10.59
      C
      3.00
      RENT
      60000.00
      23
    
    
      29066
      0
      2700
      15.96
      F
      1.00
      RENT
      52200.00
      25
    
    
      29067
      0
      5000
      9.01
      B
      1.00
      MORTGAGE
      250000.00
      30
    
    
      29068
      0
      3500
      9.96
      B
      0.00
      RENT
      13500.00
      21
    
    
      29069
      0
      5000
      10.99
      A
      8.00
      MORTGAGE
      150000.00
      30
    
    
      29070
      0
      5000
      7.43
      A
      7.00
      MORTGAGE
      95000.00
      25
    
    
      29071
      0
      25000
      10.59
      C
      2.00
      RENT
      48000.00
      27
    
    
      29072
      0
      10000
      10.28
      C
      0.00
      RENT
      27376.00
      22
    
    
      29073
      0
      5000
      9.01
      B
      1.00
      RENT
      73000.00
      24
    
    
      29074
      0
      5100
      8.38
      A
      4.00
      RENT
      32000.00
      25
    
    
      29075
      0
      5400
      7.75
      A
      0.00
      RENT
      8000.00
      22
    
    
      29076
      0
      5300
      8.38
      A
      7.00
      MORTGAGE
      74000.00
      42
    
    
      29077
      0
      5000
      10.99
      C
      1.00
      OWN
      30000.00
      25
    
    
      29078
      0
      3000
      10.28
      C
      2.00
      MORTGAGE
      200000.00
      67
    
    
      29079
      0
      3000
      8.70
      B
      0.00
      MORTGAGE
      25000.00
      36
    
    
      29080
      0
      5000
      7.43
      A
      8.00
      MORTGAGE
      120000.00
      29
    
    
      29081
      0
      5000
      8.07
      A
      0.00
      MORTGAGE
      180000.00
      21
    
    
      29082
      0
      5000
      9.33
      B
      2.00
      MORTGAGE
      180000.00
      23
    
    
      29083
      0
      5000
      10.99
      B
      4.00
      MORTGAGE
      48000.00
      33
    
    
      29084
      0
      5000
      11.22
      C
      0.00
      OWN
      80000.00
      39
    
    
      29085
      0
      5000
      7.43
      A
      1.00
      OWN
      85000.00
      46
    
    
      29086
      0
      5000
      8.70
      B
      5.00
      MORTGAGE
      75000.00
      35
    
    
      29087
      0
      2500
      8.07
      A
      4.00
      MORTGAGE
      110000.00
      27
    
    
      29088
      0
      8500
      10.28
      C
      3.00
      RENT
      18000.00
      25
    
    
      29089
      0
      5000
      8.07
      A
      0.00
      MORTGAGE
      100000.00
      27
    
    
      29090
      0
      5000
      7.43
      A
      0.00
      MORTGAGE
      200000.00
      23
    
    
      29091
      0
      7500
      10.99
      E
      0.00
      OWN
      22000.00
      25
    
  

29092 rows × 8 columns



In [65]:

    
#Now, let's check if train has missing values or not
df.isnull().sum()









    Out[65]:





default      0
amount       0
interest     0
grade        0
years        0
ownership    0
income       0
age          0
dtype: int64

Check for Outlier Values

Let us check first the categorical variables



In [66]:

    
# Which variables are Categorical?
df.dtypes









    Out[66]:





default        int64
amount         int64
interest     float64
grade         object
years        float64
ownership     object
income       float64
age            int64
dtype: object



In [69]:

    
df.grade.value_counts()









    Out[69]:





A    9649
B    9329
C    5748
D    3231
E     868
F     211
G      56
Name: grade, dtype: int64



In [68]:

    
# Create a Crosstab of those variables with another variable
pd.crosstab(df.grade, df.default)



In [70]:

    
# Create a Crosstab of those variables with another variable
df.ownership.value_counts()









    Out[70]:





RENT        14692
MORTGAGE    12002
OWN          2301
OTHER          97
Name: ownership, dtype: int64

Let us check outliers in the continuous variable

Plotting
- Histogram
- Box-Plot
Measuring
- Z-score > 3
- Modified Z-score > 3.5 where modified Z-score = 0.6745 * (x - x_median) / MAD



In [71]:

    
# Describe the data set continuous values
df.describe()

Clearly the age variable looks like it has an outlier - Age cannot be greater 100! Also the income variable looks like it may also have an outlier.



In [87]:

    
?plt.yscale



In [88]:

    
# Make a histogram of age

df.age.hist()
#plt.xlim(60,80)
#plt.ylim(0,100)
plt.yscale('log')



In [74]:

    
import seaborn as sns



In [75]:

    
sns.distplot(df.age)









    



/Users/amitkaps/miniconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[75]:





<matplotlib.axes._subplots.AxesSubplot at 0x1136b4828>



In [89]:

    
# Make a histogram of income
df.income.hist()
plt.yscale('log')



In [91]:

    
# Make Histograms for all other 
df.years.hist()
plt.yscale('log')



In [95]:

    
plt.boxplot(df.income)









    Out[95]:





{'boxes': [<matplotlib.lines.Line2D at 0x1147d0cc0>],
 'caps': [<matplotlib.lines.Line2D at 0x1147be780>,
  <matplotlib.lines.Line2D at 0x1147be8d0>],
 'fliers': [<matplotlib.lines.Line2D at 0x114bf1be0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x114793da0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x1147d07f0>,
  <matplotlib.lines.Line2D at 0x11476c240>]}



In [98]:

    
plt.boxplot(df.years)









    Out[98]:





{'boxes': [<matplotlib.lines.Line2D at 0x114da7ba8>],
 'caps': [<matplotlib.lines.Line2D at 0x114da1fd0>,
  <matplotlib.lines.Line2D at 0x114d9aef0>],
 'fliers': [<matplotlib.lines.Line2D at 0x1147d5fd0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x1147d50f0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x114da7e10>,
  <matplotlib.lines.Line2D at 0x114da1e10>]}



In [99]:

    
# Make a scatter of age and income
plt.scatter(df.income, df.age)









    Out[99]:





<matplotlib.collections.PathCollection at 0x114dec6d8>

Find the observation which has age = 144 and remove it from the dataframe



In [102]:

    
# Find the observation 
df[df.age == df.age.max()]









    Out[102]:






  
    
      
      default
      amount
      interest
      grade
      years
      ownership
      income
      age
    
  
  
    
      19485
      0
      5000
      12.73
      C
      12.00
      MORTGAGE
      6000000.00
      144



In [105]:

    
df[df.age == df.age.max()].index









    Out[105]:





Int64Index([19485], dtype='int64')



In [108]:

    
df.drop?



In [110]:

    
# Use drop to remove the observation inplace
df.drop(19485, inplace=True)



In [111]:

    
# Find the shape of the df
df.shape









    Out[111]:





(29091, 8)



In [112]:

    
# Check again for outliers
plt.scatter(df.age, df.income)









    Out[112]:





<matplotlib.collections.PathCollection at 0x113d20710>



In [113]:

    
# Save the new file as cleaned data
df.to_csv("data/loan_data_clean.csv", index=False)



In [ ]:

    
#We are good to go to the next step



In [ ]:

	default	amount	interest	years	income	age
count	29092.00	29092.00	26316.00	28283.00	29092.00	29092.00
mean	0.11	9593.51	11.00	6.15	67168.83	27.70
std	0.31	6323.42	3.24	6.68	63606.52	6.23
min	0.00	500.00	5.42	0.00	4000.00	20.00
25%	0.00	5000.00	7.90	2.00	40000.00	23.00
50%	0.00	8000.00	10.99	4.00	56424.00	26.00
75%	0.00	12250.00	13.47	8.00	80000.00	30.00
max	1.00	35000.00	23.22	62.00	6000000.00	144.00

default	0	1
grade
A	9084	565
B	8344	985
C	4904	844
D	2651	580
E	692	176
F	155	56
G	35	21

	default	amount	interest	years	income	age
count	29092.00	29092.00	29092.00	29092.00	29092.00	29092.00
mean	0.11	9593.51	11.00	6.09	67168.83	27.70
std	0.31	6323.42	3.08	6.59	63606.52	6.23
min	0.00	500.00	5.42	0.00	4000.00	20.00
25%	0.00	5000.00	8.49	2.00	40000.00	23.00
50%	0.00	8000.00	10.99	4.00	56424.00	26.00
75%	0.00	12250.00	13.11	8.00	80000.00	30.00
max	1.00	35000.00	23.22	62.00	6000000.00	144.00

	amount	interest	grade	years	ownership	income	age
0	5000	10.65	B	10.00	RENT	24000.00	33
1	2400	nan	C	25.00	RENT	12252.00	31
2	10000	13.49	C	13.00	RENT	49200.00	24
3	5000	nan	A	3.00	RENT	36000.00	39
4	3000	nan	E	9.00	RENT	48000.00	24

	default	amount	interest	grade	years	ownership	income	age
0	False	False	False	False	False	False	False	False
1	False	False	True	False	False	False	False	False
2	False	False	False	False	False	False	False	False
3	False	False	True	False	False	False	False	False
4	False	False	True	False	False	False	False	False