Linear Regression - Data Cleanup

Here we get into the detailed steps involved in data cleanup of the Lending Club dataset. We glossed over this in the Data Exploration lesson and left it as an exercise. This is where we actually do the cleanup step by step.

  1. Remove '%'
    • strip the last letter
    • map to float
  1. Remove ' months' suffix

    • strip the tailing letters
    • map to int
  2. Remove bad data

    • n/a values
    • outlier income
    • histograms
    • describe
  3. Map FICO.Range to single int

    • split on '-'
    • use bottom value
    • convert to int
    • show how to do mid-point

In [329]:
import pandas as pd
from numpy import nan as NA
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData


Out[329]:
&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 81174 to 3116
Data columns (total 14 columns):
Amount.Requested                  2500  non-null values
Amount.Funded.By.Investors        2500  non-null values
Interest.Rate                     2500  non-null values
Loan.Length                       2500  non-null values
Loan.Purpose                      2500  non-null values
Debt.To.Income.Ratio              2500  non-null values
State                             2500  non-null values
Home.Ownership                    2500  non-null values
Monthly.Income                    2499  non-null values
FICO.Range                        2500  non-null values
Open.CREDIT.Lines                 2498  non-null values
Revolving.CREDIT.Balance          2498  non-null values
Inquiries.in.the.Last.6.Months    2498  non-null values
Employment.Length                 2500  non-null values
dtypes: float64(5), int64(1), object(8)

In [330]:
loansData['Interest.Rate'][0:5] # first five rows of Interest.Rate


Out[330]:
81174     8.90%
99592    12.12%
80059    21.98%
15825     9.99%
33182    11.71%
Name: Interest.Rate, dtype: object
  • removing % signs from rates

In [331]:
ir = loansData['Interest.Rate']
ir[0:5]


Out[331]:
81174     8.90%
99592    12.12%
80059    21.98%
15825     9.99%
33182    11.71%
Name: Interest.Rate, dtype: object

In [332]:
type(ir[0:1])
irbak = ir
type(irbak)
loansDataBak = loansData
loansDataBak.reset_index()
loansDataBak
loansDataBak['Interest.Rate']
loansData


Out[332]:
&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 81174 to 3116
Data columns (total 14 columns):
Amount.Requested                  2500  non-null values
Amount.Funded.By.Investors        2500  non-null values
Interest.Rate                     2500  non-null values
Loan.Length                       2500  non-null values
Loan.Purpose                      2500  non-null values
Debt.To.Income.Ratio              2500  non-null values
State                             2500  non-null values
Home.Ownership                    2500  non-null values
Monthly.Income                    2499  non-null values
FICO.Range                        2500  non-null values
Open.CREDIT.Lines                 2498  non-null values
Revolving.CREDIT.Balance          2498  non-null values
Inquiries.in.the.Last.6.Months    2498  non-null values
Employment.Length                 2500  non-null values
dtypes: float64(5), int64(1), object(8)

In [333]:
ldb = loansDataBak.reset_index() # explain
ldb['Interest.Rate'][0:5]
irates = ldb['Interest.Rate'][0:]
type(irates)

[irates[k] for k in (0,1,2,3,4)]
#int_rates = [ float(x[:-1]) for x in irates[k] ]
#"""
#---------------------------------------------------------------------------
#ValueError                                Traceback (most recent call last)
#<ipython-input-46-5c4f2333efd1> in <module>()
#      5 
#      6 [irates[k] for k in (0,1,2,3,4)]
#----> 7 int_rates = [ float(x[:-1]) for x in irates[k] ]
#
#ValueError: could not convert string to float: 
#"""


Out[333]:
['8.90%', '12.12%', '21.98%', '9.99%', '11.71%']

In [334]:
type(ldb)


Out[334]:
pandas.core.frame.DataFrame

In [335]:
srates = ldb['Interest.Rate']

In [336]:
#nas = [ x for x in srates if x.isnull() ] # AttributeError: 'str' object has no attribute 'isnull'

In [337]:
nas = [ x for x in srates if not(x[0].isdigit()) ] # AttributeError: 'str' object has no attribute 'isnull'

In [338]:
len(nas)


Out[338]:
0

In [339]:
srates[0][:-1]


Out[339]:
'8.90'

In [340]:
float(srates[0][:-1])


Out[340]:
8.9

In [341]:
nopct = [ x[:-1] for x in srates ]

In [342]:
flrates = [float(x) for x in nopct]

In [343]:
flrates[0:5]


Out[343]:
[8.9, 12.12, 21.98, 9.99, 11.71]

In [344]:
flrate = map(float, nopct)

In [345]:
flrate[0:5]


Out[345]:
[8.9, 12.12, 21.98, 9.99, 11.71]

In [346]:
len(flrate)


Out[346]:
2500

In [347]:
ldb


Out[347]:
&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 0 to 2499
Data columns (total 15 columns):
index                             2500  non-null values
Amount.Requested                  2500  non-null values
Amount.Funded.By.Investors        2500  non-null values
Interest.Rate                     2500  non-null values
Loan.Length                       2500  non-null values
Loan.Purpose                      2500  non-null values
Debt.To.Income.Ratio              2500  non-null values
State                             2500  non-null values
Home.Ownership                    2500  non-null values
Monthly.Income                    2499  non-null values
FICO.Range                        2500  non-null values
Open.CREDIT.Lines                 2498  non-null values
Revolving.CREDIT.Balance          2498  non-null values
Inquiries.in.the.Last.6.Months    2498  non-null values
Employment.Length                 2500  non-null values
dtypes: float64(5), int64(2), object(8)

In [348]:
ldb['Interest.Rate'] = flrate

In [349]:
ldb[0:5]


Out[349]:
&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 15 columns):
index                             5  non-null values
Amount.Requested                  5  non-null values
Amount.Funded.By.Investors        5  non-null values
Interest.Rate                     5  non-null values
Loan.Length                       5  non-null values
Loan.Purpose                      5  non-null values
Debt.To.Income.Ratio              5  non-null values
State                             5  non-null values
Home.Ownership                    5  non-null values
Monthly.Income                    5  non-null values
FICO.Range                        5  non-null values
Open.CREDIT.Lines                 5  non-null values
Revolving.CREDIT.Balance          5  non-null values
Inquiries.in.the.Last.6.Months    5  non-null values
Employment.Length                 5  non-null values
dtypes: float64(6), int64(2), object(7)

In [350]:
ldb['Interest.Rate'][0:5]


Out[350]:
0     8.90
1    12.12
2    21.98
3     9.99
4    11.71
Name: Interest.Rate, dtype: float64

In [351]:
srates = loansData['Interest.Rate']

In [352]:
nopct = [ x[:-1] for x in srates ]

In [353]:
flrates = [float(x) for x in nopct]

In [354]:
rates = [float(x[:-1]) for x in srates] # use this

In [355]:
flrates == rates


Out[355]:
True

In [356]:
loansData['Interest.Rate'] = flrates

In [357]:
loansData['Interest.Rate'][0:5]


Out[357]:
81174     8.90
99592    12.12
80059    21.98
15825     9.99
33182    11.71
Name: Interest.Rate, dtype: float64

Conclusion of step 1

  • ok! whew! we're done with the % symbol stuff
  • we learnt quite a few things along the way that will be useful in the next part

Step 2: Remove the months


In [358]:
withmons = ldb['Loan.Length']

In [358]:


In [359]:
wmons = withmons[0:]

In [360]:
wmons[0:5]


Out[360]:
0    36 months
1    36 months
2    60 months
3    36 months
4    36 months
Name: Loan.Length, dtype: object

In [361]:
wmons


Out[361]:
0     36 months
1     36 months
2     60 months
3     36 months
4     36 months
5     36 months
6     36 months
7     60 months
8     36 months
9     36 months
10    36 months
11    36 months
12    60 months
13    36 months
14    36 months
...
2485    36 months
2486    36 months
2487    36 months
2488    36 months
2489    36 months
2490    36 months
2491    36 months
2492    36 months
2493    60 months
2494    36 months
2495    60 months
2496    60 months
2497    36 months
2498    36 months
2499    36 months
Name: Loan.Length, Length: 2500, dtype: object

In [362]:
wmons[0].split()


Out[362]:
['36', 'months']

In [363]:
wmons[0].split()[0]


Out[363]:
'36'

In [364]:
int(wmons[0].split()[0])


Out[364]:
36

In [365]:
x = wmons[0].split()

In [366]:
x[0]


Out[366]:
'36'

In [367]:
int(x[0])


Out[367]:
36

In [368]:
intmons = [ int(x.split()[0]) for x in wmons ]

In [369]:
intmons[0:10]


Out[369]:
[36, 36, 60, 36, 36, 36, 36, 60, 36, 36]

In [370]:
loansData['Loan.Length']


Out[370]:
81174    36 months
99592    36 months
80059    60 months
15825    36 months
33182    36 months
62403    36 months
48808    36 months
22090    60 months
76404    36 months
15867    36 months
94971    36 months
36911    36 months
41200    60 months
83869    36 months
53853    36 months
...
13070    36 months
45836    36 months
52330    36 months
48243    36 months
63256    36 months
42124    36 months
78043    36 months
925      36 months
74047    60 months
49957    36 months
23735    60 months
65882    60 months
55610    36 months
38576    36 months
3116     36 months
Name: Loan.Length, Length: 2500, dtype: object

In [371]:
loansData['Loan.Length'].value_counts()


Out[371]:
36 months    1952
60 months     548
dtype: int64

Conclusion of Step 2

  • Here we used the techniques we learned in Step 1.
    • Pull out a column from a data frame
    • Operate on it, perform some transformations
    • Replace the column in the original dataframe with this new column
  • We applied them to removing the ' months' suffix in the Loan.Length column.

Step 3: Remove bad data


In [372]:
loansData['Monthly.Income'].describe()


Out[372]:
count      2499.000000
mean       5688.931321
std        3963.118185
min         588.500000
25%        3500.000000
50%        5000.000000
75%        6800.000000
max      102750.000000
dtype: float64

First remove implausible values. We see the max value to be 102750.
This is a MONTHLY income of 100K dollars, which is certainly possible, but ....
highly implausible for a person seeking a loan of a few 10's of K dollars, i.e. implausible in this context.


In [373]:
loansData['Monthly.Income'].idxmax() # find the place where the max occurs


Out[373]:
54487

But there's a better way - a row filter i.e. an expression used as a way to restrict the rows in a dataframe.
In our case we want to eliminate rows above 100K dollars. i.e. only keep those less than 100K dollars.


In [374]:
loansData['Monthly.Income'][loansData['Monthly.Income'] < 100000]


Out[374]:
81174     6541.67
99592     4583.33
80059    11500.00
15825     3833.33
33182     3195.00
62403     4891.67
48808     2916.67
22090    13863.42
76404     3150.00
15867     5000.00
94971     3575.00
36911     4250.00
41200    14166.67
83869     9166.67
53853    11250.00
...
13070    7083.33
45836    2464.37
52330    8000.00
48243    3833.33
63256    5166.67
42124    4500.00
78043    3983.33
925      5166.67
74047    7083.33
49957    6666.67
23735    9250.00
65882    8903.25
55610    2166.67
38576    3500.00
3116     3875.00
Name: Monthly.Income, Length: 2498, dtype: float64

In [374]:


In [375]:
loansData['Monthly.Income'].describe()


Out[375]:
count      2499.000000
mean       5688.931321
std        3963.118185
min         588.500000
25%        3500.000000
50%        5000.000000
75%        6800.000000
max      102750.000000
dtype: float64

In [376]:
ldlt100 = ldb[ldb['Monthly.Income'] < 100000]

In [377]:
ldlt100


Out[377]:
&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 2498 entries, 0 to 2499
Data columns (total 15 columns):
index                             2498  non-null values
Amount.Requested                  2498  non-null values
Amount.Funded.By.Investors        2498  non-null values
Interest.Rate                     2498  non-null values
Loan.Length                       2498  non-null values
Loan.Purpose                      2498  non-null values
Debt.To.Income.Ratio              2498  non-null values
State                             2498  non-null values
Home.Ownership                    2498  non-null values
Monthly.Income                    2498  non-null values
FICO.Range                        2498  non-null values
Open.CREDIT.Lines                 2497  non-null values
Revolving.CREDIT.Balance          2497  non-null values
Inquiries.in.the.Last.6.Months    2497  non-null values
Employment.Length                 2498  non-null values
dtypes: float64(6), int64(2), object(7)

In [378]:
len(ldlt100)


Out[378]:
2498

Now drop any rows that have 'NA' values ie data not available.
In database terminiology these would be 'NULL' values.


In [379]:
ldb2 = ldlt100.dropna()

In [380]:
ldb2


Out[380]:
&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 2497 entries, 0 to 2499
Data columns (total 15 columns):
index                             2497  non-null values
Amount.Requested                  2497  non-null values
Amount.Funded.By.Investors        2497  non-null values
Interest.Rate                     2497  non-null values
Loan.Length                       2497  non-null values
Loan.Purpose                      2497  non-null values
Debt.To.Income.Ratio              2497  non-null values
State                             2497  non-null values
Home.Ownership                    2497  non-null values
Monthly.Income                    2497  non-null values
FICO.Range                        2497  non-null values
Open.CREDIT.Lines                 2497  non-null values
Revolving.CREDIT.Balance          2497  non-null values
Inquiries.in.the.Last.6.Months    2497  non-null values
Employment.Length                 2497  non-null values
dtypes: float64(6), int64(2), object(7)

In [381]:
len(ldb2)


Out[381]:
2497

So we dropped one row that had an NA value somewhere.

Conclusion of step 3

  • We used techniques we learned in step 1 to pick out a columns and operate on it
  • We also learned how to filter the data based on expressions involving column values
  • Finally we learnt how to drop NA values

Now that we have removed the "bad" data, let's take on the final data-cleaning task for this data set - converting the ranges to single integers.

Step 4: Change FICO range to a single value

Note that the FICO values are given in a range which is in the form of a string that looks like lowerlimit-upperlimit, eg 720-724.
We want to convert these values to a single int value representing the lower limit. e.g. 720 in the above example.


In [382]:
ficostr = ldb2['FICO.Range']

In [383]:
ficostr[0:10]


Out[383]:
0    735-739
1    715-719
2    690-694
3    695-699
4    695-699
5    670-674
6    720-724
7    705-709
8    685-689
9    715-719
Name: FICO.Range, dtype: object

In [384]:
ficostr[0]


Out[384]:
'735-739'

In [385]:
ficoint = [ int(x.split('-')[0]) for x in ficostr ]

In [386]:
ficoint[0:10]


Out[386]:
[735, 715, 690, 695, 695, 670, 720, 705, 685, 715]

In [387]:
ldb2['FICO.Range'] = ficoint

In [388]:
len(ficoint)


Out[388]:
2497

In [389]:
ldb2['FICO.Range']


Out[389]:
0     735
1     715
2     690
3     695
4     695
5     670
6     720
7     705
8     685
9     715
10    670
11    665
12    670
13    735
14    725
...
2485    765
2486    665
2487    660
2488    685
2489    670
2490    710
2491    720
2492    710
2493    675
2494    685
2495    705
2496    740
2497    680
2498    675
2499    670
Name: FICO.Range, Length: 2497, dtype: int64

Conclusion of Step 4

  • We used techniques similar to the ones in Step 1
    • we picked a column - the FICO.Range column
    • we split the values on the separator '-'
    • we picked the fist value, i.e. the lower limit
    • we converted it to an int

Now we have a dataset that we can use for our data exploration and analysis


In [389]: