In [16]:

    
import pandas as pd



In [17]:

    
df = pd.read_csv('datasets/dataset.csv')

Lets see what dataset we have loaded



In [18]:

    
df.education.unique()









    Out[18]:





array(['High School or Below', 'Bechalor', 'college', 'Master or Above'], dtype=object)



In [19]:

    
#school, college, bachelor, master
df['education'] = df['education'].replace(['High School or Below'], 'school')
df['education'] = df['education'].replace(['Bechalor'], 'bachelor')
df['education'] = df['education'].replace(['Master or Above'], 'master')



In [20]:

    
df.to_csv('datasets/dataset.csv')
df[:10]









    Out[20]:






  
    
      
      loan_id
      loan_status
      Principal
      terms
      effective_date
      due_date
      paid_off_time
      past_due_days
      age
      education
      Gender
    
  
  
    
      0
      xqd20166231
      PAIDOFF
      1000
      30
      9/8/2016
      10/7/2016
      9/14/2016 19:31
      NaN
      45
      school
      male
    
    
      1
      xqd20168902
      PAIDOFF
      1000
      30
      9/8/2016
      10/7/2016
      10/7/2016 9:00
      NaN
      50
      bachelor
      female
    
    
      2
      xqd20160003
      PAIDOFF
      1000
      30
      9/8/2016
      10/7/2016
      9/25/2016 16:58
      NaN
      33
      bachelor
      female
    
    
      3
      xqd20160004
      PAIDOFF
      1000
      15
      9/8/2016
      9/22/2016
      9/22/2016 20:00
      NaN
      27
      college
      male
    
    
      4
      xqd20160005
      PAIDOFF
      1000
      30
      9/9/2016
      10/8/2016
      9/23/2016 21:36
      NaN
      28
      college
      female
    
    
      5
      xqd20160706
      PAIDOFF
      300
      7
      9/9/2016
      9/15/2016
      9/9/2016 13:45
      NaN
      35
      master
      male
    
    
      6
      xqd20160007
      PAIDOFF
      1000
      30
      9/9/2016
      10/8/2016
      10/7/2016 23:07
      NaN
      29
      college
      male
    
    
      7
      xqd20160008
      PAIDOFF
      1000
      30
      9/9/2016
      10/8/2016
      10/5/2016 20:33
      NaN
      36
      college
      male
    
    
      8
      xqd20160909
      PAIDOFF
      1000
      30
      9/9/2016
      10/8/2016
      10/8/2016 16:00
      NaN
      28
      college
      male
    
    
      9
      xqd20160010
      PAIDOFF
      800
      15
      9/10/2016
      9/24/2016
      9/24/2016 13:00
      NaN
      26
      college
      male

We don't need all the columns. Right? Let's drop the unneccessary things.



In [21]:

    
df = df.drop(['loan_id', 'effective_date', 'due_date', 'paid_off_time', 'past_due_days'], axis = 1)



In [22]:

    
# The dataframe holds the needed columns now. Cool.
df[:10]









    Out[22]:






  
    
      
      loan_status
      Principal
      terms
      age
      education
      Gender
    
  
  
    
      0
      PAIDOFF
      1000
      30
      45
      school
      male
    
    
      1
      PAIDOFF
      1000
      30
      50
      bachelor
      female
    
    
      2
      PAIDOFF
      1000
      30
      33
      bachelor
      female
    
    
      3
      PAIDOFF
      1000
      15
      27
      college
      male
    
    
      4
      PAIDOFF
      1000
      30
      28
      college
      female
    
    
      5
      PAIDOFF
      300
      7
      35
      master
      male
    
    
      6
      PAIDOFF
      1000
      30
      29
      college
      male
    
    
      7
      PAIDOFF
      1000
      30
      36
      college
      male
    
    
      8
      PAIDOFF
      1000
      30
      28
      college
      male
    
    
      9
      PAIDOFF
      800
      15
      26
      college
      male

In order to get the information of the whole dataframe, use info()



In [23]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
loan_status    500 non-null object
Principal      500 non-null int64
terms          500 non-null int64
age            500 non-null int64
education      500 non-null object
Gender         500 non-null object
dtypes: int64(3), object(3)
memory usage: 23.5+ KB



In [24]:

    
# Lets clean the data and create columns if needed.
df['Gender'].unique()









    Out[24]:





array(['male', 'female'], dtype=object)

We can see that our dateset has two unique string values for GENDER. We can't assign numeric values like female = 1 and male = 2 because of feminism. Just kidding. We shouldn't assign because then they will be a factor that denotes intensity. We want to differentiate our category. So we are going to have sepeate columns for two genders.

Create df with two dummy columns named of genders.



In [25]:

    
df_sex = pd.get_dummies(df['Gender'])



In [26]:

    
df_sex[:10]



In [27]:

    
df = pd.concat([df,df_sex] , axis=1)



In [28]:

    
df[:10]









    Out[28]:






  
    
      
      loan_status
      Principal
      terms
      age
      education
      Gender
      female
      male
    
  
  
    
      0
      PAIDOFF
      1000
      30
      45
      school
      male
      0
      1
    
    
      1
      PAIDOFF
      1000
      30
      50
      bachelor
      female
      1
      0
    
    
      2
      PAIDOFF
      1000
      30
      33
      bachelor
      female
      1
      0
    
    
      3
      PAIDOFF
      1000
      15
      27
      college
      male
      0
      1
    
    
      4
      PAIDOFF
      1000
      30
      28
      college
      female
      1
      0
    
    
      5
      PAIDOFF
      300
      7
      35
      master
      male
      0
      1
    
    
      6
      PAIDOFF
      1000
      30
      29
      college
      male
      0
      1
    
    
      7
      PAIDOFF
      1000
      30
      36
      college
      male
      0
      1
    
    
      8
      PAIDOFF
      1000
      30
      28
      college
      male
      0
      1
    
    
      9
      PAIDOFF
      800
      15
      26
      college
      male
      0
      1



In [29]:

    
# Now drop the gender column from the main df and add df_sex to df

df = df.drop(['Gender'], axis=1)



In [30]:

    
df[:10]









    Out[30]:






  
    
      
      loan_status
      Principal
      terms
      age
      education
      female
      male
    
  
  
    
      0
      PAIDOFF
      1000
      30
      45
      school
      0
      1
    
    
      1
      PAIDOFF
      1000
      30
      50
      bachelor
      1
      0
    
    
      2
      PAIDOFF
      1000
      30
      33
      bachelor
      1
      0
    
    
      3
      PAIDOFF
      1000
      15
      27
      college
      0
      1
    
    
      4
      PAIDOFF
      1000
      30
      28
      college
      1
      0
    
    
      5
      PAIDOFF
      300
      7
      35
      master
      0
      1
    
    
      6
      PAIDOFF
      1000
      30
      29
      college
      0
      1
    
    
      7
      PAIDOFF
      1000
      30
      36
      college
      0
      1
    
    
      8
      PAIDOFF
      1000
      30
      28
      college
      0
      1
    
    
      9
      PAIDOFF
      800
      15
      26
      college
      0
      1



In [31]:

    
# Similary lets do the same process for both load_status and education.
# This process is called Categorical Conversion into Numerics of One-hot-coding

df_loan_status = pd.get_dummies(df['loan_status'])
df_education = pd.get_dummies(df['education'])
df = pd.concat([df, df_loan_status], axis=1)
df = pd.concat([df, df_education], axis=1)
df = df.drop(['loan_status', 'education'], axis=1)



In [32]:

    
df[:10]









    Out[32]:






  
    
      
      Principal
      terms
      age
      female
      male
      COLLECTION
      COLLECTION_PAIDOFF
      PAIDOFF
      bachelor
      college
      master
      school
    
  
  
    
      0
      1000
      30
      45
      0
      1
      0
      0
      1
      0
      0
      0
      1
    
    
      1
      1000
      30
      50
      1
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      2
      1000
      30
      33
      1
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      3
      1000
      15
      27
      0
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      4
      1000
      30
      28
      1
      0
      0
      0
      1
      0
      1
      0
      0
    
    
      5
      300
      7
      35
      0
      1
      0
      0
      1
      0
      0
      1
      0
    
    
      6
      1000
      30
      29
      0
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      7
      1000
      30
      36
      0
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      8
      1000
      30
      28
      0
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      9
      800
      15
      26
      0
      1
      0
      0
      1
      0
      1
      0
      0



In [33]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 12 columns):
Principal             500 non-null int64
terms                 500 non-null int64
age                   500 non-null int64
female                500 non-null uint8
male                  500 non-null uint8
COLLECTION            500 non-null uint8
COLLECTION_PAIDOFF    500 non-null uint8
PAIDOFF               500 non-null uint8
bachelor              500 non-null uint8
college               500 non-null uint8
master                500 non-null uint8
school                500 non-null uint8
dtypes: int64(3), uint8(9)
memory usage: 16.2 KB

In machine learning, its always easier to compute if the values are between 1 to 0 (either positive or negative) The process of converting them into such values is called normalization There are many ways to normalize them. Here I choose MinMaxScalar which converts the highest value to 1 and smallest value to 0. Remaining values exist between 0 to 1



In [34]:

    
df_to_norm = df[['Principal', 'terms', 'age']]
df_to_norm[:10]



In [35]:

    
df_norm = (df_to_norm - df_to_norm.min()) / (df_to_norm.max() - df_to_norm.min())
df_norm[:10]



In [36]:

    
df = df.drop(['Principal', 'terms', 'age'], axis=1)
df = pd.concat([df,df_norm], axis=1)



In [37]:

    
df[:10]









    Out[37]:






  
    
      
      female
      male
      COLLECTION
      COLLECTION_PAIDOFF
      PAIDOFF
      bachelor
      college
      master
      school
      Principal
      terms
      age
    
  
  
    
      0
      0
      1
      0
      0
      1
      0
      0
      0
      1
      1.000000
      1.000000
      0.818182
    
    
      1
      1
      0
      0
      0
      1
      1
      0
      0
      0
      1.000000
      1.000000
      0.969697
    
    
      2
      1
      0
      0
      0
      1
      1
      0
      0
      0
      1.000000
      1.000000
      0.454545
    
    
      3
      0
      1
      0
      0
      1
      0
      1
      0
      0
      1.000000
      0.347826
      0.272727
    
    
      4
      1
      0
      0
      0
      1
      0
      1
      0
      0
      1.000000
      1.000000
      0.303030
    
    
      5
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0.000000
      0.000000
      0.515152
    
    
      6
      0
      1
      0
      0
      1
      0
      1
      0
      0
      1.000000
      1.000000
      0.333333
    
    
      7
      0
      1
      0
      0
      1
      0
      1
      0
      0
      1.000000
      1.000000
      0.545455
    
    
      8
      0
      1
      0
      0
      1
      0
      1
      0
      0
      1.000000
      1.000000
      0.303030
    
    
      9
      0
      1
      0
      0
      1
      0
      1
      0
      0
      0.714286
      0.347826
      0.242424

Thus the dataset is clean to move further to production

	loan_id	loan_status	Principal	terms	effective_date	due_date	paid_off_time	past_due_days	age	education	Gender
0	xqd20166231	PAIDOFF	1000	30	9/8/2016	10/7/2016	9/14/2016 19:31	NaN	45	school	male
1	xqd20168902	PAIDOFF	1000	30	9/8/2016	10/7/2016	10/7/2016 9:00	NaN	50	bachelor	female
2	xqd20160003	PAIDOFF	1000	30	9/8/2016	10/7/2016	9/25/2016 16:58	NaN	33	bachelor	female
3	xqd20160004	PAIDOFF	1000	15	9/8/2016	9/22/2016	9/22/2016 20:00	NaN	27	college	male
4	xqd20160005	PAIDOFF	1000	30	9/9/2016	10/8/2016	9/23/2016 21:36	NaN	28	college	female
5	xqd20160706	PAIDOFF	300	7	9/9/2016	9/15/2016	9/9/2016 13:45	NaN	35	master	male
6	xqd20160007	PAIDOFF	1000	30	9/9/2016	10/8/2016	10/7/2016 23:07	NaN	29	college	male
7	xqd20160008	PAIDOFF	1000	30	9/9/2016	10/8/2016	10/5/2016 20:33	NaN	36	college	male
8	xqd20160909	PAIDOFF	1000	30	9/9/2016	10/8/2016	10/8/2016 16:00	NaN	28	college	male
9	xqd20160010	PAIDOFF	800	15	9/10/2016	9/24/2016	9/24/2016 13:00	NaN	26	college	male

	Principal	terms	age
0	1.000000	1.000000	0.818182
1	1.000000	1.000000	0.969697
2	1.000000	1.000000	0.454545
3	1.000000	0.347826	0.272727
4	1.000000	1.000000	0.303030
5	0.000000	0.000000	0.515152
6	1.000000	1.000000	0.333333
7	1.000000	1.000000	0.545455
8	1.000000	1.000000	0.303030
9	0.714286	0.347826	0.242424