In [16]:
import pandas as pd

In [17]:
df = pd.read_csv('datasets/dataset.csv')

Lets see what dataset we have loaded


In [18]:
df.education.unique()


Out[18]:
array(['High School or Below', 'Bechalor', 'college', 'Master or Above'], dtype=object)

In [19]:
#school, college, bachelor, master
df['education'] = df['education'].replace(['High School or Below'], 'school')
df['education'] = df['education'].replace(['Bechalor'], 'bachelor')
df['education'] = df['education'].replace(['Master or Above'], 'master')

In [20]:
df.to_csv('datasets/dataset.csv')
df[:10]


Out[20]:
loan_id loan_status Principal terms effective_date due_date paid_off_time past_due_days age education Gender
0 xqd20166231 PAIDOFF 1000 30 9/8/2016 10/7/2016 9/14/2016 19:31 NaN 45 school male
1 xqd20168902 PAIDOFF 1000 30 9/8/2016 10/7/2016 10/7/2016 9:00 NaN 50 bachelor female
2 xqd20160003 PAIDOFF 1000 30 9/8/2016 10/7/2016 9/25/2016 16:58 NaN 33 bachelor female
3 xqd20160004 PAIDOFF 1000 15 9/8/2016 9/22/2016 9/22/2016 20:00 NaN 27 college male
4 xqd20160005 PAIDOFF 1000 30 9/9/2016 10/8/2016 9/23/2016 21:36 NaN 28 college female
5 xqd20160706 PAIDOFF 300 7 9/9/2016 9/15/2016 9/9/2016 13:45 NaN 35 master male
6 xqd20160007 PAIDOFF 1000 30 9/9/2016 10/8/2016 10/7/2016 23:07 NaN 29 college male
7 xqd20160008 PAIDOFF 1000 30 9/9/2016 10/8/2016 10/5/2016 20:33 NaN 36 college male
8 xqd20160909 PAIDOFF 1000 30 9/9/2016 10/8/2016 10/8/2016 16:00 NaN 28 college male
9 xqd20160010 PAIDOFF 800 15 9/10/2016 9/24/2016 9/24/2016 13:00 NaN 26 college male

We don't need all the columns. Right? Let's drop the unneccessary things.


In [21]:
df = df.drop(['loan_id', 'effective_date', 'due_date', 'paid_off_time', 'past_due_days'], axis = 1)

In [22]:
# The dataframe holds the needed columns now. Cool.
df[:10]


Out[22]:
loan_status Principal terms age education Gender
0 PAIDOFF 1000 30 45 school male
1 PAIDOFF 1000 30 50 bachelor female
2 PAIDOFF 1000 30 33 bachelor female
3 PAIDOFF 1000 15 27 college male
4 PAIDOFF 1000 30 28 college female
5 PAIDOFF 300 7 35 master male
6 PAIDOFF 1000 30 29 college male
7 PAIDOFF 1000 30 36 college male
8 PAIDOFF 1000 30 28 college male
9 PAIDOFF 800 15 26 college male

In order to get the information of the whole dataframe, use info()


In [23]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
loan_status    500 non-null object
Principal      500 non-null int64
terms          500 non-null int64
age            500 non-null int64
education      500 non-null object
Gender         500 non-null object
dtypes: int64(3), object(3)
memory usage: 23.5+ KB

In [24]:
# Lets clean the data and create columns if needed.
df['Gender'].unique()


Out[24]:
array(['male', 'female'], dtype=object)

We can see that our dateset has two unique string values for GENDER. We can't assign numeric values like female = 1 and male = 2 because of feminism. Just kidding. We shouldn't assign because then they will be a factor that denotes intensity. We want to differentiate our category. So we are going to have sepeate columns for two genders.

Create df with two dummy columns named of genders.


In [25]:
df_sex = pd.get_dummies(df['Gender'])

In [26]:
df_sex[:10]


Out[26]:
female male
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 1
7 0 1
8 0 1
9 0 1

In [27]:
df = pd.concat([df,df_sex] , axis=1)

In [28]:
df[:10]


Out[28]:
loan_status Principal terms age education Gender female male
0 PAIDOFF 1000 30 45 school male 0 1
1 PAIDOFF 1000 30 50 bachelor female 1 0
2 PAIDOFF 1000 30 33 bachelor female 1 0
3 PAIDOFF 1000 15 27 college male 0 1
4 PAIDOFF 1000 30 28 college female 1 0
5 PAIDOFF 300 7 35 master male 0 1
6 PAIDOFF 1000 30 29 college male 0 1
7 PAIDOFF 1000 30 36 college male 0 1
8 PAIDOFF 1000 30 28 college male 0 1
9 PAIDOFF 800 15 26 college male 0 1

In [29]:
# Now drop the gender column from the main df and add df_sex to df

df = df.drop(['Gender'], axis=1)

In [30]:
df[:10]


Out[30]:
loan_status Principal terms age education female male
0 PAIDOFF 1000 30 45 school 0 1
1 PAIDOFF 1000 30 50 bachelor 1 0
2 PAIDOFF 1000 30 33 bachelor 1 0
3 PAIDOFF 1000 15 27 college 0 1
4 PAIDOFF 1000 30 28 college 1 0
5 PAIDOFF 300 7 35 master 0 1
6 PAIDOFF 1000 30 29 college 0 1
7 PAIDOFF 1000 30 36 college 0 1
8 PAIDOFF 1000 30 28 college 0 1
9 PAIDOFF 800 15 26 college 0 1

In [31]:
# Similary lets do the same process for both load_status and education.
# This process is called Categorical Conversion into Numerics of One-hot-coding

df_loan_status = pd.get_dummies(df['loan_status'])
df_education = pd.get_dummies(df['education'])
df = pd.concat([df, df_loan_status], axis=1)
df = pd.concat([df, df_education], axis=1)
df = df.drop(['loan_status', 'education'], axis=1)

In [32]:
df[:10]


Out[32]:
Principal terms age female male COLLECTION COLLECTION_PAIDOFF PAIDOFF bachelor college master school
0 1000 30 45 0 1 0 0 1 0 0 0 1
1 1000 30 50 1 0 0 0 1 1 0 0 0
2 1000 30 33 1 0 0 0 1 1 0 0 0
3 1000 15 27 0 1 0 0 1 0 1 0 0
4 1000 30 28 1 0 0 0 1 0 1 0 0
5 300 7 35 0 1 0 0 1 0 0 1 0
6 1000 30 29 0 1 0 0 1 0 1 0 0
7 1000 30 36 0 1 0 0 1 0 1 0 0
8 1000 30 28 0 1 0 0 1 0 1 0 0
9 800 15 26 0 1 0 0 1 0 1 0 0

In [33]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 12 columns):
Principal             500 non-null int64
terms                 500 non-null int64
age                   500 non-null int64
female                500 non-null uint8
male                  500 non-null uint8
COLLECTION            500 non-null uint8
COLLECTION_PAIDOFF    500 non-null uint8
PAIDOFF               500 non-null uint8
bachelor              500 non-null uint8
college               500 non-null uint8
master                500 non-null uint8
school                500 non-null uint8
dtypes: int64(3), uint8(9)
memory usage: 16.2 KB

In machine learning, its always easier to compute if the values are between 1 to 0 (either positive or negative) The process of converting them into such values is called normalization There are many ways to normalize them. Here I choose MinMaxScalar which converts the highest value to 1 and smallest value to 0. Remaining values exist between 0 to 1


In [34]:
df_to_norm = df[['Principal', 'terms', 'age']]
df_to_norm[:10]


Out[34]:
Principal terms age
0 1000 30 45
1 1000 30 50
2 1000 30 33
3 1000 15 27
4 1000 30 28
5 300 7 35
6 1000 30 29
7 1000 30 36
8 1000 30 28
9 800 15 26

In [35]:
df_norm = (df_to_norm - df_to_norm.min()) / (df_to_norm.max() - df_to_norm.min())
df_norm[:10]


Out[35]:
Principal terms age
0 1.000000 1.000000 0.818182
1 1.000000 1.000000 0.969697
2 1.000000 1.000000 0.454545
3 1.000000 0.347826 0.272727
4 1.000000 1.000000 0.303030
5 0.000000 0.000000 0.515152
6 1.000000 1.000000 0.333333
7 1.000000 1.000000 0.545455
8 1.000000 1.000000 0.303030
9 0.714286 0.347826 0.242424

In [36]:
df = df.drop(['Principal', 'terms', 'age'], axis=1)
df = pd.concat([df,df_norm], axis=1)

In [37]:
df[:10]


Out[37]:
female male COLLECTION COLLECTION_PAIDOFF PAIDOFF bachelor college master school Principal terms age
0 0 1 0 0 1 0 0 0 1 1.000000 1.000000 0.818182
1 1 0 0 0 1 1 0 0 0 1.000000 1.000000 0.969697
2 1 0 0 0 1 1 0 0 0 1.000000 1.000000 0.454545
3 0 1 0 0 1 0 1 0 0 1.000000 0.347826 0.272727
4 1 0 0 0 1 0 1 0 0 1.000000 1.000000 0.303030
5 0 1 0 0 1 0 0 1 0 0.000000 0.000000 0.515152
6 0 1 0 0 1 0 1 0 0 1.000000 1.000000 0.333333
7 0 1 0 0 1 0 1 0 0 1.000000 1.000000 0.545455
8 0 1 0 0 1 0 1 0 0 1.000000 1.000000 0.303030
9 0 1 0 0 1 0 1 0 0 0.714286 0.347826 0.242424

Thus the dataset is clean to move further to production