In [16]:
import pandas as pd
In [17]:
df = pd.read_csv('datasets/dataset.csv')
Lets see what dataset we have loaded
In [18]:
df.education.unique()
Out[18]:
In [19]:
#school, college, bachelor, master
df['education'] = df['education'].replace(['High School or Below'], 'school')
df['education'] = df['education'].replace(['Bechalor'], 'bachelor')
df['education'] = df['education'].replace(['Master or Above'], 'master')
In [20]:
df.to_csv('datasets/dataset.csv')
df[:10]
Out[20]:
We don't need all the columns. Right? Let's drop the unneccessary things.
In [21]:
df = df.drop(['loan_id', 'effective_date', 'due_date', 'paid_off_time', 'past_due_days'], axis = 1)
In [22]:
# The dataframe holds the needed columns now. Cool.
df[:10]
Out[22]:
In order to get the information of the whole dataframe, use info()
In [23]:
df.info()
In [24]:
# Lets clean the data and create columns if needed.
df['Gender'].unique()
Out[24]:
We can see that our dateset has two unique string values for GENDER. We can't assign numeric values like female = 1 and male = 2 because of feminism. Just kidding. We shouldn't assign because then they will be a factor that denotes intensity. We want to differentiate our category. So we are going to have sepeate columns for two genders.
Create df with two dummy columns named of genders.
In [25]:
df_sex = pd.get_dummies(df['Gender'])
In [26]:
df_sex[:10]
Out[26]:
In [27]:
df = pd.concat([df,df_sex] , axis=1)
In [28]:
df[:10]
Out[28]:
In [29]:
# Now drop the gender column from the main df and add df_sex to df
df = df.drop(['Gender'], axis=1)
In [30]:
df[:10]
Out[30]:
In [31]:
# Similary lets do the same process for both load_status and education.
# This process is called Categorical Conversion into Numerics of One-hot-coding
df_loan_status = pd.get_dummies(df['loan_status'])
df_education = pd.get_dummies(df['education'])
df = pd.concat([df, df_loan_status], axis=1)
df = pd.concat([df, df_education], axis=1)
df = df.drop(['loan_status', 'education'], axis=1)
In [32]:
df[:10]
Out[32]:
In [33]:
df.info()
In machine learning, its always easier to compute if the values are between 1 to 0 (either positive or negative) The process of converting them into such values is called normalization There are many ways to normalize them. Here I choose MinMaxScalar which converts the highest value to 1 and smallest value to 0. Remaining values exist between 0 to 1
In [34]:
df_to_norm = df[['Principal', 'terms', 'age']]
df_to_norm[:10]
Out[34]:
In [35]:
df_norm = (df_to_norm - df_to_norm.min()) / (df_to_norm.max() - df_to_norm.min())
df_norm[:10]
Out[35]:
In [36]:
df = df.drop(['Principal', 'terms', 'age'], axis=1)
df = pd.concat([df,df_norm], axis=1)
In [37]:
df[:10]
Out[37]:
Thus the dataset is clean to move further to production