If you are reading this article, I'm sure you love statistical modeling (or will start loving it if you're a begineer)! Not sure about you, but during my initial days with machine learning, I was too anxious to make a model that I didn't even took the pain of even opening my data file and just ran my first model. As expected, this almost never worked out and my compiler started throwing errors. On backtracking, I found the most common reasons - missing data, incorrect coding of values, nominal variables being treated as numeric because the finite values are integers, etc.
Data exploration and feature engineering are crucial for succssfully deriving actionable insights from data. In this article, I will share some tips and tricks using which this can be easily done using Pandas in Python. Python is gaining popularity and Pandas is one of the most popular tools for data handling in Python.
Note: We'll be using the dataset of the "Loan Prediction" problem from the Analytics Vidhya datahacks. Click here to access the dataset and problem statement. The data can be loaded into Python using the command:
In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv("train.csv", index_col="Loan_ID")
# test = pd.read_csv("test.csv", index_col="PassengerID")
print data.shape
data.columns
Out[1]:
In [2]:
data.loc[(data["Gender"]=="Female") & (data["Education"]=="Not Graduate") & (data["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]
Out[2]:
Although rarely used, but at times you may need to iterate through all rows. One common problem we face is nominal variables with numeric categories are treated as numerical by defauly by python. Also, there might be some numeric variable where some characters are entered in one of the rows and it'll be considered categorical by default. So it's generally a good idea to manually define the column types.
In [3]:
#Check current type:
data.dtypes
Out[3]:
Here we see that Credit_History is a nominal variable but appearing as float. A good way to tackle this issue is to create a csv file with column names and types. This way we can make a generic function to read the file and assign column data types. For instance, in this case I've defined a csv file datatypes.csv (download).
In [4]:
#Load the file:
colTypes = pd.read_csv('datatypes.csv')
print colTypes
On loading this file, we can iterate through each row and assign the datatype from column 'type' to the variable name defined in 'feature' column.
In [5]:
#Iterate through each row and assign variable type.
# Note: astype is used to asign types
for i, row in colTypes.iterrows(): #i: dataframe index; row: each row in series format
if row['feature']=="categorical":
data[row['feature']]=data[row['feature']].astype(np.object)
elif row['feature']=="continuous":
data[row['feature']]=data[row['feature']].astype(np.float)
print data.dtypes
Now the credit history column is modified to 'object' type which is used for representing nominal variables in Pandas.
It is one of the most handy functions when it comes to playing with the data, specially creating new varaibles. As the name suggests, apply returns some value after passing each row/column of a dataframe through some function. The function can be both default or user-defined.
For instance, here it can be used to find the #missing values in each row and column.
In [6]:
#Create a new function:
def num_missing(x):
return sum(x.isnull())
#Applying per column:
print "Missing values per column:"
print data.apply(num_missing, axis=0) #axis=0 defines that function is to be applied on each column
#Applying per row:
print "\nMissing values per row:"
print data.apply(num_missing, axis=1).head() #axis=1 defines that function is to be applied on each column
Thus we get the desired result. Note: head() function is used in second output because it contains many rows.
In [7]:
#First we import a function to determine the mode
from scipy.stats import mode
mode(data['Gender'])
Out[7]:
This returns both mode and count. Remember that mode can be an array as there can be multiple values with high frequency. We will take the first one by default always using:
In [8]:
mode(data['Gender']).mode[0]
Out[8]:
Now we can fill the missing values and check using technique #3.
In [9]:
#Impute the values:
data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)
#Now check the #missing values again to confirm:
print data.apply(num_missing, axis=0)
Hence confirmed the missing values are imputed. Note: This is the most primitive forms of imputation. Other sophisticated techniques include modeling the missing values using some variables or using grouped averages (mean/mode/median). We'll take up the later next.
Pandas can be used to create MS Excel style pivot tables. For isntance, in this case a very key column is "LoanAmount" which has missing values. We can impute it using mean amount of each 'Gender', 'Married' and 'Self_Employed' group. The mean 'LoanAmount' of each group can be determined as:
In [10]:
#Determine pivot table
impute_grps = data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)
print impute_grps
If you notice the output of step #4, it has a peculiar property. Each index is made of a combination of 3 values! This is called MultuIndexing and it can be used to perform some operations really fast if used wisely. Continuing the example from #4, we have the values for each group but not they have to be imputed.
This can be done using the various techniques learned till now.
In [11]:
#iterate only through rows with missing LoanAmount
for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():
ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]
#Now check the #missing values again to confirm:
print data.apply(num_missing, axis=0)
Note:
1. Multi-index require tuple for defining groups of indices in loc statement. This a tuple used in function.
2. The .values[0] suffix is required because by defualt a series element is returned which has an index not matching with that of the dataframe. So direct assignment gives an error.
In [12]:
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)
Out[12]:
These are absolute numbers but percentages can be more intuitive in making some quick insights. We can do this using the apply function:
In [13]:
def percConvert(ser):
return ser/float(ser[-1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)
Out[13]:
Now it is clearly evident that people with a credit histpry have much higher chances of getting a loan as 80% people with credit history got a loan as compared to only 9% without credit histoty.
But that's not it. It tells an iteresting story. Since I know that having a credit history is super important, what if I predict loan status to be Y for ones with credit history and N otherwise. Surprisingly, we'll be right 82+378=460 times out of 614 which is a whopping 75%! I won't blame you if you're wondering why the hell do we need statistical models. But trust me, increasing the accuracy by even 0.001% beyond this is a challenging ask.
Note: 75% is on train set. The test set will be slightly different but close. Also, I hope this gives some intuition into why even a 0.05% increase in accuracy can result in jump of 500 ranks on a Kaggle leaderboard.
More: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html
In [14]:
prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
prop_rates
Out[14]:
Now we can merge this information with the original dataframe as:
In [15]:
data_merged = data.merge(right=prop_rates, how='inner',left_on='Property_Area',right_index=True, sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'], aggfunc=len)
Out[15]:
The pivot table validates sucessful merge operation. Note that the 'values' argument is irrelevant here because we are simply counting the values.
In [16]:
data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'], ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)
Out[16]:
Note: The Pandas "sort" function is now depricated and we should use "sort_values".
In [17]:
import matplotlib.pyplot as plt
%matplotlib inline
data.boxplot(column="ApplicantIncome",by="Loan_Status")
Out[17]:
In [18]:
data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)
Out[18]:
This shows that income is not a big deciding factor on its own as there is no appreciable difference between the people who received and were denied the loan.
More: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html#pandas.DataFrame.hist | http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html#pandas.DataFrame.boxplot
Sometimes numerical values make more sense if clustered together. For example, if we're trying to model traffic (#cars on road) with time of the day (minutes). The exact minute of an hour might not be that relevant for predicting traffic as compared to actual period of the day like "Morning", "Afternoon", "Evening", "Night", "Late Night". Modeling traffic this way will be more intuitive and will avoid overfitting minute details which practically make little sense.
Here I've defined a simple function which can be re-used for binning any variable fairly easily.
In [20]:
#Binning:
def binning(col, cut_points, labels=None):
#Define min and max values:
minval = col.min()
maxval = col.max()
#create list by adding min and max to cut_points
break_points = [minval] + cut_points + [maxval]
#if no labels provided, use default labels 0 ... (n-1)
if not labels:
labels = range(len(cut_points)+1)
#Binning using cut function of pandas
colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
return colBin
#Binning age:
cut_points = [90,140,190]
labels = ["low","medium","high","very high"]
data["LoanAmount_Bin"] = binning(data["LoanAmount"], cut_points, labels)
print pd.value_counts(data["LoanAmount_Bin"], sort=False)
Very often we come accross a case where we've to modify the categories of a nominal variable. This can be due to various reasons:
1. Some algorithms (like Logistic Regression) require all inputs to be numeric. So nominal varialbes are mostly coded as 0---(n-1)
2. Sometimes the same category might be represented in 2 ways. For eg temperature might be recorded as "High", "Medium", "Low", "H", "low". Here, both "High" and "H" refer to same category. Similarly, in "Low" and "low" there is only a difference of case.
3. Some categories might have very low frequencies and its generally a good idea to combine them
Here I've defined a generic function which takes in input as a dictionary and codes the values using 'replace' function in Pandas
In [21]:
#Define a generic function using Pandas replace function
def coding(col, codeDict):
colCoded = pd.Series(col, copy=True)
for key, value in codeDict.items():
colCoded.replace(key, value, inplace=True)
return colCoded
#Coding LoanStatus as Y=1, N=0:
print 'Before Coding:'
print pd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"] = coding(data["Loan_Status"], {'N':0,'Y':1})
print '\nAfter Coding:'
print pd.value_counts(data["Loan_Status_Coded"])
Similar counts before and after proves the coding.
In this article, we coverd various functions of Pandas which can make our life easy while performing data exploration and feature engineering. Also, we defined some generic functions which can be reused for acheiving similar objective on different datasets.
Also See: If you have any doubts pertaining to Pandas or Python in general, feel free to discuss with us.
Did you find the article useful? Do you use some better (easier/faster) techniques for performing the tasks discussed above? Do you think there are better alternatives to Pandas in Python? We'll be glad if you share your thoughts as comments below.