Now that we've created some categorical data or other created features, we would like to use them as inputs for our machine learning algorithm. However, we need to tell the computer that the categorical data isn't the same as other numerical data. For example, I could have the following two types of categorical data:
We want to treat both of these slightly differently. We've got a sample dataset with both types of categorical data in it to work with. Our goal will be to predict the Output value.
In [1]:
import pandas as pd
import numpy as np
sampledata = pd.read_csv('Class03_supplemental_data.csv')
print(sampledata.dtypes)
sampledata.head()
Out[1]:
We can turn the date column into a real datetime object and get days since the first day in order to work with a more reasonable set of values.
In [2]:
sampledata["Date2"] = pd.to_datetime(sampledata["Date"])
firstdate = sampledata['Date2'][0]
sampledata['DaysSinceStart'] = sampledata['Date2'].apply(lambda date: ((date - firstdate ).seconds)/86400.0) # divided by the number of seconds in a day
sampledata.dtypes
Out[2]:
In [3]:
sampledata['CatRank'] = sampledata['Rank'].astype('category')
print(sampledata["CatRank"].cat.categories)
sampledata["CatRank"][1:10].cat.codes
Out[3]:
In [4]:
sampledata['CatState'] = sampledata['State'].astype('category')
print(sampledata["CatState"].cat.categories)
sampledata["CatState"][1:10].cat.codes
Out[4]:
In [5]:
sampledata['RankCode'] = sampledata['CatRank'].cat.codes
sampledata['StateCode'] = sampledata['CatState'].cat.codes
sampledata.columns
Out[5]:
In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
train1, test1 = train_test_split(sampledata, test_size=0.2, random_state=23)
# Step 1: Create linear regression object
regr1 = LinearRegression()
# Step 2: Train the model using the training sets
inputcolumns = ['DaysSinceStart','RankCode','StateCode']
features = train1[inputcolumns].values
labels = train1['Output'].values
regr1.fit(features,labels)
Out[6]:
In [7]:
# Step 5: Get the predictions
testinputs = test1[inputcolumns].values
predictions = regr1.predict(testinputs)
actuals = test1['Output'].values
# Step 6: Plot the results
#
# Note the change here in how we plot the test inputs. We can only plot one variable, so we choose the first.
# Also, it no longer makes sense to plot the fit points as lines. They have more than one input, so we only visualize them as points.
#
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(testinputs[:,0], actuals, color='black', label='Actual')
plt.scatter(testinputs[:,0], predictions, color='blue', label='Prediction')
plt.legend(loc='upper left', shadow=False, scatterpoints=1)
# Step 7: Get the RMS value
print("RMS Error: {0:.3f}".format( np.sqrt(np.mean((predictions - actuals) ** 2))))
So we see that this didn't do a very good job to start with. However, that's not surprising as it used the states as a ranked categorical value when they obviously aren't.
What we want is called a dummy variable. It will tell the machine learning algorithm to look at whether an entry is one of the states or not. Here's basically how it works. Suppose we have two categories: red and blue. Our categorical column may look like this:
Row | Color |
---|---|
0 | red |
1 | red |
2 | blue |
3 | red |
What we want are two new columns that identify whether the row belongs in one of the categories. We'll use 1
when it belongs and 0
when it doesn't. This is what we get:
Row | IsRed | IsBlue |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
2 | 0 | 1 |
3 | 1 | 0 |
We now use these new dummy variable columns as the inputs: they are binary and will only have a 1 value where the original row matched up with the category column. Here's what it looks like in pandas.
In [8]:
dummydf = pd.get_dummies(sampledata['CatState'],prefix='S')
dummydf.head()
Out[8]:
We now want to join this back with the original set of features so that we can use it instead of the ranked column of data. Here's one way to do that.
In [9]:
sampledata2 = sampledata.join(dummydf)
sampledata2.head()
Out[9]:
We now want to select out all 50 columns from the dummy variable. There is a python way to do this easily, since we used the prefix 'S_' for each of those columns
In [10]:
inputcolumns = ['DaysSinceStart','RankCode'] + [col for col in sampledata2.columns if 'S_' in col]
train2, test2 = train_test_split(sampledata2, test_size=0.2, random_state=23)
# Step 1: Create linear regression object
regr2= LinearRegression()
features = train2[inputcolumns].values
labels = train2['Output'].values
regr2.fit(features,labels)
Out[10]:
In [11]:
# Step 5: Get the predictions
testinputs = test2[inputcolumns].values
predictions = regr2.predict(testinputs)
actuals = test2['Output'].values
plt.scatter(testinputs[:,0], actuals, color='black', label='Actual')
plt.scatter(testinputs[:,0], predictions, color='blue', label='Prediction')
plt.legend(loc='upper left', shadow=False, scatterpoints=1)
# Step 7: Get the RMS value
print("RMS Error: {0:.3f}".format( np.sqrt(np.mean((predictions - actuals) ** 2))))
So, you can see we did significantly better by changing the categorical column into a dummy variable. Take a look at your own datasets to see if this is what you should be doing.
In [ ]: