Now that we've created some categorical data or other created features, we would like to use them as inputs for our machine learning algorithm. However, we need to tell the computer that the categorical data isn't the same as other numerical data. For example, I could have the following two types of categorical data:
We want to treat both of these slightly differently. We've got a sample dataset with both types of categorical data in it to work with. Our goal will be to predict the Output value.
In [1]:
sampledata <- read.csv('Class03_supplemental_data.csv', stringsAsFactors=FALSE)
print(str(sampledata))
head(sampledata)
We can turn the date column into a real datetime object and get days since the first day in order to work with a more reasonable set of values.
In [2]:
sampledata$Date2 <- as.POSIXct(sampledata$Date)
firstdate <- head(sampledata$Date2,1)
sampledata$DaysSinceStart <- as.numeric(difftime(sampledata$Date2, firstdate, units = "hours")/24)
str(sampledata)
In [3]:
sampledata$CatRank <- factor(sampledata$Rank, ordered=TRUE)
print(sampledata$CatRank[1:10])
print(as.numeric(sampledata$CatRank[1:10]))
In [4]:
sampledata$CatState <- factor(sampledata$State)
print(sampledata$CatState[1:10])
print(as.numeric(sampledata$CatState[1:10]))
In [5]:
sampledata$RankCode <- as.numeric(sampledata$CatRank)
sampledata$StateCode <- as.numeric(sampledata$CatState)
names(sampledata)
In [6]:
set.seed(23)
trainIndex <- sample(seq(nrow(sampledata)), nrow(sampledata)*0.8)
train1 <- sampledata[trainIndex, ]
test1 <- sampledata[-trainIndex, ]
# Step 2: Create linear regression object
inputcolumns <- c('DaysSinceStart', 'RankCode','StateCode')
lmformula <- as.formula(paste('Output ~',paste(inputcolumns, collapse='+')))
# Step 2: Fit the model
regr1 <- lm(lmformula, train1)
In [7]:
# Step 3: Get the predictions
predictions <- predict(regr1,test1)
actuals <- test1$Output
# Step 4: Plot the results
plot(test1$DaysSinceStart,actuals, pch=15, col="blue", xlab="DaysSinceStart", ylab="Output")
points(test1$DaysSinceStart,predictions, pch=15, col="red", xlab="DaysSinceStart", ylab="Output")
# Add a legend
legend(0,2.4, # places a legend at the appropriate place
c("Actuals","Predictions"), # puts text in the legend
pch=c(15,15), # Sets the symbol correctly for the point and line
col=c("blue","red")) # gives the legend lines the correct color and width
# Step 7: Get the RMS value
print(paste("RMS Error:",sqrt(mean((predictions-actuals)^2))))
So we see that this didn't do a very good job to start with. However, that's not surprising as it used the states as a ranked categorical value when they obviously aren't.
What we want is called a dummy variable. It will tell the machine learning algorithm to look at whether an entry is one of the states or not. Here's basically how it works. Suppose we have two categories: red and blue. Our categorical column may look like this:
Row | Color |
---|---|
0 | red |
1 | red |
2 | blue |
3 | red |
What we want are two new columns that identify whether the row belongs in one of the categories. We'll use 1
when it belongs and 0
when it doesn't. This is what we get:
Row | IsRed | IsBlue |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
2 | 0 | 1 |
3 | 1 | 0 |
We now use these new dummy variable columns as the inputs: they are binary and will only have a 1 value where the original row matched up with the category column. We will use the mlr
library that has this functionality built in.
In [8]:
library(mlr)
dummydf <- createDummyFeatures(sampledata$CatState)
colnames(dummydf) <- paste("S", colnames(dummydf), sep = "_")
head(dummydf)
We now want to join this back with the original set of features so that we can use it instead of the ranked column of data. Here's one way to do that.
In [9]:
sampledata2 <- cbind(sampledata,dummydf)
head(sampledata2)
We now want to select out all 50 columns from the dummy variable. There is a python way to do this easily, since we used the prefix 'S_' for each of those columns
In [10]:
inputcolumns2 <- c('DaysSinceStart','RankCode', grep("S_",colnames(sampledata2),value=TRUE))
trainIndex2 <- sample(seq(nrow(sampledata2)), nrow(sampledata2)*0.8)
train2 <- sampledata2[trainIndex2, ]
test2 <- sampledata2[-trainIndex2, ]
lmformula2 <- as.formula(paste('Output ~',paste(inputcolumns2, collapse='+')))
# Step 2: Fit the model
regr2 <- lm(lmformula2, train2)
In [11]:
# Step 3: Get the predictions
predictions <- predict(regr2,test2)
actuals <- test2$Output
# Step 4: Plot the results
plot(test2$DaysSinceStart,actuals, pch=15, col="blue", xlab="DaysSinceStart", ylab="Output")
points(test2$DaysSinceStart,predictions, pch=15, col="red", xlab="DaysSinceStart", ylab="Output")
# Add a legend
legend(0,2.4, # places a legend at the appropriate place
c("Actuals","Predictions"), # puts text in the legend
pch=c(15,15), # Sets the symbol correctly for the point and line
col=c("blue","red")) # gives the legend lines the correct color and width
# Step 7: Get the RMS value
print(paste("RMS Error:",sqrt(mean((predictions-actuals)^2))))
So, you can see we did significantly better by changing the categorical column into a dummy variable. Take a look at your own datasets to see if this is what you should be doing.
In [ ]: