This lab demonstrates how to apply one hot encoding to categorical variables with pandas. At the end of the lab, you should be able to use pandas
to:
Let's start by importing pandas in the usual way.
In [ ]:
import pandas as pd
Next, let's load the data. Write the path to your iris.csv file in the cell below:
In [ ]:
path_to_csv = "data/iris.csv"
Execute the cell below to load the data into a pandas data frame and index that data frame by the sample_number
column:
In [ ]:
df = pd.read_csv(path_to_csv, index_col=['sample_number'])
Take a quick peek at the data:
In [ ]:
df.head()
In [ ]:
df.dtypes
As you can see, we have four columns of numerical data (float64
), corresponding to the physical measurements, and one column of text data (object
), corresponding to the species labels. Let's take a closer look at the unique values in the species column:
In [ ]:
df['species'].unique()
If we wanted to use these labels as input to a machine learning algorithm, we would first need to convert them from text into some numerical format, so that the algorithm could understand them. One way to do this would be to assign a numerical value to each species, e.g. setosa = 0
, versicolor = 1
, virginica = 2
, but this wouldn't make a lot of sense as setosa is not "less than" versicolor or virginica in a mathematical sense.
A better alternative would be to create a set of new features that encode the values of the labels in such a way that an algorithm would view them as equal. One hot encoding is supported in pandas via the get_dummies
method:
In [ ]:
encoded_features = pd.get_dummies(df['species'])
encoded_features.head() # Take a quick look at the result
As you can see, pandas has encoded each label as a binary indicator variable, where a "1" represents the presence of the label and a "0" indicates the absence of the label.
We can use the concat
method to glue the new features to our existing data frame:
In [ ]:
df = pd.concat([df, encoded_features], axis='columns')
df.head()
Finally, we can use the drop
method to remove the original species
column from the data frame, leaving us with the new features only:
In [ ]:
df = df.drop('species', axis='columns')
df.head()