Three types of variables are used, particularly in linear regression.
Categorical variables are sometimes referred to as levels. They are often called dummy variables) in the statistics literature and indicate membership in a category.
The actual numbers for categorical variables do not matter. For instance, we can encode quarters as
or
or
In a linear regression model, encoding zero means the variable has no effect. Encoding one only affects the intercept.
Other types of models require all four dummy variables. For instance, decision trees (later).
We have to encode variables for use in scikit-learn
as the package only takes numeric categories. R
takes "levels" and encodes internally. So you can pass strings as categories to models in R
, but not in scikit-learn
.
Your data cleaning and preparation step may include taking strings in a column and transforming them into a numeric category or level.
In models such as regression, which are linear in the unknown variables, we can't include all categorical variables. If we have four dummy variables for quarter, we must only include three of them.
Encoding zero for quarter 1 - 3 is the same as encoding a 1 for quarter 4 and the rest zero. Remember, linear regression has an intercept, which is why we can't use four dummy variables.
The dummy variable that is omitted is the base category against which all others are compared.
For instance
$ln(\tt{wage}) = \alpha + \beta \tt{college} + \tt{error}$
Here we set up a regression problem with college
as one or zero. If college
is one, then the wage goes up (assuming $\beta > 0$). So we can interpret the significance of the college
variable as measuring a wage premium/discount for attending college.
The error
term contains
This model is not predictive but is explanatory.
Questions
wage
?college
as zero or one. Why not one and two?Including all dummy variables in a regression model introduces multicollinearity and can cause all kinds of problems for predictions. It may be ok for an explantory model, if appropriate corrections are made.
In [1]:
import numpy as np
import pandas as pd
from random import choice
In [2]:
df = pd.DataFrame(np.random.randn(25, 3), columns=['a', 'b', 'c'])
df['e'] = [choice(('Chicago', 'Boston', 'New York')) for i in range(df.shape[0])]
df.head(6)
Out[2]:
scikit-learn
has the classes OneHotEncoder
and LabelEncoder
in sklearn.preprocessing
. OneHotEncoder
encodes columns that have only integers. LabelEncoder
encodes strings.
Here is another way with Pandas. Use it as a preprocessing step before using scikit-learn
models.
In [3]:
df1 = pd.get_dummies(df, prefix=["e"])
df1.head(6)
Out[3]: