Continuous, Categorial, and Ordinal Variables

Three types of variables are used, particularly in linear regression.

1. Continuous - ordered and can be subdivided.
2. Categorical - limited and fixed number of values.
3. Ordinal - limited and fixed number of values for which order is important.

Categorical variables are sometimes referred to as levels. They are often called dummy variables) in the statistics literature and indicate membership in a category.

The actual numbers for categorical variables do not matter. For instance, we can encode quarters as

• 1 Quarter (1)
• 2 Quarter (2)
• 3 Quarter (3)
• 4 Quarter (4)

or

• 1 Quarter (0)
• 2 Quarter (1)
• 3 Quarter (2)
• 4 Quarter (3)

or

• 1 Quarter (1 or 0)
• 2 Quarter (1 or 0)
• 3 Quarter (1 or 0)
• 4 Quarter (1 or 0)

In a linear regression model, encoding zero means the variable has no effect. Encoding one only affects the intercept.

Other types of models require all four dummy variables. For instance, decision trees (later).

Differences in Languages (Packages)

We have to encode variables for use in scikit-learn as the package only takes numeric categories. R takes "levels" and encodes internally. So you can pass strings as categories to models in R, but not in scikit-learn.

Your data cleaning and preparation step may include taking strings in a column and transforming them into a numeric category or level.

Categorical Variables and Linear Models

In models such as regression, which are linear in the unknown variables, we can't include all categorical variables. If we have four dummy variables for quarter, we must only include three of them.

• 1 Quarter = 1 or 0
• 2 Quarter = 1 or 0
• 3 Quarter = 1 or 0
• 4 Quarter = 1 or 0

Encoding zero for quarter 1 - 3 is the same as encoding a 1 for quarter 4 and the rest zero. Remember, linear regression has an intercept, which is why we can't use four dummy variables.

The dummy variable that is omitted is the base category against which all others are compared.

For instance

$ln(\tt{wage}) = \alpha + \beta \tt{college} + \tt{error}$

Here we set up a regression problem with college as one or zero. If college is one, then the wage goes up (assuming $\beta > 0$). So we can interpret the significance of the college variable as measuring a wage premium/discount for attending college.

The error term contains

1. Every variable not included in the regression model.
2. Randomness.

This model is not predictive but is explanatory.

Questions

1. Why am I taking the natural log of wage?
2. What happens if $\beta < 0$?
3. We encoded college as zero or one. Why not one and two?

Including all dummy variables in a regression model introduces multicollinearity and can cause all kinds of problems for predictions. It may be ok for an explantory model, if appropriate corrections are made.

Transforming variables in Pandas



In [1]:

import numpy as np
import pandas as pd

from random import choice




In [2]:

df = pd.DataFrame(np.random.randn(25, 3), columns=['a', 'b', 'c'])
df['e'] = [choice(('Chicago', 'Boston', 'New York')) for i in range(df.shape[0])]




Out[2]:

a
b
c
e

0
-1.375866
-0.615001
-0.695595
Chicago

1
-0.050819
0.449645
0.623640
Chicago

2
-0.197383
-1.216587
-0.600860
Chicago

3
1.814729
0.776975
1.144400
New York

4
0.583220
0.716785
-0.348309
Chicago

5
-0.741097
-1.443649
0.001934
Chicago



An Example of Encoding Strings

scikit-learn has the classes OneHotEncoder and LabelEncoder in sklearn.preprocessing. OneHotEncoder encodes columns that have only integers. LabelEncoder encodes strings.

Here is another way with Pandas. Use it as a preprocessing step before using scikit-learn models.



In [3]:

df1 = pd.get_dummies(df, prefix=["e"])




Out[3]:

a
b
c
e_Boston
e_Chicago
e_New York

0
-1.375866
-0.615001
-0.695595
0
1
0

1
-0.050819
0.449645
0.623640
0
1
0

2
-0.197383
-1.216587
-0.600860
0
1
0

3
1.814729
0.776975
1.144400
0
0
1

4
0.583220
0.716785
-0.348309
0
1
0

5
-0.741097
-1.443649
0.001934
0
1
0