Three types of variables are used, particularly in linear regression.

- Continuous - ordered and can be subdivided.
- Categorical - limited and fixed number of values.
- Ordinal - limited and fixed number of values for which order is important.

Categorical variables are sometimes referred to as *levels*. They are often called *dummy variables*) in the statistics literature and indicate membership in a category.

The actual numbers for categorical variables do not matter. For instance, we can encode quarters as

- 1 Quarter (1)
- 2 Quarter (2)
- 3 Quarter (3)
- 4 Quarter (4)

or

- 1 Quarter (0)
- 2 Quarter (1)
- 3 Quarter (2)
- 4 Quarter (3)

or

- 1 Quarter (1 or 0)
- 2 Quarter (1 or 0)
- 3 Quarter (1 or 0)
- 4 Quarter (1 or 0)

In a linear regression model, encoding zero means the variable has no effect. Encoding one only affects the *intercept*.

Other types of models *require* all four dummy variables. For instance, decision trees (later).

We have to encode variables for use in `scikit-learn`

as the package only takes numeric categories. `R`

takes "levels" and encodes internally. So you can pass strings as categories to models in `R`

, but not in `scikit-learn`

.

Your data cleaning and preparation step may include taking strings in a column and transforming them into a numeric category or level.

In models such as regression, which are linear in the unknown variables, we can't include all categorical variables. If we have four dummy variables for quarter, we must only include three of them.

- 1 Quarter = 1 or 0
- 2 Quarter = 1 or 0
- 3 Quarter = 1 or 0
- 4 Quarter = 1 or 0

Encoding zero for quarter 1 - 3 is the same as encoding a 1 for quarter 4 and the rest zero. Remember, linear regression has an *intercept*, which is why we can't use four dummy variables.

The dummy variable that is omitted is the base category against which all others are compared.

For instance

$ln(\tt{wage}) = \alpha + \beta \tt{college} + \tt{error}$

Here we set up a regression problem with `college`

as one or zero. If `college`

is one, then the wage goes up (assuming $\beta > 0$). So we can interpret the significance of the `college`

variable as measuring a wage premium/discount for attending college.

The `error`

term contains

- Every variable not included in the regression model.
- Randomness.

This model is not *predictive* but is *explanatory*.

Questions

- Why am I taking the natural log of
`wage`

? - What happens if $\beta < 0$?
- We encoded
`college`

as zero or one. Why not one and two?

Including all dummy variables in a regression model introduces *multicollinearity* and can cause all kinds of problems for predictions. It *may* be ok for an explantory model, if appropriate corrections are made.

```
In [1]:
```import numpy as np
import pandas as pd
from random import choice

```
In [2]:
```df = pd.DataFrame(np.random.randn(25, 3), columns=['a', 'b', 'c'])
df['e'] = [choice(('Chicago', 'Boston', 'New York')) for i in range(df.shape[0])]
df.head(6)

```
Out[2]:
```

`scikit-learn`

has the classes `OneHotEncoder`

and `LabelEncoder`

in `sklearn.preprocessing`

. `OneHotEncoder`

encodes columns that have only integers. `LabelEncoder`

encodes strings.

Here is another way with Pandas. Use it as a preprocessing step before using `scikit-learn`

models.

```
In [3]:
```df1 = pd.get_dummies(df, prefix=["e"])
df1.head(6)

```
Out[3]:
```