There are two major types of supervised machine learning problems, called *classification* and *regression*.

In classification, the goal is to predict a *class label*, which is a choice from a predefined list of possibilities. In *Intro_to_Decision_Trees.ipynb* we used the example of classifying irises into one of three possible species. Classification is sometimes separated into binary classification, which is the special case of distinguishing between exactly two classes, and multiclass classification, which is classification between more than two classes. You can think of binary classification as trying to answer a yes/no question. Classifying emails as either spam or not spam is an example of a binary classification problem. In this binary classification task, the yes/no question being asked would be “Is this email spam?”

For regression tasks, the goal is to predict a *continuous number*, or a floating-point number in programming terms (or real number in mathematical terms). Predicting a person’s annual income from their education, their age, and where they live is an example of a regression task. When predicting income, the predicted value is an amount, and can be any number in a given range. Another example of a regression task is predicting the yield of a corn farm given attributes such as previous yields, weather, and number of employees working on the farm. The yield again can be an arbitrary number.

**An easy way to distinguish between classification and regression tasks is to ask whether there is some kind of continuity in the output. If there is continuity between possible outcomes, then the problem is a regression problem.** Think about predicting annual income. There is a clear continuity in the output. Whether a person makes $40,000 or $40,001 a year does not make a tangible difference, even though these are different amounts of money; if our algorithm predicts $39,999 or $40,001 when it should have predicted $40,000, we don’t mind that much.

By contrast, for the task of recognizing the language of a website (which is a classification problem), there is no matter of degree. A website is in one language, or it is in another. There is no continuity between languages, and there is no language that is between English and French.

*Disclaimer*: Much of the code in this notebook was lifted from the excellent book Introduction to Machine Learning with Python by Andreas Muller and Sarah Guido.

In supervised learning, we want to build a model on the training data and then be able to make accurate predictions on new, unseen data that has the same characteristics as the training set that we used. If a model is able to make accurate predictions on unseen data, we say it is able to *generalize* from the training set to the test set. We want to build a model that is able to generalize as accurately as possible.

Usually we build a model in such a way that it can make accurate predictions on the training set. If the training and test sets have enough in common, we expect the model to also be accurate on the test set. However, there are some cases where this can go wrong. For example, if we allow ourselves to build very complex models, we can always be as accurate as we like on the training set.

The only measure of whether an algorithm will perform well on new data is the evaluation on the test set. However, intuitively we expect simple models to generalize better to new data. Therefore, we always want to find the simplest model. Building a model that is too complex for the amount of information we have, as our novice data scientist did, is called *overfitting*. Overfitting occurs when you fit a model too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalize to new data. On the other hand, if your model is too simple, then you might not be able to capture all the aspects of and variability in the data, and your model will do badly even on the training set. Choosing too simple a model is called *underfitting*.

The more complex we allow our model to be, the better we will be able to predict on the training data. However, if our model becomes too complex, we start focusing too much on each individual data point in our training set, and the model will not generalize well to new data.

There is a sweet spot in between that will yield the best generalization performance. This is the model we want to find.

It’s important to note that model complexity is intimately tied to the variation of inputs contained in your training dataset: the larger variety of data points your dataset contains, the more complex a model you can use without overfitting. Usually, collecting more data points will yield more variety, so larger datasets allow building more complex models. However, simply duplicating the same data points or collecting very similar data will not help.

Having more data and building appropriately more complex models can often work wonders for supervised learning tasks. In the real world, you often have the ability to decide how much data to collect, which might be more beneficial than tweaking and tuning your model. Never underestimate the power of more data.

Linear models are a class of models that are widely used in practice and have been studied extensively in the last few decades, with roots going back over a hundred years. Linear models make a prediction using a linear function of the input features.

For regression, the general prediction formula for a linear model looks as follows:

```
ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b
```

Here, x[0] to x[p] denotes the features (in this example, the number of features is p) of a single data point, w and b are parameters of the model that are learned, and ŷ is the prediction the model makes. For a dataset with a single feature, this is:

```
ŷ = w[0] * x[0] + b
```

which you might remember from high school mathematics as the equation for a line. Here, w[0] is the slope and b is the y-axis offset. For more features, w contains the slopes along each feature axis. Alternatively, you can think of the predicted response as being a weighted sum of the input features, with weights (which can be negative) given by the entries of w.

Linear models for regression can be characterized as regression models for which the prediction is a line for a single feature, a plane when using two features, or a hyperplane in higher dimensions (that is, when using more features).

For datasets with many features, linear models can be very powerful. In particular, if you have more features than training data points, any target y can be perfectly modeled (on the training set) as a linear function.

There are many different linear models for regression. The difference between these models lies in how the model parameters w and b are learned from the training data, and how model complexity can be controlled.

Linear regression, or *ordinary least squares* (OLS), is the simplest and most classic linear method for regression. Linear regression finds the parameters w and b that minimize the *mean squared error* between predictions and the true regression targets, y, on the training set. The mean squared error is the sum of the squared differences between the predictions and the true values. Linear regression has no parameters, which is a benefit, but it also has no way to control model complexity.

The scikit-learn documentation on [Linear Regression]http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) has a decent basic example of its use.

- Simple to understand and to interpret, at least for a small number of features/dimensions
- Easy to visualize for 2 or 3 features

- Very fast to train and also fast to predict
- Doesn't suffer from the
*curse of dimensionality*that methods such as KNearsetNeighbors does- Actually linear methods tend to work better with lots of features than with a small number of features

- OLS has no way to control model complexity and can suffer from overfitting, particularly if there are a large number of features
- Modified versions of Linear Regression such as
*Ridge Regression*and*Lasso*can mitigate or fix this issue

- Modified versions of Linear Regression such as

- In lower-dimensional spaces, other models might yield better generalization performance
- Requires more data preparation than some other techniques
- Feature normalization is required for best results (for any algorithm which includes regularization)
- Non-ordinal categorical features need to be one-hot encoded
- Ordinal features need to be numerically encoded

```
In [1]:
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

One of the most famous datasets for regression in a supervised learning setting is the Boston Housing data set. It is a multivariate dataset introduced in a 1978 paper which records 13 attributes concerning housing values in the suburbs of Boston. NOTE: The data is very, very old and the house prices are ridiculously low by today's standards.

scikit-learn has a number of small toy datasets included with it which makes it quick and easy to experiment with different machine learning algorithms on these datasets.

The sklearn.datasets.load_boston() method can be used to load the this dataset.

The *boston* object that is returned by **load_boston** is a **Bunch** object, which is very similar to a dictionary. It contains keys and values.

Feature Information:

- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population

Target Information

- MEDV: Median value of owner-occupied homes in $1000's

```
In [2]:
```from sklearn.datasets import load_boston
boston = load_boston()

```
In [3]:
```print("Keys of boston: {}".format(boston.keys()))

```
```

```
In [4]:
```# The value of the key DESCR is a short description of the dataset. Here we show the beinning of the description.
print(boston['DESCR'][:193] + "\n...")

```
```

```
In [5]:
```# The value of feature_names is a list of strings, giving the abbreviated name of each feature
print("Feature names: {}".format(boston['feature_names']))

```
```

```
In [6]:
```# The data itself is contained in the target and data fields.
# data contains the numeric measurements of features in a NumPy array
print("Type of data: {}".format(type(boston['data'])))

```
```

```
In [7]:
```# The rows in the data array correspond to neighborhoods, while the columns represent the features
print("Shape of data: {}".format(boston['data'].shape))

```
```

```
In [8]:
```# We see that the array contains measurements for 506 different neighborhoods. Here are values for the first 5.
print("First five columns of data:\n{}".format(boston['data'][:5]))

```
```

```
In [9]:
```# The target array contains the Median value of owner-occupied homes in $1000's, also as a NumPy array
print("Type of target: {}".format(type(boston['target'])))

```
```

```
In [10]:
```# target is a one-dimensional array, with one entry per sample
print("Shape of target: {}".format(boston['target'].shape))

```
```

```
In [11]:
```# The target values are positive floating point numbers which represent a median house value in thousands of dollars.
print("Target:\n{}".format(boston['target']))

```
```

We want to build a machine learning model from this data that can predict the species of iris for a new set of measurements. But before we can apply our model to new measurements, we need to know whether it actually works -- that is, whether we should trust its predictions.

Unfortunately, we cannot use the data we used to build the model to evaluate it. This is because our model can always simply remember the whole training set, and will therefore always predict the correct label for any point in the training set. This "remembering" does not indicate to us whether the model will *generalize* well (in other words, whether it will also perform well on new data).

To assess the model's performance, we show it new data (data that it hasn't seen before) for which we have labels. This is usually done by splitting the labeled data we have collected (here, our 150 flower measurements) into two parts. One part of the data is used to build our machine learning model, and is called the *training data* or *training set*. The rest of the data will be used to assess how well the model works; this is called the *test data*, *test set*, or *hold-out set*.

scikit-learn contains a function that shuffles the dataset and splits it for you: the train_test_split function. This function extracts 75% of the rows in the data as the training set, together with the corresponding labels for this data. The remaining 25% of the data, together with the remaining labels, is declared as the test set. Deciding how much data you want to put into the training and the test set respectively is somewhat arbitrary, but scikit-learn's default 75/25 split is a reasonable starting point.

In scikit-learn, data is usually denoted with a capital X, while labels are denoted by a lowercase y. This is inspired by the standard formulation *f(x)=y* in mathematics, where *x* is the input to a function and *y* is the output. Following more conventions from mathematics, we use a capital *X* because the data is a two-dimensional array (a matrix) and a lowercase *y* because the target is a one-dimensional array (a vector).

Before making the split, the **train_test_split** function shuffles the dataset using a pseudorandom number generator. If we just took the last 25% of the data as a test set, all the data points would have the label 2, as the data points are sorted by the label.

To make sure this example code will always get the same output if run multiple times, we provide the pseudorandom number generator with a fixed seed using the **random_state** parameter.

The output of the **train_test_split** function is **X_train**, **X_test**, **y_train**, and **y_test**, which are all NumPy arrays. **X_train** contains 75% of the rows of the dataset, and **X_test** contains the remaining 25%.

```
In [12]:
```from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston['data'], boston['target'], random_state=0)

```
In [13]:
```print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

```
```

```
In [14]:
```print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

```
```

Before building a machine learning model, it is often a good idea to inspect the data, to see if the task is easily solvable without machine learning, or if the desired information might not be contained in the data.

Additionally, inspecting the data is a good way to find abnormalities and peculiarities. Maybe some of your irises were measured using inches and not centimeters, for example. In the real world, inconsistencies in the data and unexpected measurements are very common, as are missing data and not-a-number (NaN) or infinite values.

One of the best ways to inspect data is to visualize it. One way to do this is by using a *scatter plot*. A scatter plot of the data puts one feature along the x-axis and another along the y-axis, and draws a dot for each data point. Unfortunately, computer screens have only two dimensions, which allows us to plot only two (or maybe three) features at a time. It is difficult to plot datasets with more than three features this way. One way around this problem is to do a *pair plot*, which looks at all possible pairs of features. If you have a small number of features, such as the four we have here, this is quite reasonable. You should keep in mind, however, that a pair plot does not show the interaction of all of the features at once, so some interesting aspects of the data may not be revealed when visualizing it this way.

In Python, the *pandas* library has a convenient function called scatter_matrix for creating pair plots for a DataFrame.

```
In [15]:
```# create dataframe from data in X_train
boston_df = pd.DataFrame(X_train, columns=boston.feature_names)
# Add in the target data
boston_df['MEDV'] = y_train
# Look at the first few rows
boston_df.head()

```
Out[15]:
```

```
In [16]:
```# create a scatter matrix from the dataframe
tmp = pd.scatter_matrix(boston_df, figsize=(15, 15))

```
```

```
In [17]:
```# Get a high-level overview of the data
boston_df.describe()

```
Out[17]:
```

```
In [18]:
```# Find which features are most highly correlated with the housing prices
df = boston_df
df['MEDV'] = y_train
df.corr()['MEDV']

```
Out[18]:
```

Now we can start building the actual machine learning model. There are many regression algorithms in *scikit-learn* that we could use. Here we will use Ordinary Least Squares (OLS) Linear Regression because it is easy to understand and interpret.

All machine learning models in *scikit-learn* are implemented in their own classes, which are called *Estimator* classes. The Linear Regression algorithm is implemented in the LinearRegression class in the **linear_model** module. Before we can use the model, we need to instantiate the class into an object. This is when we will set any parameters of the model. The LinearRegression model doesn't have any particular parameters of importance.

```
In [19]:
```from sklearn.linear_model import LinearRegression
lr = LinearRegression()

The *lr* object encapsulates the algorithm that will be used to build the model from the training data, as well the algorithm to make predictions on new data points. It will also hold the information that the algorithm has extracted from the training data.

To build the model on the training set, we call the **fit** method of the *lr* object, which takes as arguments the NumPy array *X_train* containing the training data and the NumPy array *y_train* of the corresponding training labels.

```
In [20]:
```lr.fit(X_train, y_train)

```
Out[20]:
```

* attribute, while the offset or intercept (b) is stored in the intercept* attribute:

```
In [21]:
```print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))

```
```

The intercept* attribute is always a single float number, while the coef* attribute is a NumPy array with one entry per input feature. As we only have 13 input features in this dataset, lr.coef_ has 13 entries.

Let’s look at the training set and test set performance:

```
In [22]:
```print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

```
```

An R^2 of around 0.64 on the test set is not very good, but we can see that the scores on the training and test sets are are a decent distance apart. This means we are likely overfitting. With higher-dimensional datasets (meaning datasets with a large number of features), linear models become more powerful, and there is a higher chance of overfitting. More complicated linear models such as *Ridge Regression* and *Lasso* have been designed to help control this overfitting problem.

An R^2 of around 0.77 on the training set is OK, but not great. For a really good fit, we would want an R^2 of around 0.95 or so. This tells us we are missing someting. One possibility is we could do some feature engineering and either include polynomial powers of some of the features and/or include products of some of the features.

Also, linear models tend ot work better when all of the features exist on roughly the same scale, we could attempt to scale our data as well.

Some algorithms, like neural networks, SVMs, and k-NearestNeighbors are very sensitive to the scaling of the data; while many others such as linear models with regularization (Ridge, Lasso, etc.) are moderately sensitive to the scaling of the data. Therefore, a common practice is to adjust the features so that the data representation is more suitable for these algorithms. Often, this is a simple per-feature rescaling and shift of the data.

Differnt algorithms benefit from different kinds of scaling and thus Scikit-Learn supports a variety of scaling methods, though they all have a similar API.

Neural networks expect all input features to vary in a similar way, and ideally to have a mean of 0, and a variance of 1. When using ANN, we must rescale our data so that it fulfills these requirements. For doing this automatically, *scikit-learn* has the StandardScaler. The **StandardScaler** in *scikit-learn* ensures that for each feature the mean is 0 and the variance is 1, bringing all features to the same magnitude. However, this scaling does not ensure any particular minimum and maximum values for the features.

A common rescaling method for kernel SVMs is to scale the data such that all features are between 0 and 1. We can do this in *scikit-learn* by using the MinMaxScaler preprocessing method. The **MinMaxScaler** shifts the data such that all features are exactly between 0 and 1. For a two-dimensional dataset this means all of the data is contained within the rectangle created by the x-axis between 0 and 1 and the y-axis between 0 and 1.

Standard scaling does not ensure any particular minimum and maximum values for the features. The RobustScaler works similarly to the **StandardScaler** in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the **RobustScaler** uses the median and quartiles, instead of mean and variance. This makes the **RobustScaler** ignore data points that are very different from the rest (like measurement errors). These odd data points are also called *outliers*, and can lead to trouble for other scaling techniques.

```
In [23]:
```# Scale the boston dataset
from sklearn.preprocessing import MinMaxScaler
X = MinMaxScaler().fit_transform(boston.data)

```
In [24]:
```X_train, X_test, y_train, y_test = train_test_split(X, boston['target'], random_state=0)
lr = LinearRegression().fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

```
```

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning.

In particular, linear models might benefit greatly from generating new features via techniques such as binning, and adding polynomials and interactions. However, more complex models like random forests and SVMs might be able to learn more complex tasks without explicitly expanding the feature space.

In practice, the features that are used (and the match between features and method) is often the most important piece in making a machine learning approach work well.

One way to enrich a feature representation, particularly for linear models, is adding *interaction features* - products of individual original features. Another way to enrich a feature representation is to use *polynomials* of the original features - for a given feature x, we might want to consider x^2, x^3, x^4, and so on. This kind of feature engineering is often used in statistical modeling, but it’s also common in many practical machine learning applications.

Within *scikit-learn*, the addition of both *interaction features* and *polynomial features* is implemented in PolynomialFeatures in the **preprocessing** module.

In the code below, we modify the boston housing dataset by addig all polynomial features and interactions up to a degree of 2. The data originally had 13 features, which were expanded into 105 interaction features. These new features represent all possible interactions between two different original features, as well as the square of each original feature. degree=2 here means that we look at all features that are the product of up to two original features. The exact correspondence between input and output features can be found using the **get_feature_names** method.

```
In [25]:
```from sklearn.datasets import load_boston
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures, StandardScaler, RobustScaler
def load_extended_boston(scaler='minmax'):
boston = load_boston()
X = boston.data
if 'standard' == scaler:
X = StandardScaler().fit_transform(boston.data)
elif 'robust' == scaler:
X = RobustScaler().fit_transform(boston.data)
else:
X = MinMaxScaler().fit_transform(boston.data)
X = PolynomialFeatures(degree=2).fit_transform(X)
return X, boston.target

```
In [26]:
```X, y = load_extended_boston()
X.shape

```
Out[26]:
```

```
In [27]:
```# What if we fit this new dataset with a vastly expanded set of features using OLS?
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
lr = LinearRegression().fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

```
```

Now the basic OLS model is doing a dramatically better job fitting the training set (R^2 of 0.95 vs 0.77).

This discrepancy between performance on the training set and the test set is a clear sign of overfitting, and therefore we should try to find a model that allows us to control complexity. One of the most commonly used alternatives to standard linear regression is *ridge regression*, which we will look into next.

Ridge regression is also a linear model for regression, so the formula it uses to make predictions is the same one used for ordinary least squares. In ridge regression, though, the coefficients (w) are chosen not only so that they predict well on the training data, but also to fit an additional constraint. We also want the magnitude of coefficients to be as small as possible; in other words, all entries of w should be close to zero. Intuitively, this means each feature should have as little effect on the outcome as possible (which translates to having a small slope), while still predicting well. This constraint is an example of what is called *regularization*. Regularization means explicitly restricting a model to avoid overfitting. The particular kind used by ridge regression is known as L2 regularization.

Ridge regression is implemented in linear_model.Ridge. Let’s see how well it does on the extended Boston Housing dataset:

```
In [28]:
```from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge.score(X_test, y_test)))

```
```

*lower* than for LinearRegression, while the test set score is *higher*. This is consistent with our expectation. With linear regression, we were overfitting our data. Ridge is a more restricted model, so we are less likely to overfit. A less complex model means worse performance on the training set, but better generalization. As we are only interested in generalization performance, we should choose the Ridge model over the LinearRegression model.

**alpha** parameter. In the previous example, we used the default parameter alpha=1.0. There is no reason why this will give us the best trade-off, though. The optimum setting of alpha depends on the particular dataset we are using. Increasing alpha forces coefficients to move more toward zero, which decreases training set performance but might help generalization. For example:

```
In [29]:
```ridge10 = Ridge(alpha=10).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge10.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge10.score(X_test, y_test)))

```
```

```
In [30]:
```ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)
print("Training set score: {:.2f}".format(ridge01.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge01.score(X_test, y_test)))

```
```

Here, alpha=0.1 seems to be working well. We could try decreasing alpha even more to improve generalization. For now, notice how the parameter alpha corresponds to the model complexity.

Very shortly we need to think about systematic methods for properly select optimal values for parameters such as **alpha**.

We can also get a more qualitative insight into how the alpha parameter changes the model by inspecting the coef* attribute of models with different values of alpha. A higher alpha means a more restricted model, so we expect the entries of coef* to have smaller magnitude for a high value of alpha than for a low value of alpha. This is confirmed in the plot below:

```
In [31]:
```plt.figure(figsize=(15, 10))
plt.plot(ridge.coef_, 's', label="Ridge alpha=1")
plt.plot(ridge10.coef_, '^', label="Ridge alpha=10")
plt.plot(ridge01.coef_, 'v', label="Ridge alpha=0.1")
plt.plot(lr.coef_, 'o', label="LinearRegression")
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.hlines(0, 0, len(lr.coef_))
plt.ylim(-25, 25)
plt.legend()
plt.show()

```
```

Clearly, the interactions and polynomial features gave us a good boost in performance when using Ridge. When using a more complex model like a random forest, the story can be a bit different, though. Adding features will benefit linear models the most. For very complex models, adding features may actually slightly decrease the performance.

Machine learning is complex. Often you have to try several experiments and just see what works best.

To evaluate our supervised models, so far we have split our dataset into a training set and a test set using the **train_test_split function**, built a model on the training set by calling the fit method, and evaluated it on the test set using the score method, which for classification computes the fraction of correctly classified samples and for regression computes the R^2.

Remember, the reason we split our data into training and test sets is that we are interested in measuring how well our model *generalizes* to new, previously unseen data. We are not interested in how well our model fit the training set, but rather in how well it can make predictions for data that was not observed during training.

As we saw when exploring Ridge regression, we need a more robust way to assess generalization performance which is capable of automatically choosing optimal values for hyper-parameters such as **alpha**.

*Cross-validation* is a statistical method of evaluating generalization performance that is more stable and thorough than using a split into a training and a test set. In cross-validation, the data is instead split repeatedly and multiple models are trained. The most commonly used version of cross-validation is *k-fold cross-validation*, where *k* is a user-specified number, usually 5 or 10. When performing five-fold cross-validation, the data is first partitioned into five parts of (approximately) equal size, called *folds*. Next, a sequence of models is trained. The first model is trained using the first fold as the test set, and the remaining folds (2–5) are used as the training set. The model is built using the data in folds 2–5, and then the accuracy is evaluated on fold 1. Then another model is built, this time using fold 2 as the test set and the data in folds 1, 3, 4, and 5 as the training set. This process is repeated using folds 3, 4, and 5 as test sets. For each of these five splits of the data into training and test sets, we compute the accuracy. In the end, we have collected five accuracy values.

Usually, the first fifth of the data is the first fold, the second fifth of the data is the second fold, and so on.

The whole point of cross-validation is to be more robust than a simple train/test split so that the results are not likely to be influenced by a particularly good or bad split of the data. The main disadvantage is that it requires more computation.

Cross-validation is implemented in scikit-learn using the cross_val_score function from the *model_selection* module. The parameters of the **cross_val_score** function are the model we want to evaluate, the training data, and the ground-truth labels.

```
In [32]:
```# Let's evaluate cross-validation on the iris dataset using logistic regression (which is actually classification)
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
iris = load_iris()
logreg = LogisticRegression()
scores = cross_val_score(logreg, iris.data, iris.target)
print("Cross-validation scores: {}".format(scores))

```
```

```
In [33]:
```scores = cross_val_score(logreg, iris.data, iris.target, cv=5)
print("Cross-validation scores: {}".format(scores))

```
```

A common way to summarize the cross-validation accuracy is to compute the mean:

```
In [34]:
```print("Average cross-validation score: {:.2f}".format(scores.mean()))

```
```

There are several benefits to using cross-validation instead of a single split into a training and a test set. First, remember that train_test_split performs a random split of the data. Imagine that we are “lucky” when randomly splitting the data, and all examples that are hard to classify end up in the training set. In that case, the test set will only contain “easy” examples, and our test set accuracy will be unrealistically high. Conversely, if we are “unlucky,” we might have randomly put all the hard-to-classify examples in the test set and consequently obtain an unrealistically low score. However, when using cross-validation, each example will be in the training set exactly once: each example is in one of the folds, and each fold is the test set once. Therefore, the model needs to generalize well to all of the samples in the dataset for all of the cross-validation scores (and their mean) to be high.

Having multiple splits of the data also provides some information about how sensitive our model is to the selection of the training dataset. For the iris dataset, we saw accuracies between 90% and 100%. This is quite a range, and it provides us with an idea about how the model might perform in the worst case and best case scenarios when applied to new data.

Another benefit of cross-validation as compared to using a single split of the data is that we use our data more a single split of the data is that we use our data more effectively. When using train_test_split, we usually use 75% of the data for training and 25% of the data for evaluation. When using five-fold cross-validation, in each iteration we can use four-fifths of the data (80%) to fit the model. When using 10-fold cross-validation, we can use nine-tenths of the data (90%) to fit the model. More data will usually result in more accurate models.

The main disadvantage of cross-validation is increased computational cost. As we are now training k models instead of a single model, cross-validation will be roughly k times slower than doing a single split of the data.

It is important to keep in mind that cross-validation is not a way to build a model that can be applied to new data. Cross-validation does not return a model. When calling cross_val_score, multiple models are built internally, but the purpose of cross-validation is only to evaluate how well a given algorithm will generalize when trained on a specific dataset.

```
In [35]:
```lr = LinearRegression()
scores = cross_val_score(lr, boston.data, boston.target)
print("Cross-validation scores: {}".format(scores))

```
```

As we can see, a default 3-fold cross-validation performed ok for the first two folds, but horribly bad for the third one.

The fundamental problem here is that if that data isn't organized in a random way, then just taking folds in order doesn't represent a random sampling for each fold. There are multiple possible ways to mitigate this issue.

As the simple k-fold strategy would obviously fail for classification problems if the data is organized by target category, *scikit-learn* does not use it for classification, but rather uses *stratified k-fold cross-validation*. In stratified cross-validation, we split the data such that the proportions between classes are the same in each fold as they are in the whole dataset.

*scikit-learn* supports startified k-fold cross-validation via the StratifiedKFold class in the *model_selection* module.

For example, if 90% of your samples belong to class A and 10% of your samples belong to class B, then stratified cross-validation ensures that in each fold, 90% of samples belong to class A and 10% of samples belong to class B.

For regression, *scikit-learn* uses the standard k-fold cross-validation by default.

Another, very flexible strategy for cross-validation is *shuffle-split cross-validation*. In shuffle-split cross-validation, each split samples **train_size** many points for the training set and **test_size** many (disjoint) point for the test set. This splitting is repeated **n_iter** times. You can use integers for **train_size** and **test_size** to use absolute sizes for these sets, or floating-point numbers to use fractions of the whole dataset.

Since the sampling in *shuffle-split cross-validation* is done in a random fashion, this is a safer alternative to default *k-Fold Cross-Validation* when the data isn't truly randomized.

*scikit-learn* supports shuffle-split cross-validation via the ShuffleSplit class in the *model_selection* module.

There is also a stratified variant of ShuffleSplit, aptly named StratifiedShuffleSplit, which can provide more reliable results for classification tasks.

```
In [36]:
```# Let's look at the boston housing dataset again using shuffle-split cross-validation to ensure random sampling
# The following code splits the dataset into 80% training set and 20% test set for 3 iterations:
from sklearn.model_selection import ShuffleSplit
shuffle_split = ShuffleSplit(test_size=.8, train_size=.2, n_splits=3)
scores = cross_val_score(lr, boston.data, boston.target, cv=shuffle_split)
print("Cross-validation scores:\n{}".format(scores))

```
```

Now that we know how to evaluate how well a model generalizes, we can take the next step and improve the model’s generalization performance by tuning its parameters. We discussed the parameter settings of the Ridge model for ridge regression earlier. Finding the values of the important parameters of a model (the ones that provide the best generalization performance) is a tricky task, but necessary for almost all models and datasets. Because it is such a common task, there are standard methods in *scikit-learn* to help you with it. The most commonly used method is grid search, which basically means trying all possible combinations of the parameters of interest.

Consider the case of ridge regression, as implemented in the Ridge class. As we discussed earlier, there is one important parameters: the regularization parameter, *alpha*. Say we want to try the values 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, and 100 for *alpha*. Because we have eleven different settings for *alpha* and *alpha* is the only parameter, we have 11 combinations of parameters in total. Looking at all possible combinations creates a table (or grid) of parameter settings for the Ridge regression model.

We can implement a simple grid search just as a for loop over the parameter, training and evaluating a classifier for each value:

```
In [37]:
```X, y = load_extended_boston(scaler='standard')
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("Size of training set: {} size of test set: {}".format(X_train.shape[0], X_test.shape[0]))
best_score = 0
for alpha in [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]:
# for each combination of parameters, train an SVC
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
# evaluate the SVC on the test set
score = ridge.score(X_test, y_test)
# if we got a better score, store the score and parameters
if score > best_score:
best_score = score
best_parameters = {'alpha': alpha}
print("Best score: {:.2f}".format(best_score))
print("Best parameters: {}".format(best_parameters))

```
```

Given this result, we might be tempted to report that we found a model that performs with 78% accuracy on our dataset. However, this claim could be overly optimistic (or just wrong), for the following reason: we tried many different parameters and selected the one with best accuracy on the test set, but this accuracy won’t necessarily carry over to new data. Because we used the test data to adjust the parameters, we can no longer use it to assess how good the model is. This is the same reason we needed to split the data into training and test sets in the first place; we need an independent dataset to evaluate, one that was not used to create the model.

One way to resolve this problem is to split the data again, so we have three sets: the training set to build the model, the validation (or development) set to select the parameters of the model, and the test set to evaluate the performance of the selected parameters.

After selecting the best parameters using the validation set, we can rebuild a model using the parameter settings we found, but now training on both the training data and the validation data. This way, we can use as much data as possible to build our model. This leads to the following implementation:

```
In [38]:
```X, y = load_extended_boston(scaler='standard')
# split data into train+validation set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=0)
# split train+validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)
print("Size of training set: {} size of validation set: {} size of test set:"
" {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))
best_score = 0
for alpha in [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]:
# for each combination of parameters, train an SVC
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
# evaluate the Ridge on the test set
score = ridge.score(X_valid, y_valid)
# if we got a better score, store the score and parameters
if score > best_score:
best_score = score
best_parameters = {'alpha': alpha}
# rebuild a model on the combined training and validation set,
# and evaluate it on the test set
ridge = Ridge(**best_parameters)
ridge.fit(X_trainval, y_trainval)
test_score = ridge.score(X_test, y_test)
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: ", best_parameters)
print("Test set score with best parameters: {:.2f}".format(test_score))

```
```

The best score on the validation set is 92%. However, the score on the test set—the score that actually tells us how well we generalize—is lower, at 78%. So we can claim to classify new data 78% correctly. This happens to be the same as before, now we can make a stronger claim since the final test set wasn't used in any way shape or form during hyper-parameter tuning.

The distinction between the training set, validation set, and test set is fundamentally important to applying machine learning methods in practice. Any choices made based on the test set accuracy “leak” information from the test set into the model. Therefore, it is important to keep a separate test set, which is only used for the final evaluation. It is good practice to do all exploratory analysis and model selection using the combination of a training and a validation set, and reserve the test set for a final evaluation—this is even true for exploratory visualization. Strictly speaking, evaluating more than one model on the test set and choosing the better of the two will result in an overly optimistic estimate of how accurate the model is.

While the method of splitting the data into a training, a validation, and a test set that we just saw is workable, and relatively commonly used, it is quite sensitive to how exactly the data is split. From the output of the previous code snippet we can see that GridSearchCV selects 'alhpa': 50 as the best parameter. But if we were to take a different part of the training data as the validation set, it may optimize for a different value. For a better estimate of the generalization performance, instead of using a single split into a training and a validation set, we can use cross-validation to evaluate the performance of each parameter combination. This method can be coded up as follows:

```
In [39]:
```best_score = 0
for alpha in [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]:
# for each combination of parameters, train an SVC
ridge = Ridge(alpha=alpha)
# perform cross-validation
scores = cross_val_score(ridge, X_trainval, y_trainval, cv=5)
# compute mean cross-validation accuracy
score = np.mean(scores)
# if we got a better score, store the score and parameters
if score > best_score:
best_score = score
best_parameters = {'alpha': alpha}
# rebuild a model on the combined training and validation set,
# and evaluate it on the test set
ridge = Ridge(**best_parameters)
ridge.fit(X_trainval, y_trainval)
test_score = ridge.score(X_test, y_test)
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: ", best_parameters)
print("Test set score with best parameters: {:.2f}".format(test_score))

```
```

Because grid search with cross-validation is such a commonly used method to adjust parameters, *scikit-learn* provides the GridSearchCV class, which implements it in the form of an estimator. To use the **GridSearchCV** class, you first need to specify the parameters you want to search over using a dictionary. GridSearchCV will then perform all the necessary model fits. The keys of the dictionary are the names of parameters we want to adjust (as given when constructing the model—in this case, alpha), and the values are the parameter settings we want to try out. Trying the values 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, and 100 for alpha translates to the following dictionary:

```
In [40]:
```param_grid = {'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]}
print("Parameter grid:\n{}".format(param_grid))

```
```

**GridSearchCV** class with the model (*Ridge*), the parameter grid to search (*param_grid*), and the cross-validation strategy we want to use (say, five-fold stratified cross-validation):

```
In [41]:
```from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
grid_search = GridSearchCV(Ridge(), param_grid, cv=5)

**GridSearchCV** will use cross-validation in place of the split into a training and validation set that we used before. However, we still need to split the data into a training and a test set, to avoid overfitting the parameters:

```
In [42]:
```X, y = load_extended_boston(scaler='standard')
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

*grid_search* object that we created behaves just like a classifier; we can call the standard methods **fit**, **predict**, and **score** on it. However, when we call **fit**, it will run cross-validation for each combination of parameters we specified in param_grid:

```
In [43]:
```grid_search.fit(X_train, y_train)

```
Out[43]:
```

**GridSearchCV** object not only searches for the best parameters, but also automatically fits a new model on the whole training dataset with the parameters that yielded the best cross-validation performance. What happens in fit is therefore equivalent to the result of the code we saw at the beginning of this section. The **GridSearchCV** class provides a very convenient interface to access the retrained model using the predict and score methods. To evaluate how well the best found parameters generalize, we can call score on the test set:

```
In [44]:
```print("Test set score: {:.2f}".format(grid_search.score(X_test, y_test)))

```
```

*did not use the test set* to choose the parameters. The parameters that were found are scored in the ** best_params_** attribute, and the best cross-validation accuracy (the mean accuracy over the different splits for this parameter setting) is stored in

`best_score_`

```
In [45]:
```print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

```
```

** best_estimator_** attribute:

```
In [46]:
```print("Best estimator:\n{}".format(grid_search.best_estimator_))

```
```

*grid_search* itself has **predict** and **score** methods, using ** best_estimator_** is not needed to make predictions or evaluate the model.

```
In [47]:
```from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
param_grid = {'alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]}
grid_search = GridSearchCV(Ridge(), param_grid, cv=5)
X, y = load_extended_boston(scaler='standard')
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print("Run {} - Test set score: {:.2f} Best parameters: {}".format(i, grid_search.score(X_test, y_test),
grid_search.best_params_))

```
```

```
In [ ]:
```