Machine learning is an iterative process.
You will face choices about predictive variables to use, what types of models to use, what arguments to supply those models, and many other options.
We make these choices in a data-driven way by measuring model quality of various alternatives.
You've already learned to use train_test_split to split the data, so you can measure model quality on the test data.
Cross-validation extends this approach to model scoring (or "model validation.")
Compared to train_test_split, cross-validation gives you a more reliable measure of your model's quality, though it takes longer to run.

The shortcomings of train-test split:

Imagine you have a dataset with 5000 rows.
The train_test_split function has an argument for test_size that you can use to decide how many rows go to the training set and how many go to the test set.
The larger the test set, the more reliable your measures of model quality will be.
At an extreme, you could imagine having only 1 row of data in the test set.
If you compare alternative models, which one makes the best predictions on a single data point will be mostly a matter of luck.
You will typically keep about 20% as a test dataset.
But even with 1000 rows in the test set, there's some random chance in determining model scores.
A model might do well on one set of 1000 rows, even if it would be inaccurate on a different 1000 rows.
The larger the test set, the less randomness (aka "noise") there is in our measure of model quality.
But we can only get a large test set by removing data from our training data, and smaller training datasets mean worse models.
In fact, the ideal modeling decisions on small datasets typically aren't the best modeling decisions on large datasets.

The Cross-Validation Procedure

In cross-validation, we run our modeling process on different subsets of data to get multiple measures of model quality.
For example, we could have 5 folds or experiments.
We divide the data into 5 parts, each being 20% of the full dataset.
The first fold is used as a holdout set, and the remaining parts are used as training data.
This gives us a measure of model quality based on a 20% holdout set, which gives the same result as the simple train-test split.
The second experiment (aka fold) uses everything except the second fold for training the model.
This also gives us a second estimate of the model's performance.
The process is repeated, as shown below, using every fold once in turn as the holdout set, so that 100% of the data is used as a holdout at some point.

Returning to our example above from the train-test split, if we have 5000 rows of data, cross validation allows us to measure model quality based on all 5000.

Trade-offs between train-test split and cross-validation:

Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions.
However, it can take more time to run, because it estimates models once for each fold.
So it is doing more total work.
Given these tradeoffs, when should you use each approach?

On small datasets, the extra computational burden of running cross-validation isn't a big deal.
These are also the problems where model quality scores would be least reliable with train-test split.
So, if your dataset is smaller, you should run cross-validation.
For the same reasons, a simple train-test split is sufficient for larger datasets.
It will run faster, and you may have enough data that there's little need to re-use some of it for holdout.
There's no simple threshold for what constitutes a large vs small dataset.
If your model takes a couple minute or less to run, it's probably worth switching to cross-validation.
If your model takes much longer to run, cross-validation may slow down your workflow more than it's worth.
Alternatively, you can run cross-validation and see if the scores for each experiment seem close.
If each experiment gives the same results, train-test split is probably sufficient.

The Example, Already!


In [2]:
import pandas as pd

data = pd.read_csv('input/melbourne_data.csv')
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price

This is where pipelines come in handy, because doing cross-validation without them is much more challenging.


In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer

my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
my_pipeline


Out[7]:
Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))])

For those curious about the pipeline object's attributes:


In [9]:
dir(my_pipeline)


Out[9]:
['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_cache',
 '_abc_negative_cache',
 '_abc_negative_cache_version',
 '_abc_registry',
 '_estimator_type',
 '_final_estimator',
 '_fit',
 '_get_param_names',
 '_get_params',
 '_inverse_transform',
 '_pairwise',
 '_replace_step',
 '_set_params',
 '_transform',
 '_validate_names',
 '_validate_steps',
 'classes_',
 'decision_function',
 'fit',
 'fit_predict',
 'fit_transform',
 'get_params',
 'inverse_transform',
 'named_steps',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'score',
 'set_params',
 'steps',
 'transform']

On to the cross-validation scores.


In [11]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')
scores


Out[11]:
array([-327386.90503595, -305910.97468595, -280464.43569695])

What do those numbers above tell you?
You may notice that we specified an argument for scoring.
This specifies what measure of model quality to report.
The docs for scikit-learn show a list of options.
It is a little surprising that we specify negative mean absolute error in this case.
Scikit-learn has a convention where all metrics are defined so a high number is better.
Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere.
You typically want a single measure of model quality to compare between models.
So we take the average across experiments.


In [14]:
mean_absolute_error = (-1 * scores.mean())
mean_absolute_error


Out[14]:
304587.4384729497

Using cross-validation is an effective way to give more accurate measures of model quality.
Another benefit of cross-validation is that keeping track of separate training and test sets is unnecessary.