Scikit-Learn

scikit-learn is a Python library that provides many machine learning algorithms via a consistent API known as the estimator.


In [13]:
import numpy as np

Validation Data

Using validation data, we avoid model overfitting. In general, we split our data set into two partitions:

  1. training: used to construct the model.
  2. test: represents the future data.

The test data is only used to make a final estimate of the generalisation error. It is never used for fine-tuning the model. We can use the train_test_split() method to randomly split our data into training and test sets.


In [14]:
from sklearn.model_selection import train_test_split

# Let X be our input data consisting of
# 5 samples and 2 features
X = np.arange(10).reshape(5, 2)

# Let y be the target feature
y = [0, 1, 2, 3, 4]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Estimators

An Estimator can be seen as a base class for any algorithm that learns from data. It can be a classification, regression or clustering algorithm or a transformer that extracts useful features from raw data.

Hyperparameters and Parameters

In the documentation, there is a distinction between hyperparameters and parameters. Hyperparameters refers to algorithm settings that is used to tune the algorithm itself. They are usually set when an estimator is initialised. Parameters on the other hand refers to the coefficients found by the learning algorithm.


In [15]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)
print(lr) # outputs the name of the estimator and its hyperparameters


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

Common Methods

  • estimator.fit() fits the training data. For supervised learning algorithms, the fit(X, y) method accepts two arguments; X is our input data and y is our target data. Unsupervised learning estimators accept a single argument in their fit(X) method.
  • estimator.predict(T) predicts target features for a new set of data T. Some estimators also provide the estimator.predict_proba() method. This method returns the probability of an instance being in each of our target values.
  • estimator.score() provides a standard way to evaluate the model that is created by the estimator. The method returns a value from 0 to 1
  • estimator.transform(X_new) transforms new data into the new basis and applies this transformation model to unseen data.
  • estimator.fit_transform() may be more convenient and efficient for modelling and transforming the training data simultaneously.

Train, Predict and Evaluate

  1. We begin by training our model with our training data using the estimator.fit(X, y) method.
  2. Once the model is trained, we can predict any target feature on our test data using the estimator.predict(T) method.
  3. In this step, we evaluate our model by comparing the predictions to the correct values. The evaluation method depends on the algorithm that we have employed.