Scikit Learn basics

Library constructs

Estimator

Every algorithm is exposed via an Estimator which can be imported as

from sklearn.<family> import <model>

for linear regression

from sklearn.linear_model import LinearRegression
lm_model = LinearRegression(<estimator parameters>)

Estimator parameters are provided as arguments when you instantiate an Estimator. Sklearn provides good defaults.

In Scikit-learn, Estimators are designed such that

consistency: all estimators share a common interface
inspection: the hyperparameters you set when you instantiate an estimator is available for inspection as properties of that object
limited hierarchy: only the algorithms are represented as Python objects. Training data, results, parameter names follow standard Python or Numpy / Pandas types
composition: many workflows can be achieved as a series of more fundamental algorithms
sensible defaults: you guessed it.

General steps when using Scikit-learn

choose a class of model
instantiate a model from the class by specifying hyperparameters to its constructor
arrange data into X and y and split them for training and testing
fit / learn the model on training data by calling fit() method
predict new values by calling the predict() method
evaluate results

Train-test split

To split the input data into train and validation sets, use

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)

to split it at 70% train and 30% test sets. This method splits both the dependent and independent attributes so as to validate the prediction.

Training

The general syntax is model.fit(independent_train, dependent_train). Thus

lm_model.fit(x_train, y_train)

In case of unsupervised models you only have a training data, no test data. Hence

model.fit(x_train)

In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, we have model.coef_, model.intercept_

Training score

A model.score() method returns the a value 0-1 illustrating how well the model fitted the training data. Note this is useful to understand the influence of underfitting and overfitting of training data.

Prediction

Use model.predict(<independent_test data>). Thus for linear reg,

y_predicted = lm_model.predict(x_test)

Prediction probabilities

In case of classification problems, you also get a model.predict_proba() method which will return the probabilities for each class. The model.predict() will return the class with highest probability.

Transformation

Relevant in unsupervised models, model.transform() is used to transform input data to a new basis. Some models combine the fitting and transformation in one step using the model.fit_transform() method.

Validation

You can obtain the MAE (Mean Absolute Error), MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) from the metrics module.

from sklearn import metrics
import numpy.np

metrics.mean_absolute_error(y_test, y_predicted)
metrics.mean_squared_error(y_test, y_predicted)
np.sqrt(metrics.mean_squared_error(y_test, y_predicted)