Scikit Learn basics

Library constructs


Every algorithm is exposed via an Estimator which can be imported as

from sklearn.<family> import <model>

for linear regression

from sklearn.linear_model import LinearRegression
lm_model = LinearRegression(<estimator parameters>)

Estimator parameters are provided as arguments when you instantiate an Estimator. Sklearn provides good defaults.

In Scikit-learn, Estimators are designed such that

  • consistency: all estimators share a common interface
  • inspection: the hyperparameters you set when you instantiate an estimator is available for inspection as properties of that object
  • limited hierarchy: only the algorithms are represented as Python objects. Training data, results, parameter names follow standard Python or Numpy / Pandas types
  • composition: many workflows can be achieved as a series of more fundamental algorithms
  • sensible defaults: you guessed it.

General steps when using Scikit-learn

  • choose a class of model
  • instantiate a model from the class by specifying hyperparameters to its constructor
  • arrange data into X and y and split them for training and testing
  • fit / learn the model on training data by calling fit() method
  • predict new values by calling the predict() method
  • evaluate results

Train-test split

To split the input data into train and validation sets, use

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)

to split it at 70% train and 30% test sets. This method splits both the dependent and independent attributes so as to validate the prediction.


The general syntax is, dependent_train). Thus, y_train)

In case of unsupervised models you only have a training data, no test data. Hence

In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, we have model.coef_, model.intercept_

Training score

A model.score() method returns the a value 0-1 illustrating how well the model fitted the training data. Note this is useful to understand the influence of underfitting and overfitting of training data.


Use model.predict(<independent_test data>). Thus for linear reg,

y_predicted = lm_model.predict(x_test)

Prediction probabilities

In case of classification problems, you also get a model.predict_proba() method which will return the probabilities for each class. The model.predict() will return the class with highest probability.


Relevant in unsupervised models, model.transform() is used to transform input data to a new basis. Some models combine the fitting and transformation in one step using the model.fit_transform() method.


You can obtain the MAE (Mean Absolute Error), MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) from the metrics module.

from sklearn import metrics

metrics.mean_absolute_error(y_test, y_predicted)
metrics.mean_squared_error(y_test, y_predicted)
np.sqrt(metrics.mean_squared_error(y_test, y_predicted)

In [ ]: