Every algorithm is exposed via an Estimator
which can be imported as
from sklearn.<family> import <model>
for linear regression
from sklearn.linear_model import LinearRegression
lm_model = LinearRegression(<estimator parameters>)
Estimator parameters are provided as arguments when you instantiate an Estimator. Sklearn provides good defaults.
In Scikit-learn, Estimators are designed such that
X
and y
and split them for training and testingfit()
methodpredict()
methodTo split the input data into train and validation sets, use
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)
to split it at 70% train and 30% test sets. This method splits both the dependent and independent attributes so as to validate the prediction.
The general syntax is model.fit(independent_train, dependent_train)
. Thus
lm_model.fit(x_train, y_train)
In case of unsupervised models you only have a training data, no test data. Hence
model.fit(x_train)
In Scikit-Learn, by convention all model parameters that were learned during the fit()
process have trailing underscores; for example in this linear model, we have model.coef_
, model.intercept_
A model.score()
method returns the a value 0-1
illustrating how well the model fitted the training data. Note this is useful to understand the influence of underfitting and overfitting of training data.
Use model.predict(<independent_test data>)
. Thus for linear reg,
y_predicted = lm_model.predict(x_test)
In case of classification problems, you also get a model.predict_proba()
method which will return the probabilities for each class. The model.predict()
will return the class with highest probability.
Relevant in unsupervised models, model.transform()
is used to transform input data to a new basis. Some models combine the fitting and transformation in one step using the model.fit_transform()
method.
You can obtain the MAE (Mean Absolute Error), MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) from the metrics
module.
from sklearn import metrics
import numpy.np
metrics.mean_absolute_error(y_test, y_predicted)
metrics.mean_squared_error(y_test, y_predicted)
np.sqrt(metrics.mean_squared_error(y_test, y_predicted)
In [ ]: