Every algorithm is exposed via an Estimator which can be imported as
from sklearn.<family> import <model>
for linear regression
from sklearn.linear_model import LinearRegression
lm_model = LinearRegression(<estimator parameters>)
Estimator parameters are provided as arguments when you instantiate an Estimator. Sklearn provides good defaults.
In Scikit-learn, Estimators are designed such that
X and y and split them for training and testingfit() methodpredict() methodTo split the input data into train and validation sets, use
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3)
to split it at 70% train and 30% test sets. This method splits both the dependent and independent attributes so as to validate the prediction.
The general syntax is model.fit(independent_train, dependent_train). Thus
lm_model.fit(x_train, y_train)
In case of unsupervised models you only have a training data, no test data. Hence
model.fit(x_train)
In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, we have model.coef_, model.intercept_
A model.score() method returns the a value 0-1 illustrating how well the model fitted the training data. Note this is useful to understand the influence of underfitting and overfitting of training data.
Use model.predict(<independent_test data>). Thus for linear reg,
y_predicted = lm_model.predict(x_test)
In case of classification problems, you also get a model.predict_proba() method which will return the probabilities for each class. The model.predict() will return the class with highest probability.
Relevant in unsupervised models, model.transform() is used to transform input data to a new basis. Some models combine the fitting and transformation in one step using the model.fit_transform() method.
You can obtain the MAE (Mean Absolute Error), MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) from the metrics module.
from sklearn import metrics
import numpy.np
metrics.mean_absolute_error(y_test, y_predicted)
metrics.mean_squared_error(y_test, y_predicted)
np.sqrt(metrics.mean_squared_error(y_test, y_predicted)
In [ ]: