In [13]:
import numpy as np
Using validation data, we avoid model overfitting. In general, we split our data set into two partitions:
The test data is only used to make a final estimate of the generalisation error. It is never used for fine-tuning the model. We can use the train_test_split()
method to randomly split our data into training and test sets.
In [14]:
from sklearn.model_selection import train_test_split
# Let X be our input data consisting of
# 5 samples and 2 features
X = np.arange(10).reshape(5, 2)
# Let y be the target feature
y = [0, 1, 2, 3, 4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In the documentation, there is a distinction between hyperparameters and parameters. Hyperparameters refers to algorithm settings that is used to tune the algorithm itself. They are usually set when an estimator is initialised. Parameters on the other hand refers to the coefficients found by the learning algorithm.
In [15]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)
print(lr) # outputs the name of the estimator and its hyperparameters
estimator.fit()
fits the training data. For supervised learning algorithms, the fit(X, y)
method accepts two arguments; X
is our input data and y
is our target data. Unsupervised learning estimators accept a single argument in their fit(X)
method.estimator.predict(T)
predicts target features for a new set of data T
. Some estimators also provide the estimator.predict_proba()
method. This method returns the probability of an instance being in each of our target values.estimator.score()
provides a standard way to evaluate the model that is created by the estimator. The method returns a value from 0 to 1estimator.transform(X_new)
transforms new data into the new basis and applies this transformation model to unseen data.estimator.fit_transform()
may be more convenient and efficient for modelling and transforming the training data simultaneously.estimator.fit(X, y)
method.estimator.predict(T)
method.