Regression based on Iris dataset

We ll use the Iris dataset in the regression setup

  • not use the target variable (typicall classification case)
  • use petal width (cm) as dependent variable using others as independent

In [3]:
import sklearn.datasets as datasets
import pandas as pd
iris=datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head(2)


Out[3]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2

Regression with Tree Classifier


In [7]:
independent_vars = ['sepal length (cm)','sepal width (cm)', 'petal length (cm)']
dependent_var = 'petal width (cm)'

X = df[independent_vars]
y = df[dependent_var]

from sklearn import tree
model = tree.DecisionTreeRegressor()
model.fit(X,y)


Out[7]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [8]:
# get feature importances
importances = model.feature_importances_
pd.Series(importances, index=independent_vars)


Out[8]:
sepal length (cm)    0.020756
sepal width (cm)     0.024170
petal length (cm)    0.955074
dtype: float64

Score of Regression

Some evaluation metrics (like mean squared error) are naturally descending scores (the smallest score is best) </br> This is important to note, because some scores will be reported as negative that by definition can never be negative. </br> In order to keep this clear: metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.


In [13]:
from sklearn import model_selection
results = model_selection.cross_val_score(tree.DecisionTreeRegressor(), X, y, cv=10, scoring='neg_mean_squared_error')
print("MSE: %.3f (%.3f)") % (results.mean(), results.std())


MSE: -0.059 (0.046)

Regression Performance over tree depth


In [23]:
import matplotlib.pyplot as plt
from sklearn import model_selection
scores = []
depths = []
for depth in range(1, 25):  
    scores.append(
        neg_mean_squared_error = model_selection.cross_val_score(tree.DecisionTreeRegressor(max_depth = depth), X, y, cv=10, scoring='neg_mean_squared_error'),
        neg_median_absolute_error = model_selection.cross_val_score(tree.DecisionTreeRegressor(max_depth = depth), X, y, cv=10, scoring='neg_median_absolute_error')
        neg_median_absolute_error = model_selection.cross_val_score(tree.DecisionTreeRegressor(max_depth = depth), X, y, cv=10, scoring='neg_median_absolute_error')
    )
    depths.append(depth)

_ = pd.DataFrame(data = scores, index = depths, columns = ['score']).plot()



In [24]:
# looks like a best depth around 5 is the best choice for regression

In [ ]: