```
506
```

```
13
```

```
Minimum price: 5 Maximum price: 50
```

```
Mean: 22.533 Median: 21.2
```

```
9.188
```

```
"explained_variance_score" is used as the measure of model
performance.
```

```
Using other metrics, lead GridSearch to choose models that were less complex
and underfit the data. I got really bizarre results when I tried to correlate
the predicted prices with the actual prices. I evaluated mean_absolute_error,
mean_squared_error, r2_score and finally explained_variance_score. I still
intuitively couldn't understand explained_variance_score performs well though.
```

```
If we don't split the data into test and training set, and train the
model on complete data set we may end up with the model that
overfits just the training data and fails to generalize. So it
is important to split the data into training and testing sets, and
use the testing set for validation of the model.
```

```
"grid search" takes a parameter grid as one of the inputs (parameter grid
depends on the estimator used) and runs training for each of the
parameter from the grid. It also takes scorer function as input
and uses it to compare the runs and choose the best parameter
set. It is useful when we don't know what parameter will be ideal
for an estimator and want to try a set of parameters and choose
the best one.
```

```
Splitting data into training and test set makes less data
available for training. With cross validation in place, we get the
advantage of training and testing on the complete data set. This is
because, in each iteration of cross validation, portion of data is used
for testing and the remaining for training. This repeats till the model
is trained and tested on all the available data.
```

```
The training error increases as the training size increased. This
is because, when the data size is smal, any model can easily fit
it reasonably well. But as the training size increases, it is not
so easy to find a model that fits all the data points well. So the
training error increases with the size of the data.
On the other hand, the testing error is high when the data size is small. This
is because when the size of the data is less, any trivial model
can fit the data well but the model won't generalize for the new
data points. But as the training size increases, the model gets
complex in an attempt to fit the data reasonably well, and it is
likely that it fits the testing data also well. So, the training
error reduces with the increase in the size of the data.
```

```
With max_depth 1, when the model is fully trained, it suffers from high
bias(underfitting) as the model is not complex enough for the
data. Throwing more data into the model doesn't seem to help
because the model couldn't get complex enough to represent the
data accurately.
With max_depth 10, when the model is fully
trained, it suffers from high variance i.e. it overfits the
data. This was shown in the learning curve while observing the
changes to training error. With high enough max_depth, the model
got complex, closely followed the data (we can say it pretty much
memorized the data) but failed to generalize for new set of data.
```

```
The training and testing error reduces as the model
complexity increases. But after certain stage, the model becomes
unnecessarily complex as there is no improvement in the model
performance but just the complexity increases.
```

```
Desired model complexity: 5
Predicted price: 20.96776316
```

```
I tried to plot the predictions with earlier prices to see if the
model is valid. I expected a linear trend in the scatter
plot. What I could see was the trend was linear for the most part
with more data points scattered when the model complexity was 4
than when it was 7. I got predictions in the range of 19.0 to
21.0.
Example scatter plot from a run:
```