This notebook continues the starting_your_ml_project tutorial while working with the Iowa housing data instead of the Melbourne data.



In [1]:

    
import pandas as pd

filepath = 'input/train.csv'
iowa = pd.read_csv(filepath)
print(iowa.head())
iowa.describe()









    



   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities    ...     PoolArea PoolQC Fence MiscFeature MiscVal  \
0         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
1         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
2         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
3         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
4         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   

  MoSold YrSold  SaleType  SaleCondition  SalePrice  
0      2   2008        WD         Normal     208500  
1      5   2007        WD         Normal     181500  
2      9   2008        WD         Normal     223500  
3      2   2006        WD        Abnorml     140000  
4     12   2008        WD         Normal     250000  

[5 rows x 81 columns]






    Out[1]:







  
    
      
      Id
      MSSubClass
      LotFrontage
      LotArea
      OverallQual
      OverallCond
      YearBuilt
      YearRemodAdd
      MasVnrArea
      BsmtFinSF1
      ...
      WoodDeckSF
      OpenPorchSF
      EnclosedPorch
      3SsnPorch
      ScreenPorch
      PoolArea
      MiscVal
      MoSold
      YrSold
      SalePrice
    
  
  
    
      count
      1460.000000
      1460.000000
      1201.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1452.000000
      1460.000000
      ...
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
    
    
      mean
      730.500000
      56.897260
      70.049958
      10516.828082
      6.099315
      5.575342
      1971.267808
      1984.865753
      103.685262
      443.639726
      ...
      94.244521
      46.660274
      21.954110
      3.409589
      15.060959
      2.758904
      43.489041
      6.321918
      2007.815753
      180921.195890
    
    
      std
      421.610009
      42.300571
      24.284752
      9981.264932
      1.382997
      1.112799
      30.202904
      20.645407
      181.066207
      456.098091
      ...
      125.338794
      66.256028
      61.119149
      29.317331
      55.757415
      40.177307
      496.123024
      2.703626
      1.328095
      79442.502883
    
    
      min
      1.000000
      20.000000
      21.000000
      1300.000000
      1.000000
      1.000000
      1872.000000
      1950.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
      2006.000000
      34900.000000
    
    
      25%
      365.750000
      20.000000
      59.000000
      7553.500000
      5.000000
      5.000000
      1954.000000
      1967.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      5.000000
      2007.000000
      129975.000000
    
    
      50%
      730.500000
      50.000000
      69.000000
      9478.500000
      6.000000
      5.000000
      1973.000000
      1994.000000
      0.000000
      383.500000
      ...
      0.000000
      25.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      6.000000
      2008.000000
      163000.000000
    
    
      75%
      1095.250000
      70.000000
      80.000000
      11601.500000
      7.000000
      6.000000
      2000.000000
      2004.000000
      166.000000
      712.250000
      ...
      168.000000
      68.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      8.000000
      2009.000000
      214000.000000
    
    
      max
      1460.000000
      190.000000
      313.000000
      215245.000000
      10.000000
      9.000000
      2010.000000
      2010.000000
      1600.000000
      5644.000000
      ...
      857.000000
      547.000000
      552.000000
      508.000000
      480.000000
      738.000000
      15500.000000
      12.000000
      2010.000000
      755000.000000
    
  

8 rows × 38 columns

Now it's time for you to define and fit a model for your data (in your notebook).
Select the target variable you want to predict.
You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable).
Save this to a new variable called y.
Create a list of the names of the predictors we will use in the initial model.
Use just the following columns in the list (you may need to remove or replace NaN values from some of the predictors):

LotArea
YearBuilt
1stFlrSF
2ndFlrSF
FullBath
BedroomAbvGr
TotRmsAbvGrd

Using the list of variable names you just created, select a new DataFrame of the predictors data.
Save this with the variable name X.
Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model).
Ensure you've done the relevant import so you can run this command.
Fit the model you have created using the data in X and the target data you saved above.
Make a few predictions with the model's predict command and print out the predictions.



In [2]:

    
y = iowa.SalePrice
y.describe()









    Out[2]:





count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64



In [3]:

    
iowa.columns









    Out[3]:





Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')



In [4]:

    
iowa_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = iowa[iowa_predictors]
X.describe()









    Out[4]:







  
    
      
      LotArea
      YearBuilt
      1stFlrSF
      2ndFlrSF
      FullBath
      BedroomAbvGr
      TotRmsAbvGrd
    
  
  
    
      count
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
    
    
      mean
      10516.828082
      1971.267808
      1162.626712
      346.992466
      1.565068
      2.866438
      6.517808
    
    
      std
      9981.264932
      30.202904
      386.587738
      436.528436
      0.550916
      0.815778
      1.625393
    
    
      min
      1300.000000
      1872.000000
      334.000000
      0.000000
      0.000000
      0.000000
      2.000000
    
    
      25%
      7553.500000
      1954.000000
      882.000000
      0.000000
      1.000000
      2.000000
      5.000000
    
    
      50%
      9478.500000
      1973.000000
      1087.000000
      0.000000
      2.000000
      3.000000
      6.000000
    
    
      75%
      11601.500000
      2000.000000
      1391.250000
      728.000000
      2.000000
      3.000000
      7.000000
    
    
      max
      215245.000000
      2010.000000
      4692.000000
      2065.000000
      3.000000
      8.000000
      14.000000



In [5]:

    
from sklearn.tree import DecisionTreeRegressor

iowa_model = DecisionTreeRegressor()
iowa_model.fit(X,y)









    Out[5]:





DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')



In [6]:

    
print(X.head())
iowa_model.predict(X.head())









    



   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0     8450       2003       856       854         2             3   
1     9600       1976      1262         0         2             3   
2    11250       2001       920       866         2             3   
3     9550       1915       961       756         1             3   
4    14260       2000      1145      1053         2             4   

   TotRmsAbvGrd  
0             8  
1             6  
2             6  
3             7  
4             9  






    Out[6]:





array([208500., 181500., 223500., 140000., 250000.])

Model Validation

You've built a model.
But how good is it?
You'll need to answer this question for every model you ever build.
In most (though not necessarily all) applications, the relevant measure of model quality is predictive accuracy.
In other words, will the model's predictions be close to what actually happens?
Some people try answering this problem by making predictions with their training data.
They compare those predictions to the actual target values in the training data.
This approach has a critical shortcoming, which you will see in a moment (and which you'll subsequently see how to solve).
Even with this simple approach, you'll need to summarize the model quality into a form that someone can understand.
If you have predicted and actual home values for 10,000 houses, you will inevitably end up with a mix of good and bad predictions.

Looking through such a long list would be pointless.
There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE).
Let's break down this metric starting with the last word, error.
The prediction error for each house is:
error = actual − predicted
So, if a house cost \$150,000 and you predicted it would cost \$100,000, then the error is \$50,000.
With the MAE metric, we take the absolute value of each error.
This converts each error to a positive number.
We then take the average of those absolute errors.
This is our measure of model quality.
In plain English, it can be said as:
"On average, our predictions are off by about X".



In [7]:

    
from sklearn.metrics import mean_absolute_error as mae

predicted_home_prices = iowa_model.predict(X)
mae(y, predicted_home_prices)









    Out[7]:





62.35433789954339

The measure we just computed can be called an "in-sample" score.
We used a single set of houses (called a data sample) for both building the model and for calculating it's MAE score.
This is bad.
Imagine that, in the large real estate market, door color is unrelated to home price.
However, in the sample of data you used to build the model, it may be that all homes with green doors were very expensive.
The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.
Since this pattern was originally derived from the training data, the model will appear accurate in the training data.
But this pattern likely won't hold when the model sees new data, and the model would be very inaccurate (and cost us lots of money) when we applied it to our real estate business.
Even a model capturing only happenstance relationships in the data, relationships that will not be repeated when new data, can appear to be very accurate on in-sample accuracy measurements.
Models' practical value come from making predictions on new data, so we should measure performance on data that wasn't used to build the model.
The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before.
This data is called validation data.
The scikit-learn library has a function called train_test_split() to break up the data into two pieces, so the code to get a validation score looks like this:



In [8]:

    
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
iowa_model = DecisionTreeRegressor()
iowa_model.fit(train_X, train_y)

# Get the predicted prices for the cross-validation data:
val_predictions = iowa_model.predict(val_X)
print(mae(val_y, val_predictions))









    



32519.956164383562

Underfitting, Overfitting and Model Optimization

Now that you have a trustworthy way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions.
But what alternatives do you have for models?
You can see in scikit-learn's documentation that the decision tree model has many options (more than you'll want or need for a long time).
The most important options determine the tree's depth.
Recall from earlier that a tree's depth is a measure of how many splits it makes before coming to a prediction.

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses and a leaf).
As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses.
If a tree only had 1 split, it divides the data into 2 groups.
If each group is split again, we would get 4 groups of houses.
Splitting each of those again would create 8 groups.
If we keep doubling the number of groups by adding more splits at each level, we'll have 210 groups of houses by the time we get to the 10th level.
That's 1024 leaves!
When we divide the houses between many leaves, we also have fewer houses in each leaf.
Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data.
On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.
At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses.
Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason).
When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.
Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting.

Example

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes.
But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting.
The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.
We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:



In [9]:

    
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

The data is loaded into train_X, val_X, train_y, and val_y just like before, and a little repetition never hurt anyone:



In [10]:

    
import pandas as pd

file_path = 'input/train.csv'
iowa_data = pd.read_csv(file_path)
# Filter the rows with missing data:
iowa_data = iowa_data.dropna(axis=0)
# Choose the target and predictors:
y = iowa_data.SalePrice
iowa_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = iowa_data[iowa_predictors]

Now we can split our data into training and cross-validation datasets for both the target and the predictors.



In [11]:

    
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

A for-loop can be used to compare the accuracy of the models built with different values for max_leaf_nodes.



In [12]:

    
# Compare MAE with the different values of max_leaf_nodes:
for max_leaf_nodes in [5, 50, 500, 5000]:
    from sklearn.tree import DecisionTreeRegressor
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-907928b4dfb1> in <module>()
      2 for max_leaf_nodes in [5, 50, 500, 5000]:
      3     from sklearn.tree import DecisionTreeRegressor
----> 4     my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
      5     print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

<ipython-input-9-d40a45446239> in get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val)
      4 def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
      5     model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
----> 6     model.fit(predictors_train, targ_train)
      7     preds_val = model.predict(predictors_val)
      8     mae = mean_absolute_error(targ_val, preds_val)

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
   1027             sample_weight=sample_weight,
   1028             check_input=check_input,
-> 1029             X_idx_sorted=X_idx_sorted)
   1030         return self
   1031 

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sklearn/tree/tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    120         random_state = check_random_state(self.random_state)
    121         if check_input:
--> 122             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    123             y = check_array(y, ensure_2d=False, dtype=None)
    124             if issparse(X):

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    414                              " minimum of %d is required%s."
    415                              % (n_samples, shape_repr, ensure_min_samples,
--> 416                                 context))
    417 
    418     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 7)) while a minimum of 1 is required.



In [ ]:

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000000	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	...	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	56.897260	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	...	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753	180921.195890
std	421.610009	42.300571	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	...	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095	79442.502883
min	1.000000	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	34900.000000
25%	365.750000	20.000000	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000	129975.000000
50%	730.500000	50.000000	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	...	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	163000.000000
75%	1095.250000	70.000000	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	...	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	214000.000000
max	1460.000000	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	...	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000	755000.000000

Iowa Model

Model Validation

Underfitting, Overfitting and Model Optimization

Example