Introduction

This will be your workspace for Kaggle's Machine Learning education track.

You will build and continually improve a model to predict housing prices as you work through each tutorial.
Fork this notebook and write your code in it.
You will see examples predicting home prices using data from Melbourne, Australia.
You will then write code to build a model predicting prices in the US state of Iowa.
The data from the tutorial, the Melbourne data, is not available in this workspace.
You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the Learn Discussion forum for any questions or coments.

Write Your Code Below



In [38]:

    
import pandas as pd

# Load the data and greet everyone.
main_file_path = 'input/melbourne_data.csv'
data = pd.read_csv(main_file_path)
# If you want to remove entries with missing values:
# filtered_data = data.dropna(axis=0)
print('hello world')









    



hello world

Use pandas to get familiar with your data

The first thing you'll want to do is familiarize yourself with the data.
You'll use the Pandas library for this.
Pandas is the primary tool that modern data scientists use for exploring and manipulating data.
Most people abbreviate pandas in their code as pd.
We do this with the command
import pandas as pd
The most important part of the Pandas library is the DataFrame.
A DataFrame holds the type of data you might think of as a table.
This is similar to a sheet in Excel, or a table in a SQL database.
The Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.
Let's start by looking at a basic data overview with our example data from Melbourne and the data you'll be working with from Iowa:



In [39]:

    
# Explore some summary statistics:
data.describe()









    Out[39]:







  
    
      
      Unnamed: 0
      Rooms
      Price
      Distance
      Postcode
      Bedroom2
      Bathroom
      Car
      Landsize
      BuildingArea
      YearBuilt
      Lattitude
      Longtitude
      Propertycount
    
  
  
    
      count
      18396.000000
      18396.000000
      1.839600e+04
      18395.000000
      18395.000000
      14927.000000
      14925.000000
      14820.000000
      13603.000000
      7762.000000
      8958.000000
      15064.000000
      15064.000000
      18395.000000
    
    
      mean
      11826.787073
      2.935040
      1.056697e+06
      10.389986
      3107.140147
      2.913043
      1.538492
      1.615520
      558.116371
      151.220219
      1965.879996
      -37.809849
      144.996338
      7517.975265
    
    
      std
      6800.710448
      0.958202
      6.419217e+05
      6.009050
      95.000995
      0.964641
      0.689311
      0.955916
      3987.326586
      519.188596
      37.013261
      0.081152
      0.106375
      4488.416599
    
    
      min
      1.000000
      1.000000
      8.500000e+04
      0.000000
      3000.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1196.000000
      -38.182550
      144.431810
      249.000000
    
    
      25%
      5936.750000
      2.000000
      6.330000e+05
      6.300000
      3046.000000
      2.000000
      1.000000
      1.000000
      176.500000
      93.000000
      1950.000000
      -37.858100
      144.931193
      4294.000000
    
    
      50%
      11820.500000
      3.000000
      8.800000e+05
      9.700000
      3085.000000
      3.000000
      1.000000
      2.000000
      440.000000
      126.000000
      1970.000000
      -37.803625
      145.000920
      6567.000000
    
    
      75%
      17734.250000
      3.000000
      1.302000e+06
      13.300000
      3149.000000
      3.000000
      2.000000
      2.000000
      651.000000
      174.000000
      2000.000000
      -37.756270
      145.060000
      10331.000000
    
    
      max
      23546.000000
      12.000000
      9.000000e+06
      48.100000
      3978.000000
      20.000000
      8.000000
      10.000000
      433014.000000
      44515.000000
      2018.000000
      -37.408530
      145.526350
      21650.000000

Interpreting data description

The results show 8 numbers for each column in your original dataset.
The first number, the count, shows how many rows have non-missing values.
Missing values arise for many reasons.
For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house.
We'll come back to the topic of missing data.
The second value is the mean, which is the average.
Under that, std is the standard deviation, which measures how numerically spread out the values are.
To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value.
The first (smallest) value is the min.
If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.
That is the 25% value (pronounced "25th percentile").
The 50th and 75th percentiles are defined analgously, and the max is the largest number.

Selecting and filtering in pandas

This is part of Kaggle's Learn Machine Learning series.
Selecting and Filtering Data

Your dataset had too many variables to wrap your head around, or even to print out nicely.
How can you pare down this overwhelming amount of data to something you can understand?
To show you the techniques, we'll start by picking a few variables using our intuition.
Later tutorials will show you statistical techniques to automatically prioritize variables.
Before we can choose variables/columns, it is helpful to see a list of all columns in the dataset.
That is done with the columns property of the DataFrame.



In [40]:

    
data.columns









    Out[40]:





Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',
       'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
       'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
       'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

There are lots of ways to go about selecting different subsets of your data.
Let's start with the basics.

Selecting a single column

You can pull out any variable (or column) with dot-notation.
This single column is stored in a pandas Series, which is kind of like a DataFrame with a single column.



In [41]:

    
# Store the Series labeled SalePrice separately:
price_data = data.Price
# Read the first few entries:
price_data.head()









    Out[41]:





0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64

Selecting multiple columns

You can select multiple columns from a DataFrame by providing a list of column names inside brackets.
Each item in that list must be a string ('in quotes').



In [42]:

    
columns_of_interest = ['Landsize', 'BuildingArea']
two_columns = data[columns_of_interest]
two_columns.head()









    Out[42]:







  
    
      
      Landsize
      BuildingArea
    
  
  
    
      0
      202.0
      NaN
    
    
      1
      156.0
      79.0
    
    
      2
      134.0
      150.0
    
    
      3
      94.0
      NaN
    
    
      4
      120.0
      142.0



In [43]:

    
two_columns.describe()









    Out[43]:







  
    
      
      Landsize
      BuildingArea
    
  
  
    
      count
      13603.000000
      7762.000000
    
    
      mean
      558.116371
      151.220219
    
    
      std
      3987.326586
      519.188596
    
    
      min
      0.000000
      0.000000
    
    
      25%
      176.500000
      93.000000
    
    
      50%
      440.000000
      126.000000
    
    
      75%
      651.000000
      174.000000
    
    
      max
      433014.000000
      44515.000000

Your first scikit-learn model

This tutorial is part of the series Learning Machine Learning.

Choosing the prediction target

You have the code to load your data, and you know how to index it.
You are ready to choose which column you want to predict.
This column is called the prediction target.
There is a convention that the prediction target is referred to as y.



In [44]:

    
y = data.Price
y.describe()









    Out[44]:





count    1.839600e+04
mean     1.056697e+06
std      6.419217e+05
min      8.500000e+04
25%      6.330000e+05
50%      8.800000e+05
75%      1.302000e+06
max      9.000000e+06
Name: Price, dtype: float64

Choosing Predictors

Next we will select the predictors.
There may be times when you use all of the other variables besides the target as predictors.
It's possible to model with non-numeric variables, but we'll start with a narrower set of numeric variables.



In [45]:

    
data.columns









    Out[45]:





Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',
       'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
       'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
       'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')



In [46]:

    
# You may need to remove or replace NaN values from some of the predictors:
# http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values
data_predictors = ['Price', 'Rooms']

By convention, this data is called X:



In [47]:

    
X = data[data_predictors]
X.describe()









    Out[47]:







  
    
      
      Price
      Rooms
    
  
  
    
      count
      1.839600e+04
      18396.000000
    
    
      mean
      1.056697e+06
      2.935040
    
    
      std
      6.419217e+05
      0.958202
    
    
      min
      8.500000e+04
      1.000000
    
    
      25%
      6.330000e+05
      2.000000
    
    
      50%
      8.800000e+05
      3.000000
    
    
      75%
      1.302000e+06
      3.000000
    
    
      max
      9.000000e+06
      12.000000

Building your model

You will use the scikit-learn library to create your models.
When coding, this library is written as sklearn, as you will see in the sample code.
Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.
The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like.
Evaluate: Determine how accurate the model's predictions are.



In [48]:

    
from sklearn.tree import DecisionTreeRegressor

# Define the model:
melbourne_model = DecisionTreeRegressor()

# Fit the model:
melbourne_model.fit(X, y)









    Out[48]:





DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for.
Here we'll make predictions for the first rows of the training data to see how the predict function works.



In [49]:

    
print(X.head())
melbourne_model.predict(X.head())









    



       Price  Rooms
0  1480000.0      2
1  1035000.0      2
2  1465000.0      3
3   850000.0      3
4  1600000.0      4






    Out[49]:





array([1480000., 1035000., 1465000.,  850000., 1600000.])

Build a model for the Iowa data

Now it's time for you to define and fit a model for your data (in your notebook).
Select the target variable you want to predict.
You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable).
Save this to a new variable called y.
Create a list of the names of the predictors we will use in the initial model.
Use just the following columns in the list (you may need to remove or replace NaN values from some of the predictors):

LotArea
YearBuilt
1stFlrSF
2ndFlrSF
FullBath
BedroomAbvGr
TotRmsAbvGrd

Using the list of variable names you just created, select a new DataFrame of the predictors data.
Save this with the variable name X.
Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model).
Ensure you've done the relevant import so you can run this command.
Fit the model you have created using the data in X and the target data you saved above.
Make a few predictions with the model's predict command and print out the predictions.
This exercise is in the iowa_model.ipynb notebook.

Model Validation

You've built a model.
But how good is it?
You'll need to answer this question for every model you ever build.
In most (though not necessarily all) applications, the relevant measure of model quality is predictive accuracy.
In other words, will the model's predictions be close to what actually happens?
Some people try answering this problem by making predictions with their training data.
They compare those predictions to the actual target values in the training data.
This approach has a critical shortcoming, which you will see in a moment (and which you'll subsequently see how to solve).
Even with this simple approach, you'll need to summarize the model quality into a form that someone can understand.
If you have predicted and actual home values for 10,000 houses, you will inevitably end up with a mix of good and bad predictions.

Looking through such a long list would be pointless.
There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE).
Let's break down this metric starting with the last word, error.
The prediction error for each house is:
error = actual − predicted
So, if a house cost \$150,000 and you predicted it would cost \$100,000, then the error is \$50,000.
With the MAE metric, we take the absolute value of each error.
This converts each error to a positive number.
We then take the average of those absolute errors.
This is our measure of model quality.
In plain English, it can be said as:
"On average, our predictions are off by about X".
We first load the Melbourne data and create X and y.



In [50]:

    
import pandas as pd

melbourne_file_path = 'input/melbourne_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# Remove entries with missing values:
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Set X and y by choosing the target and predictors:
y = filtered_melbourne_data.Price
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_predictors]

Now we can create the decision tree model using our target and predictors:



In [51]:

    
from sklearn.tree import DecisionTreeRegressor
# Define the model:
melbourne_model = DecisionTreeRegressor()
# Fit the model:
melbourne_model.fit(X, y)









    Out[51]:





DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

We can use the results to calculate the mean absolute error:



In [52]:

    
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)









    Out[52]:





434.71594577146544

The Problem with "In-Sample" Scores

The measure we just computed can be called an "in-sample" score.
We used a single set of houses (called a data sample) for both building the model and for calculating it's MAE score.
This is bad.
Imagine that, in the large real estate market, door color is unrelated to home price.
However, in the sample of data you used to build the model, it may be that all homes with green doors were very expensive.
The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.
Since this pattern was originally derived from the training data, the model will appear accurate in the training data.
But this pattern likely won't hold when the model sees new data, and the model would be very inaccurate (and cost us lots of money) when we applied it to our real estate business.
Even a model capturing only happenstance relationships in the data, relationships that will not be repeated when new data, can appear to be very accurate on in-sample accuracy measurements.

Example

Models' practical value come from making predictions on new data, so we should measure performance on data that wasn't used to build the model.
The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before.
This data is called validation data.
The scikit-learn library has a function called train_test_split() to break up the data into two pieces, so the code to get a validation score looks like this:



In [53]:

    
from sklearn.model_selection import train_test_split

# Split the data into training and cross-validation sets for both predictors and target.
# The split is based on a pseudorandom number generator, and the random_state argument will repeat the same split.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)
# Define the model:
melbourne_model = DecisionTreeRegressor()
# Fit the model:
melbourne_model.fit(train_X, train_y)
# Get the predicted prices on the validation data:
val_predictions = melbourne_model.predict(val_X)
# Compute and display the mean absolute error:
print(mean_absolute_error(val_y, val_predictions))









    



260945.67527437056

Underfitting, Overfitting and Model Optimization

Now that you have a trustworthy way to measure model accuracy, you can experiment with alternative models and see which gives the best predictions. But what alternatives do you have for models? You can see in scikit-learn's documentation that the decision tree model has many options (more than you'll want or need for a long time). The most important options determine the tree's depth. Recall from earlier that a tree's depth is a measure of how many splits it makes before coming to a prediction.
In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses and a leaf). As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 210 groups of houses by the time we get to the 10th level. That's 1024 leaves! When we divide the houses between many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).
This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups. At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting. Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting.

There are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes.
But the max_leaf_nodes argument provides a very sensible way to control overfitting vs underfitting.
The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.
We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:



In [54]:

    
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

Using the data that has already been loaded, cleaned, and split, we can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.



In [55]:

    
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d, \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))









    



Max leaf nodes: 5, 		 Mean Absolute Error: 347380
Max leaf nodes: 50, 		 Mean Absolute Error: 257829
Max leaf nodes: 500, 		 Mean Absolute Error: 243176
Max leaf nodes: 5000, 		 Mean Absolute Error: 254915

Of the options listed above, it appears that 500 is the optimal number of leaves.
Good luck applying your function to the Iowa dataset:

>>> ValueError: Found array with 0 sample(s) (shape=(0, 7)) while a minimum of 1 is required.

Conclusion

Here's the takeaway: Models can suffer from either:

Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

We use validation data, which isn't used in model training, to measure a candidate model's accuracy.
This lets us try many candidate models and keep the best one.
But we're still using Decision Tree models, which are not very sophisticated by modern machine learning standards.

	Unnamed: 0	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	18396.000000	18396.000000	1.839600e+04	18395.000000	18395.000000	14927.000000	14925.000000	14820.000000	13603.000000	7762.000000	8958.000000	15064.000000	15064.000000	18395.000000
mean	11826.787073	2.935040	1.056697e+06	10.389986	3107.140147	2.913043	1.538492	1.615520	558.116371	151.220219	1965.879996	-37.809849	144.996338	7517.975265
std	6800.710448	0.958202	6.419217e+05	6.009050	95.000995	0.964641	0.689311	0.955916	3987.326586	519.188596	37.013261	0.081152	0.106375	4488.416599
min	1.000000	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000
25%	5936.750000	2.000000	6.330000e+05	6.300000	3046.000000	2.000000	1.000000	1.000000	176.500000	93.000000	1950.000000	-37.858100	144.931193	4294.000000
50%	11820.500000	3.000000	8.800000e+05	9.700000	3085.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.803625	145.000920	6567.000000
75%	17734.250000	3.000000	1.302000e+06	13.300000	3149.000000	3.000000	2.000000	2.000000	651.000000	174.000000	2000.000000	-37.756270	145.060000	10331.000000
max	23546.000000	12.000000	9.000000e+06	48.100000	3978.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000

Starting your ML project