Introduction

Machine learning competitions are a great way to improve your data science skills and measure your progress.

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this course.

The steps in this notebook are:

  1. Build a Random Forest model with all of your data (X and y)
  2. Read in the "test" data, which doesn't include values for the target. Predict home values in the test data with your Random Forest model.
  3. Submit those predictions to the competition and see your score.
  4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

Recap

Here's the code you've written so far. Start by running it again.


In [21]:
def str2cols(df, column, col_vals, prefix):
    '''
    df: pandas DataFrame
    column: string (name of original column)
    col_vals: list of string (unique value in original column)
    prefix: string
    
    return: None  (modify df)
    '''
    
    for col_val in col_vals:
        df[prefix + col_val] = (df[column] == col_val).astype('int64')
        
    return

In [22]:
def add_feature(home_data):
    #home_data['Price_per_SF'] = home_data.SalePrice / \
    #                           (home_data['1stFlrSF'] + home_data['2ndFlrSF'] + home_data['TotalBsmtSF'])

    str2cols(home_data, 'SaleType', ['WD', 'New', 'COD'], 'ST_')
    
    sale_condition = ['Normal', 'Abnorml', 'Partial', 'AdjLand', 'Alloca', 'Family']
    str2cols(home_data, 'SaleCondition', sale_condition, 'SC_')

    bldg = ['1Fam', '2fmCon', 'Duplex', 'TwnhsE', 'Twnhs']
    str2cols(home_data, 'BldgType', bldg, 'BT_')
    
    house_style = ['2Story', '1Story', '1.5Fin', 'SFoyer', 'SLvl']
    str2cols(home_data, 'HouseStyle', house_style, 'HS_')
        
    return

In [23]:
# Code you have previously used to load data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor



# Path of the file to read. We changed the directory structure to simplify submitting to a competition
iowa_file_path = '../input/train.csv'

home_data = pd.read_csv(iowa_file_path)

# Create target object and call it y
y = home_data.SalePrice
# Create X
add_feature(home_data)
# home_data['YearBuilt'] = 2011 - home_data['YearBuilt']  # degrade in RF, no change in LR

features = ['OverallQual', 'OverallCond', 'LotArea',
            'ST_WD', 'ST_New', 'ST_COD',  'SC_Abnorml', 'SC_Partial', # 'SC_Normal',
            'MSSubClass',
            'GarageCars', # 'GarageArea',
            'YearBuilt',  # 'YearRemodAdd', 'YrSold', 
            # 'BT_1Fam', 'BT_2fmCon', 'BT_Duplex', 'BT_TwnhsE', 'BT_Twnhs',
            # 'HS_2Story', 'HS_1Story', 'HS_1.5Fin', 'HS_SFoyer', 'HS_SLvl',
            '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

"""
# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))
"""
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))


scaler = StandardScaler()
train_Xnorm = scaler.fit_transform(train_X)
val_Xnorm = scaler.transform(val_X)
"""
svm_model = SVR(kernel='linear')
svm_model.fit(train_Xnorm, train_y)
svm_val_predict = svm_model.predict(val_Xnorm)
svm_val_mae = mean_absolute_error(svm_val_predict, val_y)
print('Validation MAE for SVM: {}'.format(svm_val_mae))
"""

lr_model = LinearRegression()
lr_model.fit(train_X, train_y)
lr_val_predict = lr_model.predict(val_X)
lr_val_mae = mean_absolute_error(lr_val_predict, val_y)
print('Validation MAE for Linear Regression: {:,.0f}'.format(lr_val_mae))

xg_model = XGBRegressor(n_estimators=5000)
xg_model.fit(train_X, train_y, early_stopping_rounds=10, eval_set=[(val_X, val_y)], verbose=False)
xg_val_predict = xg_model.predict(val_X)
xg_val_mae = mean_absolute_error(xg_val_predict, val_y)
print('Validation MAE for XGboost Regression: {:,.0f}'.format(xg_val_mae))


print(rf_val_predictions[:5])
print(np.round(lr_val_predict[:5]))
print(val_y[:5])
#print(val_X[:5])


/home/kato/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/home/kato/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/home/kato/anaconda3/lib/python3.6/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
/home/kato/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:68: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
Validation MAE for Random Forest Model: 19,180
Validation MAE for Linear Regression: 22,006
Validation MAE for XGboost Regression: 18,400
[190483.5 142640.  119175.   78650.  131735. ]
[218428. 155531. 107175.  50661. 120946.]
258     231500
267     179500
288     122000
649      84500
1233    142000
Name: SalePrice, dtype: int64

In [24]:
xg_model


Out[24]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=5000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [25]:
home_data.SaleType.value_counts()


Out[25]:
WD       1267
New       122
COD        43
ConLD       9
ConLI       5
ConLw       5
CWD         4
Oth         3
Con         2
Name: SaleType, dtype: int64

In [39]:
# xgboost regressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

params = {"n_estimators":[250, 300, 350, 400, 500, 600, 1000], "max_depth": [4,5,6], 'learning_rate': [0.02, 0.03, 0.04]}
randsearch = RandomizedSearchCV(XGBRegressor(), params, cv=5, n_iter=100, scoring="neg_mean_absolute_error", return_train_score=True, n_jobs=-1, verbose=1)
randsearch.fit(train_X, train_y, early_stopping_rounds=10, eval_set=[(val_X, val_y)], verbose=False)
print("Best Parm =", randsearch.best_params_, "score =", randsearch.best_score_)
print()


/home/kato/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py:271: UserWarning: The total space of parameters 63 is smaller than n_iter=100. Running 63 iterations. For exhaustive searches, use GridSearchCV.
  % (grid_size, self.n_iter, grid_size), UserWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 5 folds for each of 63 candidates, totalling 315 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   12.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   54.3s
[Parallel(n_jobs=-1)]: Done 315 out of 315 | elapsed:  1.3min finished
Best Parm = {'n_estimators': 500, 'max_depth': 5, 'learning_rate': 0.02} score = -19228.788377568493

Creating a Model For the Competition

Build a Random Forest model and train it on all of X and y.


In [40]:
# Fit with XGBRegressor

#xgb = XGBRegressor(n_estimators=250, max_depth=5, learning_rate=0.03)
xgb = XGBRegressor(n_estimators=500, max_depth=5, learning_rate=0.02)
xgb.fit(X, y)


Out[40]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.02, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=500,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [42]:
# predict using cross validation data
p_cv = xgb.predict(val_X)
print(p_cv[:5], val_y[:5])


[209760.02 175773.   121205.54  80816.19 134892.83] 258     231500
267     179500
288     122000
649      84500
1233    142000
Name: SalePrice, dtype: int64

Make Predictions

Read the file of "test" data. And apply your model to make predictions


In [41]:
# path to file you will use for predictions
test_data_path = '../input/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)
test_data['GarageCars'].fillna(0.0, inplace=True)
add_feature(test_data)

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]

# make predictions which we will submit. 
#test_preds = rf_model_on_full_data.predict(test_X)
test_preds = xgb.predict(test_X)

# The lines below shows you how to save your data in the format needed to score it in the competition
output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})

output.to_csv('submission.csv', index=False)

Test Your Work

After filling in the code above:

  1. Click the Commit and Run button.
  2. After your code has finished running, click the small double brackets << in the upper left of your screen. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
  3. Go to the output tab at top of your screen. Select the button to submit your file to the competition.
  4. If you want to keep working to improve your model, select the edit button. Then you can change your model and repeat the process.

Congratulations, you've started competing in Machine Learning competitions.

Continuing Your Progress

There are many ways to improve your model, and experimenting is a great way to learn at this point.

The best way to improve your model is to add features. Look at the list of columns and think about what might affect home prices. Some features will cause errors because of issues like missing values or non-numeric data types.

Level 2 of this course will teach you how to handle these types of features. You will also learn to use xgboost, a technique giving even better accuracy than Random Forest.

Other Courses

The Pandas course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects.

You are also ready for the Deep Learning course, where you will build models with better-than-human level performance at computer vision tasks.


Course Home Page

Learn Discussion Forum.