House Price Prediction (Regression Problem)

A Beginner's Guide

Exploratory Data Analysis (EDA) with Visualization
Feature Extraction
Data Modelling
Model Evaluation

Import Modules/Libraries



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
sns.set(font_scale=1)
from scipy import stats

Load train and test dataset



In [3]:

    
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Get data information



In [4]:

    
train.head()









    Out[4]:






  
    
      
      Id
      MSSubClass
      MSZoning
      LotFrontage
      LotArea
      Street
      Alley
      LotShape
      LandContour
      Utilities
      ...
      PoolArea
      PoolQC
      Fence
      MiscFeature
      MiscVal
      MoSold
      YrSold
      SaleType
      SaleCondition
      SalePrice
    
  
  
    
      0
      1
      60
      RL
      65
      8450
      Pave
      NaN
      Reg
      Lvl
      AllPub
      ...
      0
      NaN
      NaN
      NaN
      0
      2
      2008
      WD
      Normal
      208500
    
    
      1
      2
      20
      RL
      80
      9600
      Pave
      NaN
      Reg
      Lvl
      AllPub
      ...
      0
      NaN
      NaN
      NaN
      0
      5
      2007
      WD
      Normal
      181500
    
    
      2
      3
      60
      RL
      68
      11250
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      NaN
      NaN
      NaN
      0
      9
      2008
      WD
      Normal
      223500
    
    
      3
      4
      70
      RL
      60
      9550
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      NaN
      NaN
      NaN
      0
      2
      2006
      WD
      Abnorml
      140000
    
    
      4
      5
      60
      RL
      84
      14260
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      NaN
      NaN
      NaN
      0
      12
      2008
      WD
      Normal
      250000
    
  

5 rows × 81 columns



In [5]:

    
test.head()









    Out[5]:






  
    
      
      Id
      MSSubClass
      MSZoning
      LotFrontage
      LotArea
      Street
      Alley
      LotShape
      LandContour
      Utilities
      ...
      ScreenPorch
      PoolArea
      PoolQC
      Fence
      MiscFeature
      MiscVal
      MoSold
      YrSold
      SaleType
      SaleCondition
    
  
  
    
      0
      1461
      20
      RH
      80
      11622
      Pave
      NaN
      Reg
      Lvl
      AllPub
      ...
      120
      0
      NaN
      MnPrv
      NaN
      0
      6
      2010
      WD
      Normal
    
    
      1
      1462
      20
      RL
      81
      14267
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      Gar2
      12500
      6
      2010
      WD
      Normal
    
    
      2
      1463
      60
      RL
      74
      13830
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      MnPrv
      NaN
      0
      3
      2010
      WD
      Normal
    
    
      3
      1464
      60
      RL
      78
      9978
      Pave
      NaN
      IR1
      Lvl
      AllPub
      ...
      0
      0
      NaN
      NaN
      NaN
      0
      6
      2010
      WD
      Normal
    
    
      4
      1465
      120
      RL
      43
      5005
      Pave
      NaN
      IR1
      HLS
      AllPub
      ...
      144
      0
      NaN
      NaN
      NaN
      0
      1
      2010
      WD
      Normal
    
  

5 rows × 80 columns



In [6]:

    
train.shape, test.shape









    Out[6]:





((1460, 81), (1459, 80))

Target variable SalePrice is more in train dataset. It's not present in test dataset. We have to predict SalePrice for test dataset.



In [7]:

    
train.columns









    Out[7]:





Index([u'Id', u'MSSubClass', u'MSZoning', u'LotFrontage', u'LotArea',
       u'Street', u'Alley', u'LotShape', u'LandContour', u'Utilities',
       u'LotConfig', u'LandSlope', u'Neighborhood', u'Condition1',
       u'Condition2', u'BldgType', u'HouseStyle', u'OverallQual',
       u'OverallCond', u'YearBuilt', u'YearRemodAdd', u'RoofStyle',
       u'RoofMatl', u'Exterior1st', u'Exterior2nd', u'MasVnrType',
       u'MasVnrArea', u'ExterQual', u'ExterCond', u'Foundation', u'BsmtQual',
       u'BsmtCond', u'BsmtExposure', u'BsmtFinType1', u'BsmtFinSF1',
       u'BsmtFinType2', u'BsmtFinSF2', u'BsmtUnfSF', u'TotalBsmtSF',
       u'Heating', u'HeatingQC', u'CentralAir', u'Electrical', u'1stFlrSF',
       u'2ndFlrSF', u'LowQualFinSF', u'GrLivArea', u'BsmtFullBath',
       u'BsmtHalfBath', u'FullBath', u'HalfBath', u'BedroomAbvGr',
       u'KitchenAbvGr', u'KitchenQual', u'TotRmsAbvGrd', u'Functional',
       u'Fireplaces', u'FireplaceQu', u'GarageType', u'GarageYrBlt',
       u'GarageFinish', u'GarageCars', u'GarageArea', u'GarageQual',
       u'GarageCond', u'PavedDrive', u'WoodDeckSF', u'OpenPorchSF',
       u'EnclosedPorch', u'3SsnPorch', u'ScreenPorch', u'PoolArea', u'PoolQC',
       u'Fence', u'MiscFeature', u'MiscVal', u'MoSold', u'YrSold', u'SaleType',
       u'SaleCondition', u'SalePrice'],
      dtype='object')



In [142]:

    
# Description of all the features and their values
desc_file = open('data_description.txt')
print (desc_file.read())









    



MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density
	
LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access to property

       Grvl	Gravel	
       Pave	Paved
       	
Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access
		
LotShape: General shape of property

       Reg	Regular	
       IR1	Slightly irregular
       IR2	Moderately Irregular
       IR3	Irregular
       
LandContour: Flatness of the property

       Lvl	Near Flat/Level	
       Bnk	Banked - Quick and significant rise from street grade to building
       HLS	Hillside - Significant slope from side to side
       Low	Depression
		
Utilities: Type of utilities available
		
       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only	
	
LotConfig: Lot configuration

       Inside	Inside lot
       Corner	Corner lot
       CulDSac	Cul-de-sac
       FR2	Frontage on 2 sides of property
       FR3	Frontage on 3 sides of property
	
LandSlope: Slope of property
		
       Gtl	Gentle slope
       Mod	Moderate Slope	
       Sev	Severe Slope
	
Neighborhood: Physical locations within Ames city limits

       Blmngtn	Bloomington Heights
       Blueste	Bluestem
       BrDale	Briardale
       BrkSide	Brookside
       ClearCr	Clear Creek
       CollgCr	College Creek
       Crawfor	Crawford
       Edwards	Edwards
       Gilbert	Gilbert
       IDOTRR	Iowa DOT and Rail Road
       MeadowV	Meadow Village
       Mitchel	Mitchell
       Names	North Ames
       NoRidge	Northridge
       NPkVill	Northpark Villa
       NridgHt	Northridge Heights
       NWAmes	Northwest Ames
       OldTown	Old Town
       SWISU	South & West of Iowa State University
       Sawyer	Sawyer
       SawyerW	Sawyer West
       Somerst	Somerset
       StoneBr	Stone Brook
       Timber	Timberland
       Veenker	Veenker
			
Condition1: Proximity to various conditions
	
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad
	
Condition2: Proximity to various conditions (if more than one is present)
		
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad
	
BldgType: Type of dwelling
		
       1Fam	Single-family Detached	
       2FmCon	Two-family Conversion; originally built as one-family dwelling
       Duplx	Duplex
       TwnhsE	Townhouse End Unit
       TwnhsI	Townhouse Inside Unit
	
HouseStyle: Style of dwelling
	
       1Story	One story
       1.5Fin	One and one-half story: 2nd level finished
       1.5Unf	One and one-half story: 2nd level unfinished
       2Story	Two story
       2.5Fin	Two and one-half story: 2nd level finished
       2.5Unf	Two and one-half story: 2nd level unfinished
       SFoyer	Split Foyer
       SLvl	Split Level
	
OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor
	
OverallCond: Rates the overall condition of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average	
       5	Average
       4	Below Average	
       3	Fair
       2	Poor
       1	Very Poor
		
YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

RoofStyle: Type of roof

       Flat	Flat
       Gable	Gable
       Gambrel	Gabrel (Barn)
       Hip	Hip
       Mansard	Mansard
       Shed	Shed
		
RoofMatl: Roof material

       ClyTile	Clay or Tile
       CompShg	Standard (Composite) Shingle
       Membran	Membrane
       Metal	Metal
       Roll	Roll
       Tar&Grv	Gravel & Tar
       WdShake	Wood Shakes
       WdShngl	Wood Shingles
		
Exterior1st: Exterior covering on house

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast	
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
	
Exterior2nd: Exterior covering on house (if more than one material)

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
	
MasVnrType: Masonry veneer type

       BrkCmn	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       None	None
       Stone	Stone
	
MasVnrArea: Masonry veneer area in square feet

ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
		
ExterCond: Evaluates the present condition of the material on the exterior
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
		
Foundation: Type of foundation
		
       BrkTil	Brick & Tile
       CBlock	Cinder Block
       PConc	Poured Contrete	
       Slab	Slab
       Stone	Stone
       Wood	Wood
		
BsmtQual: Evaluates the height of the basement

       Ex	Excellent (100+ inches)	
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches
       NA	No Basement
		
BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement
	
BsmtExposure: Refers to walkout or garden level walls

       Gd	Good Exposure
       Av	Average Exposure (split levels or foyers typically score average or above)	
       Mn	Mimimum Exposure
       No	No Exposure
       NA	No Basement
	
BsmtFinType1: Rating of basement finished area

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement
		
BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Rating of basement finished area (if multiple types)

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating
		
       Floor	Floor Furnace
       GasA	Gas forced warm air furnace
       GasW	Gas hot water or steam heat
       Grav	Gravity furnace	
       OthW	Hot water or steam heat other than gas
       Wall	Wall furnace
		
HeatingQC: Heating quality and condition

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
		
CentralAir: Central air conditioning

       N	No
       Y	Yes
		
Electrical: Electrical system

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed
		
1stFlrSF: First Floor square feet
 
2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

Kitchen: Kitchens above grade

KitchenQual: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       	
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ	Typical Functionality
       Min1	Minor Deductions 1
       Min2	Minor Deductions 2
       Mod	Moderate Deductions
       Maj1	Major Deductions 1
       Maj2	Major Deductions 2
       Sev	Severely Damaged
       Sal	Salvage only
		
Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace
		
GarageType: Garage location
		
       2Types	More than one type of garage
       Attchd	Attached to home
       Basment	Basement Garage
       BuiltIn	Built-In (Garage part of house - typically has room above garage)
       CarPort	Car Port
       Detchd	Detached from home
       NA	No Garage
		
GarageYrBlt: Year garage was built
		
GarageFinish: Interior finish of the garage

       Fin	Finished
       RFn	Rough Finished	
       Unf	Unfinished
       NA	No Garage
		
GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
		
GarageCond: Garage condition

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
		
PavedDrive: Paved driveway

       Y	Paved 
       P	Partial Pavement
       N	Dirt/Gravel
		
WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       NA	No Pool
		
Fence: Fence quality
		
       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence
	
MiscFeature: Miscellaneous feature not covered in other categories
		
       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None
		
MiscVal: $Value of miscellaneous feature

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)

SaleType: Type of sale
		
       WD 	Warranty Deed - Conventional
       CWD	Warranty Deed - Cash
       VWD	Warranty Deed - VA Loan
       New	Home just constructed and sold
       COD	Court Officer Deed/Estate
       Con	Contract 15% Down payment regular terms
       ConLw	Contract Low Down payment and low interest
       ConLI	Contract Low Interest
       ConLD	Contract Low Down
       Oth	Other
		
SaleCondition: Condition of sale

       Normal	Normal Sale
       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
       AdjLand	Adjoining Land Purchase
       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
       Family	Sale between family members
       Partial	Home was not completed when last assessed (associated with New Homes)



In [8]:

    
train.get_dtype_counts()









    Out[8]:





float64     3
int64      35
object     43
dtype: int64



In [9]:

    
train.describe()









    Out[9]:






  
    
      
      Id
      MSSubClass
      LotFrontage
      LotArea
      OverallQual
      OverallCond
      YearBuilt
      YearRemodAdd
      MasVnrArea
      BsmtFinSF1
      ...
      WoodDeckSF
      OpenPorchSF
      EnclosedPorch
      3SsnPorch
      ScreenPorch
      PoolArea
      MiscVal
      MoSold
      YrSold
      SalePrice
    
  
  
    
      count
      1460.000000
      1460.000000
      1201.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1452.000000
      1460.000000
      ...
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
      1460.000000
    
    
      mean
      730.500000
      56.897260
      70.049958
      10516.828082
      6.099315
      5.575342
      1971.267808
      1984.865753
      103.685262
      443.639726
      ...
      94.244521
      46.660274
      21.954110
      3.409589
      15.060959
      2.758904
      43.489041
      6.321918
      2007.815753
      180921.195890
    
    
      std
      421.610009
      42.300571
      24.284752
      9981.264932
      1.382997
      1.112799
      30.202904
      20.645407
      181.066207
      456.098091
      ...
      125.338794
      66.256028
      61.119149
      29.317331
      55.757415
      40.177307
      496.123024
      2.703626
      1.328095
      79442.502883
    
    
      min
      1.000000
      20.000000
      21.000000
      1300.000000
      1.000000
      1.000000
      1872.000000
      1950.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
      2006.000000
      34900.000000
    
    
      25%
      365.750000
      20.000000
      59.000000
      7553.500000
      5.000000
      5.000000
      1954.000000
      1967.000000
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      5.000000
      2007.000000
      129975.000000
    
    
      50%
      730.500000
      50.000000
      69.000000
      9478.500000
      6.000000
      5.000000
      1973.000000
      1994.000000
      0.000000
      383.500000
      ...
      0.000000
      25.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      6.000000
      2008.000000
      163000.000000
    
    
      75%
      1095.250000
      70.000000
      80.000000
      11601.500000
      7.000000
      6.000000
      2000.000000
      2004.000000
      166.000000
      712.250000
      ...
      168.000000
      68.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      8.000000
      2009.000000
      214000.000000
    
    
      max
      1460.000000
      190.000000
      313.000000
      215245.000000
      10.000000
      9.000000
      2010.000000
      2010.000000
      1600.000000
      5644.000000
      ...
      857.000000
      547.000000
      552.000000
      508.000000
      480.000000
      738.000000
      15500.000000
      12.000000
      2010.000000
      755000.000000
    
  

8 rows × 38 columns



In [10]:

    
train.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 935.3+ KB

Correlation of Features with Target Variable

Most features (OverallQual, GrLivArea, GarageCars, etc.) have positive correlation with our target variable SalesPrice but some features (BsmtFinSF2, BsmtHalfBath, YrSold, etc.) have negative correlation as well.



In [11]:

    
corr = train.corr()["SalePrice"]
corr[np.argsort(corr, axis=0)[::-1]]









    Out[11]:





SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
ScreenPorch      0.111447
PoolArea         0.092404
MoSold           0.046432
3SsnPorch        0.044584
BsmtFinSF2      -0.011378
BsmtHalfBath    -0.016844
MiscVal         -0.021190
Id              -0.021917
LowQualFinSF    -0.025606
YrSold          -0.028923
OverallCond     -0.077856
MSSubClass      -0.084284
EnclosedPorch   -0.128578
KitchenAbvGr    -0.135907
Name: SalePrice, dtype: float64



In [12]:

    
plt.figure(figsize=(20,20))
corr = corr[1:-1] # removing 1st (SalePrice) and last (Id) row from dataframe
corr.plot(kind='barh') # using pandas plot
plt.title('Correlation coefficients w.r.t. Sale Price')









    Out[12]:





<matplotlib.text.Text at 0x7f6b4b4838d0>

Heatmap of highly correlated features with respect to SalePrice



In [13]:

    
# taking high correlated variables having positive correlation of 45% and above
high_positive_correlated_variables = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', \
                               'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', \
                               'YearRemodAdd', 'GarageYrBlt', 'MasVnrArea', 'Fireplaces']

corrMatrix = train[high_positive_correlated_variables].corr()

sns.set(font_scale=1.10)
plt.figure(figsize=(15, 15))

sns.heatmap(corrMatrix, vmax=.8, linewidths=0.01,
            square=True, annot=True, cmap='viridis', linecolor="white")

plt.title('Correlation between features');

MultiCollinearity

From the above heatmap, we can see that some features (other than our target variable SalePrice) are highly correlated among themselves. Note the yellow blocks in the above heatmap. The following features are intercorrelated:

TotRmsAbvGrd <> GrLivArea = 0.83

GarageYrBlt <> YearBuilt = 0.83

1stFlrSF <> TotalBsmtSF = 0.82

GarageArea <> GarageCars = 0.88

OverallQual is the other feature which is highly correlated with our target variable SalePrice.

SalePrice <> OverallQual = 0.79

This type of scenario results in multicollinearity. Multicollinearity occurs when there is moderate or high intercorrelation between independent variables. This can result in high standard error.

There are different ways to reduce multicollinearity like:

removing the interrelated features
creating a new feature by combining the interrelated features.

Let's see these features relation to SalePrice in overall data:



In [14]:

    
feature_variable = 'OverallQual'
target_variable = 'SalePrice'
train[[feature_variable, target_variable]].groupby([feature_variable], as_index=False).mean().sort_values(by=feature_variable, ascending=False)









    Out[14]:






  
    
      
      OverallQual
      SalePrice
    
  
  
    
      9
      10
      438588
    
    
      8
      9
      367513
    
    
      7
      8
      274735
    
    
      6
      7
      207716
    
    
      5
      6
      161603
    
    
      4
      5
      133523
    
    
      3
      4
      108420
    
    
      2
      3
      87473
    
    
      1
      2
      51770
    
    
      0
      1
      50150



In [15]:

    
feature_variable = 'GarageCars'
target_variable = 'SalePrice'
train[[feature_variable, target_variable]].groupby([feature_variable], as_index=False).mean().sort_values(by=feature_variable, ascending=False)

Multicollinearity among independent variables as stated above are:

TotRmsAbvGrd <> GrLivArea = 0.83

GarageYrBlt <> YearBuilt = 0.83

1stFlrSF <> TotalBsmtSF = 0.82

GarageArea <> GarageCars = 0.88

Let's draw a scatter plot between SalePrice and some of the high correlated variables having positive correlation with respect to SalePrice. We take the following independent variables:

OverallQual
TotRmsAbvGrd and GrLivArea are correlated as stated above with 83%. Hence, we only take GrLivArea because it has higher correlation with SalePrice as compared to TotalRmsAbvGrid.
GarageArea and GarageCars are correlated as stated above with 88%. Hence, we only take GarageCars because it has a bit higher correlation with SalePrice as compared to GarageArea.
1stFlrSF and TotalBsmtSF are correlated with 82%. We keep TotalBsmtSF because it has a bit higher correlation with SalePrice as compared to 1stFlrSF.
GarageYrBlt and YearBuilt are correlated as stated above with 83%. We keep YearBuilt because it has higher correlation with SalePrice as compared to GarageYrBlt.



In [16]:

    
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt']
sns.pairplot(train[cols], size = 2.5)









    Out[16]:





<seaborn.axisgrid.PairGrid at 0x7f6b4b364ed0>

From above scatter plot, we can see that:

GrLivArea and TotalBsmtSF are linearly related with SalePrice. The variables are positively related. When value of GrLivArea or TotalBsmtSF increases then SalePrice increases.
OverallQual and YearBuilt are also positively related with SalePrice.

Let's draw a box plot of OverallQual with respect to SalePrice.



In [17]:

    
# box plot overallqual/saleprice
plt.figure(figsize=[10,5])
sns.boxplot(x='OverallQual', y="SalePrice", data=train)









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6b4b2f7150>

Analyzing Response / Dependent Variable (SalePrice) distribution

Let's analyze the distribution of SalePrice across our train dataset.

Here, we do UNIVARIATE ANALYSIS. It's a kind of data observation and analysis which involves only one variable at a time.

We analyze Skewness and Kurtosis of SalePrice.

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left, despite the fact that the curve itself appears to be skewed or leaning to the right; left instead refers to the left tail being drawn out and, often, the mean being skewed to the left of a typical center of the data.

positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right, despite the fact that the curve itself appears to be skewed or leaning to the left; right instead refers to the right tail being drawn out and, often, the mean being skewed to the right of a typical center of the data.

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.

Graphical representation of data distribution for SalePrice:

Histogram - For viewing Skewness and Kurtosis.
Normal Probability Plot - For viewing how linearly the data is distribute. Data distribution should closely follow the diagonal that represents the normal distribution.



In [18]:

    
train['SalePrice'].describe()









    Out[18]:





count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64



In [19]:

    
# histogram to graphically show skewness and kurtosis
plt.figure(figsize=[15,5])
sns.distplot(train['SalePrice'])
plt.title('Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Number of Occurences')









    Out[19]:





<matplotlib.text.Text at 0x7f6b434d1490>



In [20]:

    
# normal probability plot
plt.figure(figsize=[8,6])
stats.probplot(train['SalePrice'], plot=plt)









    Out[20]:





((array([-3.30513952, -3.04793228, -2.90489705, ...,  2.90489705,
          3.04793228,  3.30513952]),
  array([ 34900,  35311,  37900, ..., 625000, 745000, 755000])),
 (74160.16474519411, 180921.19589041095, 0.93196656415129875))



In [21]:

    
# skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())









    



Skewness: 1.882876
Kurtosis: 6.536282

From the above computation and also from the above histogram, we can say that SalePrice:

is positively skewed or right skewed
have high kurtosis

High Kurtosis means that SalePrice has some outliners. We need to remove them so that they don't affect our prediction result.



In [22]:

    
plt.figure(figsize=[8,6])
plt.scatter(train["SalePrice"].values, range(train.shape[0]))
plt.title("Distribution of Sale Price")
plt.xlabel("Sale Price");
plt.ylabel("Number of Occurences")









    Out[22]:





<matplotlib.text.Text at 0x7f6b431bdd90>

Let's remove the extreme outliers as seen in the above figure.



In [23]:

    
# removing outliers
upperlimit = np.percentile(train.SalePrice.values, 99.5)
train['SalePrice'].loc[train['SalePrice']>upperlimit] = upperlimit # slicing dataframe upto the uppperlimit









    



/usr/lib/python2.7/dist-packages/pandas/core/indexing.py:117: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)



In [24]:

    
# plotting again the graph after removing outliers
plt.figure(figsize=[8,6])
plt.scatter(train["SalePrice"].values, range(train.shape[0]))
plt.title("Distribution of Sale Price")
plt.xlabel("Sale Price");
plt.ylabel("Number of Occurences")









    Out[24]:





<matplotlib.text.Text at 0x7f6b43268cd0>

Log Transformation

Another way of reducing skewness is by using Log Transformation method so that the data distribution become more linear. The logarithm function squeezes the larger values in your dataset and stretches out the smaller values.

Original value = $x$

New value after log-transformation = $log_{10}(x)$ = $x'$

$x = 1$ then $log_{10}(1) = 0 $

$x = 10$ then $log_{10}(10) = 1$

$x = 100$ then $log_{10}(100) = 2$

Let's log transform our target variable SalePrice values:



In [25]:

    
# applying log transformation
train['SalePrice'] = np.log(train['SalePrice'])

After applying log transformation, let's see the histogram and normal probability plot to see how has this affected Skewness and Kurtosis of the data. And, how normal and linear does the data distribution becomes.



In [26]:

    
# histogram to graphically show skewness and kurtosis
plt.figure(figsize=[15,5])
sns.distplot(train['SalePrice'])
plt.title('Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Number of Occurences')

# normal probability plot
plt.figure(figsize=[8,6])
stats.probplot(train['SalePrice'], plot=plt)









    Out[26]:





((array([-3.30513952, -3.04793228, -2.90489705, ...,  2.90489705,
          3.04793228,  3.30513952]),
  array([ 10.46024211,  10.47194981,  10.54270639, ...,  13.17558545,
          13.17558545,  13.17558545])),
 (0.39568317766527022, 12.023196041130895, 0.99577753714413431))

Great! We can see that log transformation has worked well and data distribution of SalePrice has been changed from right skewed to normal.



In [27]:

    
# skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())









    



Skewness: 0.062732
Kurtosis: 0.622026

Analyzing Predictors / Independent Variables Distribution

Let's analyze the distribution of predictors or independent variables.

Here, we do MULTIVARIATE ANALYSIS. It's a kind of data observation and analysis which involves two or more variables at a time.

PoolArea, PoolQC vs SalePrice

Let's analyze Pool Area, Pool Quality and Sale Price's relationship.

Note: Sale price is not displayed in thousand value because it is log-transformed above.



In [28]:

    
sns.factorplot(x="PoolArea",y="SalePrice",data=train,hue="PoolQC",kind='bar')
plt.title("Pool Area , Pool quality and SalePrice ")
plt.ylabel("SalePrice")
plt.xlabel("Pool Area in sq feet");

Fireplaces, FireplaceQu vs SalePrice

Let's analyze number of Fireplaces, Fireplace Quality and Sale Price's relationship.

Note: Sale price is not displayed in thousand value because it is log-transformed above.

Figure below shows that having 2 fireplaces increases sale price of the house. Excellent quality of Fireplace increases the sale price significantly.



In [29]:

    
sns.factorplot("Fireplaces","SalePrice",data=train,hue="FireplaceQu");



In [30]:

    
pd.crosstab(train.Fireplaces, train.FireplaceQu)









    Out[30]:






  
    
      FireplaceQu
      Ex
      Fa
      Gd
      Po
      TA
    
    
      Fireplaces
      
      
      
      
      
    
  
  
    
      1
      19
      28
      324
      20
      259
    
    
      2
      4
      4
      54
      0
      53
    
    
      3
      1
      1
      2
      0
      1

GrLivArea vs SalePrice

Let's analyze GrLivArea variable with respect to our target/response variable SalePrice.



In [31]:

    
# scatter plot grlivarea/saleprice
plt.figure(figsize=[8,6])
plt.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.xlabel('GrLivArea', fontsize=13)
plt.ylabel('SalePrice', fontsize=13)









    Out[31]:





<matplotlib.text.Text at 0x7f6b42b90150>

Note at the bottom right of the above plot. This shows that two very large GrLivArea are having low SalePrice. These values are outliers for GrLivArea.

Let's remove these outliers.



In [32]:

    
# Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)



In [33]:

    
# Plot the graph again

# scatter plot grlivarea/saleprice
plt.figure(figsize=[8,6])
plt.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.xlabel('GrLivArea', fontsize=13)
plt.ylabel('SalePrice', fontsize=13)









    Out[33]:





<matplotlib.text.Text at 0x7f6b42bd9ad0>

We have removed the extreme outliers from GrLivArea variable. Outliers can be present in other variables as well. But, removing outliers from all other variables may adversly affect our model because there can be outliers in test dataset as well. Solution to this will be to make the model more robust.

Getting Missing Values

Let's first concatenate train and test dataset into a single dataframe named all_data.



In [34]:

    
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
all_data.shape









    Out[34]:





(2915, 80)

List variables with missing data with total number of missing rows along with the missing percentage.



In [35]:

    
null_columns = all_data.columns[all_data.isnull().any()]
total_null_columns = all_data[null_columns].isnull().sum()
percent_null_columns = ( all_data[null_columns].isnull().sum() / all_data[null_columns].isnull().count() )
missing_data = pd.concat([total_null_columns, percent_null_columns], axis=1, keys=['Total', 'Percent']).sort_values(by=['Percent'], ascending=False)
#missing_data.head()
missing_data









    Out[35]:






  
    
      
      Total
      Percent
    
  
  
    
      PoolQC
      2907
      0.997256
    
    
      MiscFeature
      2810
      0.963979
    
    
      Alley
      2717
      0.932075
    
    
      Fence
      2345
      0.804460
    
    
      FireplaceQu
      1420
      0.487136
    
    
      LotFrontage
      486
      0.166724
    
    
      GarageCond
      159
      0.054545
    
    
      GarageQual
      159
      0.054545
    
    
      GarageYrBlt
      159
      0.054545
    
    
      GarageFinish
      159
      0.054545
    
    
      GarageType
      157
      0.053859
    
    
      BsmtExposure
      82
      0.028130
    
    
      BsmtCond
      82
      0.028130
    
    
      BsmtQual
      81
      0.027787
    
    
      BsmtFinType2
      80
      0.027444
    
    
      BsmtFinType1
      79
      0.027101
    
    
      MasVnrType
      24
      0.008233
    
    
      MasVnrArea
      23
      0.007890
    
    
      MSZoning
      4
      0.001372
    
    
      Utilities
      2
      0.000686
    
    
      Functional
      2
      0.000686
    
    
      BsmtHalfBath
      2
      0.000686
    
    
      BsmtFullBath
      2
      0.000686
    
    
      GarageCars
      1
      0.000343
    
    
      Exterior2nd
      1
      0.000343
    
    
      Exterior1st
      1
      0.000343
    
    
      KitchenQual
      1
      0.000343
    
    
      Electrical
      1
      0.000343
    
    
      BsmtUnfSF
      1
      0.000343
    
    
      BsmtFinSF2
      1
      0.000343
    
    
      BsmtFinSF1
      1
      0.000343
    
    
      SaleType
      1
      0.000343
    
    
      TotalBsmtSF
      1
      0.000343
    
    
      GarageArea
      1
      0.000343



In [36]:

    
plt.figure(figsize=[20,5])
plt.xticks(rotation='90', fontsize=14)
sns.barplot(x=missing_data.index, y=missing_data.Percent)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)









    Out[36]:





<matplotlib.text.Text at 0x7f6b429c8cd0>

Imputing Missing Values

Imputation/Imputing = Replacing missing data with substituted values.

PoolQC

+99% values are missing for PoolQC. This means majority of houses have "no Pool" in them. We replace NULL values with "None".



In [37]:

    
# get unique values of the column data
all_data['PoolQC'].unique()









    Out[37]:





array([nan, 'Ex', 'Fa', 'Gd'], dtype=object)



In [38]:

    
# replace null values with 'None'
all_data['PoolQC'].fillna('None', inplace=True)



In [39]:

    
# get unique values of the column data
all_data['PoolQC'].unique()









    Out[39]:





array(['None', 'Ex', 'Fa', 'Gd'], dtype=object)

MiscFeature

+96% values are missing for MiscFeature. Null value or NA means "no misc feature" in the house.



In [40]:

    
# get unique values of the column data
all_data['MiscFeature'].unique()









    Out[40]:





array([nan, 'Shed', 'Gar2', 'Othr', 'TenC'], dtype=object)



In [41]:

    
# replace null values with 'None'
all_data['MiscFeature'].fillna('None', inplace=True)

Alley

+93% values are missing for Alley. Null value or NA means "no alley access" in the house.



In [42]:

    
# get unique values of the column data
all_data['Alley'].unique()









    Out[42]:





array([nan, 'Grvl', 'Pave'], dtype=object)



In [43]:

    
# replace null values with 'None'
all_data['Alley'].fillna('None', inplace=True)

Fence

+80% values are missing for Fence. Null value or NA means "no fence" in the house.



In [44]:

    
# get unique values of the column data
all_data['Fence'].unique()









    Out[44]:





array([nan, 'MnPrv', 'GdWo', 'GdPrv', 'MnWw'], dtype=object)



In [45]:

    
# replace null values with 'None'
all_data['Fence'].fillna('None', inplace=True)

FireplaceQu

+48% values are missing for FireplaceQu. Null value or NA means "no fireplace" in the house.



In [46]:

    
# get unique values of the column data
all_data['FireplaceQu'].unique()









    Out[46]:





array([nan, 'TA', 'Gd', 'Fa', 'Ex', 'Po'], dtype=object)



In [47]:

    
# replace null values with 'None'
all_data['FireplaceQu'].fillna('None', inplace=True)

LotFrontage

LotFrontage: Linear feet of street connected to property

16.67% values are missing for LotFrontage. We can assume that the distance of the street connected to the property (LotFrontage) will be same as that of that particular property's neighbor property (Neighborhood).

We can fill the missing value by the median LotFrontage of all the Neighborhood.



In [48]:

    
# barplot of median of LotFrontage with respect to Neighborhood
sns.barplot(data=train,x='Neighborhood',y='LotFrontage', estimator=np.median)
plt.xticks(rotation=90)









    Out[48]:





(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24]),
 <a list of 25 Text xticklabel objects>)



In [49]:

    
# get unique values of the column data
all_data['LotFrontage'].unique()









    Out[49]:





array([  65.,   80.,   68.,   60.,   84.,   85.,   75.,   nan,   51.,
         50.,   70.,   91.,   72.,   66.,  101.,   57.,   44.,  110.,
         98.,   47.,  108.,  112.,   74.,  115.,   61.,   48.,   33.,
         52.,  100.,   24.,   89.,   63.,   76.,   81.,   95.,   69.,
         21.,   32.,   78.,  121.,  122.,   40.,  105.,   73.,   77.,
         64.,   94.,   34.,   90.,   55.,   88.,   82.,   71.,  120.,
        107.,   92.,  134.,   62.,   86.,  141.,   97.,   54.,   41.,
         79.,  174.,   99.,   67.,   83.,   43.,  103.,   93.,   30.,
        129.,  140.,   35.,   37.,  118.,   87.,  116.,  150.,  111.,
         49.,   96.,   59.,   36.,   56.,  102.,   58.,   38.,  109.,
        130.,   53.,  137.,   45.,  106.,   42.,   39.,  104.,  144.,
        114.,  128.,  149.,  313.,  168.,  182.,  138.,  152.,  124.,
        153.,   46.,   26.,   25.,  119.,   31.,   28.,  117.,  113.,
        125.,  135.,  136.,   22.,  123.,  160.,  195.,  155.,  126.,
        200.,  131.,  133.])



In [50]:

    
# replace null values with median LotFrontage of all the Neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))



In [51]:

    
all_data['LotFrontage'].unique()









    Out[51]:





array([  65. ,   80. ,   68. ,   60. ,   84. ,   85. ,   75. ,   51. ,
         50. ,   70. ,   72. ,   91. ,   73. ,   66. ,  101. ,   57. ,
         44. ,  110. ,   98. ,   47. ,  108. ,  112. ,   74. ,  115. ,
         67. ,   61. ,   48. ,   33. ,   64. ,   52. ,  100. ,   24. ,
         89. ,   63. ,   76. ,   81. ,   95. ,   69. ,   21. ,   32. ,
         78. ,  121. ,  122. ,   40. ,  105. ,   77. ,   94. ,   34. ,
         90. ,   80.5,   55. ,   88. ,   82. ,   71. ,  120. ,  107. ,
         92. ,  134. ,   62. ,   86. ,  141. ,   97. ,   72.5,   54. ,
         41. ,   79. ,  174. ,   99. ,   83. ,   43. ,  103. ,   93. ,
         30. ,   64.5,  129. ,  140. ,   35. ,   37. ,  118. ,   87. ,
        116. ,  150. ,  111. ,   49. ,   96. ,   59. ,   36. ,   56. ,
        102. ,   58. ,   38. ,  109. ,  130. ,   53. ,  137. ,   88.5,
         45. ,  106. ,   42. ,   39. ,  104. ,  144. ,  114. ,  128. ,
        149. ,  313. ,  168. ,  182. ,  138. ,  152. ,  124. ,  153. ,
         46. ,   26. ,   25. ,  119. ,   31. ,   28. ,  117. ,  113. ,
        125. ,  135. ,  136. ,   22. ,  123. ,  160. ,  195. ,  155. ,
        126. ,  200. ,  131. ,  133. ])

GarageType, GarageFinish, GarageQual and GarageCond

These are categorical (nominal) variables related to Garage. We replace their missing values with "None". None means no Garage in the house.



In [52]:

    
# get unique values of the column data
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    print (all_data[col].unique())









    



['Attchd' 'Detchd' 'BuiltIn' 'CarPort' nan 'Basment' '2Types']
['RFn' 'Unf' 'Fin' nan]
['TA' 'Fa' 'Gd' nan 'Ex' 'Po']
['TA' 'Fa' nan 'Gd' 'Po' 'Ex']



In [53]:

    
# replace null values with 'None'
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col].fillna('None', inplace=True)

GarageYrBlt, GarageArea and GarageCars

These are ordinal/numeric variables related to Garage. We replace their missing values with "0" (zero). Zero means no Garage in the house, so no Cars in Garage.



In [54]:

    
# get unique values of the column data
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    print (all_data[col].unique())









    



[ 2003.  1976.  2001.  1998.  2000.  1993.  2004.  1973.  1931.  1939.
  1965.  2005.  1962.  2006.  1960.  1991.  1970.  1967.  1958.  1930.
  2002.  1968.  2007.  2008.  1957.  1920.  1966.  1959.  1995.  1954.
  1953.    nan  1983.  1977.  1997.  1985.  1963.  1981.  1964.  1999.
  1935.  1990.  1945.  1987.  1989.  1915.  1956.  1948.  1974.  2009.
  1950.  1961.  1921.  1900.  1979.  1951.  1969.  1936.  1975.  1971.
  1923.  1984.  1926.  1955.  1986.  1988.  1916.  1932.  1972.  1918.
  1980.  1924.  1996.  1940.  1949.  1994.  1910.  1978.  1982.  1992.
  1925.  1941.  2010.  1927.  1947.  1937.  1942.  1938.  1952.  1928.
  1922.  1934.  1906.  1914.  1946.  1908.  1929.  1933.  1917.  1896.
  1895.  2207.  1943.  1919.]
[  548.   460.   608.   642.   836.   480.   636.   484.   468.   205.
   384.   736.   352.   840.   576.   516.   294.   853.   280.   534.
   572.   270.   890.   772.   319.   240.   250.   271.   447.   556.
   691.   672.   498.   246.     0.   440.   308.   504.   300.   670.
   826.   386.   388.   528.   894.   565.   641.   288.   645.   852.
   558.   220.   667.   360.   427.   490.   379.   297.   283.   509.
   405.   758.   461.   400.   462.   420.   432.   506.   684.   472.
   366.   476.   410.   740.   648.   273.   546.   325.   792.   450.
   180.   430.   594.   390.   540.   264.   530.   435.   453.   750.
   487.   624.   471.   318.   766.   660.   470.   720.   577.   380.
   434.   866.   495.   564.   312.   625.   680.   678.   726.   532.
   216.   303.   789.   511.   616.   521.   451.  1166.   252.   497.
   682.   666.   786.   795.   856.   473.   398.   500.   349.   454.
   644.   299.   210.   431.   438.   675.   968.   721.   336.   810.
   494.   457.   818.   463.   604.   389.   538.   520.   309.   429.
   673.   884.   868.   492.   413.   924.  1053.   439.   671.   338.
   573.   732.   505.   575.   626.   898.   529.   685.   281.   539.
   418.   588.   282.   375.   683.   843.   552.   870.   888.   746.
   708.   513.  1025.   656.   872.   292.   441.   189.   880.   676.
   301.   474.   706.   617.   445.   200.   592.   566.   514.   296.
   244.   610.   834.   639.   501.   846.   560.   596.   600.   373.
   947.   350.   396.   864.   304.   784.   696.   569.   628.   550.
   493.   578.   198.   422.   228.   526.   525.   908.   499.   508.
   694.   874.   164.   402.   515.   286.   603.   900.   583.   889.
   858.   502.   392.   403.   527.   765.   367.   426.   615.   871.
   570.   406.   590.   612.   650.  1390.   275.   452.   842.   816.
   621.   544.   486.   230.   261.   531.   393.   774.   749.   364.
   627.   260.   256.   478.   442.   562.   512.   839.   330.   711.
  1134.   416.   779.   702.   567.   326.   551.   606.   739.   408.
   475.   704.   983.   768.   632.   541.   320.   800.   831.   554.
   878.   752.   614.   481.   496.   423.   841.   895.   412.   865.
   630.   605.   602.   618.   444.   397.   455.   409.   820.  1020.
   598.   857.   595.   433.   776.  1220.   458.   613.   456.   436.
   812.   686.   611.   425.   343.   479.   619.   902.   574.   523.
   414.   738.   354.   483.   327.   756.   690.   284.   833.   601.
   533.   522.   788.   555.   689.   796.   808.   510.   255.   424.
   305.   368.   824.   328.   160.   437.   665.   290.   912.   905.
   542.   716.   586.   467.   582.  1248.  1043.   254.   712.   719.
   862.   928.   782.   466.   714.  1052.   225.   234.   324.   306.
   830.   807.   358.   186.   693.   482.   995.   757.  1356.   459.
   701.   322.   315.   668.   404.   543.   954.   850.   477.   276.
   518.  1014.   753.   213.   844.   860.   748.   248.   287.   825.
   647.   342.   770.   663.   377.   804.   936.   722.   208.   662.
   754.   622.   620.   370.  1069.   372.   923.   192.   730.   751.
   958.   962.   762.   713.   535.   517.   263.   780.   363.   365.
   231.   591.   209.  1017.   580.   399.   741.   253.   581.   345.
   896.   932.   640.   927.   700.   886.   949.   649.   394.   658.
   815.   623.   972.   984.   692.   845.   559.   465.   524.   561.
   549.   907.   162.   357.   207.  1184.   316.   226.   340.   266.
  1138.   904.  1231.   195.   313.   215.   307.   295.   351.   885.
   920.   698.   557.   489.  1314.   787.  1150.  1003.   944.   428.
   687.   938.   783.   851.   545.   469.   464.   267.  1488.   401.
   311.   828.   869.   355.   249.  1348.   811.   725.   715.   814.
   369.   599.   344.   356.   185.   892.   257.   729.  1110.   724.
   585.   488.  1040.  1174.   728.   916.   876.   631.   925.   806.
   933.  1092.   859.   744.  1105.   310.   293.   371.  1200.   184.
   374.   331.   224.   217.   323.   638.   332.   674.   747.   242.
   597.   579.  1154.    nan   100.   571.  1041.   963.   443.   773.
   485.  1085.   899.   959.   803.   760.   584.   449.   688.   568.
   353.   791.  1008.   378.   258.   848.   317.   646.   265.   609.
   272.]
[  2.   3.   1.   0.   4.   5.  nan]



In [55]:

    
# replace null values with 0
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col].fillna(0, inplace=True)

BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2

These are categorical (nominal) variables related to Basement. We replace their missing values with "None". None means no Basement in the house.



In [56]:

    
# get unique values of the column data
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    print (all_data[col].unique())









    



['Gd' 'TA' 'Ex' nan 'Fa']
['TA' 'Gd' nan 'Fa' 'Po']
['No' 'Gd' 'Mn' 'Av' nan]
['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' nan 'LwQ']
['Unf' 'BLQ' nan 'ALQ' 'Rec' 'LwQ' 'GLQ']



In [57]:

    
# replace null values with 'None'
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col].fillna('None', inplace=True)

BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, BsmtHalfBath

These are ordinal/numeric variables related to Basement. We replace their missing values with "0" (zero). Zero means no Basement in the house.



In [58]:

    
# replace null values with 0
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col].fillna(0, inplace=True)

MasVnrArea, MasVnrType

NA for MasVnrArea and MasVnrType means there is "no masonry veneer" for the house. We replace the NA value of nominal/categorical feature MasVnrType with "None" and the NA value of ordinal feature MasVnrArea with 0 (zero).



In [59]:

    
all_data["MasVnrType"].fillna("None", inplace=True)
all_data["MasVnrArea"].fillna(0, inplace=True)

MSZoning, Utilities, Functional, Exterior2nd, Exterior1st, KitchenQual, Electrical, SaleType

All of these features are nominal/categorical. Each of them has less than 5 missing values. We replace the missing values of each feature by the most common value for that particular feature.



In [60]:

    
for col in ('MSZoning', 'Utilities', 'Functional', 'Exterior2nd', 'Exterior1st', 'KitchenQual', 'Electrical', 'SaleType'):
    all_data[col].fillna(all_data[col].mode()[0], inplace=True)

Recheck Columns for Missing Values

Now, there are no columns with missing values.



In [61]:

    
null_columns = all_data.columns[all_data.isnull().any()]
print (null_columns)









    



Index([], dtype='object')

Reducing Skewness of Predictors (Independent Variables)

Earlier in this notebook, we have reduced the Skewness of our target variable SalePrice. We did it through Log Transformation. We will apply the same for all other numeric dependent variables having high skewness.

Let's check the Skewness of numeric dependent variables:



In [62]:

    
numeric_features = all_data.dtypes[all_data.dtypes != 'object'].index
#print (numeric_features)

skewness = []
for col in numeric_features:
    skewness.append( (col, all_data[col].skew()) )
    
pd.DataFrame(skewness, columns=('Feature', 'Skewness')).sort_values(by='Skewness', ascending=False)









    Out[62]:






  
    
      
      Feature
      Skewness
    
  
  
    
      24
      MiscVal
      21.943440
    
    
      29
      PoolArea
      18.711459
    
    
      19
      LotArea
      13.130516
    
    
      21
      LowQualFinSF
      12.086535
    
    
      2
      3SsnPorch
      11.373947
    
    
      18
      KitchenAbvGr
      4.301059
    
    
      5
      BsmtFinSF2
      4.144996
    
    
      9
      EnclosedPorch
      4.002856
    
    
      30
      ScreenPorch
      3.945539
    
    
      7
      BsmtHalfBath
      3.944922
    
    
      23
      MasVnrArea
      2.602036
    
    
      26
      OpenPorchSF
      2.530548
    
    
      33
      WoodDeckSF
      1.849236
    
    
      22
      MSSubClass
      1.375512
    
    
      0
      1stFlrSF
      1.253657
    
    
      20
      LotFrontage
      1.093272
    
    
      15
      GrLivArea
      0.978364
    
    
      4
      BsmtFinSF1
      0.974640
    
    
      8
      BsmtUnfSF
      0.920609
    
    
      1
      2ndFlrSF
      0.843671
    
    
      31
      TotRmsAbvGrd
      0.749965
    
    
      10
      Fireplaces
      0.726331
    
    
      16
      HalfBath
      0.699130
    
    
      32
      TotalBsmtSF
      0.662998
    
    
      6
      BsmtFullBath
      0.623140
    
    
      27
      OverallCond
      0.569436
    
    
      3
      BedroomAbvGr
      0.328298
    
    
      12
      GarageArea
      0.217860
    
    
      25
      MoSold
      0.198513
    
    
      28
      OverallQual
      0.181995
    
    
      11
      FullBath
      0.160000
    
    
      36
      YrSold
      0.130977
    
    
      17
      Id
      -0.001872
    
    
      13
      GarageCars
      -0.219515
    
    
      35
      YearRemodAdd
      -0.449345
    
    
      34
      YearBuilt
      -0.598395
    
    
      14
      GarageYrBlt
      -3.905056

Unskewing Data

We will use Log Transformation to reduce the Skewness of the positively skewed features.



In [63]:

    
all_data.head()









    Out[63]:






  
    
      
      1stFlrSF
      2ndFlrSF
      3SsnPorch
      Alley
      BedroomAbvGr
      BldgType
      BsmtCond
      BsmtExposure
      BsmtFinSF1
      BsmtFinSF2
      ...
      SaleType
      ScreenPorch
      Street
      TotRmsAbvGrd
      TotalBsmtSF
      Utilities
      WoodDeckSF
      YearBuilt
      YearRemodAdd
      YrSold
    
  
  
    
      0
      856
      854
      0
      None
      3
      1Fam
      TA
      No
      706
      0
      ...
      WD
      0
      Pave
      8
      856
      AllPub
      0
      2003
      2003
      2008
    
    
      1
      1262
      0
      0
      None
      3
      1Fam
      TA
      Gd
      978
      0
      ...
      WD
      0
      Pave
      6
      1262
      AllPub
      298
      1976
      1976
      2007
    
    
      2
      920
      866
      0
      None
      3
      1Fam
      TA
      Mn
      486
      0
      ...
      WD
      0
      Pave
      6
      920
      AllPub
      0
      2001
      2002
      2008
    
    
      3
      961
      756
      0
      None
      3
      1Fam
      Gd
      No
      216
      0
      ...
      WD
      0
      Pave
      7
      756
      AllPub
      0
      1915
      1970
      2006
    
    
      4
      1145
      1053
      0
      None
      4
      1Fam
      TA
      Av
      655
      0
      ...
      WD
      0
      Pave
      9
      1145
      AllPub
      192
      2000
      2000
      2008
    
  

5 rows × 80 columns



In [64]:

    
positively_skewed_features = all_data[numeric_features].columns[abs(all_data[numeric_features].skew()) > 1]
#print (positively_skewed_features)

# applying log transformation
for col in positively_skewed_features:
    all_data[col] = np.log(np.ma.array(all_data[col], mask=(all_data[col]<=0))) # using masked array to ignore log transformation of 0 values as (log 0) is undefined



In [65]:

    
all_data.head()









    Out[65]:






  
    
      
      1stFlrSF
      2ndFlrSF
      3SsnPorch
      Alley
      BedroomAbvGr
      BldgType
      BsmtCond
      BsmtExposure
      BsmtFinSF1
      BsmtFinSF2
      ...
      SaleType
      ScreenPorch
      Street
      TotRmsAbvGrd
      TotalBsmtSF
      Utilities
      WoodDeckSF
      YearBuilt
      YearRemodAdd
      YrSold
    
  
  
    
      0
      6.752270
      854
      1
      None
      3
      1Fam
      TA
      No
      706
      1
      ...
      WD
      1
      Pave
      8
      856
      AllPub
      1.000000
      2003
      2003
      2008
    
    
      1
      7.140453
      0
      1
      None
      3
      1Fam
      TA
      Gd
      978
      1
      ...
      WD
      1
      Pave
      6
      1262
      AllPub
      5.697093
      1976
      1976
      2007
    
    
      2
      6.824374
      866
      1
      None
      3
      1Fam
      TA
      Mn
      486
      1
      ...
      WD
      1
      Pave
      6
      920
      AllPub
      1.000000
      2001
      2002
      2008
    
    
      3
      6.867974
      756
      1
      None
      3
      1Fam
      Gd
      No
      216
      1
      ...
      WD
      1
      Pave
      7
      756
      AllPub
      1.000000
      1915
      1970
      2006
    
    
      4
      7.043160
      1053
      1
      None
      4
      1Fam
      TA
      Av
      655
      1
      ...
      WD
      1
      Pave
      9
      1145
      AllPub
      5.257495
      2000
      2000
      2008
    
  

5 rows × 80 columns



In [66]:

    
%%HTML
<style>
  table {margin-left: 0 !important;}
</style>

Creating Dummy Categorical Features

Dummy variables are used to convert categorical/nominal features into quantitative one. A new column is created for each unique category of a nominal/categorical column. Values in that newly created column will be either 1 or 0.

Let's take an example of a column named "Sex" which has two values "male" and "female". If we create dummy variables for this column then two new columns will be added with name "male" and "female". For any row, if the "Sex" value is 'male' then the "male" column will have value 1 and "female" column will have value 0. Similary, if the "Sex" value is 'female' then the "male" column will have value 0 and "female" column will have value 1.

BEFORE

Row	Sex
1	male
2	female
3	female
4	male

AFTER CREATING DUMMY VARIABLES

Row	Sex	male	female
1	male	1	0
2	female	0	1
3	female	0	1
4	male	1	0

We will now create dummy variables for all our categorical/nominal features.



In [67]:

    
all_data = pd.get_dummies(all_data)
print(all_data.shape)









    



(2915, 302)

Getting new Train and Test dataset

We are done with Feature Engineering part. We will not split all_data into train and test dataset.



In [68]:

    
train = all_data[:ntrain]
test = all_data[ntrain:]



In [69]:

    
train.head()









    Out[69]:






  
    
      
      1stFlrSF
      2ndFlrSF
      3SsnPorch
      BedroomAbvGr
      BsmtFinSF1
      BsmtFinSF2
      BsmtFullBath
      BsmtHalfBath
      BsmtUnfSF
      EnclosedPorch
      ...
      SaleType_ConLD
      SaleType_ConLI
      SaleType_ConLw
      SaleType_New
      SaleType_Oth
      SaleType_WD
      Street_Grvl
      Street_Pave
      Utilities_AllPub
      Utilities_NoSeWa
    
  
  
    
      0
      6.752270
      854
      1
      3
      706
      1
      1
      1
      150
      1.000000
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      1
      7.140453
      0
      1
      3
      978
      1
      0
      0
      284
      1.000000
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      2
      6.824374
      866
      1
      3
      486
      1
      1
      1
      434
      1.000000
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      3
      6.867974
      756
      1
      3
      216
      1
      1
      1
      540
      5.605802
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      4
      7.043160
      1053
      1
      4
      655
      1
      1
      1
      490
      1.000000
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
  

5 rows × 302 columns



In [70]:

    
test.head()









    Out[70]:






  
    
      
      1stFlrSF
      2ndFlrSF
      3SsnPorch
      BedroomAbvGr
      BsmtFinSF1
      BsmtFinSF2
      BsmtFullBath
      BsmtHalfBath
      BsmtUnfSF
      EnclosedPorch
      ...
      SaleType_ConLD
      SaleType_ConLI
      SaleType_ConLw
      SaleType_New
      SaleType_Oth
      SaleType_WD
      Street_Grvl
      Street_Pave
      Utilities_AllPub
      Utilities_NoSeWa
    
  
  
    
      1456
      6.797940
      0
      1
      2
      468
      4.969813
      0
      1
      270
      1
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      1457
      7.192182
      0
      1
      3
      923
      1.000000
      0
      1
      406
      1
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      1458
      6.833032
      701
      1
      3
      791
      1.000000
      0
      1
      137
      1
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      1459
      6.830874
      678
      1
      3
      602
      1.000000
      0
      1
      324
      1
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      1460
      7.154615
      0
      1
      2
      263
      1.000000
      0
      1
      1017
      1
      ...
      0
      0
      0
      0
      0
      1
      0
      1
      1
      0
    
  

5 rows × 302 columns

Modelling

Here, we create different regression models and evaluate the Root Mean Square Error (RMSE) of predictions done by those models. The root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model or an estimator and the values actually observed.

Note:

Scikit-Learn cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better).

Mean Square Error (MSE) ranges from 0 to 1. Generally, low error means better model. But, in case of scikit-learn, high MSE means better model. So, if our MSE value is 0.9 then we can say that our model is performing better as compared to MSE value 0.2.

To revert this behavior of scikit-learn, we can use "scoring" parameter in "cross_val_scores" function like this:

cv_score = cross_val_score(lasso, train.drop(['Id'], axis=1), y_train, scoring="neg_mean_squared_error", cv=5)

We will be testing the following Regression Models for this House Price problem:

Lasso
Elastic Net
Kernel Ridge
Gradient Boost
XGBoost
LightGBM

Let's first import the model libraries.



In [90]:

    
# importing model libraries
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

Defining train and test data to train model



In [78]:

    
X_train = train.drop(['Id'], axis=1)
# y_train has been defined above where we combined train and test data to create all_data
X_test = test.drop(['Id'], axis=1)

Lasso Regression

RobustScaler() method is added to the pipeline to make the model less sensitive to outliers.



In [91]:

    
#lasso = Lasso(alpha =0.0005, random_state=1)
#lasso = Lasso()
model_lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005))
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_lasso, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))









    



[ 0.10605609  0.11134129  0.11972374  0.10013615  0.10878121]
SCORE (mean: 0.109208 , std: 0.006443)

ElasticNet Regression

RobustScaler() method is added to the pipeline to make the model less sensitive to outliers.



In [92]:

    
model_elastic_net = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005))
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_elastic_net, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))









    



[ 0.10609503  0.11304797  0.12187797  0.10023517  0.11022137]
SCORE (mean: 0.110296 , std: 0.007219)

Kernel Ridge Regression



In [94]:

    
model_kernel_ridge = KernelRidge(alpha=0.6)
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_kernel_ridge, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))









    



[ 0.11446846  0.12290098  0.12999278  0.10554083  0.11596478]
SCORE (mean: 0.117774 , std: 0.008239)

Gradient Boosting Regression

loss='huber' is added as parameter to make the model less sensitive to outliers.



In [95]:

    
model_gboost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state=5)

# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_gboost, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))









    



[ 0.10909226  0.12793872  0.12617292  0.1056675   0.11093377]
SCORE (mean: 0.115961 , std: 0.009232)

XGBoost (eXtreme Gradient Boosting)

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.



In [96]:

    
model_xgboost = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=True, nthread = -1)

# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_xgboost, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))









    



[ 0.10494045  0.11890447  0.12197608  0.11011517  0.11316734]
SCORE (mean: 0.113821 , std: 0.006089)

LightGBM (Light Gradient Boosting)

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.

Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.



In [97]:

    
model_lgbm = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_lgbm, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))









    



[ 0.111941    0.12426593  0.12385647  0.10825406  0.11497446]
SCORE (mean: 0.116658 , std: 0.006410)

Generate Predictions

Training our regression model

We have already done Cross Validation before but Cross Validation fits the classifier on different subsets of dataset, and then averages their scores. It is a common practice to train/fit classifier on full dataset after it has shown sufficient score in Cross Validation.

Hence, here we train our models with fit method, i.e. we fit our models with the predictors (X_train) and outcome (y_train) so that it can learn and predict the outcome in future.



In [104]:

    
model_lasso.fit(X_train, y_train)
model_elastic_net.fit(X_train, y_train)
model_kernel_ridge.fit(X_train, y_train)
model_gboost.fit(X_train, y_train)
model_xgboost.fit(X_train, y_train)
model_lgbm.fit(X_train, y_train)









    Out[104]:





LGBMRegressor(bagging_fraction=0.8, bagging_freq=5, bagging_seed=9,
       boosting_type='gbdt', colsample_bytree=1, feature_fraction=0.2319,
       feature_fraction_seed=9, learning_rate=0.05, max_bin=55,
       max_depth=-1, min_child_samples=10, min_child_weight=5,
       min_data_in_leaf=6, min_split_gain=0, min_sum_hessian_in_leaf=11,
       n_estimators=720, nthread=-1, num_leaves=5, objective='regression',
       reg_alpha=0, reg_lambda=0, seed=0, silent=True, subsample=1,
       subsample_for_bin=50000, subsample_freq=1)

Generating Prediction on Training data

Above, we have trained our model with the training dataset. Here, use those trained models to generate predition on the training data itself. And then calculate the Root Mean Squre Error (RMSE) of those predictions.

This can show how accurately the model predict the data that it has already seen before. The result below shows that Gradient Boosting model has the most accurate predictions for already learned data.



In [122]:

    
dict_models = {'lasso':model_lasso, 'elastic_net':model_elastic_net, 'kernel_ridge':model_kernel_ridge, 
            'gboost':model_gboost, 'xgboost':model_xgboost, 'lgbm':model_lgbm}

for key, value in dict_models.items():
    pred_train = value.predict(X_train)
    rmse = np.sqrt(mean_squared_error(y_train, pred_train))
    print ("%s: %f" % (key, rmse))









    



kernel_ridge: 0.089563
lgbm: 0.072244
xgboost: 0.078819
gboost: 0.051333
elastic_net: 0.094859
lasso: 0.098365

Generate Predictions on Test dataset

We use np.expm1() function. This calculates $exp(x) - 1$ for all elements in the array. This is needed here because we have log transformed the SalePrice earlier to reduce the Skewness of SalePrice data distribution.



In [128]:

    
prediction_lasso = np.expm1(model_lasso.predict(X_test))
prediction_elastic_net = np.expm1(model_elastic_net.predict(X_test))
prediction_kernel_ridge = np.expm1(model_kernel_ridge.predict(X_test))
prediction_gboost = np.expm1(model_gboost.predict(X_test))

prediction_xgboost = np.expm1(model_xgboost.predict(X_test))
prediction_lgbm = np.expm1(model_lgbm.predict(X_test))

Different combinations of Predictions

We can try different prediction combination before generating the Kaggle submission file. We can try single prediction model or an average of two or more prediction model.

I got better result on Kaggle score while combining prediction Lasso and Elastic Net models and taking the average of their prediction.



In [131]:

    
# kaggle score: 0.12346
#prediction = prediction_gboost

# kaggle score: 0.12053
#prediction = (prediction_lasso + prediction_xgboost) / float(2) 

# kaggle score: 0.11960
#prediction = prediction_lasso 

# kaggle score: 0.11937
prediction = (prediction_lasso + prediction_elastic_net) / float(2) 

#print prediction

Create Submission File to Kaggle



In [132]:

    
submission = pd.DataFrame({
        "Id": test["Id"],
        "SalePrice": prediction
    })

#submission.to_csv('submission.csv', index=False)

References

This notebook is created by learning from these awesome notebooks:

	GarageCars	SalePrice
4	4	192655
3	3	309636
2	2	183851
1	1	128116
0	0	103317

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	NaN	NaN	0	1	2010	WD	Normal

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000000	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	...	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	56.897260	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	...	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753	180921.195890
std	421.610009	42.300571	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	...	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095	79442.502883
min	1.000000	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	34900.000000
25%	365.750000	20.000000	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000	129975.000000
50%	730.500000	50.000000	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	...	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	163000.000000
75%	1095.250000	70.000000	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	...	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	214000.000000
max	1460.000000	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	...	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000	755000.000000

	OverallQual	SalePrice
9	10	438588
8	9	367513
7	8	274735
6	7	207716
5	6	161603
4	5	133523
3	4	108420
2	3	87473
1	2	51770
0	1	50150

	Total	Percent
PoolQC	2907	0.997256
MiscFeature	2810	0.963979
Alley	2717	0.932075
Fence	2345	0.804460
FireplaceQu	1420	0.487136
LotFrontage	486	0.166724
GarageCond	159	0.054545
GarageQual	159	0.054545
GarageYrBlt	159	0.054545
GarageFinish	159	0.054545
GarageType	157	0.053859
BsmtExposure	82	0.028130
BsmtCond	82	0.028130
BsmtQual	81	0.027787
BsmtFinType2	80	0.027444
BsmtFinType1	79	0.027101
MasVnrType	24	0.008233
MasVnrArea	23	0.007890
MSZoning	4	0.001372
Utilities	2	0.000686
Functional	2	0.000686
BsmtHalfBath	2	0.000686
BsmtFullBath	2	0.000686
GarageCars	1	0.000343
Exterior2nd	1	0.000343
Exterior1st	1	0.000343
KitchenQual	1	0.000343
Electrical	1	0.000343
BsmtUnfSF	1	0.000343
BsmtFinSF2	1	0.000343
BsmtFinSF1	1	0.000343
SaleType	1	0.000343
TotalBsmtSF	1	0.000343
GarageArea	1	0.000343

	Feature	Skewness
24	MiscVal	21.943440
29	PoolArea	18.711459
19	LotArea	13.130516
21	LowQualFinSF	12.086535
2	3SsnPorch	11.373947
18	KitchenAbvGr	4.301059
5	BsmtFinSF2	4.144996
9	EnclosedPorch	4.002856
30	ScreenPorch	3.945539
7	BsmtHalfBath	3.944922
23	MasVnrArea	2.602036
26	OpenPorchSF	2.530548
33	WoodDeckSF	1.849236
22	MSSubClass	1.375512
0	1stFlrSF	1.253657
20	LotFrontage	1.093272
15	GrLivArea	0.978364
4	BsmtFinSF1	0.974640
8	BsmtUnfSF	0.920609
1	2ndFlrSF	0.843671
31	TotRmsAbvGrd	0.749965
10	Fireplaces	0.726331
16	HalfBath	0.699130
32	TotalBsmtSF	0.662998
6	BsmtFullBath	0.623140
27	OverallCond	0.569436
3	BedroomAbvGr	0.328298
12	GarageArea	0.217860
25	MoSold	0.198513
28	OverallQual	0.181995
11	FullBath	0.160000
36	YrSold	0.130977
17	Id	-0.001872
13	GarageCars	-0.219515
35	YearRemodAdd	-0.449345
34	YearBuilt	-0.598395
14	GarageYrBlt	-3.905056

	1stFlrSF	2ndFlrSF	Alley	BedroomAbvGr	BldgType	BsmtCond	BsmtExposure	BsmtFinSF1	...	SaleType	Street	TotRmsAbvGrd	TotalBsmtSF	Utilities	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold
0	856	854	None	3	1Fam	TA	No	706	...	WD	Pave	8	856	AllPub	0	2003	2003	2008
1	1262	0	None	3	1Fam	TA	Gd	978	...	WD	Pave	6	1262	AllPub	298	1976	1976	2007
2	920	866	None	3	1Fam	TA	Mn	486	...	WD	Pave	6	920	AllPub	0	2001	2002	2008
3	961	756	None	3	1Fam	Gd	No	216	...	WD	Pave	7	756	AllPub	0	1915	1970	2006
4	1145	1053	None	4	1Fam	TA	Av	655	...	WD	Pave	9	1145	AllPub	192	2000	2000	2008

	1stFlrSF	2ndFlrSF	3SsnPorch	Alley	BedroomAbvGr	BldgType	BsmtCond	BsmtExposure	BsmtFinSF1	BsmtFinSF2	...	SaleType	ScreenPorch	Street	TotRmsAbvGrd	TotalBsmtSF	Utilities	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold
0	6.752270	854	1	None	3	1Fam	TA	No	706	1	...	WD	1	Pave	8	856	AllPub	1.000000	2003	2003	2008
1	7.140453	0	1	None	3	1Fam	TA	Gd	978	1	...	WD	1	Pave	6	1262	AllPub	5.697093	1976	1976	2007
2	6.824374	866	1	None	3	1Fam	TA	Mn	486	1	...	WD	1	Pave	6	920	AllPub	1.000000	2001	2002	2008
3	6.867974	756	1	None	3	1Fam	Gd	No	216	1	...	WD	1	Pave	7	756	AllPub	1.000000	1915	1970	2006
4	7.043160	1053	1	None	4	1Fam	TA	Av	655	1	...	WD	1	Pave	9	1145	AllPub	5.257495	2000	2000	2008

	1stFlrSF	2ndFlrSF	3SsnPorch	BedroomAbvGr	BsmtFinSF1	BsmtFinSF2	BsmtHalfBath	BsmtUnfSF	EnclosedPorch	...	SaleType_WD	Street_Pave	Utilities_AllPub
1456	6.797940	0	1	2	468	4.969813	1	270	1	...	1	1	1
1457	7.192182	0	1	3	923	1.000000	1	406	1	...	1	1	1
1458	6.833032	701	1	3	791	1.000000	1	137	1	...	1	1	1
1459	6.830874	678	1	3	602	1.000000	1	324	1	...	1	1	1
1460	7.154615	0	1	2	263	1.000000	1	1017	1	...	1	1	1