In [1]:
library(tidyverse)
library(reshape2)
#options(repr.plot.width = 4, repr.plot.height = 3)  # set plot size


Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘reshape2’

The following object is masked from ‘package:tidyr’:

    smiths


In [2]:
train = read_csv("../data/trainEng.csv")


Parsed with column specification:
cols(
  .default = col_integer(),
  LotFrontage = col_double(),
  LogLotArea = col_double(),
  Neighborhood = col_character(),
  YearBuilt = col_character(),
  YearRemodAdd = col_character(),
  Exterior = col_character(),
  ExterCond = col_character(),
  Foundation = col_character(),
  LogTotalBsmtSF = col_double(),
  Log1stFlrSF = col_double(),
  LogGrLivArea = col_double(),
  LogTotalArea = col_double(),
  FullBath = col_double(),
  HalfBath = col_double(),
  GarageYrBlt = col_character(),
  GarageFinish = col_character(),
  GarageCars = col_double(),
  MoSold = col_character(),
  SaleType = col_character(),
  LogSalePrice = col_double()
)
See spec(...) for full column specifications.

In [3]:
dim(train)


  1. 1460
  2. 55

In [4]:
vars = c("LotFrontage", "LogLotArea", "LogTotalBsmtSF", "Log1stFlrSF", "LogGrLivArea", "LogTotalArea", "TotalPorchSF")
mat = cor(train[, c(vars, "LogSalePrice")])
mat


LotFrontageLogLotAreaLogTotalBsmtSFLog1stFlrSFLogGrLivAreaLogTotalAreaTotalPorchSFLogSalePrice
LotFrontage1.00000000.56650340.12997440.40140960.33368010.37928820.12953920.3355547
LogLotArea0.56650341.00000000.12552140.46746480.38543520.42682200.16623650.3999177
LogTotalBsmtSF0.12997440.12552141.00000000.28824040.20663730.56505520.10724990.3728379
Log1stFlrSF0.40140960.46746480.28824041.00000000.54598380.74732120.15204450.6089467
LogGrLivArea0.33368010.38543520.20663730.54598381.00000000.86726840.26936610.7302549
LogTotalArea0.37928820.42682200.56505520.74732120.86726841.00000000.25390460.8035906
TotalPorchSF0.12953920.16623650.10724990.15204450.26936610.25390461.00000000.1951663
LogSalePrice0.33555470.39991770.37283790.60894670.73025490.80359060.19516631.0000000

Correlation Heatmap


In [5]:
meltMat = melt(mat)
ggplot(data = meltMat, mapping = aes(x = Var1, y = Var2)) + 
    geom_tile(mapping = aes(fill = value)) +
    labs(x = "", y = "") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    scale_fill_gradient2(low = "lightblue", high = "darkblue")


Plot Sale Price vs. Total Area against all categorical variables

Recall that Total Area is the sum of living space above ground (GrLivArea) and basement space below ground (TotalBsmtSF). This variable is of special interest since it has the highest correlation with Sale Price.


In [6]:
ggplot(train, aes(x = LogTotalArea, y = LogSalePrice, color = factor(MSSubClass))) +
    geom_point() +
    labs(color = "MSSubClass")


2-2.5 story tends to a characteristic of expensive houses.


In [7]:
ggplot(train, aes(x = LogTotalArea, y = LogSalePrice, color = factor(MSZoning))) +
    geom_point() +
    labs(color = "MSZoning")



In [8]:
ggplot(train, aes(x = LogTotalArea, y = LogSalePrice, color = factor(PavedDrive))) +
    geom_point() +
    labs(color = "PavedDrive")


Cheaper houses tend to not have paved driveways.


In [9]:
ggplot(train, aes(x = LogTotalArea, y = LogSalePrice, color = factor(Alley))) +
    geom_point() +
    labs(color = "Alley")



In [10]:
ggplot(train, aes(x = LogTotalArea, y = LogSalePrice, color = factor(LotShape))) +
    geom_point() +
    labs(color = "LotShape")


As the house price increases, the lot shape tends to get irregular.


In [11]:
ggplot(train, aes(x = LogTotalArea, y = LogSalePrice, color = factor(LandContour))) +
    geom_point() +
    labs(color = "LandContour")



In [12]:
ggplot(train, aes(x = LogTotalArea, y = LogSalePrice, color = factor(LotConfig))) +
    geom_point() +
    labs(color = "LotConfig")