This dataset contains information about houses in the Boston area collected in 1978. It is small, containing only 506 observations, and contains 14 variables. The dataset can be found here.
Drawing from the University of Toronto's Computer Science department, the 14 variables are:
The Toronto CS team notes that "MEDV seems to be censored at 50.00 (corresponding to a median price of 50,000); Censoring is suggested by the fact that the highest median price of exactly 50,000 is reported in 16 cases, while 15 cases have prices between 40,000 and 50,000, with prices rounded to the nearest hundred. Harrison and Rubinfeld do not mention any censoring."
In [2]:
import pandas as pd
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
# this url has no header info, so column names must be specified
colnames = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS',
'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
# load dataset into pandas
df = pd.read_csv(data_url, header=None, sep='\s+', names=colnames)
df.head()
Out[2]:
Let's explore some of these variables. It seems like there should definitely be a correlation between several of these variables. For example, I bet as home value increases (MEDV) so will the number of rooms (RM), but the crime rate (CRIM) will decrease. Other relationships might not be so intuitive.
Let's compare seven of these variables, including the three mentioned, as well as nitric oxide concentration (NOX), the proportion of older (pre-1940s) houses (AGE), accessibility to highways (RAD), and the teacher to student ratio (PTRATIO).
In [10]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Our seven columns of interest
cols = ['MEDV', 'RM', 'CRIM', 'NOX', 'AGE', 'RAD', 'PTRATIO']
sns.set(font_scale=1.5)
sns.pairplot(df[cols], size = 2.5)
Out[10]:
A lot of information in this plot. However, the trends we supposed earlier appear to be valid. Other interesting trends are that nitric oxide concentration appears to be higher in areas with a larger share of older homes. Looks like all the crime happens where there are older homes and when the teacher/student ration is 1 to 20. Many of these relationships are pretty noisy. Let's check how these variables correlate with a heatmap.
In [18]:
import numpy as np
cm = np.corrcoef(df[cols].values.T)
sns.heatmap(cm, annot=True, square=True, fmt='.2f', annot_kws={'size' : 12},
yticklabels=cols, xticklabels=cols)
Out[18]:
In [ ]: