This is part 2 of the posts on Linear Regression. Part 1 gave an introduction, motivation and defined many of the terms that will be used. You might want to give that a read before looking at this.
You are free to use this post as you see fit. It was converted to html from a Jupyter notebook. The notebook is available at https://github.com/dbkinghorn/blog-jupyter-notebooks
The focus in this post is on getting our data and taking a first look at it. We will use some of the of the standard (nice!) tools available as Python modules to do this.
Since this is supposed to be an introduction I will be fairly verbose in descriptions of what is being done. Most of the dialog will be in "markdown" cells like what you are reading now. I will use Python comments in code cells for short descriptions.
I got the data I'm using from Kaggle. Kaggle is an interesting data analysis community site and worth a visit if you haven't seen it. The data set is on the page "House Sales in King County, USA". The data is in a .zip file and will expand to a .csv file (comma separated values). Grab that, unzip it and put it in your working directory.
Now the real work starts. We have some data so the first thing we need to do see if it is usable. Our goal is simple: Find some "interesting" data to use for an example of linear regression. We are guessing that housing sale prices data will correlate well with the size of a house. However,
... the first rule of data analysis is don't assume anything! Look at the data and make sure it usable. Remember, garbage in garbage out!
Lets take a look at this data set and see if we can use it.
The fist thing we need to do is load some Python tools. The following module loads will become a pretty standard start to this kind of work. We are importing modules and using the common abrivations for their namespaces. [pandas
, numpy
, matplotlib
, and seaborn
are common tools for this work. I hope to write separate usage posts on these at some point ]
seaborn -- is very nice plotting and data handling package that uses matplotlib (easier to use)
%matplotlib inline
-- is what is know as a Jupyter notebook magic it is a just telling the notebook to put the plots in-line in the notebook.
pd.set_option('...')
-- lines are just something I added to make the output formatting look better in some of the output cells.
In [1]:
import pandas as pd # data handeling
import numpy as np # numeriacal computing
import matplotlib.pyplot as plt # plotting core
import seaborn as sns # higher level plotting tools
%matplotlib inline
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.max_columns', 21)
pd.set_option('display.max_rows', 70)
After you have "un-zipped" the data file you have a file named kc_house_data.csv" We will load that into a pandas data frame and take a look at the first few lines of data. [ We will call the data-frame df for now, which is common practice.]
In [2]:
df = pd.read_csv("kc_house_data.csv") # create a dataframe with pandas "pd"
In [3]:
df.head() # display the first few lines of the dataframe "df"
Out[3]:
The data looks very clean. We wont have to do any type or format conversions. Lets make a quick check for missing values. [ looks OK ]
In [4]:
df.isnull().values.any() # check for missing values
Out[4]:
Here is some basic stats for some of the data.
In [5]:
df[["price","bedrooms","bathrooms","sqft_living","sqft_lot","sqft_above","yr_built","sqft_living15","sqft_lot15"]].describe()
Out[5]:
You can see the home prices vary from \$75K to \$7.7million with living space from 290sqft to 13540sqft. Lots of variety!
The data set contains 21613 observations (home sales in 2014-15) with 19 features plus house price. Descriptions and names of the columns (features) are given below.
Since we are doing linear regression we'll want to look at "continuous" features. Intuitively that will be sqft_living but could possibly be 'sqft_living','sqft_lot','sqft_above','sqft_living15','sqft_lot15'. Lets take a look at these with some plots using seaborn.
In [6]:
sns.pairplot(data=df, x_vars=['sqft_living','sqft_lot','sqft_above','sqft_living15','sqft_lot15'], y_vars=["price"])
Out[6]:
You can see that "lot" size is not well correlated to price but the data for living space is reasonable. Visually the best feature to use looks like sqft_living as we expected.
Lets pull that data out of the data-frame into a new frame.
In [7]:
df2 = df[["price", "sqft_living"]]
df2.head()
Out[7]:
Now take a a closer look at the data with a joint distribution plot.
In [8]:
sns.jointplot('sqft_living','price', data=df2, size=10, alpha=.5, marker='+')
Out[8]:
The increase of price with sqft_living space is pretty clear and the "Pearson r value" is .7 indicating a reasonable correlation. However, the data distributions show a big concentration of values in the lower left of the plot. That makes sense, most houses are between 1200 and 3000 sqft and a few hundred thousand dollars. We can eliminate the very expensive and very large houses and take another look at the data.
If we set the size (xlim) from 500 to 3500sqft and the price (ylim) from 100,000 to \$1,000,000 the data still shows the trend but it looks very scattered.
In [9]:
sns.jointplot('sqft_living','price', data=df2, xlim=(500,3500), ylim=(100000,1000000), size=10, alpha=.5, marker='+')
Out[9]:
Something worth considering is that different neighborhoods can vary greatly in average house price. Some nice neighborhoods are very expensive and some other (also nice!) neighborhoods can be quite affordable. It might be good to look at average house price by zipcode since we have that in our dataset.
In [10]:
df["zipcode"].nunique()
Out[10]:
It looks like there are 70 different sip codes in King county. Lets see how many house sales there were in each.
In [11]:
df['zipcode'].value_counts()
Out[11]:
How about the average house sale price in each zipcode ...
In [12]:
df.groupby('zipcode')['price'].mean() # group by zipcode and compute the mean of prices in a zipcode
Out[12]:
The two zipcodes that look the most interesting to me are 98103 and 98039. 98103 has the most house sale values, 602, with an average sale price of \$584,919. The most expensive zipcode 98039 has 50 sale values with an average sale price of \$2,160,606.
We can create "selectors" by creating lists of true-false values for data entries that match these two zipcodes and then use those to filter our data-frame.
In [13]:
zip98103 = df['zipcode'] == 98103 # True if zip is 98103
zip98039 = df['zipcode'] == 98039
Using the "selectors" above we can look at plots of price vs sqft_living in those zipcodes.
In [14]:
sns.jointplot('sqft_living','price', data=df2[zip98103], size=10, alpha=.5, marker='+')
Out[14]:
In [15]:
sns.jointplot('sqft_living','price', data=df2[zip98039], size=10, alpha=.5, marker='+')
Out[15]:
The 98103 zipcode has a distribution that looks similar to the complete dataset. It's interesting that in the most expensive zipcode, 98039, the house sale prices seems to be highly correlated to the size of the house (house-size envy :-) Note: I don't live in that expensive zipcode! (my neighborhood is about 10 times less expensive than that and I like it a lot)
I did actually look at all of the zipcodes and in general localizing to zipcode does improve the correlation of size to price and is probably more meaningful and useful to model just specific zipcodes. We want our model to have good predictive value so restricting to smaller areas with less variation in price is a good thing.
I'll use the expensive zipcode data for the Linear Regression example we are working on. You could grab these notebooks and use any data subset you like!
Here's the data-frame we'll use,
In [16]:
df_98039 = df2[zip98039]
In [17]:
df_98039.describe()
Out[17]:
That's enough! You can spend hours playing with a dataset. These Python tools are wonderful for exploration. I'm new to using these modules and I am really impressed with what can be done. If I had wanted to I could have done the linear regression along with the data plots. However, that would defeat the purpose of these blog posts. I encourage you to try some of this yourself.
In the nest post I'll get back to the Linear Regression algorithms and take a deep look at how they work.
Happy computing! --dbk