So far, we've been running all of our code linearly, i.e. we execute a file an we let Python do its job. Here we will be running code interactively. Each cell will contain some code, which we can execute by clicking on the cell and pressing Shift + Enter. Here is an example:
In [1]:
print("Hello, world!")
As you can see, the code executed and the output is now in the notebook. One thing that's special about Jupyter is that our variables are persistent. I.e. we could create the block:
In [2]:
hello = "Hello, world!"
And then in a new block execute:
In [3]:
print(hello)
You might have noticed the In [ ]
to the side of the cells. The numbers inside indicate the order in which the code in the cells was executed. You should imagine this notebook as maintaining a hidden Python program, with all of our variables and functions, etc. executed in the order of the In [ ]
blocks.
Now, lets analyze some data. To start off with, we will import the libraries which we will need.
In [4]:
import pandas as pd
import sklearn as sk
from sklearn import datasets as ds
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
In our first example, we will analyze a dataset of Boston house prices, which comes packaged with the sklearn
module.
In [5]:
boston = ds.load_boston()
When you run a cell in Jupyter, it will try to print out the value on the last line of the cell. So far, we've loaded the dataset into a variable boston
. Lets investigate how this dataset is actually structured.
In [6]:
boston
Out[6]:
As we can see, this dataset is actually a dict
, containing a few fields. We can get a list of those fields:
In [7]:
boston.keys()
Out[7]:
The DESCR
key is particularly interesting, as it contains a string which describes the dataset. Lets check it out:
In [8]:
print(boston['DESCR'])
So as we can see, the data
field contains the first 13 attributes, the target
field contains the median value of a house, and the feature_names
contains the names of the features. We could view these all individually.
In [9]:
boston['data']
Out[9]:
In [10]:
boston['target']
Out[10]:
In [11]:
boston['feature_names']
Out[11]:
However, this is very inconvenient. To address this issue, we will use the DataFrame
class from the pandas
module we imported.
In [13]:
df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
df
Out[13]:
Much nicer. There are settings in Jupyter which allow us to force rendering more rows, but we will not go into these now. The important thing is that we can inspect our data. We might wish to query our data to filter some results. Pandas supports a few ways of doing this. One of them is the query
function of DataFrames.
In [14]:
df.query("CRIM >= 20") # List all towns with average crime rate >= 20 per capita
Out[14]:
The query function is quite intuitive to use and if your query looks like python code, it will probably get executed.
Now, maybe we want to inspect our data, to see how each of the attributes relates to the median house price (the "target"). To do that we will use the pyplot
submodule of the matplotlib
module. Lets first plot the crime rate against the house price.
In [20]:
plt.scatter(df['CRIM'], boston['target'])
plt.xlabel('Average crime rate per capita')
plt.ylabel('Median house price in $1000')
plt.show()
As we might expect, the house price goes down as the crime rate increases. However, at the lower end, crime rate doesn't seem to be such a good predictor. We could do the same thing and plot each attribute against the house price. Or we could do it programatically.
In [21]:
plt.figure(figsize=(20,15))
for i, column in enumerate(df):
plt.subplot(4, 4, i+1)
plt.scatter(df[column], boston['target'])
As we can see, some attributes seem more predictive than others. Let's focus on the RM
attribute, which is the average number of rooms per dwelling.
In [22]:
plt.scatter(df['RM'], boston['target'])
plt.xlabel('Average number of rooms per dwelling')
plt.ylabel('Median house price in $1000')
plt.show()
As we might expect, this also seems like a pretty good predictor of house price. But what if we wanted to model the correlation between the number of rooms in a dwelling and its expected price in that town? One simple model is called Linear Regression, and it essentially involves drawing a line through the data. The module sklearn
provides us with such a model, which we will use in our program.
In [23]:
reg = LinearRegression().fit(df[['RM']], boston['target'])
Now the variable reg
contains our model. It is fit on our data, meaning that it is the line which best fits our data (according to the linear regression interpretation). The red line is how our model fits the data. Essentially, it will always predict a house price that lies on the red line, given the number of rooms.
In [25]:
plt.scatter(df['RM'], boston['target'])
plt.plot(df['RM'], reg.predict(df[['RM']]), color='r')
plt.xlabel('Average number of rooms per dwelling')
plt.ylabel('Median house price in $1000')
plt.show()
One way to evaluate how good our model fits the data is called an $R^2$ score, or a coefficient of determination. Essentially, it assigns a number between $-\infty$ and $1$ which indicates how well our model fits the data. The best possible score is $1$. We can check out the $R^2$ score of our model using the score
function.
In [27]:
reg.score(df[['RM']], boston['target'])
Out[27]:
Not great, not terrible.
We can use the same technique to try to use all of our attributes to predict the house price. This would look like:
In [28]:
reg_full = LinearRegression().fit(df, boston['target'])
Now, to get an intuition of how well our model fits the data now, we will again plot the average number of rooms against the median price. This time, the actual data points are in blue, and the predictions of our model are in red.
In [31]:
plt.scatter(df['RM'], boston['target'])
plt.scatter(df['RM'], reg_full.predict(df), color='r')
plt.xlabel('Average number of rooms per dwelling')
plt.ylabel('Median house price in $1000')
plt.show()
This looks much better. In fact, we can try the score again:
In [32]:
reg_full.score(df, boston['target'])
Out[32]:
As we can see, this model is a lot better than the first one. As an exercise, you might want to try removing some of the attributes, and seeing which ones affect the score a lot, and which ones don't.
If we wanted to import our own dataset, we can do that as well. Pandas has functions like read_csv
and read_excel
which allow you to import your dataset from a .csv
or .xlsx
file and analyze it like any other DataFrame. What you would do is:
In [ ]:
my_df = pd.read_csv('my_dataset.csv')
Or
In [ ]:
my_df = pd.read_excel('my_dataset.xlsx')
The output here is omitted, but you can try googling "csv dataset" or "excel dataset" and try exploring some of them.
As an additional excercise, you might want to check out the other datasets included with sklearn
and try exploring them like we did here. Try looking up methods other than Linear Regression as well.
In [ ]: