So far, we've been running all of our code linearly, i.e. we execute a file an we let Python do its job. Here we will be running code interactively. Each cell will contain some code, which we can execute by clicking on the cell and pressing Shift + Enter. Here is an example:
In [1]:
print("Hello, world!")
As you can see, the code executed and the output is now in the notebook. One thing that's special about Jupyter is that our variables are persistent. I.e. I could create the block:
In [2]:
hello = "Hello, world!"
And then in a new block execute:
In [3]:
print(hello)
You might have noticed the In [ ]
to the side of the cells. The numbers inside indicate the order in which the code in the cells was executed. You should imagine this notebook as maintaining a hidden Python program, with all of our variables and functions, etc. executed in the order of the In [ ]
blocks.
Now, lets do some data science. To start off with, we will import the libraries which we will need.
In [4]:
import pandas as pd
import sklearn as sk
from sklearn import datasets as ds
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
In our first example, we will analyze a dataset of Boston house prices, which comes pre-loaded with the sklearn
module.
In [5]:
boston = ds.load_boston()
When you run a cell in Jupyter, it will try to print out the value on the last line of the cell. So far, we've loaded the dataset into a variable boston
. Lets investigate how this dataset is actually structured.
In [6]:
boston
Out[6]:
As we can see, this dataset is actually a dict
, containing a few fields. We can get a list of those fields:
In [7]:
boston.keys()
Out[7]:
The DESCR
key is particularly interesting, as it contains a string which describes the dataset. Lets check it out:
In [8]:
print(boston['DESCR'])
So as we can see, the data
field contains the first 13 attributes, the target
field contains the medial value of a house, and the feature_names
contains the names of the features. We could view these all individually.
In [9]:
boston['data']
Out[9]:
In [10]:
boston['target']
Out[10]:
In [11]:
boston['feature_names']
Out[11]:
However, this is very inconvenient. To address this issue, we will use the DataFrame
class from the pandas
module that we imported.
In [13]:
df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
df
Out[13]:
Much nicer. There are settings in Jupyter which allow us to force rendering more rows, but we will not go into these now. The important thing is that we can inspect our data. We might wish to query our data to filter some results. Pandas supports a few ways of doing this. One of them is the query
function of DataFrames.
In [14]:
df.query("CRIM >= 20") # List all towns with average crime rate >= 20 per capita
Out[14]:
The query function is quite intuitive to use and if your query looks like python code, it will probably get executed.
Now, maybe we want to inspect our data, to see how each of the attributes is related to the median house price (the "target"). To do that we will use the pyplot
submodule of the matplotlib
module. Let's first plot the crime rate against the house price.
In [20]:
plt.scatter(df['CRIM'], boston['target'])
plt.xlabel('Average crime rate per capita')
plt.ylabel('Median house price in $1000')
plt.show()
As we might expect, the house price goes down as the crime rate increases. However, at the lower end, crime rate doesn't seem to be such a good predictor. We could do the same thing and plot each column against the house price. Or we could do it programatically.
In [21]:
plt.figure(figsize=(20,15))
for i, column in enumerate(df):
plt.subplot(4, 4, i+1)
plt.scatter(df[column], boston['target'])
As we can see, some attributes seem more predictive than others. Let's focus on the RM
attribute, which is the average number of rooms per dwelling.
In [22]:
plt.scatter(df['RM'], boston['target'])
plt.xlabel('Average number of rooms per dwelling')
plt.ylabel('Median house price in $1000')
plt.show()
As we might expect, this also seems like a pretty good predictor of house price. But what if we wanted to model the correlation between the number of rooms in a dwelling and its expected price in that town? One simple model is called Linear regression, and it essentially involves drawing a line through the data points, such that it is, on average, closest to all other data points. The module sklearn
provides us with such a model, which we will use in our program.
In [ ]:
reg = LinearRegression().fit(df[['RM']], boston['target'])
y_pred = reg.predict(X)