Python for Data Science Workshop @VeloCity

1.1 Jupyter Notebook

Jupyter notebook is often used by data scientists who work in Python. It is loosely based on Mathematica and combines code, text and visual output in one page.

Some relevant short cuts:

  • SHIFT + ENTER executes 1 block of code called a cell
  • Tab-completion is omnipresent after the import of a package has been executed
  • SHIFT + TAB gives you extra information on what parameters a function takes
  • Repeating SHIFT + TAB multiple times gives you even more information

To get used to these short cuts try them out on the cell below.


In [ ]:
print 'Hello world!'
print range(5)

1.2 Numpy arrays

We'll be working often with numpy arrays so here's a very short introduction.


In [ ]:
import numpy as np

# This is a two-dimensional numpy array
arr = np.array([[1,2,3,4],[5,6,7,8]])
print arr

# The shape is a tuple describing the size of each dimension
print "shape=" + str(arr.shape)

In [ ]:
# The numpy reshape method allows one to change the shape of an array, while keeping the underlying data.
# One can leave one dimension unspecified by passing -1, it will be determined from the size of the data.

print "As 4x2 matrix" 
print np.reshape(arr, (4,2))

print 
print "As 8x1 matrix" 
print np.reshape(arr, (-1,1))

print 
print "As 2x2x2 array" 
print np.reshape(arr, (2,2,-1))

Basic arithmetical operations on arrays of the same shape are done elementwise:


In [ ]:
x = np.array([1.,2.,3.])
y = np.array([4.,5.,6.])

print x + y
print x - y
print x * y
print x / y

1.3 Parts to be implemented

In cells like the following example you are expected to implement some code. The remainder of the tutorial won't work if you skip these.

Sometimes assertions are added as a check.


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# three = ?
assert three == 3

2. Anomaly Detection

2.1 Load Data

First we will load the data using a pickle format. (We use import cPickle as pickle because cPickle is faster.)

The data we use contains the pageviews of one of our own websites and for convenience there is only 1 data point per hour.


In [ ]:
import cPickle as pickle

past = pickle.load(open('data/past_data.pickle'))
all_data = pickle.load(open('data/all_data.pickle'))

2.2 Plot past data

To plot the past data we will use matplotlib.pyplot. For convenience we import it as plt.

% matplotlib inline makes sure you can see the output in the notebook.

(Use % matplotlib notebook if you want to make it ineractive. Don't forget to click the power button to finish the interaction and to be able to plot a new figure.)


In [ ]:
% matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(20,4)) # This creates a new figure with the dimensions of 20 by 4
plt.plot(past)             # This creates the actual plot
plt.show()                 # This shows the plot

2.3 Find the minimum and maximum

Use np.nanmax() and np.nanmin() to find the minmum and maximum while ignoring the NaNs.


In [ ]:
import numpy as np

In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# maximum = ?
# minimum = ?
print minimum, maximum

And plot these together with the data using the plt.axhline() function.


In [ ]:
plt.figure(figsize=(20,4))
plt.plot(past)
plt.axhline(maximum, color='r')
plt.axhline(minimum, color='r')
plt.show()

2.4 Testing the model on unseen data

Now plot all the data instead of just the past data


In [ ]:
plt.figure(figsize=(20,4))
plt.plot(all_data, color='g')
plt.plot(past, color='b')
plt.axhline(maximum, color='r')
plt.axhline(minimum, color='r')
plt.show()

You can clearly see now that this model does not detect any anomalies. However, the last day of data clearly looks different compared to the other days.

In what follows we will build a better model for anomaly detection that is able to detect these 'shape shifts' as well.

2.5 Building a model with seasonality

To do this we are going to take a step by step approach. Maybe it won't be clear at first why every step is necessary, but that will become clear throughout the process.

First we are going to reshape the past data to a 2 dimensional array with 24 columns. This will give us 1 row for each day and 1 column for each hour. For this we are going to use the np.reshape() function. The newshape parameter is a tuple which in this case should be (-1, 24). If you use a -1 the reshape function will automatically compute that dimension. Pay attention to the order in which the numbers are repositonned (the default ordering should work fine here).


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# reshaped_past = ?

assert len(reshaped_past.shape) == 2
assert reshaped_past.shape[1] == 24

Now we are going to compute the average over all days. For this we are going to use the np.mean() with the axis variable set to the first dimension (axis=0). Next we are going to plot this.


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# average_past = 

assert average_past.shape == (24,)

plt.plot(average_past)
plt.show()

What you can see in the plot above is the average number of pageviews for eacht hour of the day.

Now let's plot this together with the past data on 1 plot. Use a for loop and the np.concatenate() function to concatenate this average 6 times into the variable model.


In [ ]:
model = []
for i in range(6):
##### Implement this part of the code #####
raise NotImplementedError()
#   model = np.concatenate( ? )

plt.figure(figsize=(20,4))    
plt.plot(model, color='k')
plt.plot(past, color='b')
plt.show()

In the next step we are going to compute the maximum (= positive) and minimum (= negative) deviations from the average to determine what kind of deviations are normal. (Just substract the average/model from the past and take the min and the max of that)


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
print delta_min, delta_max

Now let's plot this.


In [ ]:
plt.figure(figsize=(20,4))
plt.plot(model, color='k')
plt.plot(past, color='b')
plt.plot(model + delta_max, color='r')
plt.plot(model + delta_min, color='r')
plt.show()

Now let's test this on all data


In [ ]:
model_all = np.concatenate((model, average_past))

plt.figure(figsize=(20,4))
plt.plot(all_data, color='g')
plt.plot(model_all, color='k')
plt.plot(past, color='b')
plt.plot(model_all + delta_max, color='r')
plt.plot(model_all + delta_min, color='r')
plt.show()

Now you can clearly see where the anomaly is detected by this more advanced model. The code below can gives you the exact indices where an anomaly is detected. The functions uses are the following np.where() and np.logical_or().


In [ ]:
anomaly_timepoints = np.where(np.logical_or(all_data < model_all + delta_min, all_data > model_all + delta_max))[0]

plt.figure(figsize=(20,4))
plt.scatter(anomaly_timepoints, all_data[anomaly_timepoints], color='r', linewidth=8)
plt.plot(all_data, color='g')
plt.plot(model_all, color='k')
plt.plot(past, color='b')
plt.plot(model_all + delta_max, color='r')
plt.plot(model_all + delta_min, color='r')
plt.xlim(0, len(all_data))
plt.show()

print 'The anomaly occurs at the following timestamps:', anomaly_timepoints

3. Modeling

It is often desired to understand the relationship between different sources of information. As an example we'll consider the historical request rate of a web server and compare it to its CPU usage. We'll try to predict the CPU usage of the server based on the request rates of the different pages. First some imports:


In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pylab
pylab.rcParams['figure.figsize'] = (13.0, 8.0)
%matplotlib inline

3.1 Data import and inspection

Pandas is a popular library for data wrangling, we'll use it to load and inspect a csv file that contains the historical web request and cpu usage of a web server:


In [ ]:
data = pd.DataFrame.from_csv("data/request_rate_vs_CPU.csv")

The head command allows one to quickly see the structure of the loaded data:


In [ ]:
data.head()

We can select the CPU column and plot the data:


In [ ]:
data.plot(figsize=(13,8), y="CPU")

Next we plot the request rates, leaving out the CPU column as it has another unit:


In [ ]:
data.drop('CPU',1).plot(figsize=(13,8))

Now to continue and start to model the data, we'll work with basic numpy arrays. By doing this we also drop the time-information as shown in the plots above.

We extract the column labels as the request_names for later reference:


In [ ]:
request_names = data.drop('CPU',1).columns.values
request_names

We extract the request rates as a 2-dimensional numpy array:


In [ ]:
request_rates = data.drop('CPU',1).values

and the cpu usage as a one-dimensional numpy array


In [ ]:
cpu = data['CPU'].values

3.2 Simple linear regression

First, we're going to work with the total request rate on the server, and compare it to the CPU usage. The numpy function np.sum can be used to calculate the total request rate when selecting the right direction (axis=1) for the summation.


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# total_request_rate =

assert total_request_rate.shape == (288,)

Let's plot the total request rate to check:


In [ ]:
plt.figure(figsize=(13,8))
plt.plot(total_request_rate)

We can make use of a PyPlot's scatter plot to understand the relation between the total request rate and the CPU usage:


In [ ]:
plt.figure(figsize=(13,8))
plt.xlabel("Total request rate")
plt.ylabel("CPU usage")
##### Implement this part of the code #####
raise NotImplementedError()
# plt.scatter( ? , ? )

There clearly is a strong correlation between the request rate and the CPU usage. Because of this correlation we can build a model to predict the CPU usage from the total request rate. If we use a linear model we get a formula like the following:

$$ \text{cpu} = c_0 + c_1 \text{total_request_rate} $$

Since we don't know the exact values for $c_0$ and $c_1$ we will have to compute them. For that we'll make use of the scikit-learn machine learning library for Python and use least-squares linear regression


In [ ]:
from sklearn import linear_model
simple_lin_model = linear_model.LinearRegression()

Now we need to feed the data to the model to fit it. The model.fit(X,y) method in general takes an array X and vector y as arguments:

      X = [[x_11, x_12, x_13, ...],                  y = [y_1,
           [x_21, x_22, x_23, ...],                       y_2,  
           [x_31, x_32, x_33, ...],                       y_3,
           ...]                                           ...]

and tries to find coefficients that allow to predict the y_i's from the x_ij's. In our case the matrix X will consist of only 1 column containing the total request rates. Our total_request_rate variable however, is still only a one-dimensional vector, so we need to np.reshape it into a two-dimensional array:


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# total_request_rate_M = 

# Test to see it's a two dimensional array
assert len(total_request_rate_M.shape) == 2
# Test to see it's got only 1 column
assert total_request_rate_M.shape[1] == 1

Then we fit our model using the the total request rate and cpu. The coefficients found are automatically stored in the simple_lin_model object.


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# simple_lin_model.fit( ? , ? )

We can now inspect the coefficient $c_1$ and constant term (intercept) $c_0$ of the model:


In [ ]:
print "Coefficient = %s, constant term = %f" % (str(simple_lin_model.coef_), simple_lin_model.intercept_)

So this means that each additional request adds about 0.11% CPU load to the server and all the other processes running on the server consume on average 0.72% CPU.

Once the model is trained we can use it to predict the outcome for a given input (or array of inputs). Note that the predict function requires a 2-dimensional array similar to the fit function.

What is the expected CPU usage when we have 880 requests per second?


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# simple_lin_model.predict( [[ ? ]] )

Now we plot the linear model together with our data to verify it captures the relationship correctly (the predict method can accept the entire total_request_rate_M array at once).


In [ ]:
plt.figure(figsize=(13,8))

plt.scatter(total_request_rate, cpu,  color='black')
plt.plot(total_request_rate, simple_lin_model.predict(total_request_rate_M), color='blue', linewidth=3)

plt.xlabel("Total request rate")
plt.ylabel("CPU usage")

plt.show()

Our model can calculate a score indicating how well the linear model captures the data. A score of 1 means the data is perfectly linear, a score of 0 (or lower) means the data is not linear at all (and it does not make sense to try to model it that way). The score method takes the same arguments as the fit method:


In [ ]:
simple_lin_model.score(total_request_rate_M, cpu)

3.3 Multiple linear regression

Now let us consider the separate request rates instead and build a linear model for that. The model we try to fit takes the form:

$$\text{cpu} = c_0 + c_1 \text{request_rate}_1 + c_2 \text{request_rate}_2 + \ldots + c_n \text{request_rate}_n$$

where the $\text{request_rate}_i$'s correspond the our different requests:


In [ ]:
print request_names

We start again by creating a LinearRegression model.


In [ ]:
multi_lin_model = linear_model.LinearRegression()

Next we fit the model on the data, using multi_lin_model.fit(X,y). In contrast to the case above our request_rates variable already has the correct shape to pass as the X matrix: it has one column per request type.


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# multi_lin_model.fit( ? , ? )

Now, given the coefficients calculated by the model, which capture the contribution of each request type to the total CPU usage, we can start to answer some interesting questions. For example, which request causes most CPU usage, on a per visit basis?

For this we can generate a table of request names with their coefficients in descending order:


In [ ]:
# combine the requests and the output in a pandas data frame for easy printing
result_table = pd.DataFrame(zip(request_names, multi_lin_model.coef_), columns=['Request', 'Coef'])

# sort the results in descending order
result_table = result_table.sort_values(by='Coef',ascending=False)

# executing this as the last command returns a nice table
result_table

From this table we see that 'resources/js/basket.js' consumes the most per CPU per request. It generates about 0.30% CPU load for each additional request. 'products/science.html' on the other hand is much leaner and only consumes about 0.04% CPU per request.

Now let us investigate the constant term again.


In [ ]:
print 'The other processes on the server consume %.2f%%' % multi_lin_model.intercept_

As you can see this term is very similar to the result achieved in single linear regression, but it is not entirely the same. This means that these models are not perfect. However, the seem to be able to give a reliable estimate.

3.4 Multiple linear regression 'advanced'

In the previous section we have modeled how much load each individual request generates. But in some cases you might want to transfer one of the request to another server. Now, suppose we want to minimize average CPU usage on this server by deviating traffic of only one webpage to another server, which page should we choose?

For this we simulate diverting the traffic of one page to another server. This means that for the request that is diverted the rate becomes 0, for the other requests we use the average rate.

We implement this by first calculating the average_request_rates using np.mean. These average_request_rates are then fed to the multi_lin_model.predict() method but with setting each individual request rate to 0 once.

(For linear models you can also compute the result based on the coefficients, but this approach also works for non-linear models.)


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# average_request_rates = 
assert average_request_rates.shape == (6,)

results = []

# Loop over all requests
for i in range(len(request_names)):
    # make a copy of the array to avoid overwriting
    tweaked_load = np.copy(average_request_rates)
##### Implement this part of the code #####
raise NotImplementedError()
    # tweaked_load[ ? ] = ?
    # resulting_cpu = ?
    
    results.append( (request_names[i], 
                     multi_lin_model.coef_[i], 
                     average_request_rates[i], 
                     resulting_cpu))

# Now we store the results in a pandas dataframe for easier inspection.
mlin_df = pd.DataFrame(results, columns=['Diverted request', 'Coef', 'Rate', 'Predicted CPU'])
mlin_df = mlin_df.sort_values(by='Predicted CPU')
mlin_df

As you can see in the table above, it is best to divert the traffic of 'api/product/get.php' (Why is the result different than the table based on the coefficient)?

4. Forecasting

For the forecasting we are going to use page views data, very similar to the data used in the anomaly detection section. It is also page view data and contains 1 sample per hour.


In [ ]:
train_set = pickle.load(open('data/train_set_forecasting.pickle'))

plt.figure(figsize=(20,4))
plt.plot(train_set)
plt.show()

In the graph above you can clearly see that there is a rising trend in the data.

4.1 One-step ahead prediction

This forecasting section will describe the one-step ahead prediction. This means in this case that we will only predict the next data point which is in this case the number of pageviews in the next hour.

Now let's first build a model that tries to predict the next data point from the previous one.


In [ ]:
import sklearn
import sklearn.linear_model
import sklearn.gaussian_process

model = sklearn.linear_model.LinearRegression()

# the input X contains all the data except the last data point
X = train_set[ : -1].reshape((-1, 1)) # the reshape is necessary since sklearn requires a 2 dimensional array

# the output y contains all the data except the first data point
y = train_set[1 : ]

# this code fits the model on the train data
model.fit(X, y)

# this score gives you how well it fits on the train set
# higher is better and 1.0 is perfect
print 'The score of the linear model is', model.score(X, y)

As you can see from the score above, the model is not perfect but it seems to get a relatively high score. Now let's make a prediction into the future and plot this.

To predict the datapoint after that we will use the predicted data to make a new prediction. The code below shows how this works for this data set using the linear model you used earlier. Don't forget to fill out the missing code.


In [ ]:
nof_predictions = 100

import copy
# use the last data point as the first input for the predictions
x_test = copy.deepcopy(train_set[-1]) # make a copy to avoid overwriting the training data

prediction = []
for i in range(nof_predictions):
    # predict the next data point
    y_test = model.predict([[x_test]])[0] # sklearn requires a 2 dimensional array and returns a one-dimensional one
    
    ##### Implement this part of the code #####
    raise NotImplementedError()
    # prediction.append( ? )
    # x_test = ?

prediction = np.array(prediction)

plt.figure(figsize=(20,4))
plt.plot(np.concatenate((train_set, prediction)), 'g')
plt.plot(train_set, 'b')
plt.show()

As you can see from the image above the model doesn't quite seem to fit the data well. Let's see how we can improve this.

4.2 Multiple features

If your model is not smart enough there is a simple trick in machine learning to make your model more intelligent (but also more complex). This is by adding more features.

To make our model better we will use more than 1 sample from the past. To make your life easier there is a simple function below that will create a data set for you. The width parameter sets the number of hours in the past that will be used.


In [ ]:
def convert_time_series_to_Xy(ts, width):
    X, y = [], []
    for i in range(len(ts) - width - 1):
        X.append(ts[i : i + width])
        y.append(ts[i + width])
    return np.array(X), np.array(y)

In [ ]:
width = 5
X, y = convert_time_series_to_Xy(train_set, width)

print X.shape, y.shape

As you can see from the print above both X and y contains 303 datapoints. For X you see that there are now 5 features which contain the pageviews from the 5 past hours.

So let's have a look what the increase from 1 to 5 features results to.


In [ ]:
width = 5
X, y = convert_time_series_to_Xy(train_set, width)
model = sklearn.linear_model.LinearRegression()
model.fit(X,y)
print 'The score of the linear model with width =', width, 'is', model.score(X, y)

Now change the width parameter to see if you can get a better score.

4.3 Over-fitting

Now execute the code below to see the prediction of this model.


In [ ]:
import copy

# this is a helper function to make the predictions
def predict(model, train_set, width, nof_points):
    prediction = []
    # create the input data set for the first predicted output
    # copy the data to make sure the orriginal is not overwritten
    x_test = copy.deepcopy(train_set[-width : ]) 
    for i in range(nof_points):
        # predict only the next data point
        prediction.append(model.predict(x_test.reshape((1, -1))))
        # use the newly predicted data point as input for the next prediction
        x_test[0 : -1] = x_test[1 : ]
        x_test[-1] = prediction[-1]
    return np.array(prediction)

In [ ]:
nof_predictions = 200
prediction = predict(model, train_set, width, nof_predictions)

plt.figure(figsize=(20,4))
plt.plot(np.concatenate((train_set, prediction[:,0])), 'g')
plt.plot(train_set, 'b')
plt.show()

As you can see in the image above the prediction is not what you would expect from a perfect model. What happened is that the model learned the training data by heart without 'understanding' what the data is really about. This fenomenon is called over-fitting and will always occur if you make your model more complex.

Now play with the width variable below to see if you can find a more sensible width.


In [ ]:
width = 1 #find a better width
X, y = convert_time_series_to_Xy(train_set, width)
model = sklearn.linear_model.LinearRegression()
model.fit(X,y)
print 'The score of the linear model with width =', width, 'is', model.score(X, y)

prediction = predict(model, train_set, width, 200)

plt.figure(figsize=(20,4))
plt.plot(np.concatenate((train_set, prediction[:,0])), 'g')
plt.plot(train_set, 'b')
plt.show()

As you will have noticed by now is that it is better to have a non-perfect score which will give you a much better outcome. Now try the same thing for the following models:

  • sklearn.linear_model.RidgeCV()
  • sklearn.linear_model.LassoCV()
  • sklearn.gaussian_process.GaussianProcess()

The first 2 models also estimate the noise that is present in the data to avoid overfitting. RidgeCV will keep the weights that are found small, but it won't put them to zero. LassoCV on the other hand will put several weights to 0. Execute model.coef_ to see the actual coefficients that have been found.

GaussianProcess is a non-linear method. This makes this method a lot more complex and therefore it will need significantly less features to be able to learn the data by hart (and thus to over-fit). In many cases however this additional complexity allows to better understand the data. Additionally it has the advantage that it can estimate confidance intervals similar to the red lines used in the anomaly detection.

4.4 Automation

What we have done up to now is manually selecting the best outcome based on the test result. This can be considered cheating because you have just created a self-fulfilling prophecy. Additionally it is not only cheating it is also hard to find the exact width that gives the best result by just visually inspecting it. So we need a more objective approach to solve this.

To automate this process you can use a validation set. In this case we will use the last 48 hours of the training set to validate the score and select the best parameter value. This means that we will have to use a subset of the training set to fit the model.


In [ ]:
model_generators = [sklearn.linear_model.LinearRegression, sklearn.linear_model.RidgeCV,
                    sklearn.linear_model.LassoCV, sklearn.gaussian_process.GaussianProcess]
best_score = 0

##### Implement this part of the code #####
raise NotImplementedError()
# for model_gen in ? :
#    for width in range( ? , ? ): 
        X, y = convert_time_series_to_Xy(train_set, width)
        # train the model on the first 48 hours
        X_train, y_train = X[ : -48, :], y[ : -48]
        # use the last 48 hours for validation
        X_val, y_val = X[-48 : ], y[-48 : ]
        
        ##### Implement this part of the code #####
        raise NotImplementedError()
        # model = 
        
        # there is a try except clause here because some models do not converge for some data
        try:
            ##### Implement this part of the code #####
            raise NotImplementedError()
            # model.fit( ? , ? )
            # this_score = ?
            
            if this_score > best_score:
                best_score = this_score
                best_model_gen = model_gen
                best_width = width
        except:
            pass

print best_model_gen().__class__, 'was selected as the best model with a width of', best_width, 'and a validation score of', best_score

If everything is correct the LassoCV methods was selected.

Now we are going to train this best model on all the data. In this way we use all the available data to build a model.


In [ ]:
##### Implement this part of the code #####
raise NotImplementedError()
# width = ?
# model = ?

X, y = convert_time_series_to_Xy(train_set, width)

##### Implement this part of the code #####
raise NotImplementedError()
# model.fit( ? , ? )

nof_predictions = 200
prediction = predict(model, train_set, width, nof_predictions)

plt.figure(figsize=(20,4))
plt.plot(np.concatenate((train_set, prediction[:,0])), 'g')
plt.plot(train_set, 'b')
plt.show()

Altough the optimal result found here might not be the best visually, it is a far better result than the one you selected manually just because there was no cheating involved ;-).

Some additional info:

  • This noise level of RidgeCV and LassoCV is estimated by automatically performing train and validation within the method itself. This will make them much more robust against over-fitting. The actual method used is Cross-validation) which is a better approach of what we do here because it repeats the training and validation multiple times for different training and validation sets. The parameter that is set for these methods is often called the regularization parameter in literature and is well suited to avoid over-fitting.
  • Although sklearn supports estimating the noise level in Gaussian Processes it is not implemented within the method itself. Newer versions of sklearn seem to entail a lot of changes in this method so possibly it will be integrated in the (near) future. If you want to implement this noise level estimation yourself you can use their cross-validation tool to set the alpha parameter in version 0.18 of sklearn. (The version used here is 0.17.)

5. Challenge

Details will be given near the end of the tutorial.


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: