Jupyter Notebook and NumPy introduction

Jupyter notebook is often used by data scientists who work in Python. It is loosely based on Mathematica and combines code, text and visual output in one page.

Basic Jupyter Notebook commands

Some relevant short cuts:

  • SHIFT + ENTER executes 1 block of code called a cell
  • Tab-completion is omnipresent after the import of a package has been executed
  • SHIFT + TAB gives you extra information on what parameters a function takes
  • Repeating SHIFT + TAB multiple times gives you even more information

To get used to these short cuts try them out on the cell below.


In [ ]:
print('Hello world!')
print(range(5))

Imports

In Python you need to import tools to be able to use them. In this workshop we will mainly use the numpy toolbox and you can import it like this:


In [ ]:
import numpy as np

Parts to be implemented

In cells like the following example you are expected to implement some code. The remainder of the tutorial won't work if you skip these.

Sometimes assertions are added as a check.


In [ ]:
# To proceed, implement the missing code, and remove the 'raise NotImplementedException()'

### BEGIN SOLUTION
three = 3
### END SOLUTION
# three = ?
assert three == 3

Numpy arrays

We'll be working often with numpy arrays so here's a short introduction. np.array() create a new array.


In [ ]:
import numpy as np

# This is a two-dimensional numpy array:
arr = np.array([[1,2,3,4],[5,6,7,8]])
print(arr)

# The shape is a tuple describing the size of each dimension
print("shape=" + str(arr.shape))

# Elements are selected by specifying two indices, counting from 0
print("arr[1][3] = %d" % arr[1][3])

print()
# This is a three-dimensional numpy array
arr3 = np.array([[[1, 2, 3, 4], [5, 6, 7, 8]], [[9, 10, 11, 12], [13, 14, 15, 16]]])
print(arr3)

print()
print("shape=" + str(arr3.shape))

# Elements in a three dimensional array are selected by specifying three indices, counting from 0
print(arr3[1][0][2])

In [ ]:
# The numpy reshape method allows one to change the shape of an array, while keeping the underlying data.
# One can leave one dimension unspecified by passing -1, it will be determined from the size of the data.

print("Original array:")
print(arr)

print()
print("As 4x2 matrix")
print(np.reshape(arr, (4,2)))

print()
print("As 8x1 matrix")
print(np.reshape(arr, (-1,1)))

print()
print("As 2x2x2 array")
print(np.reshape(arr, (2,2,-1)))

In [ ]:
# the numpy sum, mean min and max can be used to calculate aggregates across any axis
table = np.array([[10.9, 12.1, 15.2, 7.3], [3.9, 1.2, 34.6, 8.3], [1.9, 23.3, 1.2, 3.7]])
print(table)

# Calculating the maximum across the first axis (=0).
max0 = np.max(table, axis=0)
# Calculating the maximum across the second axis (=1).
max1 = np.max(table, axis=1)
# Calculating the overall maximum.
max_overall = np.max(table)

print("Maximum over the rows of the table = " + str(max0))
print("Maximum over the columns of the table = " + str(max1))
print("Overall maximum of the table = " + str(max_overall))

Basic arithmetical operations on arrays of the same shape are done elementwise:


In [ ]:
x = np.array([1.,2.,3.])
y = np.array([4.,5.,6.])

print(x + y)
print(x - y)
print(x * y)
print(x / y)

Data plotting

We use matplotlib.pyplot to make various types of plots


In [ ]:
import matplotlib.pyplot as plt

A basic plot a list of values:


In [ ]:
plt.plot([7.3, 8.2, 1.2, 3.2, 9.1, 1.5])
plt.show()

Plotting a list of X and Y values:


In [ ]:
plt.plot([100.0, 110.0, 120.0, 130.0, 140.0, 150.0],[2.3, 8.1, 9.3, 9.7, 9.8, 20.0])
plt.show()

Using mathematical functions and plotting more than one line on a graph


In [ ]:
x = np.arange(0.0, 10.0, 0.1)

# Most numpy function work on arrays as well by applying the function to each element in turn
y_cos = np.cos(x)
y_sin = np.sin(x)

In [ ]:
plt.plot(x, y_cos, x, y_sin)
plt.show()

Data import and inspection (optional)

Pandas is a popular library for data wrangling, we'll use it to load and inspect a csv file that contains the historical web request and cpu usage of a web server:


In [ ]:
import pandas as pd

data = pd.DataFrame.from_csv("data/request_rate_vs_CPU.csv")

The head command allows one to quickly see the structure of the loaded data:


In [ ]:
data.head()

We can select the CPU column and plot the data:


In [ ]:
data.plot(figsize=(13,8), y="CPU")

Now to show the plot we need to import matplotlib.pyplot and execute the show() function.


In [ ]:
plt.show()

Next we plot the request rates, leaving out the CPU column as it has another unit:


In [ ]:
data.drop('CPU',1).plot(figsize=(13,8))
plt.show()

Now to continue and start to model the data, we'll work with basic numpy arrays. By doing this we also drop the time-information as shown in the plots above.

We extract the column labels as the request_names for later reference:


In [ ]:
request_names = data.drop('CPU',1).columns.values
request_names

We extract the request rates as a 2-dimensional numpy array:


In [ ]:
request_rates = data.drop('CPU',1).values
request_rates

and the cpu usage as a one-dimensional numpy array


In [ ]:
cpu = data['CPU'].values
cpu

In [ ]: