Jupyter notebook is often used by data scientists who work in Python. It is loosely based on Mathematica and combines code, text and visual output in one page.
Some relevant short cuts:
SHIFT + ENTER executes 1 block of code called a cellSHIFT + TAB gives you extra information on what parameters a function takesSHIFT + TAB multiple times gives you even more informationTo get used to these short cuts try them out on the cell below.
In [ ]:
print('Hello world!')
print(range(5))
In [ ]:
import numpy as np
In [ ]:
# To proceed, implement the missing code, and remove the 'raise NotImplementedException()'
##### Implement this part of the code #####
raise NotImplementedError()
# three = ?
assert three == 3
We'll be working often with numpy arrays so here's a short introduction.
np.array() create a new array.
In [ ]:
import numpy as np
# This is a two-dimensional numpy array:
arr = np.array([[1,2,3,4],[5,6,7,8]])
print(arr)
# The shape is a tuple describing the size of each dimension
print("shape=" + str(arr.shape))
# Elements are selected by specifying two indices, counting from 0
print("arr[1][3] = %d" % arr[1][3])
print()
# This is a three-dimensional numpy array
arr3 = np.array([[[1, 2, 3, 4], [5, 6, 7, 8]], [[9, 10, 11, 12], [13, 14, 15, 16]]])
print(arr3)
print()
print("shape=" + str(arr3.shape))
# Elements in a three dimensional array are selected by specifying three indices, counting from 0
print(arr3[1][0][2])
In [ ]:
# The numpy reshape method allows one to change the shape of an array, while keeping the underlying data.
# One can leave one dimension unspecified by passing -1, it will be determined from the size of the data.
print("Original array:")
print(arr)
print()
print("As 4x2 matrix")
print(np.reshape(arr, (4,2)))
print()
print("As 8x1 matrix")
print(np.reshape(arr, (-1,1)))
print()
print("As 2x2x2 array")
print(np.reshape(arr, (2,2,-1)))
In [ ]:
# the numpy sum, mean min and max can be used to calculate aggregates across any axis
table = np.array([[10.9, 12.1, 15.2, 7.3], [3.9, 1.2, 34.6, 8.3], [1.9, 23.3, 1.2, 3.7]])
print(table)
# Calculating the maximum across the first axis (=0).
max0 = np.max(table, axis=0)
# Calculating the maximum across the second axis (=1).
max1 = np.max(table, axis=1)
# Calculating the overall maximum.
max_overall = np.max(table)
print("Maximum over the rows of the table = " + str(max0))
print("Maximum over the columns of the table = " + str(max1))
print("Overall maximum of the table = " + str(max_overall))
Basic arithmetical operations on arrays of the same shape are done elementwise:
In [ ]:
x = np.array([1.,2.,3.])
y = np.array([4.,5.,6.])
print(x + y)
print(x - y)
print(x * y)
print(x / y)
We use matplotlib.pyplot to make various types of plots
In [ ]:
import matplotlib.pyplot as plt
A basic plot a list of values:
In [ ]:
plt.plot([7.3, 8.2, 1.2, 3.2, 9.1, 1.5])
plt.show()
Plotting a list of X and Y values:
In [ ]:
plt.plot([100.0, 110.0, 120.0, 130.0, 140.0, 150.0],[2.3, 8.1, 9.3, 9.7, 9.8, 20.0])
plt.show()
Using mathematical functions and plotting more than one line on a graph
In [ ]:
x = np.arange(0.0, 10.0, 0.1)
# Most numpy function work on arrays as well by applying the function to each element in turn
y_cos = np.cos(x)
y_sin = np.sin(x)
In [ ]:
plt.plot(x, y_cos, x, y_sin)
plt.show()
Pandas is a popular library for data wrangling, we'll use it to load and inspect a csv file that contains the historical web request and cpu usage of a web server:
In [ ]:
import pandas as pd
data = pd.DataFrame.from_csv("data/request_rate_vs_CPU.csv")
The head command allows one to quickly see the structure of the loaded data:
In [ ]:
data.head()
We can select the CPU column and plot the data:
In [ ]:
data.plot(figsize=(13,8), y="CPU")
Now to show the plot we need to import matplotlib.pyplot and execute the show() function.
In [ ]:
plt.show()
Next we plot the request rates, leaving out the CPU column as it has another unit:
In [ ]:
data.drop('CPU',1).plot(figsize=(13,8))
plt.show()
Now to continue and start to model the data, we'll work with basic numpy arrays. By doing this we also drop the time-information as shown in the plots above.
We extract the column labels as the request_names for later reference:
In [ ]:
request_names = data.drop('CPU',1).columns.values
request_names
We extract the request rates as a 2-dimensional numpy array:
In [ ]:
request_rates = data.drop('CPU',1).values
request_rates
and the cpu usage as a one-dimensional numpy array
In [ ]:
cpu = data['CPU'].values
cpu
In [ ]: