In this lab we will introduce some of the modules that we will use in the rest of the labs of the course.
The usual beginning of any python module is a list of import statements. In most of our files we will use the following modules:
In [ ]:
%matplotlib inline
# The line above is needed to include the figures in this notebook, you can remove it if you work with a normal script
import numpy as np
import csv
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
In [ ]:
my_array = np.array([[1, 2],[3, 4]])
print my_array
print np.shape(my_array)
Define a new 3x2 array named my_array2 with [1, 2, 3] in the first row and [4,5,6] in the second. Check the dimensions of the array.
In [ ]:
my_array2 = np.array([[1, 2, 3],[4, 5, 6]])
print my_array2
print np.shape(my_array2)
Until now, we have created arrays defining their elements. But you can also create it defining the range
In [ ]:
my_new_array = np.arange(3,11,2)
print my_new_array
Check the functions np.linspace, np.logspace and np.meshgrid which let you create more sophisticated ranges
You can create numpy arrays in several ways. For example numpy provides a number of functions to create special types of matrices.
Create 3 arrays usings ones, zeros and eye. If you have any doubt about the parameters of the functions have a look at the help with the function help( ).
In [ ]:
A1 = np.zeros((3,4))
print A1
A2 = np.ones((2,6))
print A2
A3 = np.eye(5)
print A3
In [ ]:
a = np.array([0,1,2,3,4,5])
print a*2
print a**2
Compare this with operations over python lists:
In [ ]:
[1,2,3,4,5]*2
There are several operations you can do with numpy arrays similar to the ones you can do with matrices in Matlab. One of the most important is slicing (we saw it when we talked about lists). It consists in extracting some subarray from the array.
In [ ]:
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print x[1:7:2] # start:stop:step
print x[-2:10] # confusing, avoid negative values...
print x[8:10] # equivalent
print x[-3:3:-1] # confusing, avoid negative values...
print x[7:3:-1] # equivalent
print x[:7] # when start value is not indicated, it takes the first
print x[5:] # when stop value is not indicated, it takes the last
print x[:] # select "from first to last" == "all"
One important thing to consider when you do slicing are the dimensions of the output array. Run the following cell and check the shape of my_array3. Check also its dimension with ndim function:
In [ ]:
my_array = np.array([[1, 2],[3, 4]])
my_array3 = my_array[:,1]
print my_array3
print my_array[1,0:2]
In [ ]:
print my_array3.shape
print my_array3.ndim
If you have correctly computed it you will see that my_array3 is one dimensional. Sometimes this can be a problem when you are working with 2D matrixes (and vectors can be considered as 2D matrixes with one of the sizes equal to 1). To solve this, numpy provides the newaxis constant.
In [ ]:
my_array3 = my_array3[:,np.newaxis]
Check again the shape and dimension of my_array3
In [ ]:
print my_array3.shape
print my_array3.ndim
When you try to index different rows and columns of a matrix you have to define it element by element. For example, consider that we want to select elements of rows [0, 3] and columns [0, 2], we have to define the row 0 index for each column to be selected....
In [ ]:
x = np.array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
# We want to select elements of rows [0, 3] and columns [0, 2]
rows = np.array([[0, 0],[3, 3]], dtype=np.intp)
columns = np.array([[0, 2],[0, 2]], dtype=np.intp)
print x[rows, columns]
To make this easier, we can use the ix_ function which automatically creates all the needed indexes
In [ ]:
# With ix_
rows = np.array([0, 3], dtype=np.intp)
columns = np.array([0, 2], dtype=np.intp)
print np.ix_(rows, columns)
print x[np.ix_(rows, columns)]
Another important array manipulation method is array concatenation or stacking. It is useful to always state explicitly in which direction we want to stack the arrays. For example in the following example we are stacking the arrays vertically.
In [ ]:
my_array = np.array([[1, 2],[3, 4]])
my_array2 = np.array([[11, 12],[13, 14]])
print np.concatenate( (my_array, my_array2) , axis=1) # columnwise concatenation
In [ ]:
print <COMPLETAR>
In [ ]:
print <COMPLETAR>
Finally numpy provides all the basic matrix operations: multiplications, dot products, ... You can find information about them in the Numpy manual
In [ ]:
x=np.array([1,2,3])
y=np.array([1,2,3])
print x*y #Element-wise
print np.multiply(x,y) #Element-wise
print sum(x*y) # dot product
print #Fast matrix product
In [ ]:
x=[1,2,3]
dot_product_x = <COMPLETAR>
print dot_product_x
Some functions let you:
In [ ]:
x = np.array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
print x
print np.where(x>4)
print np.nonzero(x>4)
In [ ]:
print a.argmax(axis=0)
print a.max(axis=0)
# a.min(axis=0), a.argmin(axis=0)
In [ ]:
a = np.array([[1,4], [3,1]])
print a
a.sort(axis=1)
print a
a.sort(axis=0)
b = a
print b
In [ ]:
x = np.array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
print x.mean(axis=0)
print x.var(axis=0)
print x.std(axis=0)
In [ ]:
np.random.seed(0)
perm = np.random.permutation(100)
perm[:10]
In addition to numpy we have a more advanced library for scientific computing, scipy. Scipy includes modules for linear algebra, signal processing, fourier transform, ...
In [ ]:
t = np.arange(0.0, 1.0, 0.05)
a1 = np.sin(2*np.pi*t)
a2 = np.sin(4*np.pi*t)
plt.figure()
ax1 = plt.subplot(211)
ax1.plot(t,a1)
plt.xlabel('t')
plt.ylabel('a_1(t)')
ax2 = plt.subplot(212)
ax2.plot(t,a2, 'r.')
plt.xlabel('t')
plt.ylabel('a_2(t)')
plt.show()
One of the main machine learning problems is clasification. In the following example we will load and visualize a dataset that can be used in a clasification problem.
The iris dataset is the most popular pattern recognition dataset. And it consists on 150 instances of 4 features of iris flowers:
The objective is usually to distinguish three different classes of iris plant: Iris setosa, Iris versicolor and Iris virginica.
We give you the data in .csv format. In each line of the csv file we have the 4 real-valued features of each instance and then a string defining the class of that instance: Iris-setosa, Iris-versicolor or Iris-virginica. There are 150 instances of flowers (lines) in the csv file.
Let's se how we can load the data in an array.
In [ ]:
# Open up the csv file in to a Python object
csv_file_object = csv.reader(open('data/iris_data.csv', 'rb'))
datalist = [] # Create a variable called 'data'.
for row in csv_file_object: # Run through each row in the csv file,
datalist.append(row) # adding each row to the data variable
data = np.array(datalist) # Then convert from a list to an array
# Be aware that each item is currently
# a string in this format
print np.shape(data)
X = data[:,0:-1]
label = data[:,-1,np.newaxis]
print X.shape
print label.shape
In the previous code we have saved the features in matrix X and the class labels in the vector labels. Both are 2D numpy arrays. We are also printing the shapes of each variable (see that we can also use array_name.shape to get the shape, apart from function shape( )). This shape checking is good to see if we are not making mistakes.
In [ ]:
x = X[:,0:2]
#print len(set(list(label)))
list_label = [l[0] for l in label]
labels = list(set(list_label))
colors = ['bo', 'ro', 'go']
#print list_label
plt.figure()
for i, l in enumerate(labels):
pos = np.where(np.array(list_label) == l)
plt.plot(x[pos,0], x[pos,1], colors[i])
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
According to this plot, which classes seem more difficult to distinguish?
Now that we know how to load some data and visualize it we will try to solve a simple regression task.
Our objective in this example is to predict the crime rates in different areas of the US using some socio-demographic data.
This dataset has 127 socioeconomic variables, of different nature: categorical, integer, real, and for some of them there are also missing data (check wikipedia). This is usually a problem when training machine learning models, but we will ignore that problem and take only a small number of variables that we think can be useful for regression and which have no missing values.
The objective in the regresion problem is another real value that contains the total number of violent crimes per 100K population.
First of all, load the data from file communities.csv in a new array. This array should have 1994 rows (instances) and 128 columns.
In [ ]:
csv_file_object = csv.reader(open('communities.csv', 'rb'))
datalist = []
for row in csv_file_object:
datalist.append(row)
data = np.array(datalist)
print np.shape(data)
Take the columns (5,6,17) of the data and save them in a matrix X_com. This will be our input data. Convert this array into a float array. The shape should be (1994,3)
EXERCISE: Get the last column of the data and save it in an array called y_com. Convert this matrix into a float array. Check that the shape is (1994,1) .
In [ ]:
X_com = <COMPLETAR>
Nrow = np.shape(data)[0]
Ncol = np.shape(data)[1]
print X_com.shape
y_com = <COMPLETAR>
print y_com.shape
EXERCISE: Plot each variable in X_com versus y_com to have a first (partial) view of the data.
In [ ]:
plt.figure()
plt.plot(<COMPLETAR>, 'bo')
plt.xlabel('X_com[0]')
plt.ylabel('y_com')
plt.figure()
plt.plot(<COMPLETAR>, 'ro')
plt.xlabel('X_com[1]')
plt.ylabel('y_com')
plt.figure()
plt.plot(<COMPLETAR>, 'go')
plt.xlabel('X_com[2]')
plt.ylabel('y_com')
Now we are about to start doing machine learning. But, first of all, we have to separate our data between train and test.
The train data will be used to adjust the parameters of our model (train). The test data will be used to evaluate our model.
EXERCISE: Use sklearn.cross_validation.train_test_split to split the data in train (60%) and test (40%). Save the results in variables named X_train, X_test, y_train, y_test.
In [ ]:
from sklearn.cross_validation import train_test_split
Random_state = 131
X_train, X_test, y_train, y_test = train_test_split(<COMPLETAR>, <COMPLETAR>, test_size=<COMPLETAR>, random_state=Random_state)
print X_train.shape
print X_test.shape
print y_train.shape
print y_test.shape
Most machine learning algorithms require that the data is standarized (mean=0, standard deviation= 1). Scikit-learn provides a tool to do that in the object sklearn.preprocessing.StandardScaler
EXERCISE: Compute and print the mean and standard deviation of the data. Then normalize the data, such that it has zero mean and unit standard deviation, and check the results.
Values before normalizing:
[ 0.06044314 0.46025084 0.36419732]
[ 0.0533208 0.46810777 0.35651629]
[ 0.13651131 0.16684793 0.21110026]
[ 0.11073518 0.15868603 0.20651214]
Values after normalizing:
[ -6.99180587e-16 -2.18145828e-17 1.69596778e-15]
[-0.052174 0.04709039 -0.03638571]
[ 1. 1. 1.]
[ 0.81117952 0.95108182 0.97826567]
In [ ]:
print "Values before normalizing:\n"
print <COMPLETAR>.mean(axis=0)
print X_test.<COMPLETAR>
print <COMPLETAR>.std(axis=0)
print X_test.<COMPLETAR>
# from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(<COMPLETAR>) # computes mean and std using the train dataset
X_train_normalized = scaler.transform(<COMPLETAR>) # applies the normalization to train
X_test_normalized = scaler.transform(<COMPLETAR>) # applies the normalization to test
print "\nValues after normalizing:\n"
print <COMPLETAR>
print <COMPLETAR>
print <COMPLETAR>
print <COMPLETAR>
We will use two different K-NN regressors for this example. One with K (n_neighbors) = 1 and the other with K=7.
Read the API and this example to understand how to fit the model.
EXERCISE: Train the two models described above with default parameters.
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform')
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=7, p=2,
weights='uniform')
In [ ]:
from sklearn import neighbors
knn1_model = neighbors.KNeighborsRegressor(<COMPLETAR>)
knn1_model.fit(<COMPLETAR>.astype(np.float), <COMPLETAR>.astype(np.float))
knn7_model = neighbors.KNeighborsRegressor(<COMPLETAR>)
knn7_model.fit(<COMPLETAR>.astype(np.float), <COMPLETAR>.astype(np.float))
print knn1_model
print knn7_model
Now use the two models you have trained to predict the test output y_test. Then evaluate it measuring the Mean-Square Error (MSE).
The formula of MSE is
$$\text{MSE}=\frac{1}{K}\sum_{k=1}^{K}(\hat{y}-y)^2$$
The MSE value for model1 is 0.060090
The MSE value for model7 is 0.038202
First 5 prediction values with model 1:
[[ 0.51]
[ 0.17]
[ 0.46]
[ 0.2 ]
[ 0.34]]
First 5 prediction values with model 7:
[[ 0.40857143]
[ 0.21285714]
[ 0.27428571]
[ 0.32 ]
[ 0.36857143]]
In [ ]:
y_predict_1 = knn1_model.predict(<COMPLETAR>.astype(np.float))
mse1 = <COMPLETAR>
print " The MSE value for model1 is %f\n " % mse1
y_predict_7 = knn7_model.predict(<COMPLETAR>.astype(np.float))
mse7 = <COMPLETAR>
print " The MSE value for model7 is %f\n " % mse7
print "First 5 prediction values with model 1:\n"
print <COMPLETAR>
print "\nFirst 5 prediction values with model 7:\n"
print <COMPLETAR>
In [ ]:
y_pred = y_predict_1.squeeze()
csv_file_object = csv.writer(open('output.csv', 'wb'))
for index, y_aux in enumerate(<COMPLETAR>): # Run through each row in the csv file,
csv_file_object.writerow([index,y_aux])