Lab of data analysis with python

In this lab we will introduce some of the modules that we will use in the rest of the labs of the course.

The usual beginning of any python module is a list of import statements. In most of our files we will use the following modules:

  • numpy: The basic scientific computing library.
  • csv: Used for input/output using comma separated values files, one of the standard formats in data management.
  • matplotlib: Used for plotting figures and graphs.
  • sklearn: Scikit-learn is the machine learning library for python.

In [ ]:
%matplotlib inline  
# The line above is needed to include the figures in this notebook, you can remove it if you work with a normal script
    
import numpy as np
import csv
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split

 1. NUMPY

The numpy module is useful for scientific computing in Python.

 1.a Create numpy arrays

The main data structure in numpy is the n-dimensional array. You can define a numpy array from a list or a list or lists. Python will try to build it with the appropiate dimensions. You can check the dimensions of the array with shape()


In [ ]:
my_array = np.array([[1, 2],[3, 4]])
print my_array
print np.shape(my_array)

Define a new 3x2 array named my_array2 with [1, 2, 3] in the first row and [4,5,6] in the second. Check the dimensions of the array.


In [ ]:
my_array2 = np.array([[1, 2, 3],[4, 5, 6]])
print my_array2
print np.shape(my_array2)

Until now, we have created arrays defining their elements. But you can also create it defining the range


In [ ]:
my_new_array = np.arange(3,11,2)
print my_new_array

Check the functions np.linspace, np.logspace and np.meshgrid which let you create more sophisticated ranges

You can create numpy arrays in several ways. For example numpy provides a number of functions to create special types of matrices.

Create 3 arrays usings ones, zeros and eye. If you have any doubt about the parameters of the functions have a look at the help with the function help( ).


In [ ]:
A1 = np.zeros((3,4))
print A1
A2 = np.ones((2,6))
print A2
A3 = np.eye(5)
print A3

1.b Elementwise operations

One of the main advantages of numpy arrays is that operations are propagated to the individual elements of the array


In [ ]:
a = np.array([0,1,2,3,4,5])
print a*2
print a**2

Compare this with operations over python lists:


In [ ]:
[1,2,3,4,5]*2

1.c Indexing numpy arrays

There are several operations you can do with numpy arrays similar to the ones you can do with matrices in Matlab. One of the most important is slicing (we saw it when we talked about lists). It consists in extracting some subarray from the array.


In [ ]:
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print x[1:7:2] # start:stop:step

print x[-2:10] # confusing, avoid negative values... 
print x[8:10] # equivalent

print x[-3:3:-1] # confusing, avoid negative values...
print x[7:3:-1] # equivalent

print x[:7]  # when start value is not indicated, it takes the first
print x[5:]  # when stop value is not indicated, it takes the last
print x[:]   # select "from first to last" == "all"

One important thing to consider when you do slicing are the dimensions of the output array. Run the following cell and check the shape of my_array3. Check also its dimension with ndim function:


In [ ]:
my_array = np.array([[1, 2],[3, 4]])
my_array3 = my_array[:,1]
print my_array3
print my_array[1,0:2]

In [ ]:
print my_array3.shape
print my_array3.ndim

If you have correctly computed it you will see that my_array3 is one dimensional. Sometimes this can be a problem when you are working with 2D matrixes (and vectors can be considered as 2D matrixes with one of the sizes equal to 1). To solve this, numpy provides the newaxis constant.


In [ ]:
my_array3 = my_array3[:,np.newaxis]

Check again the shape and dimension of my_array3


In [ ]:
print my_array3.shape
print my_array3.ndim

When you try to index different rows and columns of a matrix you have to define it element by element. For example, consider that we want to select elements of rows [0, 3] and columns [0, 2], we have to define the row 0 index for each column to be selected....


In [ ]:
x = np.array([[ 0,  1,  2], [ 3,  4,  5], [ 6,  7,  8], [ 9, 10, 11]])

# We want to select elements of rows [0, 3] and columns [0, 2]
rows = np.array([[0, 0],[3, 3]], dtype=np.intp)
columns = np.array([[0, 2],[0, 2]], dtype=np.intp)

print x[rows, columns]

To make this easier, we can use the ix_ function which automatically creates all the needed indexes


In [ ]:
# With ix_
rows = np.array([0, 3], dtype=np.intp)
columns = np.array([0, 2], dtype=np.intp)

print np.ix_(rows, columns)
print x[np.ix_(rows, columns)]

Another important array manipulation method is array concatenation or stacking. It is useful to always state explicitly in which direction we want to stack the arrays. For example in the following example we are stacking the arrays vertically.

1.d Concatenate numpy arrays


In [ ]:
my_array = np.array([[1, 2],[3, 4]])
my_array2 = np.array([[11, 12],[13, 14]])

print np.concatenate( (my_array, my_array2) , axis=1) # columnwise concatenation

EXERCISE: Concatenate the first column of my_array and the second column of my_array2

The answer should be:

[[ 1 12]
 [ 3 14]]

In [ ]:
print <COMPLETAR>

Numpy also includes the functions hstack() and vstack() to concatenate by columns or rows, respectively.

EXERCISE: Use these functions to concatenate my_array and my_array2 by rows and columns.

The answer should be:

[[ 1 12]
 [ 3 14]]

In [ ]:
print <COMPLETAR>

1.e Matrix multiplications

Finally numpy provides all the basic matrix operations: multiplications, dot products, ... You can find information about them in the Numpy manual


In [ ]:
x=np.array([1,2,3])
y=np.array([1,2,3])
print x*y  #Element-wise
print np.multiply(x,y) #Element-wise
print sum(x*y) # dot product
print  #Fast matrix product

EXERCISE: Try to compute the dot product with python arrays:

The answer should be:

14

In [ ]:
x=[1,2,3]
dot_product_x = <COMPLETAR>
print dot_product_x

1.f Other useful functions

Some functions let you:

  • Find elements holding a condition

In [ ]:
x = np.array([[ 0,  1,  2], [ 3,  4,  5], [ 6,  7,  8], [ 9, 10, 11]])
print x
print np.where(x>4)
print np.nonzero(x>4)
  • Compute the maximum, minimum or, even, the positions of the maximum or minimum

In [ ]:
print a.argmax(axis=0) 
print a.max(axis=0) 
# a.min(axis=0), a.argmin(axis=0)
  • Sort a vector

In [ ]:
a = np.array([[1,4], [3,1]])
print a
a.sort(axis=1)
print a
a.sort(axis=0)
b = a
print b
  • Calculate some statistical parameters

In [ ]:
x = np.array([[ 0,  1,  2], [ 3,  4,  5], [ 6,  7,  8], [ 9, 10, 11]])
print x.mean(axis=0)
print x.var(axis=0)
print x.std(axis=0)
  • Obtain random numbers

In [ ]:
np.random.seed(0)
perm = np.random.permutation(100)
perm[:10]

In addition to numpy we have a more advanced library for scientific computing, scipy. Scipy includes modules for linear algebra, signal processing, fourier transform, ...

2. Matplotlib

One important step of data analysis is data visualization. In python the simplest plotting library is matplotlib and its sintax is similar to Matlab plotting library. In the next example we plot two sinusoids with different simbols.


In [ ]:
t = np.arange(0.0, 1.0, 0.05)
a1 = np.sin(2*np.pi*t)
a2 = np.sin(4*np.pi*t)


plt.figure()
ax1 = plt.subplot(211)
ax1.plot(t,a1)
plt.xlabel('t')
plt.ylabel('a_1(t)')
ax2 = plt.subplot(212)
ax2.plot(t,a2, 'r.')
plt.xlabel('t')
plt.ylabel('a_2(t)')
plt.show()

3. Classification example

One of the main machine learning problems is clasification. In the following example we will load and visualize a dataset that can be used in a clasification problem.

The iris dataset is the most popular pattern recognition dataset. And it consists on 150 instances of 4 features of iris flowers:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm

The objective is usually to distinguish three different classes of iris plant: Iris setosa, Iris versicolor and Iris virginica.

3.1 Loading the data

We give you the data in .csv format. In each line of the csv file we have the 4 real-valued features of each instance and then a string defining the class of that instance: Iris-setosa, Iris-versicolor or Iris-virginica. There are 150 instances of flowers (lines) in the csv file.

Let's se how we can load the data in an array.


In [ ]:
# Open up the csv file in to a Python object
csv_file_object = csv.reader(open('data/iris_data.csv', 'rb')) 
datalist = []                    # Create a variable called 'data'.
for row in csv_file_object:      # Run through each row in the csv file,

    datalist.append(row)         # adding each row to the data variable


data = np.array(datalist)  # Then convert from a list to an array
                           # Be aware that each item is currently
                           # a string in this format
print np.shape(data)
X = data[:,0:-1]
label = data[:,-1,np.newaxis]
print X.shape
print label.shape

In the previous code we have saved the features in matrix X and the class labels in the vector labels. Both are 2D numpy arrays. We are also printing the shapes of each variable (see that we can also use array_name.shape to get the shape, apart from function shape( )). This shape checking is good to see if we are not making mistakes.

 3.2 Visualizing the data

Extract the first two features of the data (sepal length and width) and plot the first versus the second in a figure, use a different color for the data corresponding to different classes.

First of all, you probably want to split the data according to each class label.


In [ ]:
x = X[:,0:2]
#print len(set(list(label)))
list_label = [l[0] for l in label]
labels = list(set(list_label))
colors = ['bo', 'ro', 'go']
#print list_label
plt.figure()
for i, l in enumerate(labels):
    pos = np.where(np.array(list_label) == l)
    plt.plot(x[pos,0], x[pos,1], colors[i])
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')

According to this plot, which classes seem more difficult to distinguish?

4. Regression example

Now that we know how to load some data and visualize it we will try to solve a simple regression task.

Our objective in this example is to predict the crime rates in different areas of the US using some socio-demographic data.

This dataset has 127 socioeconomic variables, of different nature: categorical, integer, real, and for some of them there are also missing data (check wikipedia). This is usually a problem when training machine learning models, but we will ignore that problem and take only a small number of variables that we think can be useful for regression and which have no missing values.

  • population: population for community
  • householdsize: mean people per household
  • medIncome: median household income

The objective in the regresion problem is another real value that contains the total number of violent crimes per 100K population.

4.1 Loading the data

First of all, load the data from file communities.csv in a new array. This array should have 1994 rows (instances) and 128 columns.


In [ ]:
csv_file_object = csv.reader(open('communities.csv', 'rb')) 
datalist = []                    
for row in csv_file_object:      
    datalist.append(row)         
data = np.array(datalist)  
print np.shape(data)

Take the columns (5,6,17) of the data and save them in a matrix X_com. This will be our input data. Convert this array into a float array. The shape should be (1994,3)

EXERCISE: Get the last column of the data and save it in an array called y_com. Convert this matrix into a float array. Check that the shape is (1994,1) .


In [ ]:
X_com = <COMPLETAR>
Nrow = np.shape(data)[0]
Ncol = np.shape(data)[1]
print X_com.shape

y_com = <COMPLETAR>
print y_com.shape

EXERCISE: Plot each variable in X_com versus y_com to have a first (partial) view of the data.


In [ ]:
plt.figure()
plt.plot(<COMPLETAR>, 'bo')
plt.xlabel('X_com[0]')
plt.ylabel('y_com')

plt.figure()
plt.plot(<COMPLETAR>, 'ro')
plt.xlabel('X_com[1]')
plt.ylabel('y_com')

plt.figure()
plt.plot(<COMPLETAR>, 'go')
plt.xlabel('X_com[2]')
plt.ylabel('y_com')

4.3 Train/Test splitting

Now we are about to start doing machine learning. But, first of all, we have to separate our data between train and test.

The train data will be used to adjust the parameters of our model (train). The test data will be used to evaluate our model.

EXERCISE: Use sklearn.cross_validation.train_test_split to split the data in train (60%) and test (40%). Save the results in variables named X_train, X_test, y_train, y_test.


In [ ]:
from sklearn.cross_validation import train_test_split
Random_state = 131

X_train, X_test, y_train, y_test = train_test_split(<COMPLETAR>, <COMPLETAR>, test_size=<COMPLETAR>, random_state=Random_state)

print X_train.shape
print X_test.shape
print y_train.shape
print y_test.shape

4.4 Normalization

Most machine learning algorithms require that the data is standarized (mean=0, standard deviation= 1). Scikit-learn provides a tool to do that in the object sklearn.preprocessing.StandardScaler

EXERCISE: Compute and print the mean and standard deviation of the data. Then normalize the data, such that it has zero mean and unit standard deviation, and check the results.

The answer should be:

Values before normalizing:

[ 0.06044314  0.46025084  0.36419732]
[ 0.0533208   0.46810777  0.35651629]
[ 0.13651131  0.16684793  0.21110026]
[ 0.11073518  0.15868603  0.20651214]

Values after normalizing:

[ -6.99180587e-16  -2.18145828e-17   1.69596778e-15]
[-0.052174    0.04709039 -0.03638571]
[ 1.  1.  1.]
[ 0.81117952  0.95108182  0.97826567]

In [ ]:
print "Values before normalizing:\n"
print <COMPLETAR>.mean(axis=0)
print X_test.<COMPLETAR>
print <COMPLETAR>.std(axis=0)
print X_test.<COMPLETAR>

# from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(<COMPLETAR>)                # computes mean and std using the train dataset

X_train_normalized = scaler.transform(<COMPLETAR>)  # applies the normalization to train
X_test_normalized = scaler.transform(<COMPLETAR>)  # applies the normalization to test

print "\nValues after normalizing:\n"

print <COMPLETAR>
print <COMPLETAR>
print <COMPLETAR>
print <COMPLETAR>

 4.5 Training

We will use two different K-NN regressors for this example. One with K (n_neighbors) = 1 and the other with K=7.

Read the API and this example to understand how to fit the model.

EXERCISE: Train the two models described above with default parameters.

The answer should be:

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=1, p=2,
          weights='uniform')
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=7, p=2,
          weights='uniform')

In [ ]:
from sklearn import neighbors

knn1_model = neighbors.KNeighborsRegressor(<COMPLETAR>)
knn1_model.fit(<COMPLETAR>.astype(np.float), <COMPLETAR>.astype(np.float))
    
knn7_model = neighbors.KNeighborsRegressor(<COMPLETAR>)
knn7_model.fit(<COMPLETAR>.astype(np.float), <COMPLETAR>.astype(np.float))

print knn1_model
print knn7_model

4.6 Prediction and evaluation

Now use the two models you have trained to predict the test output y_test. Then evaluate it measuring the Mean-Square Error (MSE).

The formula of MSE is

$$\text{MSE}=\frac{1}{K}\sum_{k=1}^{K}(\hat{y}-y)^2$$
The answer should be:

 The MSE value for model1 is 0.060090

 The MSE value for model7 is 0.038202

First 5 prediction values with model 1:

[[ 0.51]
 [ 0.17]
 [ 0.46]
 [ 0.2 ]
 [ 0.34]]

First 5 prediction values with model 7:

[[ 0.40857143]
 [ 0.21285714]
 [ 0.27428571]
 [ 0.32      ]
 [ 0.36857143]]

In [ ]:
y_predict_1 = knn1_model.predict(<COMPLETAR>.astype(np.float))
mse1 = <COMPLETAR>
print " The MSE value for model1 is %f\n " % mse1

y_predict_7 = knn7_model.predict(<COMPLETAR>.astype(np.float))
mse7 = <COMPLETAR>
print " The MSE value for model7 is %f\n " % mse7

print "First 5 prediction values with model 1:\n"
print <COMPLETAR>

print "\nFirst 5 prediction values with model 7:\n"
print <COMPLETAR>

 4.7 Saving the results

Finally we will save all our prediction for the model with K=1 in a csv file. To do so you can use the following code Snippet, where y_pred are the predicted output values for test.


In [ ]:
y_pred = y_predict_1.squeeze()
csv_file_object = csv.writer(open('output.csv', 'wb')) 
for index, y_aux in enumerate(<COMPLETAR>):      # Run through each row in the csv file,
    csv_file_object.writerow([index,y_aux])