Machine Learning Using Python by ARCC

What is Machine Learning?

Machine Learning is the subfield of computer science, which is defined by Arthur Samuel as "Giving computers the ability to learn without being explicitly programmed". Generally speaking, it can be defined as the ability of machines to learn to perform a task efficiently based on experience.

Two major types of Machine Learning problems

The two major types of Machine Learning problems are:

Supervised Learning : In this type of Machine Learning problem, the learning algorithms are provided with a data set that has a known label or result, such as classifying a bunch of emails as spam/not-spam emails.
Unsupervised Learning : In this type of Machine Learning problem, the learning algorithms are provided with a data set that has no known label or result. The algorithm without any supervision is expected to find some structure in the data by itself. For example search engines.

In order to limit the scope of this boot camp, as well as save us some time, we will focus on Supervised Learning today.

Machine Learning's "Hello World" programs

Supervised Learning

In this section we focus on three Supervised Learning algorithms namely Linear Regression, Linear Classifier and Support Vector Machines.

Linear Regression

In order to explain what Linear Regression is, lets write a program that performs Linear Regression. Our goal is to find the best fitting straight line through a data set comprising of 50 random points.

Equation of a line in slope intercept form is: $y=m*x+C$



In [17]:

    
#NumPy is the fundamental package for scientific computing with Python
import numpy as np

# Matplotlib is a Python 2D plotting library 
import matplotlib.pyplot as plt

#Number of data points
n=50
x=np.random.randn(n)
y=np.random.randn(n)

#Create a figure and a set of subplots
fig, ax = plt.subplots()

#Find best fitting straight line
#This will return the coefficients of the best fitting straight line 
#i.e. m and c in the slope intercept form of a line-> y=m*x+c
fit = np.polyfit(x, y, 1)

#Plot the straight line
ax.plot(x, fit[0] * x + fit[1], color='black')

#scatter plot the data set
ax.scatter(x, y)

plt.ylabel('y axis')
plt.xlabel('x axis')
plt.show()

#predict output for an input say x=5
x_input=5
predicted_value= fit[0] * x_input + fit[1]
print(predicted_value)









    












    



0.271414517289

Using the best fittng straight line, we can predict the next expected value. Hence Regression is a Machine Learning technique which helps to predict the output in a model that takes continous values.

Linear Classifier

A classifier, for now, can be thought of as a program that uses an object's characteristics to identify which class it belongs to. For example, classifying a fruit as an orange or an apple. The following program is a simple Supervised Learning classifier, that makes use of a decision tree classifier (An example of a decision tree is shown below).



In [18]:

    
#Import the decision tree classifier class from the scikit-learn machine learning library
from sklearn import tree

#List of features
#Say we have 9 inputs each with two features i.e. [feature one=1:9, feature two=0 or 1] 
features=[[1,1],[8,0],[5,1],[2,1],[6,0],[9,1],[3,1],[4,1],[7,0]]

#The 9 inputs are classified explicitly into three classes (0,1 and 2) by us
# For example input 1,1 belongs to class 0
#             input 4,1 belongs to class 1
#             input 8,1 belongs to class 2
labels=[0,0,0,1,1,1,2,2,2]

#Features are the inputs to the classifier and labels are the outputs

#Create decision tree classifier
clf=tree.DecisionTreeClassifier()
#Training algorithm, included in the object, is executed
clf=clf.fit(features,labels) #Fit is a synonym for 
"find patterns in data"

#Predict to which class does an input belong
#for example [20,1]
print (clf.predict([[2,1]]))

[1]

Support Vector Machines

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. An example of SVM would be using a linear hyperplane to seperate two clusters of data points. The following code implements the same.



In [19]:

    
#import basic libraries for plotting and scientific computing
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
#emulates the aesthetics of ggplot in R
style.use("ggplot")

#import class svm from scikit-learn
from sklearn import svm

#input data
X = [1, 5, 1.5, 8, 1, 9]
Y = [2, 8, 1.8, 8, 0.6, 11]

#classes assigned to input data
y = [0,1,0,1,0,1]

#plot input data
#plt.scatter(X,Y)
#plt.show()

#Create the Linear Support Vector Classification
clf = svm.SVC(kernel='linear', C = 1.0)

#input data in 2-D
points=[[1,2],[5,8],[1.5,1.8],[8,8],[1,0.6],[9,11]]#,[2,2],[1,4]]

#Fit the data with Linear Support Vector Classification
clf.fit(points,y) #Fit is a synonym for "find patterns in data"

#Predict the class for the following two points depending on which side of the SVM they lie
print(clf.predict([0.58,0.76]))
print(clf.predict([10.58,10.76]))

#find coefficients of the linear svm
w = clf.coef_[0]
#print(w)

#find slope of the line we wish to draw between the two classes
#a=change in y/change in x
a = -w[0] / w[1]

#Draw the line

#x points for a line
#linspace ->Return evenly spaced numbers over a specified interval.
xx = np.linspace(0,12)
#equation our SVM hyperplane
yy = a * xx - clf.intercept_[0] / w[1]

#plot the hyperplane
h0 = plt.plot(xx, yy, 'k-', label="svm") 
#plot the data points as a scatter plot
plt.scatter(X,Y)
plt.legend()
plt.show()









    



[0]
[1]






    



/apps/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
/apps/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

Basic workflow when using Machine Learning algorithms

Neural Networks

Neural Networks can be thought of as a one large composite function, which is comprised of other functions. Each layer is a function that is fed an input which is the result of the previous function's output. For example:

The example above was rather rudimentary. Let us look at a case where we have more than one inputs, fed to a prediction function that maps them to an output. This can be depicted by the following graph.

Building a Neural Network

The function carried out by our layer is termed as the sigmoid. It takes the form: $1/(1+exp(-x))$

Steps to follow to create our neural network:
1) Get set of input
2) dot with a set of weights i.e. weight1input1+weight2input2+weight3*input3
3) send the dot product to our prediction function i.e. sigmoid
4) check how much we missed i.e. calculate error
5) adjust weights accordingly
6) Do this for all inputs and about 1000 times



In [20]:

    
#import numpy library
import numpy as np

# Sigmoid function which maps any value to between 0 and 1 
# This is the function which will our layers will comprise of
# It is used to convert numbers to probabilties

def sigmoid(x):
    return 1/(1+np.exp(-x))

# input dataset
# 4 set of inputs
x = np.array([[0.45,0,0.21],[0.5,0.32,0.21],[0.6,0.5,0.19],[0.7,0.9,0.19]])
 
# output dataset corresponding to our set of inputs    
# .T takes the transpose of the 1x4 matrix which will give us a 4x1 matrix
y = np.array([[0.5,0.4,0.6,0.9]]).T

#Makes our model deterministic for us to judge the outputs better
#Numbers will be randomly distributed but randomly distributed in exactly the same way each time the model is trained
np.random.seed(1)
 
# initialize weights randomly with mean 0, weights lie within -1 to 1
# dimensions are 3x1 because we have three inputs and one output
weights = 2*np.random.random((3,1))-1


#Network training code
#Train our neural network
for iter in range(1000):
    #get input
    input_var = x
    
    #This is our prediction step
    #first predict given input, then study how it performs and adjust to get better
    #This line first multiplies the input by the weights and then passes it to the sigmoid function
    output = sigmoid(np.dot(input_var,weights))
    
    #now we have guessed an output based on the provided input
    #subtract from the actual answer to see how much did we miss
    error = y - output

    #based on error update our weights
    weights += np.dot(input_var.T,error)

#The best fit weights by our neural net is as following:
print("The weights that the neural network found was:")
print(weights)

#Predict with new inputs i.e. dot with weights and then send to our prediction function
predicted_output = sigmoid(np.dot(np.array([0.3,0.9,0.1]),weights))
print ("Predicted Output:")
print (predicted_output)









    



The weights that the neural network found was:
[[ 0.77051871]
 [ 1.85826276]
 [-3.820081  ]]
Predicted Output:
[ 0.82077161]

Optional :

K-Nearest Neighbour



In [21]:

    
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Feature set containing (x,y) values of 25 known/training data
trainData = np.random.randint(0,100,(25,2)).astype(np.float32)

# Labels each one either Red or Blue with numbers 0 and 1
responses = np.random.randint(0,2,(25,1)).astype(np.float32)

# Take Red families and plot them
red = trainData[responses.ravel()==0]
plt.scatter(red[:,0],red[:,1],80,'r','^')

# Take Blue families and plot them
blue = trainData[responses.ravel()==1]
plt.scatter(blue[:,0],blue[:,1],80,'b','s')

#New unknown data point
newcomer = np.random.randint(0,100,(1,2)).astype(np.float32)
#Make this unknown data point green
plt.scatter(newcomer[:,0],newcomer[:,1],80,'g','o')

#Carry out the K nearest neighbour classification
knn = cv2.ml.KNearest_create()
#Train the algorithm
#passing 0 as a parameter considers the length of array as 1 for entire row.
knn.train(trainData, 0, responses)
#Find 3 nearest neighbours...also make sure the neighbours found belong to both classes
ret, results, neighbours ,dist = knn.findNearest(newcomer, 3)
print ("result: ", results,"\n")
print ("neighbours: ", neighbours,"\n")
print ("distance: ", dist)
plt.show()









    



result:  [[ 1.]] 

neighbours:  [[ 1.  1.  0.]] 

distance:  [[ 145.  205.  293.]]

Additional resources to further learn Python

Tutorials by python.org at https://docs.python.org/3/tutorial/

Python for everybody specialization by the University of Michigan at www.coursera.org

Python intro course on Data Camp at https://www.datacamp.com/courses/intro-to-python-for-data-science

Free exercises at https://learnpythonthehardway.org/book/