Exercises for Chapter 1

Training Machine Learning Algorithms for Classification

Question 1. In the file algos/blank/perceptron.py, implement Rosenblatt's perceptron algorithm by fleshing out the class Perceptron. When you're finished, run the code in the block below to test your implementation.



In [2]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from algos.blank.perceptron import Perceptron

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', 1, -1)
X = df.iloc[0:100, [0, 2]].values

ppn = Perceptron()
ppn.fit(X, y)

if (ppn.errors[-1] == 0):
    print('Looks good!')
else:
    print("Looks like your classifier didn't converge to 0 :(")

Question 2. Raschka claims that without an epoch or a threshold of acceptable misclassification, the perceptron may not ever stop updating. Explain why this can happen, and give an example.



In [ ]:

    
# Write your answer here

Question 3. The following diagram comes from Raschka's book. Try to answer the questions about it without looking back at the text.

What is being depicted in the diagram on the left? How about the diagram on the right?



In [3]:

    
# Write your answer here

Describe in words what the following symbols represent in the diagram on the left:

The axes, $w^{T}x$ and $\phi(w^{T}x)$
The thick black line



In [ ]:

    
# Write your answer here

Describe in words what the following symbols represent in the diagram on the right:

The red circles
The blue pluses
The axes, $X_{1}$ and $X_{2}$
The vertical dashed line



In [ ]:

    
# Write your answer here

True or False: In the diagram on the right, $X_{1} = \phi(w^{T}x) = 0$. Explain your reasoning.



In [ ]:

    
# Write your answer here

True or False: in the general relationship depicted by the diagram on the right ($X_{1}$ vs. $X_{2}$), the dashed line must always be vertical. Explain your reasoning.



In [ ]:

    
# Write your answer here

Question 4. Plot $X$ and its standardized form $X'$ following the feature scaling algorithm that Raschka uses in the book. How does scaling the feature using the $t$-statistic change the sample distribution?



In [4]:

    
# Write your code here



In [ ]:

    
# Write your answer here

Question 5. In the file algos/blank/adaline.py, implement the Adaline rule in the class Adaline. When you're finished, run the code in the block below to test your implementation.



In [5]:

    
from algos.blank.adaline import Adaline

ada = Adaline()
ada.fit(X_std, y)

if (ada.cost[-1] < 5):
    print('Looks good!')
else:
    print("Looks like your classifier didn't find the minimum :(")

Question 6. Implement stochastic gradient descent as an option for the Adaline class. Then, run the test code below.



In [6]:

    
ada_sgd = Adaline(stochastic=True)
ada_sgd.fit(X_std, y)

if (ada_sgd.cost[1] < 1):
    print('Looks good!')
else:
    print("Looks like your stochastic model isn't performing well enough :(")

Question 7. Describe a situation in which you would choose to use batch gradient descent, a situation in which you would choose to use stochastic gradient descent, and a situation in which you would choose to use mini-batch gradient descent.



In [ ]:

    
# Write your answer here

Question 8. Implement online learning as an option for the Adaline class. Then, run the test code below.



In [6]:

    
new_X = df.iloc[100, [0, 2]]
new_X = new_X - (np.mean(X, axis=0)) / np.std(X, axis=0)
new_y = df.iloc[100, 4]
new_y = np.where(new_y == 'Iris-setosa', 1, -1)

ada_sgd.partial_fit(new_X, new_y)

Question 9. Raschka claims that stochastic gradient descent could result in "cycles" if the order in which the samples were read (and corresponding weights updated) wasn't randomized, or "shuffled," before every iteration. Explain the intuition behind this idea, and describe what a "cycle" might look like.



In [ ]:

    
# Write your answer here

Question 10. Verify that stochastic gradient descent improves the speed of convergence for Adaline in the case of the Iris dataset by plotting the errors against the iteration epoch in both cases. Then, briefly explain why this is the case.



In [ ]:

    
# Write your code here



In [ ]:

    
# Write your answer here