In this exercise, we illustrate Hoeffding's inequality in the context of virtual coin flips.
For an infinite bin, from which we pick a sample of size $N$, we can calculate the probability bounds between the sample frequency, $\nu$, and unknown the bin frequency, $\mu$ via Hoeffding's Inequality:
$$\mathbb{P} \left[ \left| \nu - \mu \right| > \epsilon \right] \le 2e^{-2\epsilon^2 N}$$It is valid for all $N$ and $\epsilon$ and the bound does not depend on $\mu$. It illustrates the tradeoff between $N$, $\epsilon$, and the bound.
The Python libraries used are:
In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import multiprocessing as mp
import itertools
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(edgeitems=3,infstr='inf',linewidth=75,nanstr='nan',precision=4,suppress=False,threshold=500,formatter=None)
%matplotlib inline
In this virtual experiment, 1000 fair coins are simulated. Each coin is flipped 10 times. The data is stored in a matrix where the 10 columns represent each of the 10 flips and the 1000 rows represent the 1000 individual coins.
It is assumed that 'heads' are represented by '1' and 'tails' are represented by '0', so that we can use np.sum(..., axis=1) to get the number of heads for each coin.
A total of three coins are selected from the 1000 fair coins - $c_1$ is the first coin, $c_{rand}$ is a random coin, and $c_{min}$ is the coin that yielded the minimum number of heads (out of the 10 coin flips). The following code calculates the mean frequency of "obtaining a head" for each of these three coins.
In [2]:
def get_coin_flip_data():
c = np.random.randint(2,size=(1000,10))
idx = np.random.choice(1000) # choose index for a random coin
n_heads = np.sum(c,axis=1) # calculate the number of heads
lo = np.argmin(n_heads) # find first argument with smallest number of heads
c_1 = c[0,:] # choose the first coin
c_rand = c[idx,:] # choose a random coin
c_min = c[lo,:] # choose the coin with the minimum number of heads
ν_1 = np.mean(c_1) # get the fraction of heads for first coin
ν_rand = np.mean(c_rand) # get the fraction of heads for random coin
ν_min = np.mean(c_min) # get the fraction of heads for coin with minimum number of heads
return ν_1, ν_rand, ν_min
In [3]:
def run_experiment(N=100000,seed=None,verbose=False):
if seed is not None:
np.random.seed(seed)
sum_ν_1 = 0.0
sum_ν_rand = 0.0
sum_ν_min = 0.0
for i in range(N):
ν_1, ν_rand, ν_min = get_coin_flip_data()
sum_ν_1 += ν_1
sum_ν_rand += ν_rand
sum_ν_min += ν_min
mean_ν_1 = sum_ν_1/N
mean_ν_rand = sum_ν_rand/N
mean_ν_min = sum_ν_min/N
if verbose is True:
print("Mean ν_1 = {}".format(mean_ν_1))
print("Mean ν_rand = {}".format(mean_ν_rand))
print("Mean ν_min = {}".format(mean_ν_min))
return mean_ν_min
In [4]:
def run_experiment_parallel(N=100000,verbose=False):
n_cpus = mp.cpu_count()//2
experiments = list(itertools.repeat(N//n_cpus,n_cpus))
#Note that this makes an assumption that the different seeds
#will generate independent random streams; this may not be true.
seeds = np.random.choice(N,n_cpus,replace=False)
p = mp.Pool(n_cpus)
ν_arr = p.starmap(run_experiment,zip(experiments,seeds,itertools.repeat(False)))
ν_arr = np.array(ν_arr)
if verbose is True:
print(ν_arr)
return np.mean(ν_arr)
In [5]:
# Takes about 10 seconds for N=100k: [0.037, 0.0374, 0.0376, 0.0378]
# %time run_experiment(N=100000)
%time run_experiment_parallel(verbose=True)
Out[5]:
The coins with a distribution of $\nu$ that satisfies the (single-bin) Hoeffding Inequality are $c_1$ and $c_{rand}$ because their means, $\nu_1, \nu_{rand}$ are close to 0.5, while $\nu_{min}$ is not close to 0.5.
To get the probability bounds that $\nu$ will be close to the bin frequency (within a certain tolerance, $\epsilon$, the following function can be used:
In [6]:
def get_probability_bounds(epsilon=0.02,N=100000):
return 2.0 * np.exp(-2.0*epsilon*epsilon*N)
get_probability_bounds(epsilon=0.02,N=20000)
Out[6]:
Here, it can be seen that the probability that $\nu$ is not within $\epsilon$=0.01 of the bin frequency, 0.5, is extremely low.
Let us consider a target function, $y=f\left(\mathbf{x}\right)$. The effect of noise is that for a given input, $\mathbf{x}$, it is possible to get multiple values of $y$. In order to take into account the effects of noise, the target function is replaced by a target distribution, which we shall call $\mathbb{P}(y | \mathbf{x})$.
Hence, the sample points, $(\mathbf{x},y)$, are now generated via the joint distribution:
$$\mathbb{P}(\mathbf{x})\mathbb{P}(y|\mathbf{x}) \equiv \mathbb{P}(\mathbf{x},y)$$where $\mathbb{P}(\mathbf{x})$ is the input distribution, and $\mathbb{P}(y|\mathbf{x})$ is the target distribution.
Suppose $h, f$ are binary functions and we observe that the hypothesis, $h$, has a probability, $\mu$, of making an error when approximating the target function, $f$.
The input distribution arises from this observation and we can write this as:
$$\mathbb{P}(\mathbf{x}) = \left\{ \begin{array}{rl} \mu & h \ne f(x)\\ 1 - \mu & h = f(x) \end{array}\right.$$Suppose "noise" is added to $f$ in the following manner:
$$\mathbb{P}(y|\mathbf{x}) = \left\{ \begin{array}{rl} \lambda & y = f(x)\\ 1-\lambda & y \ne f(x) \end{array}\right.$$We now wish to calculate the probability of error that $h$ makes in approximating $y$. There are two ways for this to happen: the first is for the top event of the two equations above to occur, the second is for the bottom event of the two equations above to occur. The probability of error is therefore:
$$\mu \lambda + \left(1 - \mu\right)\left(1 - \lambda\right)$$In the case of $\lambda = \frac{1}{2}$, the above expression becomes:
$$\tfrac{1}{2} \mu + \tfrac{1}{2} \left(1 - \mu\right) = \tfrac{1}{2}$$which is independent of $\mu$.
In [7]:
def generate_data(n,seed=None):
if seed is not None:
np.random.seed(seed)
x0 = np.ones(n)
x1 = np.random.uniform(low=-1,high=1,size=(2,n))
return np.vstack((x0,x1)).T
The region of interest is $\mathcal{X}=\left[-1,1\right] \times \left[-1,1\right]$, where $\times$ denotes the Cartesian Product.
To ensure that the random line falls within the region of interest, two points, $(x_0,y_0)$ and $(x_1,y_1)$ are generated within $\mathcal{X}$.
The hypothesis is given by:
$$h(\mathbf{x}) = \text{sign}(\mathbf{w^T x}) = \text{sign}\left(\sum_{i=0}^N w_i x_i\right)$$We now consider the two dimensional case ($N=2$), and note that $x_0=1$. The decision boundary is $h(\mathbf{x})$=0, which can be converted to the slope-intercept form if we take $x_1$ to be the $x$ axis, and $x_2$ to be the $y$ axis, i.e.:
$$y = - \frac{w_1}{w_2} x - \frac{w_0}{w_2}$$Comparison with the equation $y = mx + c$ yields the following relationships:
$$m = - \frac{w_1}{w_2},\quad c = - \frac{w_0}{w_2}$$If we arbitrarily set $w_2=1$, we arrive at the following set of weights, which is consistent with the decision boundary denoted by $y = mx + c$:
$$\mathbf{w} = \left(-c,-m,1\right)$$
In [8]:
def get_random_line(seed=None):
X = generate_data(2,seed=seed)
x = X[:,1]
y = X[:,2]
m = (y[1]-y[0])/(x[1]-x[0])
c = y[0] - m*x[0]
return np.array([-c,-m,1])
def draw_line(ax,w,marker='g--',label=None):
m = -w[1]/w[2]
c = -w[0]/w[2]
x = np.linspace(-1,1,20)
y = m*x + c
if label is None:
ax.plot(x,y,marker)
else:
ax.plot(x,y,marker,label=label)
def get_hypothesis(X,w):
return np.sign(np.dot(X,w))
In [9]:
def plot_data(fig,plot_id,X,y=None,w_arr=None,my_x=None,title=None):
ax = fig.add_subplot(plot_id)
if y is None:
ax.plot(X[:,1],X[:,2],'gx')
else:
ax.plot(X[y > 0,1],X[y > 0,2],'b+',label='Positive (+)')
ax.plot(X[y < 0,1],X[y < 0,2],'ro',label='Negative (-)')
ax.set_xlim(-1,1)
ax.set_ylim(-1,1)
ax.grid(True)
if w_arr is not None:
if isinstance(w_arr,list) is not True:
w_arr=[w_arr]
for i,w in enumerate(w_arr):
if i==0:
draw_line(ax,w,'g-',label='Theoretical')
else:
draw_line(ax,w,'g--')
if my_x is not None:
ax.plot([my_x[0]],[my_x[1]],'kx',markersize=10)
if title is not None:
ax.set_title(title)
ax.legend(loc='best',frameon=True)
The weights for the decision boundary may be calculated using linear regression with an appropriate estimator. We start with a "cost function", where in this case we choose the in-sample squared error, i.e.:
$$E_{in}(h)=\frac{1}{N}\sum_{i=1}^N \left(h(\mathbf{x_i})-y_n\right)^2$$$$E_{in}(\mathbf{w})=\frac{1}{N}\sum_{i=1}^N \left(\mathbf{w^T x_i}-y_n\right)^2 = \frac{1}{N} \| \mathbf{Xw - y} \|^2$$The in-sample error is minimized as follows:
$$\nabla E_{in}(\mathbf{w}) = \frac{2}{N}\mathbf{X^T}\left(\mathbf{X w - y}\right) = 0$$$$\mathbf{X^T X w} = \mathbf{X^T y}$$$$\mathbf{w} = \left(\mathbf{X^T X}\right)^{-1}\mathbf{X^T y}$$We thus arrive at the following, which is also called the ordinary least squares (OLS) estimator:
$$w = \left(\mathbf{X^T X}\right)^{-1}\mathbf{X^T y} = \mathbf{X^\dagger}y$$where in the last step, we have introduced the Moore–Penrose pseudoinverse:
$$\mathbf{X^\dagger} = \left(\mathbf{X^T X}\right)^{-1}\mathbf{X^T}$$This is implemented via the following line of code in the subroutine below:
w_linreg = np.dot(np.linalg.pinv(X),y)
In [10]:
def create_dataset(N,make_plot=True,seed=None):
X = generate_data(N,seed=seed)
w_theoretical = get_random_line()
y = get_hypothesis(X,w_theoretical)
w_linreg = np.dot(np.linalg.pinv(X),y)
if make_plot is True:
fig = plt.figure(figsize=(7,5))
plot_data(fig,111,X,y,[w_theoretical,w_linreg],
title="Initial Dataset")
return X,y,w_theoretical,w_linreg
If we know the theoretical decision boundary, w_theoretical, that 'knows' the correct classification of points, we can calculate the number of points misclassified by w (the weights calculated using linear regression) via a Monte Carlo approach (random sampling).
In [11]:
def calculate_misclassification(w_theoretical,w,n_samples=1000,verbose=True):
X = generate_data(n_samples,seed=None)
y0 = get_hypothesis(X,w_theoretical)
y = get_hypothesis(X,w)
n_correct = np.sum(y == y0)
return (n_samples-n_correct)/n_samples
In [12]:
def misclassification_runs(N=100,n_expts=1000):
m_in_arr = list()
m_out_arr = list()
for i in range(n_expts):
w_theoretical = get_random_line()
X_in = generate_data(N,seed=None)
y_in = get_hypothesis(X_in,w_theoretical)
w_in = np.dot(np.linalg.pinv(X_in),y_in)
m_in = calculate_misclassification(w_theoretical,w_in,n_samples=1000)
X_out = generate_data(N,seed=None)
y_out = get_hypothesis(X_out,w_theoretical)
w_out = np.dot(np.linalg.pinv(X_out),y_out)
m_out = calculate_misclassification(w_theoretical,w_out,n_samples=1000)
m_in_arr.append(m_in)
m_out_arr.append(m_in)
return m_in_arr,m_out_arr
In [13]:
# [0.0482, 0.04727, 0.04815, 0.04858, 0.04900, 0.04890]
# [0.24319, 0.24848, 0.24765]
m_in_arr,m_out_arr = misclassification_runs()
print("Misclassification error: %.5f (in-sample)" % np.mean(m_in_arr))
print("Misclassification error: %.5f (out-of-sample)" % np.mean(m_out_arr))
The miclassiification error here is about ~5%. Also, we find no difference in misclassification error between the in-sample and out-of-sample weights.
The Preceptron Learning Algorithm (PLA) is implemented in the following steps
In [14]:
def PLA(w,X,y0,n_iterations=10,verbose=True):
assert len(y0)==X.shape[0]
n=len(y0)
x_arr = list()
w_arr = list()
m_arr = list()
for i in range(n_iterations):
h = get_hypothesis(X,w)
bad = h != y0
bad = np.argwhere(bad).flatten()
if len(bad) > 0:
idx = np.random.choice(bad,1)[0]
my_x = X[idx,:]
m_arr.append(100.0*len(bad)/n)
x_arr.append(my_x)
w_arr.append(np.copy(w))
w += np.dot(y0[idx],my_x)
if verbose is True:
print("iter {}: {}% misclassified, w={}" \
.format(i,m_arr[-1],w_arr[-1]))
else:
m_arr.append(0.0)
x_arr.append(np.array([1.0, np.nan,np.nan]))
w_arr.append(np.copy(w))
if verbose is True:
print("iter {}: zero misclassified (PLA has converged)".format(i))
return w,w_arr,m_arr,x_arr
print("PLA failed to converge after {} iterations".format(i))
return None,None,None,None
In [15]:
def get_iteration_distribution(N,n_trials=1000,max_iterations=10000,summary=True):
n_iterations=np.zeros(n_trials,dtype=int)
for i in range(n_trials):
X,y,w_theoretical,w_linreg = create_dataset(N=N,make_plot=False,seed=None)
w,w_arr,m_arr,x_arr = PLA(w_linreg,X,y,n_iterations=max_iterations,verbose=False)
n_iterations[i]=len(w_arr)
if summary is True:
print("Minumum iterations: {}".format(np.min(n_iterations)))
print("Maximum iterations: {}".format(np.max(n_iterations)))
print("Mean iterations: {}".format(np.mean(n_iterations)))
return n_iterations
In [16]:
# [4.721, 5.88, 4.8, 5.31, 6.237]
n_iterations = get_iteration_distribution(N=10)
In [17]:
def get_nonlinear_hypothesis(X):
h = X[:,1]**2 + X[:,2]**2 - 0.6
return np.sign(h).astype(int)
In [18]:
def create_nonlinear_dataset(N=1000,make_plot=True,seed=None,verbose=False):
X = generate_data(N,seed=seed)
y0 = get_nonlinear_hypothesis(X)
idx = np.random.choice(len(y0),100,replace=False) # flip 100
y_flipped = np.copy(y0) # random
y_flipped[idx] = -y0[idx] # indices
w_linreg = np.dot(np.linalg.pinv(X),y_flipped)
y_linreg = get_hypothesis(X,w_linreg)
if verbose is True:
print("w (from linear regression) = {}".format(w_linreg))
print("y (from linear regression) = {}".format(y_linreg))
if make_plot is True:
fig = plt.figure(figsize=(7,5))
plot_data(fig,111,X,y_flipped,[np.array([10,-1,1]), w_linreg],
title="Initial Dataset")
return y_flipped,y_linreg
In [19]:
y0,y_linreg = create_nonlinear_dataset(verbose=True)
In [20]:
def run_trials_nonlinear(N=1000,n_trials=1000,verbose=True):
m_arr = list()
for i in range(n_trials):
y0,y = create_nonlinear_dataset(N=N,make_plot=False,seed=None)
n_correct = np.sum(y == y0)
m_arr.append((N-n_correct)/N)
if verbose is True:
print("Minimum misclassification= {}".format(np.min(m_arr)))
print("Maximum misclassification= {}".format(np.max(m_arr)))
print("Average misclassification= {}".format(np.mean(m_arr)))
return m_arr
In [21]:
m_arr = run_trials_nonlinear()
By changing to a nonlinear decision boundary, the misclassification error can be considerably reduced. The original dataset, $(x_1, x_2)$ is replaced by a new dataset, $(x_1, x_2, x_3, x_4, x_5) = (x_1, x_2, x_1 x_2, x_1^2, x_2^2)$, where we have added the last 3 columns via nonlinear transformations on the original data.
In [22]:
def add_nonlinear_features(X):
N = X.shape[0]
X = np.hstack((X,np.zeros((N,3))))
X[:,3] = X[:,1]*X[:,2]
X[:,4] = X[:,1]**2
X[:,5] = X[:,2]**2
return(X)
In [23]:
def create_nonlinear_dataset_v2(N=1000,make_plot=True,seed=None,verbose=False):
X = generate_data(N,seed=seed)
y0 = get_nonlinear_hypothesis(X)
idx = np.random.choice(len(y0),100,replace=False) # flip 100
y_flipped = np.copy(y0) # random
y_flipped[idx] = -y0[idx] # indices
X = add_nonlinear_features(X)
w_linreg = np.dot(np.linalg.pinv(X),y_flipped)
y_linreg = get_hypothesis(X,w_linreg)
if verbose is True:
print("w (from linear regression) = {}".format(w_linreg))
print("y (from linear regression) = {}".format(y_linreg))
if make_plot is True:
fig = plt.figure(figsize=(7,5))
plot_data(fig,111,X,y_flipped,title="Initial Dataset")
return y_flipped,y_linreg,X
In [24]:
def plot_data_nonlinear(fig,plot_id,X,y=None,w_arr=None,w_colors=None,title=None):
p = 1.5
ax = fig.add_subplot(plot_id)
if y is None:
ax.plot(X[:,1],X[:,2],'gx')
else:
ax.plot(X[y > 0,1],X[y > 0,2],'b+',label='Positive (+)')
ax.plot(X[y < 0,1],X[y < 0,2],'r_',label='Negative (-)')
ax.set_xlim(-p,p)
ax.set_ylim(-p,p)
ax.grid(True)
if w_arr is not None and w_colors is not None:
if isinstance(w_arr,list) is not True:
w_arr=[w_arr]
x1 = np.linspace(-p,p,100)
x2 = np.linspace(-p,p,100)
X1,X2 = np.meshgrid(x1,x2)
X1X2 = X1*X2
X1_sq= X1**2
X2_sq= X2**2
labels = ['(a)','(b)','(c)','(d)','(e)']
for i,w in enumerate(w_arr):
Y = w[0] + w[1]*X1 + w[2]*X2 + w[3]*X1X2 + \
w[4]*X1_sq + w[5]*X2_sq
cp = ax.contour(X1,X2,Y,1,linewidth=4, levels=[0.0],
linestyles='dashed', colors=w_colors[i],
label=labels[i])
ax.clabel(cp, inline=True, fontsize=14)
if title is not None:
ax.set_title(title)
ax.legend(loc='best',frameon=True)
In [25]:
N = 1000
y0,y,X = create_nonlinear_dataset_v2(N=N,make_plot=False,seed=None)
n_correct = np.sum(y == y0)
m = (N-n_correct)/N
print(m)
By using a nonlinear decision boundary, the misclassification error has gone down to about ~12%.
In [26]:
def get_nonlinear_misclass(n_trials=1000,verbose=True):
m_arr = list()
for i in range(n_trials):
y0,y,X = create_nonlinear_dataset_v2(N=N,make_plot=False,seed=None)
n_correct = np.sum(y == y0)
m_arr.append((N-n_correct)/N)
if verbose is True:
print("Minimum misclassification= {}".format(np.min(m_arr)))
print("Maximum misclassification= {}".format(np.max(m_arr)))
print("Average misclassification= {}".format(np.mean(m_arr)))
return m_arr
In [27]:
m_arr = get_nonlinear_misclass()
With repeated trials of 1000 Monte-Carlo simulations, the misclassification error remains at roughly ~12%. A histogram shows the probability distribution of the misclassification error.
In [28]:
_ = plt.hist(m_arr,bins=20)
In [29]:
w1 = np.array([-1,-0.05,0.08,0.13,1.50,1.5])
w2 = np.array([-1,-0.05,0.08,0.13,1.50,15.0])
w3 = np.array([-1,-0.05,0.08,0.13,15.0,1.5])
w4 = np.array([-1,-1.50,0.08,0.13,0.05,0.05])
w5 = np.array([-1,-0.05,0.08,0.13,0.15,0.15])
w_arr = [w1,w2,w3,w4,w5]
w_colors = ['red','orange','green','blue','pink']
Apparently, there is little necessity to perform a Monte-Carlo simulation to calculate their misclassification error. A contour plot of the above 5 weights (decision boundary at 'zero' plotted as dotted lines) clearly shows that red line ($w_1$) as the best performing hypothesis.
In [30]:
fig = plt.figure(figsize=(10,7))
plot_data_nonlinear(fig,111,X,y0,w_arr,w_colors,"Nonlinear Decision Boundaries")
In [ ]: