This is a one-page comprehensive publish that aims to help you learn and understand what Machine Learning is and the workings behind "how a machine thinks", all right from scratch. We'll not only be building learning/prediction models together but also understand and realize the nuts-and-bolts that make it work.
This also means that there is going to be exhaustive mathematics (linear algebra, calculus, etc.) throughout the way but do not be intimidated already if you aren't coming from a mathematical background. This publish also aims to teach you the mathematics behind learning models in the most simplest of terms present in the English language. I'll be using the keyword ST (for in simpler terms) to denote something hard to comprehend in the most simplest of terms. For instance,
$$\frac{dy}{dx}$$ST
The notation above denotes an operation $\frac{d}{dx}$ on a variable $y$ to realize "at what rate does $y$ change with respect to $x$". Assume the variable $y$ to be the anger temperament of your boss and $x$ to be your skills to mollify an individual well. If it were to be a perfect linear relation, the above factor would denote no appreciable rate of change whatsoever.
WARNING: Do not consider such "simple english" to solve various mathematical problems. Mathematics per se has a different way of expressing the world and is binded by its rules and grammar. Mathematics provides you reasoning that is universally accepted and is never ambiguous. By all means, you should never merge the two domains for problem solving. We'll try incorporating the basics behind the mathematics that is going to be used throughout this publish.
We then remain true to our title - Machine Learning for dummies
PS: Do not be disheartened by the word dummies. We aim to present this publish even for those who aren't able make sense of all that scientific textbook jargon.
This publish will be regularly updated over the course of time, attempting to learn and build new models and solve problems. You can keep track of these updates in the What's new? section.
I'm currently an undergraduate at the Department of Computer Engineering, University of Mumbai and expected to graduate in the year 2017.
You can contact me in the following ways, I'd be happy to help!
Type | Description |
---|---|
achillesrasquinha@gmail.com | |
www.facebook.com/rasquinhaachilles | |
www.linkedin.com/in/achillesrasquinha | |
GitHub | www.github.com/achillesrasquinha |
I'd like to thank my Professor - Mrs. Anuradha Srinivas for introducing me first into the field of Machine Learning and also helping me engage in solving a wide range of problems within the domain. I'd also like to thank my Professor - Mrs. Vincy Joseph for being a supportive guide and mentor as I work on my Bachelor's thesis on Image Context Analysis and Contextual Caption Generation Using RNNs/LSTMs.
Without further ado, let's get started.
Choice of Programming Language
Image Source: xkcd
Solving complex mathematical problems require some state-of-the-art tools, especially when it comes to the field of Machine Learning. We'll be implementing our models and perform our analysis using one such tool (to be precise, a programming language) - Python.
But, why Python?
Simply because of its "english-like" code structure that attempts to somewhat make linguistic sense to people who may have never written a single line of code. Moreover, Python comes with a wide range of scientific computing libraries (for linear algebra computations, calculus, graph visualizations, etc.) which we'll be using throughout our way.
Finally (and this is my favourite), we've chosen Python because of its high rate of productivity. Python has a high rate of productivity by minimizing the number of lines of code (in Computer Science, we call this unit - LOC) to write some logic, in comparision to other programming languages such as C++, Java, etc. [1] Performance wise, the latter perform way better than Python (in terms of time and space complexity). We're however not keen in optimizing our code (by choice of language) but with the said limitations, we'll try out best to. Also, Python is a general-purpose high level programming language.
So Python does seem to fit the jig-saw.
Recommended Books
class ANN
.In order to build a learning path for ourselves, I've created a mind map that can be viewed over here.
Machine Learning is when "a machine turns a statistician". Machine Learning is when a computer observes you while you clean your house (the way you hold your mop, the way you sway it over the floor) and mimicks (well, almost) the steps you performed while you cleaned. Not just that, it may even outperform in terms of tidiness or maybe perform novel ways of cleaning that you weren't even aware of. Machine Learning is when a computer observes you drive (the way you percieve roads) and later attempts to drive a car all by itself (we call it today - Autonomous Driving, a subject that had its seed planted in the 1980s) [2]
Machine Learning is when a machine learns to humanly walk after observing a human walking; it is also the field when the machine may jaywalk after observing a group of people jaywalking. This means that we're never expecting a global truth or moral code from a machine but rather what truth and beliefs we wish it must follow. At times, the machine may itself find plausible truths that is beyond the capabilities of individual observance and thinking (What does it mean to be human? How was the universe formed? etc.)
A machine may be trained to even be a sexist. [3]
Here's what Microsoft AI chatbot @TayandYou had to say about feminism after recieveing learning feedbacks from Twitter trolls.
I fucking hate feminists and they should all die and burn in hell.
NOTE: I do not endorse any kind of Microsoft products.
So yes, machines are beautiful and maybe even dangerous. But so are humans. The fear that these machines might someday take over the world (Stephen Hawking thinks it might) [4] is no less compared to that of the fear for a next human genocide, planned by humans themselves.
Machines and mathematical models are a reflection of us and the world we live in. So the fear for machines exhibiting immoral behaviour merely highlights the flaws we humans exhibit. Our goal is however, to build machines that truly exhibit some desired behaviour and that is universally accepted.
Definition
A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E.
Tom Mitchell, Machine Learning
In simple terms, Machine Learning is mapping a variable $x$ (or a set of them) to a variable $y$ (or a set of them) based on its observations of many $x$'s or many $y$'s. You could either direct the machine to move towards the desired $y$ or may leave it alone to discover.
To help you understand better, consider each instance from a data set to be one of the many opinions of the members of House of Commons whereas the learning model is the House's Speaker for that session. I consider John Bercow (the current Speaker for the House of Commons) as a fine example.
NOTE: I ain't British. One doesn't necessarily need to be British to enjoy the hilarity that arises during Prime Minister's Questions.
Each member has his or her own say. Some members have a collective say and some may have an exclusive opinion. Members of the Parliament aim to rant their best over a subject in order to influence members of the House (our data set) as well as the Speaker. Based on how effective each member's opinions are relates to the Speaker being influenced by their opinions. One member's opinion may pull a Speaker's decision on one given direction while the other towards another. A collective analysis (either inferencing from some general truth or his/her observance so far) helps the Speaker of the House to profess his or her conclusions.
Machine Learning models are no different from the Speaker of the House. Like the Speaker, a learning model does not caste its own vote and is nothing without a given say (a data set). It is however influenced by the very nature of data that pulls it towards different directions. Like the House of Commons filled with effective and mundane opinions, we conclude that it is in one's best interest for a goal decision if each of the opinons are extremely effective and not merely how many opinions were put forward during the convention. (In the House of Commons, this seems to be the extreme opposite).
By this we mean to say that having a large data set does not necessarily yeild a well-learned model but rather a qualitative data set along with the choices of effective $x$'s and $y$'s is what helps a machine make great conclusions.
We categorize learning into two kinds
As the word suggests, supervised learning attempts to comapre the expected $y$ and the $y$ it has generated for a given $x$, thereby modifying its learning unit such that it comes closer towards the expected output the next time. In simpler terms, a supervised learning algorithm is provided what is expected from the model.
A training set is nothing but a portion of a data set we're going to work on and feed into the model. It's never advisable to feed the entire data set into our learning model which is why there exists yet another set called as the testing set that is used to observe whether our learning model predicts well or not.
We known that the cardinality (number of elements) of a set (a collection of non-repititive elements) can be given by $$|S| = n$$ where $S$ is a set and $n$ is its cardinality.
We define a given training set with the notation $Tr$ and its cardinality $|Tr| = u$. Similarly, we define the notation $Te$ and its cardinality $|Te| = v$ for a testing set.
We'll be dealing almost always with a multivariate system, i.e. many variables affect the learning unit of our model. Hence, We denote $x_{j}$ to be our input $j$ in our input vector $X$ of length $m$ and $y_{k}$ to be our output $k$ in our output vector $Y$ of length $n$
To denote a sample from the training or testing set we define the notation as, $$S(\{x_{j}^{(i)}\}, \{y_{k}^{(i)}\})$$ where S could be $Tr$ or $Te$ and the tuple $(\{x_{j}^{(i)}\}, \{y_{k}^{(i)}\})$ is nothing but the $i_{th}$ sample of set $S$ denoting an input vector $\{x_{j}^{(i)}\}$ and its corresponding output vector $\{y_{k}^{(i)}\}$. $j$ and $k$ denote our selected attributes from the input and output vectors respectively.
Our model aims to predict output(s) $y$ based on some input(s) $x$. We define this function mapping as
$$\{y_{k}^{(i)}\} = predict_{\theta}(\{x_{j}^{(i)}\})$$Our prediction model depends not only on the input vector $X$ but also a set of parameters $\theta = \{{\theta_{p}}\}$ where $p$ ranges from 0 to $o$, $o$ being the cardinality of the parameter vector $\theta$, i.e. the number of parameters affecting the function $predict$
In [ ]:
from abc import ABCMeta, abstractmethod
In [ ]:
class Model(metaclass = ABCMeta):
@abstractmethod
def predict(self, features, parameters):
pass
We could further classify supervised learning into two kinds of problems.
Let's consider a data set to work on in order to understand what Regression is all about. scikit-learn
(a machine learning library in Python) provides many toy data sets and we'll consider one of them - the Boston House Pricing data set provided under load_boston()
within the module datasets
. University of California, Irvine provides a wide range of open data set repositories to work on, the Housing Prices being one. [5]
In [ ]:
from sklearn.datasets import load_boston
boston = load_boston()
In order to view our data set in a much better way, we'll be using Pandas' DataFrame
Object everytime. DataFrame
is a very powerful data structure for not just storing data sets, but also for data munging, transforming, etc. Let's go ahead and import pandas
.
In [ ]:
import pandas as pd
Our next task is clearly to convert our sklearn
data Bunch
into a Pandas' DataFrame
. Since our data set has been stored as a numpy
array, we'll go ahead and import numpy
as well.
In [ ]:
import numpy as np
In [ ]:
df = pd.DataFrame(data = np.c_[boston['data'], boston['target']],
columns = list(boston['feature_names']) + ['HP'])
Let's look at a sample of our data set.
In [ ]:
nsamples = 5
df.sample(nsamples)
Hmm, just a bunch of numbers. UCI's repository site has a clear description of the names of the aliases within our data frame. Let's visualize our data set considering the relationship between average number of rooms per dwelling (RM) and its corresponding house prices (HP).
We'll be using matplotlib
for graph visualizations and seaborn
for some pretty plotting.
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
# a magic keyword to display matplotlib outputs on jupyter notebooks
% matplotlib inline
We can visualize our data set using a simple scatter plot.
In [ ]:
sns.regplot('RM', 'HP', df)
Notice how the plot seems to be dispersed. We can somewhat visualize that there is an increase in the housing value with an increase of average number of rooms per dwelling for the limited data that we've observed. If we were to estimate a given value for $x$ would result in a unique value (real or whole) for $y$. This is known as a regression problem, i.e. the function which attempts to estimate for a given value of $x$ is a continous function. We could then somewhat fit a line (as shown) or draw a curve that passes through almost all the points on the scatter plot and aim to estimate approximately well for any given input $x$.
Machine Learning is all about building that model of estimation/prediction. The example states to build a model that can draw that line (a polynomial equation to be precise) across these points in order to maximize the closeness with $y$ for any given $x$. Such a model is called as a Linear Regression model which we'll discuss further. (In the case above, we have a single variable $x$ and hence is a univariate problem)
In its simplest sense, a Linear Regressor attempts to find the best fit line that can be drawn from a given set of features (or inputs) with respect to a given set of labels (or outputs).
As mentioned, a Supervised Learning algorithms maps a feature $x$ to $y$ through a function $predict$. We can then define a $predict$ function for a Multivariate Linear Regressor as $$predict_{\theta}(x_{j}^{(i)}) = \sum_{j = 0}^{m}\theta_{j}x_{j}^{(i)}$$ where $j = [0, m]$ and $x_{0}^{(i)} = 1$
In a vector notation we can then denote the above equation as $$predict_{\theta}(X^{(i)}) = \theta^{T}X^{(i)}$$
where $X$ is a vector of shape $[1, m + 1]$ whereas $\theta^T$ is a vector of shape $[m + 1, 1]$.
In [ ]:
class LinearRegression(Model):
'''
Parameters
----------
features - a testing set of size (nsamples, nfeatures)
each sample within the set consists of a vector of size (1, nfeatures)
parameters - a parameter set of size (1, nfeatures + 1)
'''
def predict(self, features, parameters):
labels = [ ]
for i in features:
i = np.asarray(i) # convert to array if single-valued
i = np.insert(i, 0, 1) # insert x0 = 1 into our feature vector
p = np.atleast_2d(parameters)
y = np.dot(np.transpose(p), i)
labels.append(y)
return labels
ST
For a Univariate Linear Regressor (like the example above) i.e. for $j = 0$ and for $j = 1$, the predict function would be:
$$predict_{\theta}(x_{1}^{(i)}) = \theta_{0} + \theta_{1}x_{1}^{(i)}$$knowing that $x_{0}^{(i)} = 1$
The above equation is similar to that of the equation of a line in a two-dimensional space with $\theta_{0}$ being the y-intercept (how high or low it is from $y = 0$) and $\theta_{1}$ being the slope (steepness) of our line. The generalized equation extends this linear incisiveness to higher dimensions that cannot be visualized. The univariate equation help us to understand the fundamentals behind linear predictive models and how it attempts to "draw fine distinctions".
For the sake of a clear explanation, we'll consider a Univariate Linear Regressor that can be extended towards a higher dimensional vector space.
Justice Delayed is Justice Piled: A predictive model for the Indian Judiciary System
For our explanation, we'll be considering a data set "Crime against Women during 2014 in India" published by the Government of India on their website www.data.gov.in under the Digital India initiative. We'll be considering data sets from various platforms in order to learn and observe (and maybe, find reasons and solutions to problems that can be visualized through the data set). The data set is open to use and can be downloaded from here.
Imagine if the employees of the Indian Judiciary System would like to know the number of cases that may be pending for investigation at the end of the year by knowing the number of cases that were pending investigation the year before.
In simpler terms, By how much will the burden of the Judiciary System rise for cases of crimes against women by the end of the year? In order to serve justice well and as early as possible to victims of the crimes, our goal would be to decrease the number of pending investigations for the year in comparision to what our learning model (our Linear Regressor) predicts.
Let's load our data set into a Pandas' DataFrame Object
In [ ]:
path = './data/crimes-against-women-india-2014.csv'
df = pd.read_csv(path)
We shall consider the overall aggregate number of crimes across all States and Union Territories of India.
In [ ]:
df = df[df['States/UTs'] == 'Total (All India)']
Our model needs to predict the number of investigations pending at the end of the year if given the number of investigations pending from the year before.
In [ ]:
colx = 'Cases pending investigation from previous year'
coly = 'Cases pending investigation at the end of the year'
X = df[colx]
y = df[coly]
Alright. Let's now view our data set using a simple scatter plot.
In [ ]:
sns.regplot(colx, coly, df, fit_reg = False)
By observing our graph above, we can very much infere that there exists a clear rising linearity between the two variables. To prove this, we can calculate the Pearson product-moment correlation coefficient which in simpler terms, checks the degree of linearilty between our two variables on a scale of $[-1, +1]$ where $-1$ denotes a perfect negative linear correlation, $0$ denotes no linear correlation and $+1$ denotes a perfect positive linear correlation.
SciPy has a neat function for Karl Pearson's product-moment correlation coefficient.
In [ ]:
from scipy.stats import pearsonr
In [ ]:
r, p = pearsonr(X, y)
r
Very impressive. We've recieved a correlation coefficient of $\approx 0.9988$ which states the high degree of linearity which exists in our data set. Our goal is to fit a best-fit line through the data observed which depends on our parameter set $\theta$
Let's consider a few cases for our parameter set $\theta$
Case 1: $\theta_{0} = 15$ and $\theta_{1} = 0.0$
Case 2: $\theta_{0} = 20$ and $\theta_{1} = 0.5$
Case 3: $\theta_{0} = 23$ and $\theta_{1} = 1.5$
We can then plot the above parameters with respect to the prediction function as
In [ ]:
plt.scatter(X, y, marker = '1')
parameters = np.asarray([
[[15], [0.0]], # case 1
[[20], [0.5]], # case 2
[[23], [1.5]] # case 3
], dtype = np.float32)
regressor = LinearRegression()
legends = [ ]
predictions = [ ]
for p in parameters:
prediction = regressor.predict(X, p)
plt.plot(X, prediction)
legends.append('$\\theta_{0} = %.2f$ and $\\theta_{1} = %.2f$' % (p[0], p[1]))
predictions.append(prediction)
plt.legend(legends)
plt.xlabel(column1)
plt.ylabel('$predict_{\\theta}(x_{1}^{(i)}) = \\theta_{0} + \\theta_{1}x_{1}^{(i)}$')
plt.grid()
plt.show()
We see that an increase in $\theta_{0}$ raises the line above $y = 0$. Similarly an increase in $\theta_{1}$ increases the radial shift of our line with respect to $x = 0$. Our goal is to choose $\theta_{0}$ and $\theta_{1}$ such that it is close to output $y$ for any given input $x$,
i.e. the difference between the output from the prediction $predict_{\theta}(x_{j}^{(i)})$ and $y_{k}^{(i)}$ must be mimimum. Hence, for a given training set with $m$ training samples, we could then estimate the difference in the errors (a cost function) by,
$$C(\theta) = \frac{1}{2m}\sum_{i = 1}^{m} (predict_{\theta}(x^{(i)}) - y^{(i)})^{2}$$Such a function is also called as the Mean Squared Error.
In [ ]:
def mean_squared_error(output, target, nsamples):
sum = np.sum(np.power(np.subtract(output, target), 2))
result = (0.5 / nsamples) * sum
return result
ST
The quadratic nature of the cost function speaks volumes. Squaring the difference helps us eliminate the positive and negative feedback of errors and at the same time increase the spread on a wide range. The factor $\frac{1}{2}$ is multiplied with the mean cost since minimizing the overall error is same as minimizing half its error. A clear reason as to why we've chosen the MSE as our cost function will be discussed in some time.
Our job is to find a parameter set $\theta$ that could minimize the cost function close or equal to 0.
Let's consider our Univariate Linear Regressor again, this time considering $\theta_{0}$ constant (i.e., having no affect over our cost function $C(\theta)$) and estimate the cost with respect to the parameters we've used.
Knowing $\theta_{0} = 0.0$
In [ ]:
from IPython.core.display import display, Math
parameters = np.array(
[[[0.0], [0.0]], # case 1
[[0.0], [0.5]], # case 2
[[0.0], [1.5]]] # case 3
, dtype = np.float32)
predictions = [ ]
nsamples = len(X)
for p in parameters:
prediction = regressor.predict(X, p)
predictions.append(prediction)
for i in range(len(predictions)):
error = mean_squared_error(predictions[i], y, nsamples)
display(Math("Case %i: C(%.2f, %.2f) = %.2e" % (i + 1, parameters[i][0], parameters[i][1], error)))
If we were to consider for a range of $\theta_{1}$ values (keeping our $\theta_{0}$ as it is), we can then plot a graph like the following
In [ ]:
import sys
theta1 = np.arange(-10, 10, 0.1) # a range of theta1
mintheta1 = sys.maxsize
minerror = sys.maxsize
for t in theta1:
prediction = regressor.predict(X, [[0], [t]])
error = mean_squared_error(prediction, y, nsamples)
if error < minerror:
mintheta1 = t
minerror = error
plt.plot(t, error, marker = '.', c = 'b')
plt.xlabel('$\\theta_{1}$')
plt.ylabel('$C(\\theta_{1})$')
plt.grid()
plt.show()
display(Math("C_{min}(0.00, %.2f) = %.2e" % (mintheta1, minerror)))
Woah! Check out that graph. We've got a smooth decrease from an error at around $3.2 \times 10^{10}$ at $\theta_{1} = -9.99$ reaching $9.93 \times 10^{5}$ at $\theta_{1} = 1.20$ and a gradual increase in error again.
If we were to move one more dimension higher, we then could estimate the cost with respect to both, $\theta_{0}$ and $\theta_{1}$. Let's consider for a range of $\theta_{0}$ values as well.
In [ ]:
theta0 = np.arange(-10, 10, 0.1)
mintheta0 = sys.maxsize
mintheta1 = sys.maxsize
minerror = sys.maxsize
errors = [ ]
for u in theta0:
l = [ ]
for v in theta1:
prediction = regressor.predict(X, [[u], [v]])
error = mean_squared_error(prediction, y, nsamples)
if error < minerror:
mintheta0 = u
mintheta1 = v
minerror = error
l.append(error)
errors.append(l)
display(Math("$C_{min}(%.2f, %.2f) = %.2e$" % (mintheta0, mintheta1, minerror)))
Hmm. Well, the error seems to have reached its minimum at $9.92 \times 10^{5}$ after being influenced by $\theta_{0}$. Raising yet another dimension, we could visualize a 3-dimensional plot with contours reflected at each axis.
In [ ]:
def plot_surface(x, y, z, labels = { 'x': '', 'y': '', 'z': '' }):
axes = plt.gca(projection = '3d')
axes.plot_surface(x, y, z, alpha = 0.1)
axes.set_xlabel(labels['x'])
axes.set_ylabel(labels['y'])
axes.set_zlabel(labels['z'])
plt.tight_layout()
X, Y = np.meshgrid(theta0, theta1)
Z = np.atleast_2d(errors)
labels = { 'x': '$\\theta_{0}$', 'y': '$\\theta_{1}$', 'z': '$C(\\theta_{0}, \\theta_{1})$' }
plot_surface(X, Y, Z, labels)
axes = plt.gca(projection = '3d')
# plotting Cmin
axes.scatter(mintheta0, mintheta1, minerror)
plt.show()
In [ ]:
axes = plt.gca(projection = '3d')
axes.contour(X, Y, Z, zdir = 'x', offset = np.amin(X), cmap = cm.coolwarm)
axes.contour(X, Y, Z, zdir = 'y', offset = np.amax(Y), cmap = cm.coolwarm)
axes.contour(X, Y, Z, zdir = 'z', offset = np.amin(Z), cmap = cm.coolwarm)
axes.set_xlabel('$\\theta_{0}$')
axes.set_ylabel('$\\theta_{1}$')
axes.set_zlabel('$C(\\theta_{0}, \\theta_{1})$')
plt.title('Contours')
plt.tight_layout()
plt.show()
Not really. In our first example, we considered $\theta_{0} = 0$ whereas in our second example we limited our range for $\theta_{0} = [-10, 10)$. Both of these examples have resulted in a different minimized error, i.e. we limited our sack of $\theta$ values and then went on a hunt for our parameters that minimizes our cost function. Many sacks of $\theta$s results many minimized errors but then, what would be our optimal solution? Our approach of trying many permutations of these ranges is not a great one. So let us refer to our graph once again.
In [ ]:
plot_surface(X, Y, Z, labels)
# plotting Cmin
axes = plt.gca(projection = '3d')
axes.scatter(mintheta0, mintheta1, minerror)
plt.show()
Our minimized error (blue dot) is no way near to the steepest point downhill that surface. What could then be another solution to reach to our steepest point?
ST
One way to realize this problem is to imagine a scenario wherein you are placed at a random point on the graph. Consider taking small iterative steps such that you're moving all the way dowhill upto a point wherein whichever step you may take in any direction does not lead you further downhill.
In mathematics, we call this as the Gradient Descent Algorithm which in simpler terms means "move downhill by taking very small steps until you cannot move further downhill." We call this least downhill point (with respect to the observer) as a point of convergence.
We can now formulate our Gradient Descent Algorithm as follows:
Let $\alpha$ be the learning rate (how small our step must be), then
$repeat$ $$\theta_{j} := \theta_{j} - \alpha\frac{\partial C(\theta)}{\partial \theta_{j}}$$ $until$ $convergence$
One must note that this is a parallel operation, i.e. each of these assignments are updated simultaneously. (we look towards all possible directions and only then take a step)
ST
The $:=$ operator is an assignment operator (updating a new value for the said variable) whereas the $\frac{\partial C(\theta)}{\partial \theta_{j}}$ factor denotes the rate of change of our cost function influenced by all variables in the parameter set $\theta$ with respect to an individual parameter $\theta_{j}$.
Consider this factor to be somewhat a GPS compass (covering a limited surface area). Our GPS is completely aware of the shape of our hill and guides us which direction should we move towards. If the factor were to be a positive large value, then our step would be equally large moving towards a lesser (big jump) $\theta_{j}$ value. Similarly, if our factor to be a negative small value, our step would be moving towards a greater (small jump) $\theta_{j}$ value.
One needs to observe for the fact that if $\alpha$ (our learning parameter) were to be very small, the algorithm would take a great amount of time to reach a point of convergence. Whereas in case where the $\alpha$ were to be comparatively a larger value, we may overshoot the point of convergence and may even tend to diverge.
Let's consider $C(\theta)$ and compute its partial derivative with respect to $\theta_{j}$, we then get.
$$\frac{\partial C(\theta)}{\partial \theta_{j}} = \frac{1}{m} \sum_{i = 1}^{m} (predict_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)}$$Well, yes. Our graph has a single point of convergence which denotes the global minima simply because of the quadratic nature of our function $C(\theta)$. On our two-dimensional graph, we observed a parabolic plot (a convex function) and raising a parabola to higher dimensions thereby also results a convex function. No matter what, we'll always have a single point of convergence. So yes. We must thank our quadratic cost function for helping us get away with local minimas.
We can now implement our Gradient Descent Algorithm as follows:
In [ ]:
# def gradient_descent(features, labels, parameters, nsamples, predict, learning_rate = 0.001):
# predictions = predict(features, parameters)
# error = predictions - labels
# features = np.insert(features, 0, 1, axis = 1)
# dcost = (1.0 / nsamples) * np.dot(np.transpose(error), features)
# parameters = parameters - learning_rate * np.transpose(dcost)
# return parameters
In [ ]:
# iterations = 10
# parameters = np.random.randn(2, 1)
# for i in range(iterations):
# parameters = gradient_descent(x, y, parameters, nsamples, regressor.predict)
# prediction = regressor.predict(x, parameters)
# error = mean_squared_error(prediction, y, nsamples)
# print(error)
# plt.plot(i, error, marker = '.', color = 'blue')
# plt.grid()
# plt.show()
# plt.scatter(x, y)
# plt.plot(x, regressor.predict(x, parameters))
Let's discuss another kind of Supervised Learning problems.
Let's consider yet another data set to work on in order to understand what Classification is all about. We'll consider a famous data set provided by seaborn
- Ronald Fisher's iris data set provided under load_dataset()
.
In [ ]:
import seaborn as sns
iris = sns.load_dataset('iris')
Let's take a look of our data set.
In [ ]:
iris.sample(nsamples)
The data set maps 4 different attributes namely (sepal length & width, petal length & width) to a given iris species from a set of 3 (setosa, versicolor and virginica). We can visualize our data set with a scatter plot matrix as follows:
In [ ]:
sns.PairGrid(iris, hue = "species") \
.map_diag(plt.hist) \
.map_upper(plt.scatter) \
.map_lower(plt.scatter)
In the case of the above graph, we see that setosa species (blue) has a small sepal length with a comparatively larger petal length, whereas virginica species (red) has a larger sepal length but a smaller petal length in comparision. Such a kind of problem could be termed as a Classification problem, i.e. choosing a categorical output based on some input variables (categorical or numeric). If we were to draw boundaries within the scatter plot, we could approximately classify a given iris species from a list of 3 based on the sepal length and petal length provided.
Machine Learning is all about building those boundaries.
An Artificial Neural Network (hereby, ANN) is a learning model that is based on the way a Biological Neural Network works.
Such models prove effective in cases wherein our data set is not linearly seperable i.e., one cannot seperate input to output mappings by merely drawing a line.
Our biological neural network system consists of billions of interconnected small units called as neurons. Let's look at a simplistic biological neuron and visualize our artifical model from it later.
So, how do ANNs work? We now describe the smallest computational unit of an ANN - neuron.
Image Source: [6]
Each neuron is connected to other neurons by a connection we call as the synapse. Each synapse has a weight $w$ associated to it. Consider this as the strength of the neuron that can be updated to increase the performance of the architecture. At each neuron, a $net$ is calculated by a summation of the products of each input to the neuron through a synapse and the weight associated to it on that synapse. The output of a neuron is given by $\phi(net)$ where $\phi$ denotes the activation function of that neuron. An additional neuron known as the bias having an input $+1$ is attached to each neuron of an ANN.
Let $w_{ij}$ be the weight of a synapse/connection from the $i^{th}$ neuron to the $j^{th}$ and $b_{j}$ is the bias weight to the $j^{th}$ neuron. Let the input from the $i^{th}$ neuron be $x_i$. Then, $net_j$ can be defined as $$net_j = \sum_{i = 1}^{m} w_{ij} x_i + b_{j}$$ where $m$ is the number of neurons in the layer where neuron $i$ exists.
The output from the $j_{th}$ neuron will then be $$x_{j} = \phi(net_{i})$$
where $\phi(net_{j})$ is the activation function present at neuron $i$. Also, $x_{j}$ either acts as the output or an input from neuron $j$ to say, neuron $k$.
ST
Assume a neuron having a single input, say $x_{1}$ coming from neuron $1$ to neuron $2$ connected by a weight $w_{12}$. Also, declaring a bias input having an input 1 and a weight $b_{j}$ associated to its synapse. We can then formulate the output $y$ at neuron $2$ as, $$y = \phi(w_{12} x_1 + b_{2})$$
We notice that the above equation represents a Univariate Logistic Regressor, where $w_{12}$ denotes the slope of a line and $b_{2}$ denotes the y-intercept. Hence, each neuron in itself is a logistic unit.
In [ ]:
# an identity function
def identity(x, derivative = False):
if derivative:
return np.sign(x)
return x
In [ ]:
# testing our activation functions using a range(-10, 10, 0.01)
x = np.arange(-10, 10, 0.01)
In [ ]:
# plotting an identity function and its first order derivative
_, ax = plt.subplots()
ax.plot(x, identity(x) , lw = 2)
ax.plot(x, identity(x, derivative = True), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title('Identity Function')
plt.legend(["$\phi(x)$", "$\phi'(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)/\phi'(x)$")
plt.show()
In [ ]:
# a heaviside function
def heaviside(x):
return 0.5 * (np.sign(x) + 1)
In [ ]:
# plotting a heaviside function
_, ax = plt.subplots()
ax.plot(x, heaviside(x), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title('Heaviside Function')
plt.legend(["$\phi(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)$")
# adding margins
margin = 1
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1 - margin, y2 + margin))
plt.show()
In [ ]:
# finding the first order derivative of a sigmoid function
s = sp.Symbol('x') # denoting a sympy symbol
f = 1 / (1 + sp.exp(-s)) # a sigmoid function
sp.diff(f) # differentiating the equation
In [ ]:
# a sigmoid function
def sigmoid(x, derivative = False):
e = np.exp(-x)
if derivative:
return e / np.power(1 + e, 2)
return 1 / (1 + e)
In [ ]:
# plotting a sigmoid function and its first order derivative
_, ax = plt.subplots()
ax.plot(x, sigmoid(x) , lw = 2)
ax.plot(x, sigmoid(x, derivative = True), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title('Sigmoid Function')
plt.legend(["$\phi(x)$", "$\phi'(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)/\phi'(x)$")
plt.show()
In [ ]:
# finding the first order derivative of a bipolar sigmoid function
f = (1 - sp.exp(-s)) / (1 + sp.exp(-s)) # a bipolar sigmoid function
d = sp.diff(f) # differentiating the equation
sp.simplify(d) # simplifying the equation
In [ ]:
# a bipolar sigmoid function
def bipolar_sigmoid(x, derivative = False):
e = np.exp(-x)
if derivative:
e = np.exp(x)
return 2 * e / np.power(1 + e, 2)
return (1 - e) / (1 + e)
In [ ]:
# plotting a sigmoid function and its first order derivative
_, ax = plt.subplots()
ax.plot(x, bipolar_sigmoid(x) , lw = 2)
ax.plot(x, bipolar_sigmoid(x, derivative = True), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title('Bipolar Sigmoid Function')
plt.legend(["$\phi(x)$", "$\phi'(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)/\phi'(x)$")
plt.show()
In [ ]:
# finding the first order derivative of a cloglog function
f = 1 - sp.exp(-sp.exp(s)) # a cloglog function
d = sp.diff(f) # differentiating the equation
sp.simplify(d)
In [ ]:
# a complementary log-log function
def cloglog(x, derivative = False):
e = np.exp(x)
if derivative:
return np.exp(x - e)
return 1 - np.exp(-e)
In [ ]:
# plotting a sigmoid function and its first order derivative
_, ax = plt.subplots()
ax.plot(x, cloglog(x) , lw = 2)
ax.plot(x, cloglog(x, derivative = True), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title('Complementary Log Log')
plt.legend(["$\phi(x)$", "$\phi'(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)/\phi'(x)$")
# adding margins
margin = 0.5
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1 - margin, y2 + margin))
plt.show()
In [ ]:
# finding the first order derivative of a tanh function
f = sp.tanh(s)
sp.diff(f)
In [ ]:
# a tanh function
def tanh(x, derivative = False):
if derivative:
return 1 - np.power(np.tanh(x), 2)
return np.tanh(x)
In [ ]:
# plotting a tanh function and its first order derivative
_, ax = plt.subplots()
ax.plot(x, tanh(x) , lw = 2)
ax.plot(x, tanh(x, derivative = True), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title('Hyperbolic Tan Function')
plt.legend(["$\phi(x)$", "$\phi'(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)/\phi'(x)$")
plt.show()
This activation function is best to use when:
For more details, see [7]
In [ ]:
def tanh_lecun(x, derivative = False):
a = 1.7159
b = (2.0 / 3.0)
if derivative:
return a * (b - np.tanh(b * x))
return a * np.tanh(x)
In [ ]:
# plotting yann lecun's tanh function and its first order derivative
_, ax = plt.subplots()
ax.plot(x, tanh_lecun(x) , lw = 2)
ax.plot(x, tanh_lecun(x, derivative = True), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title("Yann LeCun's Hyperbolic Tan Function")
plt.legend(["$\phi(x)$", "$\phi'(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)/\phi'(x)$")
plt.show()
In [ ]:
# a relu (Rectifier Linear Unit) function
def relu(x, derivative = False):
if derivative:
# the derivative of a ramp function is nothing but a heaviside function
return heaviside(x)
return np.maximum(0, x)
In [ ]:
# plotting a relu function and its first order derivative
_, ax = plt.subplots()
ax.plot(x, relu(x) , lw = 2)
ax.plot(x, relu(x, derivative = True), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title("ReLU Function")
plt.legend(["$\phi(x)$", "$\phi'(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)/\phi'(x)$")
plt.show()
In [ ]:
# an absolute function
def absolute(x):
return np.abs(x)
In [ ]:
# plotting a relu function and its first order derivative
_, ax = plt.subplots()
ax.plot(x, absolute(x), lw = 2)
ax.grid(True, which = 'both')
ax.axhline(y = 0, color = 'k')
ax.axvline(x = 0, color = 'k')
plt.title("Absolute Function")
plt.legend(["$\phi(x)$"])
plt.xlabel('$x$')
plt.ylabel("$\phi(x)$")
plt.show()
Gaussian (Normal) Distribution with a standard deviation $\sigma = 0.01$
Yann LeCun's Weight Initialization Method: Yann LeCun states that if the data is been normalized and tanh_lecun
is been used, then the weights could be initialized from a distribution having a mean $\mu = 0$ and a standard deviation $\sigma = {\frac{1}{\sqrt{m}}}$ where $m$ is the number of inputs to a node.
Xavier / Glorot Weight Initialization Method: If $m$ is the number of fan-in nodes and $n$ is the number of fan-out nodes, then the weights could be initialized choosing a random value from a Uniform Distribution in the range $$U \sim [\,{\sqrt{\frac{6}{m + n}}}, {\sqrt{\frac{6}{m - n}}}]\,$$
In order to randomize weights around 0, we pick out random values from the said probability density function.
In [ ]:
# a probability density function for a Normal Distribution
def pdf_normal(x, mu = 0.0, sigma = 1.0):
c = 2 * np.power(sigma, 2)
return (1.0 / np.sqrt(np.pi * c)) * np.exp(-(np.power(x - mu, 2) / c))
In [ ]:
# plotting a Uniform Distribution
def plot_uniform(low = 0.0, high = 1.0, bins = 100, size = 1000, title = None):
range = np.random.uniform(low, high, size)
_, bins, _ = plt.hist(range, bins, normed = True)
plt.legend(['$U \sim [\,%.2f, %.2f]\,$' % (low, high)])
if title is not None:
plt.title(title)
plt.show()
In [ ]:
# a test case to plot a Uniform Distribution
plot_uniform()
In [ ]:
# plotting a Normal Distribution and its PDF
def plot_normal(mu = 0.0, sigma = 1.0, bins = 100, size = 1000, title = None):
range = np.random.normal(mu, sigma, size)
_, bins, _ = plt.hist(range, bins, normed = True)
plt.plot(bins, pdf_normal(bins, mu, sigma), lw = 2, color = 'r')
if title is not None:
plt.title(title)
plt.legend(['$\mu = %0.2f, \sigma = %0.2f$' % (mu, sigma)])
plt.show()
In [ ]:
# a test case to plot a Normal Distribution and its PDF
plot_normal()
In [ ]:
# Our Artificial Neural Network class
class ANN:
# available weight initialization methods
RANDN = 1
LECUN_UNIFORM = 2
LECUN_NORMAL = 3
XAVIER_UNIFORM = 4
XAVIER_NORMAL = 5
# available activation functions
IDENTITY = identity
CLOGLOG = cloglog
RELU = relu
SIGMOID = sigmoid
TANH = tanh
TANH_LECUN = tanh_lecun
HEAVISIDE = heaviside
ABSOLUTE = absolute
'''
@arguments
sizes - a list containing the number of neurons in each layer inclusive of the input and output layer.
wimethod - the initialization method for initializating weights and biases.
'''
def __init__(self, sizes, wimethod = RANDN, debug = False):
self.log = Log(debug) # a logger for debugging
self.nlayers = len(sizes) # number of layers inclusive of i/p layer
self.ninputs = sizes[ 0] # number of inputs
self.noutputs = sizes[-1] # number of outputs
self.sizes = sizes # the topological structure of the neural network
self.neurons = sum(sizes) # number of neurons within the architecture
self.biases, self.weights = self._init_weights(wimethod)
# logging initial weights
self.log.debug("ANN._init_weights: Initial Weights: ")
if self.log.debuggable:
for i, w in enumerate(self.weights):
self.log.debug('weight(%i)' % (i + 1))
self.log.display(npa_to_latex_bmatrix_string(w))
# logging initial biases
self.log.debug("ANN._init_weights: Initial Biases: ")
if self.log.debuggable:
for i, b in enumerate(self.biases):
self.log.debug('bias(%i)' % (i + 1))
self.log.display(npa_to_latex_bmatrix_string(b))
def _init_weights(self, wimethod = RANDN):
sizes = self.sizes
# initialize weights to 0 and then "break the symmetry" [3]
weights = [np.zeros(shape = (x, y)) for x, y in zip(sizes[:-1], sizes[1:])]
biases = [np.zeros(shape = (1, y)) for y in sizes[1:]]
if wimethod is ANN.RANDN:
self.log.debug("ANN._init_weights: Initialization Weights using a Gaussian Distribution (mu = 0, sigma = 0.01).")
# a random gaussian distribution with standard deviation = 0.01
mu = 0.0
sigma = 0.01
weights = [np.random.normal(size = (x, y), loc = mu, scale = sigma) for x, y in zip(sizes[:-1], sizes[1:])]
biases = [np.random.normal(size = (1, y)) for y in sizes[1:]]
self.log.debug('ANN._init_weights: Plotting a Probability Density Function for weights.')
if self.log.debuggable:
plot_normal(mu = mu, sigma = sigma, title = 'Initial Weights Distribution for all Layers.');
if wimethod is ANN.LECUN_UNIFORM:
self.log.debug("ANN._init_weights: Initialization Weights using Yann LeCun's method (Uniform Distribution).")
# to be implemented
if wimethod is ANN.LECUN_NORMAL:
self.log.debug("ANN._init_weights: Initialization Weights using Yann LeCun's method (Normal Distribution).")
mu = 0.0
for i in range(self.nlayers - 1):
m = sizes[i] # fan-in
n = sizes[i + 1] # fan-out
sigma = np.sqrt(1.0 / m)
weights[i] = np.random.normal(size = (m, n)) * sigma
biases [i] = np.random.normal(size = (1, n))
u = i + 1 # layer x
v = i + 2 # layer y
self.log.debug('ANN._init_weights: Plotting a Probability Density Function for weights (%i, %i).' % (u, v))
if self.log.debuggable:
plot_normal(mu = mu, sigma = sigma, title = 'Initial Weight Distribution for (%i, %i)' % (u, v))
if wimethod is ANN.XAVIER_UNIFORM:
self.log.debug("ANN._init_weights: Initialization Weights using Xavier's method (Uniform Distribution).")
for i in range(self.nlayers - 1):
m = sizes[i] # fan-in
n = sizes[i + 1] # fan-out
r = np.sqrt(6.0 / (m + n)) # range
weights[i] = np.random.uniform(low = -r, high = r, size = (m, n))
biases [i] = np.random.uniform(low = -r, high = r, size = (1, n))
u = i + 1 # layer x
v = i + 2 # layer y
self.log.debug('ANN._init_weights: Plotting a Uniform Distribution for weights (%i, %i).' % (u, v))
if self.log.debuggable:
plot_uniform(low = -r, high = r, title = 'Initial Weight Distribution for (%i, %i)' % (u, v))
if wimethod is ANN.XAVIER_NORMAL:
self.log.debug("ANN._init_weights: Initialization Weights using Xavier's method (Normal Distribution).")
mu = 0.0
for i in range(self.nlayers - 1):
m = sizes[i] # fan-in
n = sizes[i + 1] # fan-out
sigma = np.sqrt(3.0 / (m + n)) # standard deviation
weights[i] = np.random.normal(size = (m, n)) * sigma
biases [i] = np.random.normal(size = (1, n))
u = i + 1 # layer x
v = i + 2 # layer y
self.log.debug('ANN._init_weights: Plotting a Probability Density Function for weights (%i, %i).' % (u, v))
if self.log.debuggable:
plot_normal(mu = mu, sigma = sigma, title = 'Initial Weight Distribution for (%i, %i)' % (u, v))
return (biases, weights)
def view(self):
graph = nx.DiGraph()
# to be implemented
nx.draw(graph, node_color = 'y')
In [ ]:
class FNN(ANN):
def __init__(self, *args, **kwargs):
ANN.__init__(self, *args, **kwargs)
def _feedforward(self, x, activation = ANN.SIGMOID):
self.log.debug('FNN._feedforward: Feeding an input: ' + str(x))
table = PrettyTable(['LAYER.', 'WEIGHT', 'BIAS', 'NET', 'f(NET)'])
x = np.atleast_2d(x)
print(x)
input = [ ] # stores the inputs to each layer
output = [x] # stores the outputs from each layer
for i in range(self.nlayers - 1):
w = self.weights[i]
b = self.biases [i]
z = np.add(np.dot(x, w), b) # z = x.w + b, the net of a given neuron
x = activation(z) # a = f(z) , triggering the activation function
input .append(z)
output.append(x)
table.add_row([i + 1,
npa_to_latex_bmatrix_string(w),
npa_to_latex_bmatrix_string(b),
npa_to_latex_bmatrix_string(z),
npa_to_latex_bmatrix_string(x)])
self.log.debug('FNN._feedforward: For each iteration.')
self.log.display(table.get_html_string())
self.log.debug('FNN._feedforward: Done. Model generated: ' + str(x))
return { 'input': input, 'output': output }
def fit(self, features, labels, activation = ANN.SIGMOID, learning_rate = 0.01, epochs = 10000):
self.activation = activation
choice = random.randrange(self.ninputs) # choosing a random feature to feed into the network
# feeding the feature once through the network
# model states the inputs and outputs of each layer
model = self._feedforward(features[choice], activation)
y = model['output'][-1] # the output recieved from the output layer
# using mean squared error
error = mse(labels[choice], y)
self.log.debug('FNN.fit: Error recieved: ' + str(error))
# initiating the backpropagation algorithm
# derivative of the cost function
# = - (expected_output - predicted_output) x f'(input to output layer)
delta = (y - labels[choice]) * self.activation(model['input'][-1], derivative = True)
self.log.debug('FNN.fit: BackPropagation Error from output layer: ' + str(delta))
delta_weights = [np.zeros(w.shape) for w in self.weights] # delta weights list
delta_biases = [np.zeros(b.shape) for b in self.biases] # delta biases list
table = PrettyTable(['LAYER.', 'DELTA', 'ΔW', 'ΔB'])
round = 4
delta_weights[-1] = np.dot(np.transpose(model['output'][-2]), delta)
delta_biases [-1] = delta
table.add_row([self.nlayers,
npa_to_latex_bmatrix_string(np.round(delta , round)),
npa_to_latex_bmatrix_string(np.round(delta_weights[-1], round)),
npa_to_latex_bmatrix_string(np.round(delta_biases [-1], round))])
for i in range(2, self.nlayers):
z = model['input' ][-i]
weight = self.weights[-i + 1]
delta = np.dot(delta, np.transpose(weight)) * self.activation(z, derivative = True)
a = model['output'][-i - 1]
delta_weights[-i] = np.dot(np.transpose(a), delta)
delta_biases [-i] = delta
table.add_row([self.nlayers - i + 1,
npa_to_latex_bmatrix_string(delta),
npa_to_latex_bmatrix_string(delta_weights[-i]),
npa_to_latex_bmatrix_string(delta_biases [-i])])
self.log.debug('FNN.fit: For each BackPropagation iteration.')
self.log.display(table.get_html_string())
def predict(self, features):
self.log.debug('FNN.predict: Attempting to predict the outputs.')
outputs = [ ] # a list of outputs recieved for each feature
for i in features:
# feeding the feature through the network
model = self._feedforward(i, self.activation)
output = model['output'][-1]
self.log.debug('FNN.predict: Prediction for ' + str(i) + ' recieved is ' + str(output))
outputs.append(output)
return outputs
In [ ]:
features = np.asarray([[0, 0], [0, 1], [1, 0], [1, 1]])
labels = [[0], [1], [1], [0]]
nfeatures = len(features[0]) # number of features
nlabels = len(labels[0]) # number of labels
sizes_hidden = [3] # 1 hidden layer with 3 neurons
sizes = [nfeatures] + sizes_hidden + [nlabels]
In [ ]:
fnn = FNN(sizes, wimethod = ANN.XAVIER_NORMAL, debug = True)
fnn.fit(features, labels)
Survival of the fittest - Charles Darwin, On the Origin of Species
Our ability to learn is in itself - evolutionary; and the principles behind Genetic Algorithms are no different. Of course, the very nature of evolution is nothing but cell optimization over time.
We select individuals that seem a best fit to our objective function (or fitness function) that can survive for the next generation.
Our individuals of the next generation perform a genetic crossover (sharing genetic information) that aim to perform a better generation of individuals that can survive our fitness function.
Mutation occurs when there is a kind of deformation of genes by say, exposure to radiation. In practice, this can be achieved by randomizing our string to an extent to form new individuals that aim to try preserving some genetic diversity.
Our resultant genetic diversity now creates new offsprings from these individuals recieved.
We can now formulate a generalized algorithm for Genetic Algorithms as follows:
In order to tackle any given Machine Learning problem, a standard and structured approach is required. For this purpose, I've built a mind map that can help you make your way towards building any (almost) model. We'll be using the same mind map to tackle our problem for this case study.
Our analysis involves a combination of various algorithms right from Data Preprocessing to Estimation. We can consider building a Pipeline that can perform these steps in sequence, thereby deriving analysis for all cases we'll be considering.
You can consider a pipeline as a container of steps that can be formulated throughout the process of development and deployment. Our data set undergoes a series of phases - processing (missing value imputation, removing outliers), transformation, cross-validation, etc. We can stack these steps into one model and pass our data set through our pipeline in one go, recieving output at the other end.
"They're like legos!"
Kevin Goetsch, "Deploying Machine Learning using Sklearn Pipelines", PyData 2016 [17]
sklearn
has an amazing module dedicated to pipelining your project.
In [ ]:
from sklearn.pipeline import Pipeline
Probably one of the most neglected step in approaching any problem, is formulating the problem itself. A mere definition of the problem won't make much of a difference unless you deep dive into the problem and understand its true required outcome. When one tends to deviate from our original task, we can always refer to our original problem statement.
Let's define a formal description of our problem.
Given a data set containing 36 germplasm features corresponding to any one of the six groups (Marker, Variety, Okra, New, BBR and JBWR) of cotton, build an efficient prediction model that is able to classify under which group does any given input set of germplasm features of cotton lie within.
Based on our Statement, our problem can be termed as a multi-class classification problem. Our job is to analyse and build an efficient model that is able to classify any given cotton species with respect to any of the six groups.
Although our job sounds simple, nothing can be said until we have a good view of our data set.
Our data set is been provided by the Central Institute of Cotton Research (CICR), Nagpur, India. An initial insight into our acquired data set shows us a high degree of unstructuredness, noise and ambiguities within our data set. We have recieved a data set in its utmost raw form. Suprisingly, a large number of real-world data sets are raw, and unstructured in all ways you can possibly imagine. You can have a look of our raw data set on this link.
A great amount of data preprocessing will be required only after which we will be able to analyse and profile our data set well.
An exclusive subdomain called as Agricultural Data Science exists that is dedicated building prediction models that can identify undiscovered species and species identification without testing for various parameters. A cotton sample or an undiscovered species can be classified into a destined group using a classifier model that we propose to build.
In some cases (such as ours), a decent domain knowledge would be necessary to help us understand and approach our data set better. I solemly declare myself to be naive in the field of seed science. After some research (basically Googling), I came across this Doctoral Thesis [13] by Dr. Harish S. that can help us understand more about Cotton Germplasm. For instance, many discrepancies within the recieved data set were resolved thanks to a little bit of knowledge about what exactly our features denote.
Data Preprocessing is possibly one of the most tedious and time-consuming part of building a learning/prediction model. You can consider yourself lucky if you are working on a preprocessed data set. In real world however, a large amount of data sets are noisy, inconsistent and unstructured in nature.
As we have seen, Machine Learning algorithms although sound smart, but aren't smart enough. A raw data set holds a high dregree of noise (inconsistencies, discrepancies, etc.). The GIGO principle applies very well in our field of Machine Learning which states - "garbage in, garbage out". This means that if you were to feed inconsistent logic into your machine, you recieve equally flawed conclusions from it.
Making our data representable into a tabular format requires some kind of human intervention. I initially recieved the data set in an Excel sheet. After a day of data munging, I've successfully converted our dirty data to a cleaned one.
In [1]:
import pandas as pd
I've created a function load_cotton
which returns our data wrapped into a Pandas's DataFrame Object.
In [2]:
def load_cotton():
path = './data/cotton/cotton.csv'
df = pd.read_csv(path)
return df
In [3]:
df = load_cotton()
At first, we'd like to know what are the features present within our data set, even after having no intuition of anything related to seed science.
We can view a sample of our data set using pandas.DataFrame.sample
method.
In [8]:
nsamples = 3
df.sample(nsamples)
Out[8]:
In [ ]:
import pandas_profiling as pp
profile = pp.ProfileReport(df)
profile.to_file('./data/cotton/profile.html')
NOTE: Link to Data Description.
Our data set consists of a mixture of both - categorical as well as numeric features associated to one target variable - Group. As stated in our problem description, our problem is a multi-class classification problem.
Our first question would be - are the number of tuples for each class equal?
Let's visualize our question and infere from what we achieve.
In [4]:
import matplotlib.pyplot as plt
import seaborn as sns
In [5]:
frame = (df.Group.value_counts() / len(df)) * 100
axes = frame.plot.pie(autopct = "%.2f")
axes.set_title('Class Frequency Distribution')
axes.set_aspect('equal')
plt.axis('off')
plt.show()
Probably the most dominating class out of the six is JBWR followed by Variety whereas the smallest contribution to the data set is from Okra. Our data set consists of imbalanced contributions from each class of our target variable.
Why yes. Our model's performance may rely on the frequency distribution of each class within our data set. [15] One way to overcome this is to collect more data for balancing our data set that can help us to improve the overall performance of our machine learning models. However in our case, this isn't possible. Given the limitations, we can carefully split our training, validation and testing sets on the basis of class frequency distribution. We'll be discussing more on this, later.
In our samples above, we've noticed quite a few "NaN"s (non-availability of data) already. This also implies that our data set consists of holes that ought to be eliminated. More on this later.
We shall follow a strict protocol [12] of preprocessing our data which is as follows:
sklearn
has an entire module called preprocessing
that can help us with this phase of our solution.
We can get away with missing values by either:
Ignoring our missing values is certainly not the option since not all predictive models consider missing values (k-NN in this case). Secondly, manually filling out missing values is probably the slowest method and requires an extensive understanding of our problem's domain. Third, deleting our training tuples with missing values could remove a large amount of relavant data that could help our model to learn better. pandas
has a function called DataFrame.dropna()
if you wish to consider deleting missing values (not recommended).
Our final case would be clearly substituting a value in place of missing values. We call this as Imputation of Missing Values. A wide range of stratergies and algorithms exists to impute missing values.
sklearn
has a class Imputer
that implements a wide range of single imputation strategies."MICE $\dots$ is nice."
Elizabeth Stuart, "Recent Advances in missing Data Methods: Imputation and Weighting" [16]
Let's assume that we have features $X_{1}, X_{2}, \dots, X_{m}$. We then perform a simple imputation (using a mean strategy). We then predict the value of $X_{i}$, by using a prediction model using the remaining features and continue this process for each feature vector combination.
Let's find the number of missing values present throughout the data set.
In [9]:
nmissing = df.isnull().sum().sum()
nmissing
Out[9]:
We can calculate the percentage contribution of missing values to our data set as follows:
In [10]:
(nmissing / df.size) * 100
Out[10]:
About $7.88\%$ of our entire data set contains missing values. What about the contribution of each individual feature to our $7.88\%$? Let's visualize the same.
In [11]:
frame = (df.isnull().sum() / nmissing).sort_values() * 100
axes = frame.plot.barh()
axes.set_title('Percentage of Missing Values')
axes.set_xlabel('%')
axes.tick_params(labelsize = 8)
plt.show()
Plant: Growth Habit and Leaf: Nectaries feature vectors seem to consists a high amount of missing data. Will dropping these columns affect the overall performance of our model? What about post imputation?
Well, nothing can be said for now.
Let's build a data set transfomer that can impute missing data within our data set. We can then pass such a transformer with a desired algorithm through our pipeline.
sklearn
has a powerful TransformerMixin
class that can be utilized to build our Imputation class. Our class shall be just a wrapper around fancyimpute
(a library containing a wide range of imputation stratergies).
NOTE: sklearn
has a class Imputer
dedicated for imputing missing values. We'll be building an extension that implements strategies beyond ones provided by sklearn
In [12]:
import numpy as np
from sklearn.base import TransformerMixin
from fancyimpute import SimpleFill, KNN, MICE
In [13]:
class DataFrameImputer(TransformerMixin):
NUMERIC = 0x000100
CATEGORICAL = 0x000200
'''
Parameters
----------
imputer - a fancyimpute class with desired parameters
copy - whether you'd like to perform an inplace imputation or return a copy
default returns a copy
impute - NUMERIC, CATEGORICAL kinds of imputataion
default is imputation on only numeric columns
'''
def __init__(self, imputer = SimpleFill(), impute = NUMERIC, copy = True):
self.imputer = imputer
self.copy = copy
self.impute = impute
def fit(self, *_):
return self
def transform(self, df):
columns = df.columns
numeric = list(column for column in columns if df[column].dtype != 'object')
categorical = list(column for column in columns if df[column].dtype == 'object')
frame = df.copy() if self.copy else df
frame[numeric] = frame[numeric].fillna(np.nan)
frame[categorical] = frame[categorical].fillna('NA')
try:
if self.impute is DataFrameImputer.NUMERIC:
nparray = frame[numeric].as_matrix()
frame[numeric] = self.imputer.complete(nparray)
except ValueError:
pass
return frame
In [14]:
imputer = DataFrameImputer(imputer = MICE(verbose = False), copy = False)
imputer.fit_transform(df)
df.sample(nsamples)
Out[14]:
In case of our numerical feature variables, we successfully imputed our missing values with the mean of that feature vector. Let's view the reduction of NA/NaN's present within our data set.
In [15]:
nmissing = (df.isnull().sum().sum() / df.size) * 100
nmissing
Out[15]:
Aha! We've successfully reduced our number of missing values. In case of our categorical features, we've considered a string entity "NA" in place of our global constant np.nan
.
Duplicate tuples act redundant and serve no purpose for our classifier. We can count the number of duplicate tuples within our dataframe using the DataFrame.duplicated()
function as follows:
In [16]:
nduplicates = df.duplicated().sum()
nduplicates
Out[16]:
Lucky for us, there isn't any duplicate tuples present within our data set. However if there were so, you can delete these tuples using the DataFrame.drop_duplicates()
function.
As a matter of fact, we can visualize data sets beyond 3 dimensions that provide us a decent intuition of what is affecting our data set. We'll try visualizing our data set using a wide range of options available.
If you have say $n$ features, you would then have a $n(n - 1)$ scatter plots where the diagonal denotes each feature. Our data set has about 37 features, that's clearly around 1332 scatter plots! This sounds computationally expensive. Moreover, a standard scatter plot matrix is restricted to scatter plots only. Since our data set is a composition of both numeric as well as categorical data, visualizing a scatter plot relationship between any two different types of features does not give us a clear understanding of our data set. We'll be building what is known as a Generalized Pair Plot that denotes relationships between any kind of variables, be it categorical (ordinal, normal or dichotomous) or numeric. [18]
pandas
has a great plotting module extension to matplotlib
that can help us visualizing mutli-variate Andrew's Plot.
In [17]:
from pandas.tools.plotting import andrews_curves
Since Andrew's Plot is a plot of our features versus a smooth curve function, we would require to convert our data set into something numeric. In case of categorical data, one way to achieve our objective is to use scikit-learn
's LabelEncoder
class that converts our categorical data corresponding to a numerical value.
In [18]:
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
In [19]:
le = defaultdict(LabelEncoder)
dfle = df.copy()
dfle = dfle.apply(lambda x: x if x.dtype != 'object' else le[x.name].fit_transform(x))
dfle.sample(nsamples)
Out[19]:
In [20]:
labels = dfle['Group'].unique()
groups = le['Group'].inverse_transform(labels)
We can now plot a mutli-dimensional visualization of our data as follows:
In [21]:
def andrews_plot(df, target, labels):
ax = andrews_curves(df, 'Group')
handles, _ = ax.get_legend_handles_labels()
ax.set_title("Andrew's Plot")
ax.set_xlabel('$t$')
ax.set_ylabel('$f_{x}(t)$')
ax.legend(handles, labels)
plt.show()
In [22]:
andrews_plot(dfle, 'Group', groups)
where $$f_{x}(t) = \frac{x^{(i)}_{0}}{\sqrt{2}} + x^{(i)}_{1}\sin{(t)} + x^{(i)}_{2}\cos{(t)} + x^{(i)}_{3}\sin{(2t)} + x^{(i)}_{3}\cos{(2t)} + \dots$$
Argh! Plotting all our observations seem to overlap curves of some groups. Andrew's Plot is impressively clear for 20 observations or less. Let's sample about 20 random observations from our data set for a better picture.
In [23]:
nobservations = 20
Our data set consists of imbalanced observations for each class. We can calculate the percentage contribution of each group to our data set as follows:
In [24]:
frame = (df['Group'].value_counts() / len(df)) * 100
frame.plot.pie(autopct = "%.2f", figsize = (6, 6))
plt.axis('off')
plt.show()
When it comes to sampling our data set (including splitting our data set into training and testing sets), the ratio of each group's contribution must be consistent.
In [25]:
from sklearn.model_selection import train_test_split
In [26]:
y = dfle['Group'].copy()
_, sample = train_test_split(dfle, test_size = (nobservations/len(dfle)), stratify = y)
andrews_plot(sample, 'Group', groups)
Not all features play an impressive role for our prediction model. We can get done away with features that are redundant and have no appreciable affect over our model's prediction. Data Reduction decreases the possibility of overfitting, improves our training time (lesser the number of features, lesser the computation required for learning), simplicity and relavancy.
Our data set consists of 36 features, 1 label and 191 observations. While this may not sound a lot, a large number of data sets used for building prediction models may have an extensive number of features and observations. In our case, there may exists features that may impact the overall performance of our models.
So, how do we go about it?
Isabelle Guyon and Andre Elisseeff provide an excellent checklist on selecting your features and we shall follow the same. [24]
Our data set consists of a mixture of both categorical as well as numerical features. However, many machine learning models understand inputs only in terms of numbers. In order to convert our categorical data into numerical data, we need to use some kind of encoding system.
Let's assume the three kinds of categorical data we can come across:
In case of ordinal variables, scikit-learn
has the LabelEncoder
class that can be used for this purpose. Unless you aren't using any kind of comparision of variables on the way, a simple LabelEncoder
would suffice (for visualizations, etc.).
On analysing our data set, we can conclude that each categorical feature within our data set is clearly nominal and dichotomous in nature.
Converting our nominal/dichotomous categories into something numeric, we use the One-Hot Encoding system. This kind of encoding system considers each nominal/dichotomous category as a seperate feature that takes a value either $1$ or $0$ if the corresponding feature holds that category or not.
pandas
helps us to convert our categorical data into One-Hot vectors using the get_dummies
function. In statistics, a dummy variable is nothing but a variable that takes a value $1$ or $0$ to indiciate the presence or absense of some literal.
In [27]:
dfoh = pd.get_dummies(df, prefix_sep = ' ')
dfoh.sample(nsamples)
Out[27]:
Such encoding schemes often tend to increase the number of features by a huge amount, thus adding a computation overhead on our learning models. In our case, these columns increased almost three times the initial number of columns (37 to 125).
NOTE: Given that we have a multi-class classifier to-be-built, performing a One-Hot encoding now doesn't seem to be the best time. We'll be performing it over our data set, later. We use our Label encoded DataFrame dfle
.
Assuming our data set, one possibly cannot compare say fibre color to plant height, that would be comparing apples to oranges. Our data set consists of incomparable features and hence must be squashed to some uniform range. We call this stage as Feature Scaling.
We can define our normalized data set within the range $[0, 1]$ as: $$X := \frac{X - X_{min}}{X_{max} - X_{min}}$$
where $X$ is our feature vector.
In [28]:
columns = list(df)
features = [column for column in columns if column != 'Group']
dfstd = dfle.copy()
dfstd[features] = (dfle[features] - dfle[features].min()) / (dfle[features].max() - dfle[features].min())
dfstd.sample(nsamples)
Out[28]:
Our learning models require a unit that can quantify the overall performance of our model on our data set. Our training phase relies heavily on our data set and on our defined parameters (for parameteric models only) in order to build itself as a successful predictive model. However, using the entire data set would mean that our goal seems to be fitting our model to our data set alone, which was never the objective in the first place. Exhaustively utilizing the data set and tweaking parameters would possibly result in an ideal model and not a predictiion model. We call this as overfitting, an ML engineer's worst nightmare.
One good way to avoid this would be to split our data set into two parts of some defined ratio - a training set and a testing set. Random tuples from the data set can be added into our testing set that can be used to estimate how well our model performs for our data set. scikit-learn
has a function called train_test_split
under the model_selection
module (in scikit-learn 0.17
and before, the funciton is under the cross_validation
module) that can divide a data set into a desired ratio of training and testing sets.
In [33]:
from sklearn.model_selection import train_test_split
Our data set consists of multi-class target variables that are imbalanced by proportion within the data set. scikit-learn
's train_test_split
function has a parameter stratify
that divides our data set with the same proportion of our target variables as it was in the complete data set. (this is generally used for multi-class variables only)
We use the Pareto Principle (80-20 rule of thumb), i.e, we divide our data set in the ratio of $80:20$ training/testing split.
In [34]:
columns = list(df)
labels = [column for column in columns if column.startswith('Group')]
features = [column for column in columns if column not in labels]
target = df[labels]
train, test = train_test_split(dfstd, stratify = target)
train = dfstd
train.sample(nsamples)
Out[34]:
Nevertheless, the clutches of overfitting still cling on to us. During our training phase, our parameters are tweaked in order to fit on our training set. In someway, a leak of information of the testing set occurs during our training set, thereby leading to an overfit. In order to avoid this, we could go one step further by creating yet another set called as a validation set that can be used to estimate our model's performance after our training phase, then moving towards our testing phase.
Now, a split of our data set into three seperate entities (training, testing and a validation set) might decrease the the number of required training tuples, thereby limiting the overall performance of our learning model.
This brings us to the concept of Cross Validation (CV).
Cross Validation helps us to answer the question, "how well does my learning model work on any independent data set?"
We divide our data set into $K$ blocks. Then for each $i^{th}$ block (where $i = 1,2,...,K$), we assume the remaning $K - 1$ blocks to be our training set and $i$ to be our validation set for that given iteration. Hence, each block shall have $\frac{u}{K}$ observations where $u$ is the total number of observations in our training set.
A cumulative error (called as the Cross Validation Error) is calculated based on the errors recieved at each iteration from our validation set. Hence for any given model, our goal would be to minimize our Cross Validation Error.
An ideal Cross Validation approach would be to divide our data set into $u$ blocks, i.e, $K = u$. We call this, Leave One Out Cross Validation (LOOCV). For a data set with very limited observations, such an approach would be favourable but however in cases where we have millions of observations, an LOOCV would be computationally expensive. Typically, $K = 10$ (also known as 10-fold Cross Validation).
scikit-learn
has an extensive number of classes designated for various Cross Validation methods in its model_selection
module.
Given that our problem is a multi-class classification problem, our data set seems to hold an imbalanced number of samples for each given class. In order to ensure that for each fold, the relative frequencies of each class are preserved for our training and validation sets, scikit-learn
has a decent class called StratifiedKFold
which helps us to preserve the percentage of each sample for each class.
In [35]:
from sklearn.model_selection import StratifiedKFold
Being one of the most simplest of all machine learning algorithms, k-Nearest Neighbors (in short, k-NN) is a no-brainer. k-NN is a non-parametric approach which in simpler terms means that its performance solely depends on our data set. This encourages a more flexible model that can distinguish boundaries even better.
Why so?
Remember in the case of our Linear Regressor, a single tweak in any one of the parameters in our parameter set led our entire boundary (lines or hyperplanes) to change with respect to our $x$'s and $y$'s. Imagine a function $f(x)$ which shows a linear relationship, followed by constant relationship and eventually rises exponentially. Clearly fitting a parametric model for a such a function would be herculean; which is why we have non-parametric models that aim to fit themselves within each of these local relationships that exists within a data set thereby having a high dependency on it. Also, this raises yet another criteria for non-parametric models that they possess a high dependency on a data set of sufficient size.
k-Nearest Neighbors can be defined in a layman's terms as follows:
find k distinct points within the feature space that is closest (similar) to our point in space.
Hence, our parameter in terms of k-NN is $k$ itself, call it the confidence parameter. An increase in k naturally increases its accuracy but also its computation. Also, a very large k clearly may overfit the data, hence there should exist some optimal $k$. In order to know the distance between points, a distance metric is required (generally Eucliedian). k-NN also relies on the distance metric used.
Assume $X_{query}$ to be our query vector and $X_{1}, X_{2}, ..., X_{u}$ to be feature vectors within our training corpus. Hence, our nearest neighbor point $X_{NN_{1}}$ would be the one which has the minimum distance between our query vector and any point within our corpus.
Hence, $$X_{NN_{1}} = distance_{min}(X_{query}, X_{i})$$ where $i$ is within the range $[1, u]$.
sklearn
has a neat classifier class for the k-NN algorithm under the neighbors
module. Let's go ahead and import it.
In [36]:
from sklearn.neighbors import KNeighborsClassifier
In [37]:
knn = KNeighborsClassifier()
Let's divide our training/testing sets into features and label sets.
In [38]:
Xtrain, ytrain, \
Xtest , ytest = train[features], train[labels], \
test[features], test[labels]
In [39]:
Xtrain.head(nsamples)
Out[39]:
In [47]:
ytrain.head(nsamples)
Out[47]:
Let's estimate accuracy of our prediction. For this, we'll be using sklearn
's accuracy_score
function under the metrics
module.
In [48]:
from sklearn.metrics import accuracy_score
We can now evaluate the cross-validation error while training as follows:
In [49]:
def stratified_cross_validate(Xtrain, ytrain, class_, nfolds = 10, shuffle = False, **kwargs):
classifier = class_(**kwargs)
validator = StratifiedKFold(n_splits = nfolds,
shuffle = shuffle)
yhat = ytrain.copy()
# sklearn tweak
# http://stackoverflow.com/questions/35022463/stratifiedkfold-indexerror-too-many-indices-for-array
npXtrain = Xtrain.as_matrix() # numpy array
npytrain = ytrain.as_matrix() # numpy array
rows, cols = ytrain.shape
npytrain = npytrain.reshape((rows,))
errorsum = 0.0
for ii, jj in validator.split(npXtrain, npytrain):
trainX, testX, \
trainy, testy = npXtrain[ii], npXtrain[jj], \
npytrain[ii], npytrain[jj]
classifier.fit(trainX, trainy)
ypredict = classifier.predict(testX)
errorsum += accuracy_score(testy, ypredict)
avg = errorsum / nfolds
return avg
In [50]:
accuracy = stratified_cross_validate(Xtrain, ytrain, KNeighborsClassifier)
In [53]:
print('Accuracy 3-NN: ' + str(accuracy * 100) + '%')
In [94]:
maxneighbors = 30
nfolds = 10
nneighbors = list(range(1, maxneighbors + 1))
accuracies = [stratified_cross_validate(Xtrain, ytrain, KNeighborsClassifier, nfolds = nfolds,
n_neighbors = n) * 100 for n in nneighbors]
plt.plot(nneighbors, accuracies)
plt.xticks(nneighbors)
plt.xlabel('$k$')
plt.ylabel('$accuracy$')
plt.show()
In [95]:
from prettytable import PrettyTable
In [96]:
table = PrettyTable(['$k$', '$Accuracy$'])
for i in range(1, maxneighbors + 1):
table.add_row([nneighbors[i - 1], accuracies[i - 1]])
In [124]:
from IPython.core.display import display, HTML
In [109]:
display(HTML(table.get_html_string()))
In [110]:
np.amax(accuracies)
Out[110]:
Our classifier seems to predict best at $k = 13$ with an accuracy $= 65.25\%$.
In [117]:
maxneighbors = 30
low, high = 1, 12
nfolds = np.multiply(5, np.arange(low, high + 5))
accuracies = [stratified_cross_validate(Xtrain, ytrain, KNeighborsClassifier,
nfolds = n) * 100 for n in nfolds]
plt.plot(nfolds, accuracies)
plt.xticks(nfolds)
plt.xlabel('$k$')
plt.ylabel('$accuracy$')
plt.show()
In [123]:
table = PrettyTable(['$k$', '$Accuracy$'])
for i in range(len(nfolds)):
table.add_row([nfolds[i], accuracies[i]])
display(HTML(table.get_html_string()))
In [125]:
from sklearn import svm
In [126]:
accuracy = stratified_cross_validate(Xtrain, ytrain, svm.SVC)
In [127]:
print('Accuracy SVM (Kernel = Radial Basis Function, Penaly Parameter = 1): ' + str(accuracy))
In this paper [24], we find a recommended way of tweaking our parameters for an optimal SVM Classifier.
In [134]:
'''
trade-off between decision boundary and number of correctly classified points,
larger C, better classification.
'''
low, high = -3, 12
penalties = np.power(2.0, np.arange(low, high))
accuracies = [stratified_cross_validate(Xtrain, ytrain, svm.SVC, C = c) * 100 for c in penalties]
plt.plot(np.log2(penalties), accuracies)
plt.xticks(np.log2(penalties))
plt.xlabel('$\log_{2}(C)$')
plt.ylabel('$accuracy$')
plt.show()
In [136]:
table = PrettyTable(['$log_{2}C$', '$Accuracy$'])
for i in range(len(penalties)):
table.add_row([np.log2(penalties[i]), accuracies[i]])
display(HTML(table.get_html_string()))
In [ ]:
We see that the accuracy reaches its maximum at $\log_{2}{C} = 5$ $\approx 75\%$.
In [137]:
from sklearn.neural_network import MLPClassifier
In [138]:
accuracy = stratified_cross_validate(Xtrain, ytrain, MLPClassifier)
In [139]:
print('Accuracy ANN: ' + str(accuracy))
In [140]:
low, high = 100, 150
hsizes = [i for i in range(low, high + 1)]
accuracies = [stratified_cross_validate(Xtrain, ytrain, MLPClassifier,
hidden_layer_sizes = (size,)) * 100 for size in hsizes]
plt.plot(hsizes, accuracies)
plt.xlabel('hidden neurons')
plt.ylabel('accuracy')
plt.show()
A binary classifier is one which can seperate data into two kinds (yes or no, apple or oranges). However, the problem posed to us is a multi-class classification problem. k-NN and ANN acts as a multi-class classifier but not SVMs easily. So, how do we go about building classifiers that can predict a class within a multi-class set?
We've dived into a lot of how a machine learns, but let's focus on what really does a machine think. After a good trainng, a supervised learning model predicts outcomes we need answers for, and that's the whole point. A huge research and exploration is happening in the age of our new world in the field of Image Captioning. But that simply limits itself to what a machine sees. Could we explore the idea of what a machine believes about the image?
Google Brain team has gifted a boon to the Deep Learning community by Open Sourcing its fast Machine Learning Library - TensorFlow. Similar to an sklearn
's Pipeline
, TensorFlow creates a graph wherein each node is nothing but a compuational unit.
Assume that the only job you do for the world is provide the output for $a = b + c$. An incoming edge provides you the values for $b$ and $c$ and all you do is output the value for $a$, passing it to all your outgoing edges. Such independent nodes increases the overall modularity of our architecture. (Remember, if you were to give a developer a choice between modularity and holism, he'd always choose modularity. He'd adore its simplicity, curse its implementation.)
Data is fed into our architecture as one giant multi-dimensional array (or Tensor, similary to numpy
's ndarray
). You could consider the architecture built using TensorFlow nothing more than a Data Flow Graph, with Data Transformations happening everywhere across the graph both - in serial as well as in parallel. Hence the name - TensorFlow. To understand things more clearly, I've attached a GIF image taken from the TensorFlow website itself.
Let's go ahead and import our TensorFlow library.
In [ ]:
import tensorflow as tf
Natural Language Processing is concerned with problems that involve textual data. This is not to be confused with Speech Processing (a subdomain of NLP), since speech processing clearly deals with converting a speech signal into textual data. Given that there exists models that can learn textual data, how do we ensure that it is capable of learning a language that is independent per se of the language? Also, what makes one word similar or different from the other besides its meaning?
Natural languages have not less than a million words, each exhibiting some degree of uniqueness. Wrapping around a learning model that can compare a million words with each other sounds like a very high-dimensional space problem. One needs to consider say, the position of the alphabets, singular v/s plural words, words that are solely verbs, words denoting a tense (past, present and future), etc.
A nice way to encode all the relavant information for a given word is to represent it in the form of a vector.
It's well known that if you were to flip a coin (a classical example to introducing probability theory), there is a 50%-50% chance for it to fall with a face either being a head or tails. Although not necessary, probability is always concerned with events that have some degree of randomness. Let me put forward this in a better way - Probability is always concerned with events that have some degree of uncertainity. So, what makes an even $E$ as such to make us conclude that given intuition?
We denote the probability of an event $E$ as $\mathbb{P}(E)$, where $\mathbb{P}(E)$ is within the range $[0, 1]$. In the case above, there are a maximum of two events occuring (the face side could be either heads or tails) whereas there is just some condition depending on the event $E$ (in this case, the number of times a tail occures) alone. Hence,
if $a$ is the number of possibilities based on our condition and $b$ be the total number of possibilities,
\begin{align*} \mathbb{P}(E) &= \frac{a}{b} \\ \mathbb{P}(head) = \mathbb{P}(tail) &= \frac{1}{2} \\ &= 0.5 \end{align*}Does the event $E = head$ or $E = tail$ depend on its previous, subsequent or any other event? Of course not. In this case, $E$ is said to be an independent event. But what about cases where the probability of an event is based on the probability of some other events?
For instance, we know that during a stormy night - a sharp lightning is always followed by loud thunder. So, what would be the probability of a thunder to occur if the probability of lightning is given? We call this event (thunder) as a dependent event on an independent event (lightning) and its corresponding probability as conditional probability denoted by $\mathbb{P}(thunder \mid lightning)$.
In his attempt to prove the existence of God (proven as 67% using Bayes' Theorem [20]), Bayes formulated the well-known theorem in a large number of fields which solves events relying on conditional probabilities. Given that the event $A$ is a dependent event on an independent event $B$, the probability of $A$ to occur would be: $$\mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A) \times \mathbb{P}(A)}{\mathbb{P}(B)}$$ NOTE: The above equation is not commutative, i.e $\mathbb{P}(A \mid B) \ne \mathbb{P}(B \mid A)$
Let's assume a sentence $S$ is an ordered set consisting of $n$ words in sequence. Let us consider a sentence, say "yes we can". In this case, $S$ would naturally be $\{yes, we, can\}$.
The goal of our language model is to compute the probability of the possibility that $yes$ is followed by $we$, followed by $can$. Hence, for any given sentence $S = \{w_{1}, w_{2}, \dots, w_{n}\}$, our language model outputs $\mathbb{P}(w_{1}, w_{2}, \dots, w_{n})$ where $\mathbb{P}(w_{i})$ is the probability of the occurence of word $w_{i}$ in that sentence.
Naturally, our intuition says that the probability that "can" comes after "we" is dependent on the conditional probability of "we" and "yes". Hence, $\mathbb{P}(can \mid yes, we)$.
Therefore, the overall probability of a sentence can be denoted as: \begin{align*} \mathbb{P}(S) &= \mathbb{P}(w_{1}) \times \mathbb{P}(w_{2} \mid w_{1}) \times \mathbb{P}(w_{3} \mid w_{1}, w_{2}) \times \dots \times \mathbb{P}(w_{n} \mid w_{1}, w_{2}, \dots, w_{n - 1}) \\ &= \prod_{i = 1}^{n} \mathbb{P}(w_{i} \mid w_{1}, \dots, w_{i - 1}) \end{align*}
Hmm. Our first question would be what if $n$ is exponentially large? I mean, given that we're creating a machine that is able to simulate the literature style of Marcel Proust's In Search of Lost Time (1,267,069 words) along with Shakespearen literature (884,647 words) [21] and many more corpora of literature, wouldn't predicting the probabilities of each word be extremely time-consuming as denoted in the above equation? To complicate things more, I'll let you know that each word is fed into our language model as a sparse matrix. Surely the search for a better algorithm lesser than $O(n^3)$ is still open, but such an approach is still not feasible for our model as of now. And then, enters Andrey Markov.
"The future is independent of the past given the present."
In order to define Markov Modelling in its purest form is no better than the statement said above. Even mathematically, this can be defined as, \begin{align*} \mathbb{P}(S) &= \mathbb{P}(w_{1}) \times \mathbb{P}(w_{2} \mid w_{1}) \times \mathbb{P}(w_{3} \mid w_{1}, w_{2}) \times \dots \times \mathbb{P}(w_{N} \mid w_{1}, w_{2}, \dots, w_{n - 1}) \\ &= \prod_{i = 1}^{N} \mathbb{P}(w_{i} \mid w_{1}, \dots, w_{i - 1}) \end{align*}
where $N < n$
What we'd like to say is that the probability of the current word to occur does not necessarily depend on the probabilities of all words occured prior to it, rather it depends only to a limited history or a few words behind. When $n$ is extremely large, make sure $N << n.$
Based on the value for $N$, we have what is known in Natural Language Processing as N-gram modelling
Let's say,
For $N = 2$ or a bigram model, the probability of the word can to occur would be $\mathbb{P}(can \mid we)$
Similary,
For $N = 3$ or a tigram model, the probability of the word can to occur would be $\mathbb{P}(can \mid we, yes)$
Consider $N$ to be a window size limiting our history for previous words. Naturally, it's this $N$ that achieves the context of our sentence. For instance, when I say the word "yes", what do you think should follow? Sure there are a wide range of possibilities. But now, what if I were to say the word can? Aha! You're now limiting your context to something you've heard before and if you're pro-Democratic, you're most likely to guess the third word. Of course, the accuracy of the sentence increases with an increase in $N$ but the world always works with some trade-off (especially in the case of Machine Learning, take the classic bias v/s variance trade-off for instance). Anyway, thanks Markov!
Probability estimation in case of our to-be-built language model highly depends on the genre and style of literature corpora we'll be using. It's highly unlikely that the word thou appears beyond Shakespearen literature (or the Bible). So we must ensure that the probability densities of words are distributed well which will help our model learn better. As the saying goes, "Choose your sample space well!"
Consider the case of Feedforward Neural Networks. Each sample from the training set passed through the network thereby resulting a desired output. In no case did the next input to be fed showed any dependency with its previous sample. But what about training samples that exhibit relationships between their inputs? At the same time, what about a training sample that denotes an output dependent on not only the current sample but also its previous training samples? One of the answers to this are languages; and here is where Recurrent Nerual Networks comes to our rescue.
A Recurrent Neural Network (hereby, RNN) is a kind of an Artificial Neural Network that considers the question at which time-step did you feed your input? This encourages us to use RNNs for modelling sequences.
A vanilla Recurrent Neural Network's (hereby, RNN) architecture can be defined as follows:
\begin{align*} h_{t} &= \theta\hspace{2mm}\phi(h_{t-1}) + \theta_{x}x_{t} \\ y_{t} &= \theta_{y}\phi(h_{t}) \end{align*}where $x_{t}$ is the input vector at time-step $t$, $h_t$ and $h_{t-1}$ denote the state of the hidden layer at time-step $t$ and $t-1$ respectively, $y_{t}$ denotes the output vector at time $t$, $\theta_{x}$ denotes the parameter (weight) used to condition the input vector $x$ at time-step $t$, $\theta_{h_{t-1}}$ denotes the parameter (weight) used to condition the hidden layer $h$ at time-step $t-1$, $\theta_{y}$ denotes the parameter (weight) used to condition the hidden layer $h$ at time-step $t$ and $\phi(x)$ denotes the activation function (typically, a non-linear function). Also, $h_{0}$ would be the initialization vector at time-step $t = 1$.
A nice way to look at the architecture is the "time-step dependency" relationship between $h_{t}$ and $h_{t-1}$ it exhibits. Here, $h_{t-1}$ learns to pass the state information to $h_{t}$. Notice that parameters are shared over each time-step. (In the case of Convolutional Neural Networks, parameters are shared over space)
In order to estimate the error recieved from the RNN, we'd have to consider the sum of error (our function to be minimized) derivatives with respect to $\theta$ generated at each time-step. Hence,
\begin{align*} \frac{\partial E}{\partial \theta} &= \sum_{t = 1}^{S} \frac{\partial E_{t}}{\partial \theta} \\ \frac{\partial E_{t}}{\partial \theta} &= \sum_{k = 1}^{t} \frac{\partial E_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial h_{t}} \frac{\partial h_{t}}{\partial h_{k}} \frac{\partial h_{k}}{\partial \theta_{t}} \end{align*}where $S$ is the number of time-steps. Notice how the last two partial derivative factors at each time-step also relies on the previous $k$ time-steps. The term $\frac{\partial h_{t}}{\partial h_{k}}$ denotes the partial derivative of the hidden layer at time-step $t$ with respect to the partial derivative of all the time-steps previous to it and can be given as:
\begin{align*} \frac{\partial h_{t}}{\partial h_{k}} &= \prod_{i = k + 1}^{t} \frac{\partial h_{i}}{\partial h_{i - 1}} \\ &= \prod_{i = k + 1}^{t} \frac{\partial (\theta\phi(h_{i-1}) + \theta_{x}x_{i})}{\partial h_{i - 1}} \\ &= \prod_{i = k + 1}^{t} \theta^{T} diag[\phi'(h_{i - 1})] \end{align*}In order to help RNNs memorize the previous states, we need to build a state that can WRITE our input into memory (our state), READ the output from our memory, and ERASE the information associated with respect to any state. In short, our RNN requires a memory unit and we call it - a Long Short-Term Memory (LSTM) cell.
Initally proposed by Sepp Hochreiter and Jürgen Schmidhuber in their paper Long Short-Term Memory [22], an LSTM is nothing but a gate-like memory unit. Say we'd like to write our input into memory, then multiplying our input by a binary factor could help us open or close our gate for writing our input into memory. A similar case can be derived for reading and erasing as well. But how about using a continuous factor within a range (say $[0, 1]$) that determines whether the input needs to to written to memory or not? A continuous factor helps us to differentiate and thereby back-propagate over time (chaining backwards). Our continuous factor could then be a logistic unit that either opens / close gates into / from memory. Such a gating technique helps us to keep only necessary information into the cell.
Let's look at the architecture of our LSTM cell:
LSTMs help us to back propagate through time for very long time-sequences, thereby solving the vanishing and exploding gradient problem. This helps us to increase our context window as much wide as we need.
TensorFlow has a dedicated module for various RNN cells under the python.ops.rnn_cell
module.
In [1]:
from tensorflow.python.ops import rnn_cell
Remember, our data set requires to have a great distribution of words. At the same time, our model must be capable of learning literature styles, something that can relate to our users. What makes one find a sentence fascinating? I'd say quotations.
Given an array of 6 Ultrasonic sensors mounted on a Remote Controlled Car, the objective is to train an Artificial Neural Network (RNN/LSTM) that is able to predict the next possible step of a path destined to parallelly park the car by using the values of the 6 sensors.
Searching an appropriate parking space on roads where the width of the road is comparatively small is in itself a task. Moreover, a large amount of time is consumed in parking the vehicle in a desired orientation within the given constraint space.
Our feature vector consists of 6 ultrasonic sensor (HC-SR04) values. Each of our sensor initially discharges a trigger pulse that awaits to recieve an echo pulse back. Here, the pulse width is directly proportional to the distance between the vehicle and the object. Our target vector consists of two variables - direction and rotation of the wheel.
Our remote controlled car is assumed to remain at a constant speed. Two motors of 100 RPM (rotations per minute) each are dedicated
[1] Titanic: Machine Learning from Disaster: Predict one's survival on the Titanic
Titanic: Machine Learning from Disaster: Predict one's survival on the Titanic
We begin moving towards our first Kaggle problem, somewhat a "Hello, World!" substitute for Data Science. Before even begining to articulate a well-defined problem statement, I suggest you try to flush out everything you know about the Titanic, including the movie. Forget all observations and inferences you made or can make if you heard the word - Titanic. Let's assume we're given no other information except the title stated above and a training and testing data set.
Q. "Why? I don't get it."
Whenever you aim towards attacking any problem in Data Science, remind yourself that you're nothing more than just a Data Scientist or Machine Learning Engineer (in the making). Our expertise is limited within the boundaries of our domain and any kind of preconcieved notions or knowledge about our data set tends to restrict our observations towards that direction. Many a times, data sets exhibit relations that goes unnoticed by the domain it belongs to. And that is what data analysis is for. Below is a great talk on Data Agnosticism: Feature Engineering Without Domain Expertise by Nicolas Kridler.
One of the main takeaway from the talk is that let the machine make sense of what it takes.
Let's load our data set into Pandas' DataFrame Objects. I've created a function called load_titanic()
that returns a dictionary holding 2 data frames, one for our training set while the other for our testing set.
In [ ]:
def load_titanic():
path = {
'train': './data/titanic/train.csv',
'test' : './data/titanic/test.csv'
}
df = {
'train': pd.read_csv(path['train']),
'test' : pd.read_csv(path['test'])
}
return df
Let's view our training set for now.
[1] Prechelt, Lutz. An Empirical Comparison of C, C , Java, Perl, Python, Rexx, and Tcl for a Search, String Processing Program. Karlsruhe: U, Fak. Für Informatik, 2000.
[2] Pomerleau, Dean A. ALVINN, an Autonomous Land Vehicle in a Neural Network. Pittsburgh, PA: Carnegie Mellon U, Computer Science Dept., 1989.
[3] Twitter taught Microsoft’s AI chatbot to be a racist asshole in less than a day - The Verge
[4] Stephen Hawking: Artificial intelligence could wipe out humanity when it gets too clever as humans will be like ants
[5] UCI Machine Learning Repository - Housing Data Set
[6] Lecture Notes 3 on Deep Learning for NLP - CS Stanford
[7] Lecun, Yann, Leon Bottou, Genevieve B. Orr, and Klaus Robert Müller. Efficient BackProp. 1998
[8] Comprehensive list of activation functions in neural networks with pros/cons
[9] Why should weights of Neural Networks be initialized to random numbers?
[10] A list of cost functions used in neural networks, alongside applications
[11] YouTube Lecture on Deep Learning Lecture 12: Recurrent Neural Nets and LSTMs by Nando de Freitas
[12] Data Preprocessing
[13] Varietal Identification And Seed Vigour Assessment In Cotton - Dr. Harish S.
[15] PyData Talk on Machine learning with imbalanced data sets - Natalie Hockham
[16] Recent Advances in missing Data Methods: Imputation and Weighting - Elizabeth Stuart
[17] PyData Talk on Deploying Machine Learning using sklearn pipelines - Kevin Goetsch
[18] The Generalized Pair Plot
[20] Odds on that God exists, says scientist - The Guardian
[21] Shakespeare FAQ - Folger
[22] Sepp Hochreiter, Jürgen Schmidhuber, "Long short-term memory", Neural Computation, 1997.
[24] Isabelle Guyon, Andre Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3, 2003.
[25] A Practical Guide to Support Vector Classification