Optical Character Recognition, or OCR, is a technology that enables you to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data. In this section we will explore some basics of the same.
MINST database (http://yann.lecun.com/exdb/mnist/) is a great dataset that is publicly available to train OCR algorithms. However we will take a short cut and use the digits dataset that comes with scikit learn. We are going to compare a couple of different learners
Each data point is simply each pixel value within an 8x8 grid:
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import math
import tensorflow as tf
In [2]:
from sklearn import datasets
digits = datasets.load_digits()
digits.images.shape
print(digits.images.shape)
# Sample image
print(digits.images[0])
In [3]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='green')
ax.set_xticks([])
ax.set_yticks([])
In [4]:
# The data for use in our algorithms
print(digits.data.shape)
print(digits.data[0])
In [5]:
# The target label
print(digits.target)
Let's prepare to run some classification on the digits data set. As a first step we want to split the data in training and test data sets.
Discussion:
In [6]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target, random_state=2)
print(Xtrain.shape, ytrain.shape, Xtest.shape, ytest.shape)
In [7]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l2')
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
What is our prediction accuracy ? How best to see it ?
In [8]:
print(accuracy_score(ytest, ypred))
print(confusion_matrix(ytest, ypred))
Let's visualize how we did. We will mark mistakes in red:
In [9]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
ax.imshow(Xtest[i].reshape(8, 8), cmap='binary')
ax.text(0.05, 0.05, str(ypred[i]),
transform=ax.transAxes,
color='green' if (ytest[i] == ypred[i]) else 'red')
ax.set_xticks([])
ax.set_yticks([])
As you can see several mistakes are what even we would get wrong, while doing a visual inspection.
In [10]:
from sklearn.svm import SVC # "Support Vector Classifier"
svm = SVC(kernel='linear')
svm.fit(Xtrain, ytrain)
ypred= svm.predict(Xtest)
print(accuracy_score(ytest, ypred))
print(confusion_matrix(ytest, ypred))
Random forests are built on decision trees. Let's talk about decision trees first. Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero-in on the classification. The binary splitting makes this extremely efficient. As always, though, the trick is to ask the right questions. This is where the algorithmic process comes in: in training a decision tree classifier, the algorithm looks at the features and decides which questions (or "splits") contain the most information.
In [11]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=999)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print(accuracy_score(ytest, ypred))
print(confusion_matrix(ytest, ypred))
What is TensorFlow ? Well at a 100K feet level it is a computational framework for distributed machine learning. Well what does that mean ? It means it is a framework that has built in efficient basic computational constructs (viz. matrix manipulation, softmax computations) and an expressive graph based descriptor language that makes it real easy to express complex machine learning algorithms.
OCR is a problem that is particularly suited to neural networks. So let's see how we could use tensorflow to do a quick model and measure its accuracy. This section is intended to give you a quick flavor of tensorflow. So do not worry as yet if you cannot get all of it.
We will use Stochastic Gradient Descent (SGD) combined with 1-hidden layer neural network with rectified linear units nn.relu()(https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#relu) and 1024 hidden nodes.
Reformat into a shape that's more adapted to the models we're going to train:
In [12]:
image_size = 8
num_labels = 10
def reformat(dataset, labels):
dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
# Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(Xtrain, ytrain)
test_dataset, test_labels = reformat(Xtest, ytest)
print('Training set', train_dataset.shape, train_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
TensorFlow works like this:
with graph.as_default():
...
with tf.Session(graph=graph) as session:
...
Let's load all the data into TensorFlow and build the computation graph corresponding to our training. We create a Placeholder node which will be fed actual data at every call of session.run().
In [14]:
batch_size = train_dataset.shape[0]
hidden_units = 1024
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_test_dataset = tf.constant(test_dataset, dtype=tf.float32)
# Stage 1 - Training computation.
weights1 = tf.Variable(tf.truncated_normal([image_size * image_size, hidden_units]))
biases1 = tf.Variable(tf.zeros([hidden_units]))
hidden1 = tf.nn.relu(tf.matmul(tf_train_dataset, weights1) + biases1)
# Final stage
weights2 = tf.Variable(tf.truncated_normal([hidden_units, num_labels]))
biases2 = tf.Variable(tf.zeros([num_labels]))
logits = tf.matmul(hidden1, weights2) + biases2
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
# Optimizer.
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
test_prediction = tf.nn.softmax(
tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, weights1) + biases1),
weights2) + biases2)
Let's run it:
In [16]:
num_steps = 3001
def accuracy(predictions, labels):
return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
/ predictions.shape[0])
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print("Initialized")
for step in range(num_steps):
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : train_dataset, tf_train_labels : train_labels}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
In [ ]: