Classifying Handwritten Digits
Logistic Regression


In this notebook an interactive PySpark shell is loaded and our Logistic Regression application is executed, using our accelerated ML library. The accelerated ML library is written in Python. It supports standard learning algorithms, including common settings like classification, regression etc. We are given the option to choose between an accelerated execution that uses both software and hardware and a non-accelerated one, that uses only the CPU cores. Upon choosing the accelerated option, the accelerator's library is invoked (which is also written in Python) where the input data is stored in memory mapped buffers and are then transfered and processed in the PL. The whole communication with the PL is achieved using an AXI4-Stream Accelerator Adapter.

1. Data Sets

The data are taken from the famous MNIST dataset.

The original data file contains gray-scale images of hand-drawn digits, from zero through nine. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

In this example the data we use are already preprocessed/normalized using Feature Standardization method (Z-score scaling).

The (train and test) data sets that are used below have 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the (rescaled) pixel-values of the associated image.

2. PySpark initialization

In this section we initialize PySpark to predefine the SparkContext variable. \$SPARK_HOME and other needed environment variables are set under the /etc/environment file.

Make sure you have correctly set all needed paths and variables and that Py4J matches the version you have installed.

In [1]:
import sys, os

spark_home = os.environ.get("SPARK_HOME", None)

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home + "/python/lib/py4j-0.10.4-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
filename = spark_home+"/python/pyspark/shell.py"

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1-SNAPSHOT

Using Python version 3.4.3+ (default, Oct 14 2015 16:03:50)
SparkSession available as 'spark'.

3. Logistic Regression Application

This example shows how our accelerated Logistic Regression library is called to train a LR model on the train set and then test its accuracy. If accel is set (accel = 1), the hardware accelerator is used for the computation of the gradients in each iteration.

Read data & parameters

The size of the train set, as well as the number of the iterations are intentionally picked small, to avoid large execution time in SW-only cases.

In [2]:
chunkSize = 4000
alpha = 0.25
iterations = 5

train_file = "data/MNIST_train.dat"
test_file = "data/MNIST_test.dat"

sc.appName = "Python Logistic Regression"

print("* LogisticRegression Application *")
print(" # train file:               " + train_file)
print(" # test file:                " + test_file)

* LogisticRegression Application *
 # train file:               data/MNIST_train.dat
 # test file:                data/MNIST_test.dat

HW accelerated vs SW-only

In [3]:
accel = int(input("Select mode (0: SW-only, 1: HW accelerated) : "))

Select mode (0: SW-only, 1: HW accelerated) : 1

Instantiate a Logistic Regression model

In [4]:
from pyspark.mllib_accel.classification import LogisticRegression

trainRDD = sc.textFile(train_file).coalesce(1)

numClasses = 10
numFeatures = 784 
LR = LogisticRegression(numClasses, numFeatures)

Train the LR model

In [5]:
weights = LR.train(trainRDD, chunkSize, alpha, iterations, accel)
with open("data/weights.out", "w") as weights_file:
    for k in range(0, numClasses):
        for j in range(0, numFeatures):
            if j == 0:
                weights_file.write(str(round(weights[k * numFeatures + j], 5)))
                weights_file.write("," + str(round(weights[k * numFeatures + j], 5)))

    * LogisticRegression Training *
     # numSamples:               4000
     # chunkSize:                4000
     # numClasses:               10
     # numFeatures:              784
     # alpha:                    0.25
     # iterations:               5
100% |▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥▥| 5/5 Time: 96.310 sec

Test the LR model

In [6]:
testRDD = sc.textFile(test_file)


    * LogisticRegression Testing *
     # accuracy:                 0.82(1640/2000)
     # true:                     1640
     # false:                    360

4. Performance metrics

Execution time for different execution scenarios:

Target Time
PYNQ SW-only: 1483.859 sec
PYNQ HW accelerated: 96.310 sec