In this notebook an interactive PySpark shell is loaded and our Logistic Regression application is executed, using our accelerated ML library. The accelerated ML library is written in Python. It supports standard learning algorithms, including common settings like classification, regression etc. We are given the option to choose between an accelerated execution that uses both software and hardware and a non-accelerated one, that uses only the CPU cores. Upon choosing the accelerated option, the accelerator's library is invoked (which is also written in Python) where the input data is stored in memory mapped buffers and are then transfered and processed in the PL. The whole communication with the PL is achieved using an AXI4-Stream Accelerator Adapter.
The data are taken from the famous MNIST dataset.
The original data file contains gray-scale images of hand-drawn digits, from zero through nine. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.
In this example the data we use are already preprocessed/normalized using Feature Standardization method (Z-score scaling).
The (train and test) data sets that are used below have 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the (rescaled) pixel-values of the associated image.
In this section we initialize PySpark to predefine the SparkContext variable. \$SPARK_HOME and other needed environment variables are set under the /etc/environment file.
Make sure you have correctly set all needed paths and variables and that Py4J matches the version you have installed.
In [1]:
import sys, os
spark_home = os.environ.get("SPARK_HOME", None)
# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home + "/python/lib/py4j-0.10.4-src.zip"))
# Initialize PySpark to predefine the SparkContext variable 'sc'
filename = spark_home+"/python/pyspark/shell.py"
exec(open(filename).read())
This example shows how our accelerated Logistic Regression library is called to train a LR model on the train set and then test its accuracy. If accel is set (accel = 1), the hardware accelerator is used for the computation of the gradients in each iteration.
The size of the train set, as well as the number of the iterations are intentionally picked small, to avoid large execution time in SW-only cases.
In [2]:
chunkSize = 4000
alpha = 0.25
iterations = 5
train_file = "data/MNIST_train.dat"
test_file = "data/MNIST_test.dat"
sc.appName = "Python Logistic Regression"
print("* LogisticRegression Application *")
print(" # train file: " + train_file)
print(" # test file: " + test_file)
In [3]:
accel = int(input("Select mode (0: SW-only, 1: HW accelerated) : "))
In [4]:
from pyspark.mllib_accel.classification import LogisticRegression
trainRDD = sc.textFile(train_file).coalesce(1)
numClasses = 10
numFeatures = 784
LR = LogisticRegression(numClasses, numFeatures)
In [5]:
weights = LR.train(trainRDD, chunkSize, alpha, iterations, accel)
with open("data/weights.out", "w") as weights_file:
for k in range(0, numClasses):
for j in range(0, numFeatures):
if j == 0:
weights_file.write(str(round(weights[k * numFeatures + j], 5)))
else:
weights_file.write("," + str(round(weights[k * numFeatures + j], 5)))
weights_file.write("\n")
weights_file.close()
In [6]:
testRDD = sc.textFile(test_file)
LR.test(testRDD)