Linear Regression Algorithms using Apache SystemML

This notebook shows:

  • Install SystemML Python package and jar file
    • pip
    • SystemML 'Hello World'
  • Example 1: Matrix Multiplication
    • SystemML script to generate a random matrix, perform matrix multiplication, and compute the sum of the output
    • Examine execution plans, and increase data size to observe changed execution plans
  • Load diabetes dataset from scikit-learn
  • Example 2: Implement three different algorithms to train linear regression model
    • Algorithm 1: Linear Regression - Direct Solve (no regularization)
    • Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)
    • Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)
  • Example 3: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API
  • Example 4: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API
  • Uninstall/Clean up SystemML Python package and jar file

This notebook is supported with SystemML 0.14.0 and above.


In [1]:
!pip show systemml


---
Metadata-Version: 1.1
Name: systemml
Version: 1.0.0
Summary: Apache SystemML is a distributed and declarative machine learning platform.
Home-page: http://systemml.apache.org/
Author: Apache SystemML
Author-email: dev@systemml.apache.org
License: Apache 2.0
Location: /Users/asurve/.pyenv/versions/3.5.0/lib/python3.5/site-packages
Requires: numpy, scipy, pandas, Pillow
You are using pip version 7.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Import SystemML API


In [2]:
from systemml import MLContext, dml, dmlFromResource

ml = MLContext(sc)

print ("Spark Version:" + sc.version)
print ("SystemML Version:" + ml.version())
print ("SystemML Built-Time:"+ ml.buildTime())


Spark Version:2.0.2
SystemML Version:1.0.0-SNAPSHOT
SystemML Built-Time:2017-08-09 18:39:30 UTC

In [3]:
ml.execute(dml("""s = 'Hello World!'""").output("s")).get("s")


Out[3]:
u'Hello World!'

Import numpy, sklearn, and define some helper functions


In [4]:
import sys, os, glob, subprocess
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
plt.switch_backend('agg')
    
def printLastLogLines(n):
    fname = max(glob.iglob(os.sep.join([os.environ["HOME"],'/logs/notebook/kernel-pyspark-*.log'])), key=os.path.getctime)
    print(subprocess.check_output(['tail', '-' + str(n), fname]))

Example 1: Matrix Multiplication

SystemML script to generate a random matrix, perform matrix multiplication, and compute the sum of the output


In [5]:
script = """
    X = rand(rows=$nr, cols=1000, sparsity=0.5)
    A = t(X) %*% X
    s = sum(A)
"""

In [6]:
prog = dml(script).input('$nr', 1e5).output('s')
s = ml.execute(prog).get('s')
print (s)


6259203081.18

Examine execution plans, and increase data size to observe changed execution plans


In [7]:
ml = MLContext(sc)
ml = ml.setStatistics(True)
# re-execute ML program
# printLastLogLines(22)

In [8]:
prog = dml(script).input('$nr', 1e6).output('s')
out = ml.execute(prog).get('s')
print (out)


62606598468.1

In [9]:
ml = MLContext(sc)
ml = ml.setStatistics(False)

Load diabetes dataset from scikit-learn


In [10]:
%matplotlib inline

In [11]:
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20].reshape(-1,1)
diabetes_y_test = diabetes.target[-20:].reshape(-1,1)

plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')


Out[11]:
<matplotlib.collections.PathCollection at 0x1105d0150>

In [12]:
diabetes.data.shape


Out[12]:
(442, 10)

Example 2: Implement three different algorithms to train linear regression model

Algorithm 1: Linear Regression - Direct Solve (no regularization)

Least squares formulation

w* = argminw ||Xw-y||2 = argminw (y - Xw)'(y - Xw) = argminw (w'(X'X)w - w'(X'y))/2

Setting the gradient

dw = (X'X)w - (X'y) to 0, w = (X'X)-1(X' y) = solve(X'X, X'y)


In [13]:
script = """
    # add constant feature to X to model intercept
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    A = t(X) %*% X
    b = t(X) %*% y
    w = solve(A, b)
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]
"""

In [14]:
prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')
w, bias = ml.execute(prog).get('w','bias')
w = w.toNumPy()

In [15]:
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='blue', linestyle ='dotted')


Out[15]:
[<matplotlib.lines.Line2D at 0x10e203e50>]

Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)

Algorithm

Step 1: Start with an initial point while(not converged) { Step 2: Compute gradient dw. Step 3: Compute stepsize alpha. Step 4: Update: wnew = wold + alpha*dw }

Gradient formula

dw = r = (X'X)w - (X'y)

Step size formula

Find number alpha to minimize f(w + alpha*r) alpha = -(r'r)/(r'X'Xr)


In [16]:
script = """
    # add constant feature to X to model intercepts
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    max_iter = 100
    w = matrix(0, rows=ncol(X), cols=1)
    for(i in 1:max_iter){
        XtX = t(X) %*% X
        dw = XtX %*%w - t(X) %*% y
        alpha = -(t(dw) %*% dw) / (t(dw) %*% XtX %*% dw)
        w = w + dw*alpha
    }
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]    
"""

In [17]:
prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w').output('bias')
w, bias = ml.execute(prog).get('w', 'bias')
w = w.toNumPy()

In [18]:
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')


Out[18]:
[<matplotlib.lines.Line2D at 0x110711fd0>]

Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)

Problem with gradient descent: Takes very similar directions many times

Solution: Enforce conjugacy

Step 1: Start with an initial point while(not converged) { Step 2: Compute gradient dw. Step 3: Compute stepsize alpha. Step 4: Compute next direction p by enforcing conjugacy with previous direction. Step 5: Update: w_new = w_old + alpha*p }


In [19]:
script = """
    # add constant feature to X to model intercepts
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    m = ncol(X); i = 1; 
    max_iter = 20;
    w = matrix (0, rows = m, cols = 1); # initialize weights to 0
    dw = - t(X) %*% y; p = - dw;        # dw = (X'X)w - (X'y)
    norm_r2 = sum (dw ^ 2); 
    for(i in 1:max_iter) {
        q = t(X) %*% (X %*% p)
        alpha = norm_r2 / sum (p * q);  # Minimizes f(w - alpha*r)
        w = w + alpha * p;              # update weights
        dw = dw + alpha * q;           
        old_norm_r2 = norm_r2; norm_r2 = sum (dw ^ 2);
        p = -dw + (norm_r2 / old_norm_r2) * p; # next direction - conjugacy to previous direction
        i = i + 1;
    }
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]    
"""

In [20]:
prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w').output('bias')
w, bias = ml.execute(prog).get('w','bias')
w = w.toNumPy()

In [21]:
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')


Out[21]:
[<matplotlib.lines.Line2D at 0x1107352d0>]

Example 3: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API


In [22]:
import os
from subprocess import call

dirName = os.path.dirname(os.path.realpath("~")) + "/scripts"
call(["mkdir", "-p", dirName])
call(["wget", "-N", "-q", "-P", dirName, "https://raw.githubusercontent.com/apache/systemml/master/scripts/algorithms/LinearRegDS.dml"])

scriptName = dirName + "/LinearRegDS.dml"
dml_script = dmlFromResource(scriptName)

prog = dml_script.input(X=diabetes_X_train, y=diabetes_y_train).input('$icpt',1.0).output('beta_out')
w = ml.execute(prog).get('beta_out')
w = w.toNumPy()
bias=w[1]
print (bias)


[ 152.91886229]

In [23]:
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w[0]*diabetes_X_test)+bias, color='red', linestyle ='dashed')


Out[23]:
[<matplotlib.lines.Line2D at 0x1109a6090>]

Example 4: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API

mllearn API allows a Python programmer to invoke SystemML's algorithms using scikit-learn like API as well as Spark's MLPipeline API.


In [24]:
from pyspark.sql import SQLContext
from systemml.mllearn import LinearRegression
sqlCtx = SQLContext(sc)

In [25]:
regr = LinearRegression(sqlCtx)
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)


Out[25]:
lr

In [26]:
predictions = regr.predict(diabetes_X_test)

In [27]:
# Use the trained model to perform prediction
%matplotlib inline
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, predictions, color='black')


Out[27]:
[<matplotlib.lines.Line2D at 0x110d0bc10>]