Linear Regression Algorithms using Apache SystemML

This notebook shows:

Install SystemML Python package and jar file
- pip
- SystemML 'Hello World'
Example 1: Matrix Multiplication
- SystemML script to generate a random matrix, perform matrix multiplication, and compute the sum of the output
- Examine execution plans, and increase data size to observe changed execution plans
Load diabetes dataset from scikit-learn
Example 2: Implement three different algorithms to train linear regression model
- Algorithm 1: Linear Regression - Direct Solve (no regularization)
- Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)
- Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)
Example 3: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API
Example 4: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API
Uninstall/Clean up SystemML Python package and jar file

This notebook is supported with SystemML 0.14.0 and above.



In [1]:

    
!pip show systemml









    



---
Metadata-Version: 1.1
Name: systemml
Version: 1.0.0
Summary: Apache SystemML is a distributed and declarative machine learning platform.
Home-page: http://systemml.apache.org/
Author: Apache SystemML
Author-email: dev@systemml.apache.org
License: Apache 2.0
Location: /Users/asurve/.pyenv/versions/3.5.0/lib/python3.5/site-packages
Requires: numpy, scipy, pandas, Pillow
You are using pip version 7.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Import SystemML API



In [2]:

    
from systemml import MLContext, dml, dmlFromResource

ml = MLContext(sc)

print ("Spark Version:" + sc.version)
print ("SystemML Version:" + ml.version())
print ("SystemML Built-Time:"+ ml.buildTime())









    



Spark Version:2.0.2
SystemML Version:1.0.0-SNAPSHOT
SystemML Built-Time:2017-08-09 18:39:30 UTC



In [3]:

    
ml.execute(dml("""s = 'Hello World!'""").output("s")).get("s")









    Out[3]:





u'Hello World!'

Import numpy, sklearn, and define some helper functions



In [4]:

    
import sys, os, glob, subprocess
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
plt.switch_backend('agg')
    
def printLastLogLines(n):
    fname = max(glob.iglob(os.sep.join([os.environ["HOME"],'/logs/notebook/kernel-pyspark-*.log'])), key=os.path.getctime)
    print(subprocess.check_output(['tail', '-' + str(n), fname]))

Example 1: Matrix Multiplication

SystemML script to generate a random matrix, perform matrix multiplication, and compute the sum of the output



In [5]:

    
script = """
    X = rand(rows=$nr, cols=1000, sparsity=0.5)
    A = t(X) %*% X
    s = sum(A)
"""



In [6]:

    
prog = dml(script).input('$nr', 1e5).output('s')
s = ml.execute(prog).get('s')
print (s)









    



6259203081.18

Examine execution plans, and increase data size to observe changed execution plans



In [7]:

    
ml = MLContext(sc)
ml = ml.setStatistics(True)
# re-execute ML program
# printLastLogLines(22)



In [8]:

    
prog = dml(script).input('$nr', 1e6).output('s')
out = ml.execute(prog).get('s')
print (out)









    



62606598468.1



In [9]:

    
ml = MLContext(sc)
ml = ml.setStatistics(False)

Load diabetes dataset from scikit-learn



In [10]:

    
%matplotlib inline



In [11]:

    
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20].reshape(-1,1)
diabetes_y_test = diabetes.target[-20:].reshape(-1,1)

plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')









    Out[11]:





<matplotlib.collections.PathCollection at 0x1105d0150>



In [12]:

    
diabetes.data.shape









    Out[12]:





(442, 10)

Example 2: Implement three different algorithms to train linear regression model

Algorithm 1: Linear Regression - Direct Solve (no regularization)

Least squares formulation

w* = argminw ||Xw-y||2 = argminw (y - Xw)'(y - Xw) = argminw (w'(X'X)w - w'(X'y))/2

Setting the gradient

dw = (X'X)w - (X'y) to 0, w = (X'X)-1(X' y) = solve(X'X, X'y)



In [13]:

    
script = """
    # add constant feature to X to model intercept
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    A = t(X) %*% X
    b = t(X) %*% y
    w = solve(A, b)
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]
"""



In [14]:

    
prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')
w, bias = ml.execute(prog).get('w','bias')
w = w.toNumPy()



In [15]:

    
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='blue', linestyle ='dotted')









    Out[15]:





[<matplotlib.lines.Line2D at 0x10e203e50>]

Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)

Algorithm

Step 1: Start with an initial point while(not converged) { Step 2: Compute gradient dw. Step 3: Compute stepsize alpha. Step 4: Update: wnew = wold + alpha*dw }

Gradient formula

dw = r = (X'X)w - (X'y)

Step size formula

Find number alpha to minimize f(w + alpha*r) alpha = -(r'r)/(r'X'Xr)



In [16]:

    
script = """
    # add constant feature to X to model intercepts
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    max_iter = 100
    w = matrix(0, rows=ncol(X), cols=1)
    for(i in 1:max_iter){
        XtX = t(X) %*% X
        dw = XtX %*%w - t(X) %*% y
        alpha = -(t(dw) %*% dw) / (t(dw) %*% XtX %*% dw)
        w = w + dw*alpha
    }
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]    
"""



In [17]:

    
prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w').output('bias')
w, bias = ml.execute(prog).get('w', 'bias')
w = w.toNumPy()



In [18]:

    
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')









    Out[18]:





[<matplotlib.lines.Line2D at 0x110711fd0>]

Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)

Problem with gradient descent: Takes very similar directions many times

Solution: Enforce conjugacy

Step 1: Start with an initial point while(not converged) { Step 2: Compute gradient dw. Step 3: Compute stepsize alpha. Step 4: Compute next direction p by enforcing conjugacy with previous direction. Step 5: Update: w_new = w_old + alpha*p }



In [19]:

    
script = """
    # add constant feature to X to model intercepts
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    m = ncol(X); i = 1; 
    max_iter = 20;
    w = matrix (0, rows = m, cols = 1); # initialize weights to 0
    dw = - t(X) %*% y; p = - dw;        # dw = (X'X)w - (X'y)
    norm_r2 = sum (dw ^ 2); 
    for(i in 1:max_iter) {
        q = t(X) %*% (X %*% p)
        alpha = norm_r2 / sum (p * q);  # Minimizes f(w - alpha*r)
        w = w + alpha * p;              # update weights
        dw = dw + alpha * q;           
        old_norm_r2 = norm_r2; norm_r2 = sum (dw ^ 2);
        p = -dw + (norm_r2 / old_norm_r2) * p; # next direction - conjugacy to previous direction
        i = i + 1;
    }
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]    
"""



In [20]:

    
prog = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w').output('bias')
w, bias = ml.execute(prog).get('w','bias')
w = w.toNumPy()



In [21]:

    
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')









    Out[21]:





[<matplotlib.lines.Line2D at 0x1107352d0>]

Example 3: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API



In [22]:

    
import os
from subprocess import call

dirName = os.path.dirname(os.path.realpath("~")) + "/scripts"
call(["mkdir", "-p", dirName])
call(["wget", "-N", "-q", "-P", dirName, "https://raw.githubusercontent.com/apache/systemml/master/scripts/algorithms/LinearRegDS.dml"])

scriptName = dirName + "/LinearRegDS.dml"
dml_script = dmlFromResource(scriptName)

prog = dml_script.input(X=diabetes_X_train, y=diabetes_y_train).input('$icpt',1.0).output('beta_out')
w = ml.execute(prog).get('beta_out')
w = w.toNumPy()
bias=w[1]
print (bias)









    



[ 152.91886229]



In [23]:

    
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w[0]*diabetes_X_test)+bias, color='red', linestyle ='dashed')









    Out[23]:





[<matplotlib.lines.Line2D at 0x1109a6090>]

Example 4: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API

mllearn API allows a Python programmer to invoke SystemML's algorithms using scikit-learn like API as well as Spark's MLPipeline API.



In [24]:

    
from pyspark.sql import SQLContext
from systemml.mllearn import LinearRegression
sqlCtx = SQLContext(sc)



In [25]:

    
regr = LinearRegression(sqlCtx)
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)









    Out[25]:





lr



In [26]:

    
predictions = regr.predict(diabetes_X_test)



In [27]:

    
# Use the trained model to perform prediction
%matplotlib inline
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, predictions, color='black')









    Out[27]:





[<matplotlib.lines.Line2D at 0x110d0bc10>]