Linear Regression Algorithms using Apache SystemML

Table of Content:

Install SystemML using pip
Example 1: Implement a simple 'Hello World' program in SystemML
Example 2: Matrix Multiplication
Load diabetes dataset from scikit-learn for the example 3
Example 3: Implement three different algorithms to train linear regression model
Example 4: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API
Example 5: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API
Uninstall/Clean up SystemML Python package and jar file

Install SystemML using pip

For more details, please see the install guide.



In [ ]:

    
!pip install --upgrade --user systemml



In [ ]:

    
!pip show systemml

Example 1: Implement a simple 'Hello World' program in SystemML

First import the classes necessary to implement the 'Hello World' program.

The MLContext API offers a programmatic interface for interacting with SystemML from Spark using languages such as Scala, Java, and Python. As a result, it offers a convenient way to interact with SystemML from the Spark Shell and from Notebooks such as Jupyter and Zeppelin. Please refer to the documentation for more detail on the MLContext API.

As a sidenote, here are alternative ways by which you can invoke SystemML (not covered in this notebook):

Command-line invocation using either spark-submit or hadoop.
Using the JMLC API.



In [ ]:

    
from systemml import MLContext, dml, dmlFromResource

ml = MLContext(sc)

print("Spark Version:", sc.version)
print("SystemML Version:", ml.version())
print("SystemML Built-Time:", ml.buildTime())



In [ ]:

    
# Step 1: Write the DML script
script = """
print("Hello World!");
"""

# Step 2: Create a Python DML object
script = dml(script)

# Step 3: Execute it using MLContext API
ml.execute(script)

Now let's implement a slightly more complicated 'Hello World' program where we initialize a string variable to 'Hello World!' and print it using Python. Note: we first register the output variable in the dml object (in the step 2) and then fetch it after execution (in the step 3).



In [ ]:

    
# Step 1: Write the DML script
script = """
s = "Hello World!";
"""

# Step 2: Create a Python DML object
script = dml(script).output('s')

# Step 3: Execute it using MLContext API
s = ml.execute(script).get('s')
print(s)

Example 2: Matrix Multiplication

Let's write a script to generate a random matrix, perform matrix multiplication, and compute the sum of the output.



In [ ]:

    
# Step 1: Write the DML script
script = """
    # The number of rows is passed externally by the user via 'nr'
    X = rand(rows=nr, cols=1000, sparsity=0.5)
    A = t(X) %*% X
    s = sum(A)
"""

# Step 2: Create a Python DML object
script = dml(script).input(nr=1e5).output('s')

# Step 3: Execute it using MLContext API
s = ml.execute(script).get('s')
print(s)

Now, let's generate a random matrix in NumPy and pass it to SystemML.



In [ ]:

    
import numpy as np
npMatrix = np.random.rand(1000, 1000)

# Step 1: Write the DML script
script = """
    A = t(X) %*% X
    s = sum(A)
"""

# Step 2: Create a Python DML object
script = dml(script).input(X=npMatrix).output('s')

# Step 3: Execute it using MLContext API
s = ml.execute(script).get('s')
print(s)

Load diabetes dataset from scikit-learn for the example 3



In [ ]:

    
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
plt.switch_backend('agg')



In [ ]:

    
%matplotlib inline



In [ ]:

    
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20].reshape(-1,1)
diabetes_y_test = diabetes.target[-20:].reshape(-1,1)

plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

Example 3: Implement three different algorithms to train linear regression model

Linear regression models the relationship between one numerical response variable and one or more explanatory (feature) variables by fitting a linear equation to observed data. The feature vectors are provided as a matrix $X$ an the observed response values are provided as a 1-column matrix $y$.

A linear regression line has an equation of the form $y = Xw$.

Algorithm 1: Linear Regression - Direct Solve (no regularization)

Least squares formulation

The least squares method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the difference between the predicted response $Xw$ and the actual response $y$.

$w^* = argmin_w ||Xw-y||^2 \\ \;\;\; = argmin_w (y - Xw)'(y - Xw) \\ \;\;\; = argmin_w \dfrac{(w'(X'X)w - w'(X'y))}{2}$

To find the optimal parameter $w$, we set the gradient $dw = (X'X)w - (X'y)$ to 0.

$(X'X)w - (X'y) = 0 \\ w = (X'X)^{-1}(X' y) \\ \;\;= solve(X'X, X'y)$



In [ ]:

    
# Step 1: Write the DML script
script = """
    # add constant feature to X to model intercept
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    A = t(X) %*% X
    b = t(X) %*% y
    w = solve(A, b)
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]
"""

# Step 2: Create a Python DML object
script = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')

# Step 3: Execute it using MLContext API
w, bias = ml.execute(script).get('w','bias')
w = w.toNumPy()



In [ ]:

    
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='blue', linestyle ='dotted')

Algorithm 2: Linear Regression - Batch Gradient Descent (no regularization)

Algorithm

Step 1: Start with an initial point while(not converged) { Step 2: Compute gradient dw. Step 3: Compute stepsize alpha. Step 4: Update: wnew = wold + alpha*dw }

Gradient formula

dw = r = (X'X)w - (X'y)

Step size formula

Find number alpha to minimize f(w + alpha*r) alpha = -(r'r)/(r'X'Xr)



In [ ]:

    
# Step 1: Write the DML script
script = """
    # add constant feature to X to model intercepts
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    max_iter = 100
    w = matrix(0, rows=ncol(X), cols=1)
    for(i in 1:max_iter){
        XtX = t(X) %*% X
        dw = XtX %*%w - t(X) %*% y
        alpha = -(t(dw) %*% dw) / (t(dw) %*% XtX %*% dw)
        w = w + dw*alpha
    }
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]    
"""

# Step 2: Create a Python DML object
script = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')

# Step 3: Execute it using MLContext API
w, bias = ml.execute(script).get('w','bias')
w = w.toNumPy()



In [ ]:

    
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')

Algorithm 3: Linear Regression - Conjugate Gradient (no regularization)

Problem with gradient descent: Takes very similar directions many times

Solution: Enforce conjugacy

Step 1: Start with an initial point while(not converged) { Step 2: Compute gradient dw. Step 3: Compute stepsize alpha. Step 4: Compute next direction p by enforcing conjugacy with previous direction. Step 4: Update: w_new = w_old + alpha*p }



In [ ]:

    
# Step 1: Write the DML script
script = """
    # add constant feature to X to model intercepts
    X = cbind(X, matrix(1, rows=nrow(X), cols=1))
    m = ncol(X); i = 1; 
    max_iter = 20;
    w = matrix (0, rows = m, cols = 1); # initialize weights to 0
    dw = - t(X) %*% y; p = - dw;        # dw = (X'X)w - (X'y)
    norm_r2 = sum (dw ^ 2); 
    for(i in 1:max_iter) {
        q = t(X) %*% (X %*% p)
        alpha = norm_r2 / sum (p * q);  # Minimizes f(w - alpha*r)
        w = w + alpha * p;              # update weights
        dw = dw + alpha * q;           
        old_norm_r2 = norm_r2; norm_r2 = sum (dw ^ 2);
        p = -dw + (norm_r2 / old_norm_r2) * p; # next direction - conjugacy to previous direction
        i = i + 1;
    }
    bias = as.scalar(w[nrow(w),1])
    w = w[1:nrow(w)-1,]    
"""

# Step 2: Create a Python DML object
script = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')

# Step 3: Execute it using MLContext API
w, bias = ml.execute(script).get('w','bias')
w = w.toNumPy()



In [ ]:

    
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')

Example 4: Invoke existing SystemML algorithm script LinearRegDS.dml using MLContext API

SystemML ships with several pre-implemented algorithms that can be invoked directly. Please refer to the algorithm reference manual for usage.



In [ ]:

    
# Step 1: No need to write a DML script here. But, keeping it as a placeholder for consistency :)

# Step 2: Create a Python DML object
script = dmlFromResource('scripts/algorithms/LinearRegDS.dml')
script = script.input(X=diabetes_X_train, y=diabetes_y_train).input('$icpt',1.0).output('beta_out')

# Step 3: Execute it using MLContext API
w = ml.execute(script).get('beta_out')
w = w.toNumPy()
bias = w[1]
w = w[0]



In [ ]:

    
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')

Example 5: Invoke existing SystemML algorithm using scikit-learn/SparkML pipeline like API

mllearn API allows a Python programmer to invoke SystemML's algorithms using scikit-learn like API as well as Spark's MLPipeline API.



In [ ]:

    
# Step 1: No need to write a DML script here. But, keeping it as a placeholder for consistency :)

# Step 2: No need to create a Python DML object. But, keeping it as a placeholder for consistency :)

# Step 3: Execute Linear Regression using the mllearn API
from systemml.mllearn import LinearRegression
regr = LinearRegression(spark)
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)



In [ ]:

    
predictions = regr.predict(diabetes_X_test)



In [ ]:

    
# Use the trained model to perform prediction
%matplotlib inline
plt.scatter(diabetes_X_train, diabetes_y_train,  color='black')
plt.scatter(diabetes_X_test, diabetes_y_test,  color='red')

plt.plot(diabetes_X_test, predictions, color='black')

Uninstall/Clean up SystemML Python package and jar file

!pip uninstall systemml --y