Table of Content:
For more details, please see the install guide.
In [ ]:
!pip install --upgrade --user systemml
In [ ]:
!pip show systemml
The MLContext API offers a programmatic interface for interacting with SystemML from Spark using languages such as Scala, Java, and Python. As a result, it offers a convenient way to interact with SystemML from the Spark Shell and from Notebooks such as Jupyter and Zeppelin. Please refer to the documentation for more detail on the MLContext API.
As a sidenote, here are alternative ways by which you can invoke SystemML (not covered in this notebook):
In [ ]:
from systemml import MLContext, dml, dmlFromResource
ml = MLContext(sc)
print("Spark Version:", sc.version)
print("SystemML Version:", ml.version())
print("SystemML Built-Time:", ml.buildTime())
In [ ]:
# Step 1: Write the DML script
script = """
print("Hello World!");
"""
# Step 2: Create a Python DML object
script = dml(script)
# Step 3: Execute it using MLContext API
ml.execute(script)
Now let's implement a slightly more complicated 'Hello World' program where we initialize a string variable to 'Hello World!' and print it using Python. Note: we first register the output variable in the dml object (in the step 2) and then fetch it after execution (in the step 3).
In [ ]:
# Step 1: Write the DML script
script = """
s = "Hello World!";
"""
# Step 2: Create a Python DML object
script = dml(script).output('s')
# Step 3: Execute it using MLContext API
s = ml.execute(script).get('s')
print(s)
In [ ]:
# Step 1: Write the DML script
script = """
# The number of rows is passed externally by the user via 'nr'
X = rand(rows=nr, cols=1000, sparsity=0.5)
A = t(X) %*% X
s = sum(A)
"""
# Step 2: Create a Python DML object
script = dml(script).input(nr=1e5).output('s')
# Step 3: Execute it using MLContext API
s = ml.execute(script).get('s')
print(s)
Now, let's generate a random matrix in NumPy and pass it to SystemML.
In [ ]:
import numpy as np
npMatrix = np.random.rand(1000, 1000)
# Step 1: Write the DML script
script = """
A = t(X) %*% X
s = sum(A)
"""
# Step 2: Create a Python DML object
script = dml(script).input(X=npMatrix).output('s')
# Step 3: Execute it using MLContext API
s = ml.execute(script).get('s')
print(s)
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
plt.switch_backend('agg')
In [ ]:
%matplotlib inline
In [ ]:
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20].reshape(-1,1)
diabetes_y_test = diabetes.target[-20:].reshape(-1,1)
plt.scatter(diabetes_X_train, diabetes_y_train, color='black')
plt.scatter(diabetes_X_test, diabetes_y_test, color='red')
Linear regression models the relationship between one numerical response variable and one or more explanatory (feature) variables by fitting a linear equation to observed data. The feature vectors are provided as a matrix $X$ an the observed response values are provided as a 1-column matrix $y$.
A linear regression line has an equation of the form $y = Xw$.
The least squares method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the difference between the predicted response $Xw$ and the actual response $y$.
$w^* = argmin_w ||Xw-y||^2 \\ \;\;\; = argmin_w (y - Xw)'(y - Xw) \\ \;\;\; = argmin_w \dfrac{(w'(X'X)w - w'(X'y))}{2}$
To find the optimal parameter $w$, we set the gradient $dw = (X'X)w - (X'y)$ to 0.
$(X'X)w - (X'y) = 0 \\ w = (X'X)^{-1}(X' y) \\ \;\;= solve(X'X, X'y)$
In [ ]:
# Step 1: Write the DML script
script = """
# add constant feature to X to model intercept
X = cbind(X, matrix(1, rows=nrow(X), cols=1))
A = t(X) %*% X
b = t(X) %*% y
w = solve(A, b)
bias = as.scalar(w[nrow(w),1])
w = w[1:nrow(w)-1,]
"""
# Step 2: Create a Python DML object
script = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')
# Step 3: Execute it using MLContext API
w, bias = ml.execute(script).get('w','bias')
w = w.toNumPy()
In [ ]:
plt.scatter(diabetes_X_train, diabetes_y_train, color='black')
plt.scatter(diabetes_X_test, diabetes_y_test, color='red')
plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='blue', linestyle ='dotted')
In [ ]:
# Step 1: Write the DML script
script = """
# add constant feature to X to model intercepts
X = cbind(X, matrix(1, rows=nrow(X), cols=1))
max_iter = 100
w = matrix(0, rows=ncol(X), cols=1)
for(i in 1:max_iter){
XtX = t(X) %*% X
dw = XtX %*%w - t(X) %*% y
alpha = -(t(dw) %*% dw) / (t(dw) %*% XtX %*% dw)
w = w + dw*alpha
}
bias = as.scalar(w[nrow(w),1])
w = w[1:nrow(w)-1,]
"""
# Step 2: Create a Python DML object
script = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')
# Step 3: Execute it using MLContext API
w, bias = ml.execute(script).get('w','bias')
w = w.toNumPy()
In [ ]:
plt.scatter(diabetes_X_train, diabetes_y_train, color='black')
plt.scatter(diabetes_X_test, diabetes_y_test, color='red')
plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')
Problem with gradient descent: Takes very similar directions many times
Solution: Enforce conjugacy
Step 1: Start with an initial point
while(not converged) {
Step 2: Compute gradient dw.
Step 3: Compute stepsize alpha.
Step 4: Compute next direction p by enforcing conjugacy with previous direction.
Step 4: Update: w_new = w_old + alpha*p
}
In [ ]:
# Step 1: Write the DML script
script = """
# add constant feature to X to model intercepts
X = cbind(X, matrix(1, rows=nrow(X), cols=1))
m = ncol(X); i = 1;
max_iter = 20;
w = matrix (0, rows = m, cols = 1); # initialize weights to 0
dw = - t(X) %*% y; p = - dw; # dw = (X'X)w - (X'y)
norm_r2 = sum (dw ^ 2);
for(i in 1:max_iter) {
q = t(X) %*% (X %*% p)
alpha = norm_r2 / sum (p * q); # Minimizes f(w - alpha*r)
w = w + alpha * p; # update weights
dw = dw + alpha * q;
old_norm_r2 = norm_r2; norm_r2 = sum (dw ^ 2);
p = -dw + (norm_r2 / old_norm_r2) * p; # next direction - conjugacy to previous direction
i = i + 1;
}
bias = as.scalar(w[nrow(w),1])
w = w[1:nrow(w)-1,]
"""
# Step 2: Create a Python DML object
script = dml(script).input(X=diabetes_X_train, y=diabetes_y_train).output('w', 'bias')
# Step 3: Execute it using MLContext API
w, bias = ml.execute(script).get('w','bias')
w = w.toNumPy()
In [ ]:
plt.scatter(diabetes_X_train, diabetes_y_train, color='black')
plt.scatter(diabetes_X_test, diabetes_y_test, color='red')
plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')
SystemML ships with several pre-implemented algorithms that can be invoked directly. Please refer to the algorithm reference manual for usage.
In [ ]:
# Step 1: No need to write a DML script here. But, keeping it as a placeholder for consistency :)
# Step 2: Create a Python DML object
script = dmlFromResource('scripts/algorithms/LinearRegDS.dml')
script = script.input(X=diabetes_X_train, y=diabetes_y_train).input('$icpt',1.0).output('beta_out')
# Step 3: Execute it using MLContext API
w = ml.execute(script).get('beta_out')
w = w.toNumPy()
bias = w[1]
w = w[0]
In [ ]:
plt.scatter(diabetes_X_train, diabetes_y_train, color='black')
plt.scatter(diabetes_X_test, diabetes_y_test, color='red')
plt.plot(diabetes_X_test, (w*diabetes_X_test)+bias, color='red', linestyle ='dashed')
mllearn API allows a Python programmer to invoke SystemML's algorithms using scikit-learn like API as well as Spark's MLPipeline API.
In [ ]:
# Step 1: No need to write a DML script here. But, keeping it as a placeholder for consistency :)
# Step 2: No need to create a Python DML object. But, keeping it as a placeholder for consistency :)
# Step 3: Execute Linear Regression using the mllearn API
from systemml.mllearn import LinearRegression
regr = LinearRegression(spark)
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
In [ ]:
predictions = regr.predict(diabetes_X_test)
In [ ]:
# Use the trained model to perform prediction
%matplotlib inline
plt.scatter(diabetes_X_train, diabetes_y_train, color='black')
plt.scatter(diabetes_X_test, diabetes_y_test, color='red')
plt.plot(diabetes_X_test, predictions, color='black')