Flight Delay Prediction Demo Using SystemML

This notebook is based on datascientistworkbench.com's tutorial notebook for predicting flight delay.

Loading SystemML

To use one of the released version, use "%AddDeps org.apache.systemml systemml 0.9.0-incubating". To use nightly build, "%AddJar https://sparktc.ibmcloud.com/repo/latest/SystemML.jar"

Or you provide SystemML.jar and dependency through commandline when starting the notebook (for example: --packages com.databricks:spark-csv_2.10:1.4.0 --jars SystemML.jar)


In [1]:
%AddJar https://sparktc.ibmcloud.com/repo/latest/SystemML.jar


Using cached version of SystemML.jar

Use Spark's CSV package for loading the CSV file


In [2]:
%AddDeps com.databricks spark-csv_2.10 1.4.0


:: loading settings :: url = jar:file:/usr/local/spark-kernel/lib/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
:: resolving dependencies :: com.ibm.spark#spark-kernel;working [not transitive]
	confs: [default]
	found com.databricks#spark-csv_2.10;1.4.0 in central
downloading https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.4.0/spark-csv_2.10-1.4.0.jar ...
	[SUCCESSFUL ] com.databricks#spark-csv_2.10;1.4.0!spark-csv_2.10.jar (68ms)
:: resolution report :: resolve 642ms :: artifacts dl 72ms
	:: modules in use:
	com.databricks#spark-csv_2.10;1.4.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
	---------------------------------------------------------------------
:: retrieving :: com.ibm.spark#spark-kernel
	confs: [default]
	1 artifacts copied, 0 already retrieved (153kB/9ms)

Import Data

Download the airline dataset from stat-computing.org if not already downloaded


In [3]:
import sys.process._
import java.net.URL
import java.io.File
val url = "http://stat-computing.org/dataexpo/2009/2007.csv.bz2"
val localFilePath = "airline2007.csv.bz2"
if(!new java.io.File(localFilePath).exists) {
    new URL(url) #> new File(localFilePath) !!
}

Load the dataset into DataFrame using Spark CSV package


In [4]:
import org.apache.spark.sql.SQLContext
import org.apache.spark.storage.StorageLevel
val sqlContext = new SQLContext(sc)
val fmt = sqlContext.read.format("com.databricks.spark.csv")
val opt = fmt.options(Map("header"->"true", "inferSchema"->"true"))
val airline = opt.load(localFilePath).na.replace( "*", Map("NA" -> "0.0") )

In [5]:
airline.printSchema


root
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: string (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: string (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: string (nullable = true)
 |-- CRSElapsedTime: string (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: integer (nullable = true)
 |-- TaxiIn: integer (nullable = true)
 |-- TaxiOut: integer (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted: integer (nullable = true)
 |-- CarrierDelay: integer (nullable = true)
 |-- WeatherDelay: integer (nullable = true)
 |-- NASDelay: integer (nullable = true)
 |-- SecurityDelay: integer (nullable = true)
 |-- LateAircraftDelay: integer (nullable = true)

Data Exploration

Which airports have the most delays?


In [6]:
airline.registerTempTable("airline")
sqlContext.sql("""SELECT Origin, count(*) conFlight, avg(DepDelay) delay
                    FROM airline
                    GROUP BY Origin
                    ORDER BY delay DESC""").show


+------+---------+------------------+
|Origin|conFlight|             delay|
+------+---------+------------------+
|   PIR|        4|              45.5|
|   ACK|      314|45.296178343949045|
|   SOP|      195| 34.02051282051282|
|   HHH|      997| 22.58776328986961|
|   MCN|      992|22.496975806451612|
|   AKN|      235|21.123404255319148|
|   CEC|     1055|20.807582938388627|
|   GNV|     1927| 20.69797612869746|
|   EYW|     1052|20.224334600760457|
|   ACY|      735|20.141496598639456|
|   SPI|     1745|19.545558739255014|
|   GST|       90|19.233333333333334|
|   EWR|   154113|18.800853918877706|
|   BRW|      726| 18.02754820936639|
|   AGS|     2286|17.728346456692915|
|   ORD|   375784|17.695756072637472|
|   TRI|     1207| 17.63628831814416|
|   SBN|     5128|17.505850234009362|
|   FAY|     2185| 17.48970251716247|
|   PHL|   104063|17.067776250924922|
+------+---------+------------------+
only showing top 20 rows

Modeling: Logistic Regression

Predict departure delays of greater than 15 of flights from JFK


In [8]:
sqlContext.udf.register("checkDelay", (depDelay:String) => try { if(depDelay.toDouble > 15) 1.0 else 2.0 } catch { case e:Exception => 1.0 })
val tempSmallAirlineData = sqlContext.sql("SELECT *, checkDelay(DepDelay) label FROM airline WHERE Origin = 'JFK'").persist(StorageLevel.MEMORY_AND_DISK)
val popularDest = tempSmallAirlineData.select("Dest").map(y => (y.get(0).toString, 1)).reduceByKey(_ + _).filter(_._2 > 1000).collect.toMap
sqlContext.udf.register("onlyUsePopularDest", (x:String) => popularDest.contains(x))
tempSmallAirlineData.registerTempTable("tempAirline")
val smallAirlineData = sqlContext.sql("SELECT * FROM tempAirline WHERE onlyUsePopularDest(Dest)")

val datasets = smallAirlineData.randomSplit(Array(0.7, 0.3))
val trainDataset = datasets(0).cache
val testDataset = datasets(1).cache
trainDataset.count
testDataset.count


Out[8]:
34773

Feature selection

Encode the destination using one-hot encoding and include the columns Year, Month, DayofMonth, DayOfWeek, Distance


In [9]:
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}

val indexer = new StringIndexer().setInputCol("Dest").setOutputCol("DestIndex") // .setHandleInvalid("skip") // Only works on Spark 1.6 or later
val encoder = new OneHotEncoder().setInputCol("DestIndex").setOutputCol("DestVec")
val assembler = new VectorAssembler().setInputCols(Array("Year","Month","DayofMonth","DayOfWeek","Distance","DestVec")).setOutputCol("features")

Build the model: Use SystemML's MLPipeline wrapper.

This wrapper invokes MultiLogReg.dml (for training) and GLM-predict.dml (for prediction). These DML algorithms are available at https://github.com/apache/incubator-systemml/tree/master/scripts/algorithms


In [10]:
import org.apache.spark.ml.Pipeline
import org.apache.sysml.api.ml.LogisticRegression

val lr = new LogisticRegression("log", sc).setRegParam(1e-4).setTol(1e-2).setMaxInnerIter(0).setMaxOuterIter(100)

val pipeline = new Pipeline().setStages(Array(indexer, encoder, assembler, lr))
val model = pipeline.fit(trainDataset)


BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...
-- Initially:  Objective = 56433.27085246851,  Gradient Norm = 4.469119635504498E7,  Trust Delta = 0.001024586722033724
-- Outer Iteration 1: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 9262.13484840509,  Predicted = 8912.05664442707  (A/P: 1.0393),  Trust Delta = 4.1513539310828525E-4
   -- New Objective = 47171.13600406342,  Beta Change Norm = 3.9882828705797336E-4,  Gradient Norm = 3491408.311614066
 
-- Outer Iteration 2: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 107.11137476684962,  Predicted = 105.31921188128369  (A/P: 1.017),  Trust Delta = 4.1513539310828525E-4
   -- New Objective = 47064.02462929657,  Beta Change Norm = 1.0302143846288746E-4,  Gradient Norm = 84892.35372269012
Termination / Convergence condition satisfied.

Evaluate the model

Output RMS error on test data


In [11]:
val predictions = model.transform(testDataset.withColumnRenamed("label", "OriginalLabel"))
predictions.select("prediction", "OriginalLabel").show
sqlContext.udf.register("square", (x:Double) => Math.pow(x, 2.0))


+----------+-------------+
|prediction|OriginalLabel|
+----------+-------------+
|       1.0|          2.0|
|       1.0|          1.0|
|       1.0|          2.0|
|       1.0|          2.0|
|       1.0|          2.0|
|       1.0|          2.0|
|       1.0|          2.0|
|       1.0|          2.0|
|       1.0|          1.0|
|       1.0|          2.0|
|       1.0|          1.0|
|       1.0|          2.0|
|       1.0|          2.0|
|       1.0|          2.0|
|       1.0|          1.0|
|       1.0|          2.0|
|       1.0|          2.0|
|       1.0|          1.0|
|       1.0|          1.0|
|       1.0|          1.0|
+----------+-------------+
only showing top 20 rows

Out[11]:
UserDefinedFunction(<function1>,DoubleType,List())

In [12]:
predictions.registerTempTable("predictions")
sqlContext.sql("SELECT sqrt(avg(square(OriginalLabel - prediction))) FROM predictions").show


+------------------+
|               _c0|
+------------------+
|0.8557362892866146|
+------------------+

Perform k-fold cross-validation to tune the hyperparameters

Perform cross-validation to tune the regularization parameter for Logistic regression.


In [13]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 1e-3, 1e-6)).build()
crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(2) // Setting k = 2
val cvmodel = crossval.fit(trainDataset)


BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...
-- Initially:  Objective = 28202.772482623055,  Gradient Norm = 2.221087060254761E7,  Trust Delta = 0.001024586722033724
-- Outer Iteration 1: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 4576.927438869821,  Predicted = 4405.651264293149  (A/P: 1.0389),  Trust Delta = 4.127578309122139E-4
   -- New Objective = 23625.845043753234,  Beta Change Norm = 3.9671126297839183E-4,  Gradient Norm = 1718538.331150294
Termination / Convergence condition satisfied.
BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...
-- Initially:  Objective = 28202.772482623055,  Gradient Norm = 2.221087060254761E7,  Trust Delta = 0.001024586722033724
-- Outer Iteration 1: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 4576.927438878782,  Predicted = 4405.651264300938  (A/P: 1.0389),  Trust Delta = 4.127578309130283E-4
   -- New Objective = 23625.845043744273,  Beta Change Norm = 3.967112629790933E-4,  Gradient Norm = 1718538.3311583179
 
-- Outer Iteration 2: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 52.06267761322306,  Predicted = 51.207226997373795  (A/P: 1.0167),  Trust Delta = 4.127578309130283E-4
   -- New Objective = 23573.78236613105,  Beta Change Norm = 1.0195505438829344E-4,  Gradient Norm = 41072.985998067124
 
-- Outer Iteration 3: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.03776156834283029,  Predicted = 0.037741389955733964  (A/P: 1.0005),  Trust Delta = 4.127578309130283E-4
   -- New Objective = 23573.744604562708,  Beta Change Norm = 3.3257729178954336E-6,  Gradient Norm = 3559.0088415221207
Termination / Convergence condition satisfied.
BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...
-- Initially:  Objective = 28202.772482623055,  Gradient Norm = 2.221087060254761E7,  Trust Delta = 0.001024586722033724
-- Outer Iteration 1: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 4576.927438878873,  Predicted = 4405.651264301018  (A/P: 1.0389),  Trust Delta = 4.1275783091303654E-4
   -- New Objective = 23625.845043744182,  Beta Change Norm = 3.9671126297910036E-4,  Gradient Norm = 1718538.331158408
 
-- Outer Iteration 2: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 52.062677613230335,  Predicted = 51.20722699738286  (A/P: 1.0167),  Trust Delta = 4.1275783091303654E-4
   -- New Objective = 23573.782366130952,  Beta Change Norm = 1.0195505438831547E-4,  Gradient Norm = 41072.98599806662
 
-- Outer Iteration 3: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.03776156833919231,  Predicted = 0.037741389955751575  (A/P: 1.0005),  Trust Delta = 4.1275783091303654E-4
   -- New Objective = 23573.744604562613,  Beta Change Norm = 3.3257729178972746E-6,  Gradient Norm = 3559.008841523661
 
-- Outer Iteration 4: Had 3 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 1.3742707646742929,  Predicted = 1.374282851981874  (A/P: 1.0),  Trust Delta = 0.0016510313236521462
   -- New Objective = 23572.37033379794,  Beta Change Norm = 4.1275783091303654E-4,  Gradient Norm = 23218.782943544382
 
-- Outer Iteration 5: Had 3 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 5.475667862796399,  Predicted = 5.475595423716493  (A/P: 1.0),  Trust Delta = 0.006604125294608585
   -- New Objective = 23566.894665935142,  Beta Change Norm = 0.0016510313236521464,  Gradient Norm = 3400.306136071355
 
-- Outer Iteration 6: Had 3 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 19.796611347293947,  Predicted = 19.796668922654057  (A/P: 1.0),  Trust Delta = 0.02641650117843434
   -- New Objective = 23547.09805458785,  Beta Change Norm = 0.006604125294608585,  Gradient Norm = 12384.979229404262
 
-- Outer Iteration 7: Had 3 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 48.9038754012945,  Predicted = 48.86479486479853  (A/P: 1.0008),  Trust Delta = 0.039975464358656405
   -- New Objective = 23498.194179186554,  Beta Change Norm = 0.026416501178434335,  Gradient Norm = 25887.667183269536
 
-- Outer Iteration 8: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 0.007870123248721939,  Predicted = 0.007868226951946769  (A/P: 1.0002),  Trust Delta = 0.039975464358656405
   -- New Objective = 23498.186309063305,  Beta Change Norm = 6.078745447586554E-7,  Gradient Norm = 1345.8027775103888
 
-- Outer Iteration 9: Had 5 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 25.04238552428069,  Predicted = 25.024767443519863  (A/P: 1.0007),  Trust Delta = 0.0405590959281579
   -- New Objective = 23473.143923539024,  Beta Change Norm = 0.039975464358656405,  Gradient Norm = 63769.52436782582
 
-- Outer Iteration 10: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 0.04773861860303441,  Predicted = 0.04771039962536379  (A/P: 1.0006),  Trust Delta = 0.0405590959281579
   -- New Objective = 23473.09618492042,  Beta Change Norm = 1.4963385754664812E-6,  Gradient Norm = 720.8018323328566
 
-- Outer Iteration 11: Had 5 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 8.123822556943196,  Predicted = 8.128868676639112  (A/P: 0.9994),  Trust Delta = 0.10966765508915642
   -- New Objective = 23464.972362363478,  Beta Change Norm = 0.040559095928157894,  Gradient Norm = 72691.91595482397
 
-- Outer Iteration 12: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 0.06196295309564448,  Predicted = 0.061921093377362  (A/P: 1.0007),  Trust Delta = 0.10966765508915642
   -- New Objective = 23464.910399410383,  Beta Change Norm = 1.7036583109418734E-6,  Gradient Norm = 482.30416635512506
 
-- Outer Iteration 13: Had 6 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 17.71440401360087,  Predicted = 17.616303961789683  (A/P: 1.0056),  Trust Delta = 0.16941777360208057
   -- New Objective = 23447.19599539678,  Beta Change Norm = 0.10966765508915642,  Gradient Norm = 448422.2320019876
 
-- Outer Iteration 14: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 2.386916461367946,  Predicted = 2.397254649433668  (A/P: 0.9957),  Trust Delta = 0.16941777360208057
   -- New Objective = 23444.809078935414,  Beta Change Norm = 1.0691952710422448E-5,  Gradient Norm = 2940.4721234861527
 
-- Outer Iteration 15: Had 4 CG iterations
   -- Obj.Reduction:  Actual = 4.294265273932979,  Predicted = 4.301599925371988  (A/P: 0.9983),  Trust Delta = 0.16941777360208057
   -- New Objective = 23440.51481366148,  Beta Change Norm = 0.018008719957742635,  Gradient Norm = 4590.1170762087395
 
-- Outer Iteration 16: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 2.4845889129210263E-4,  Predicted = 2.4844829761319425E-4  (A/P: 1.0),  Trust Delta = 0.16941777360208057
   -- New Objective = 23440.51456520259,  Beta Change Norm = 1.0825357762700158E-7,  Gradient Norm = 280.5707172598387
 
-- Outer Iteration 17: Had 8 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 22.440803682489786,  Predicted = 22.42170069553472  (A/P: 1.0009),  Trust Delta = 0.2496076412979077
   -- New Objective = 23418.0737615201,  Beta Change Norm = 0.16941777360208057,  Gradient Norm = 37677.05806399844
 
-- Outer Iteration 18: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.15241017882726737,  Predicted = 0.15239595431754965  (A/P: 1.0001),  Trust Delta = 0.2496076412979077
   -- New Objective = 23417.921351341272,  Beta Change Norm = 8.477249180981066E-6,  Gradient Norm = 707.427496995126
 
-- Outer Iteration 19: Had 8 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 36.817799356838805,  Predicted = 36.84419020002096  (A/P: 0.9993),  Trust Delta = 0.3890684157185231
   -- New Objective = 23381.103551984434,  Beta Change Norm = 0.2496076412979077,  Gradient Norm = 181659.30511599063
 
-- Outer Iteration 20: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 3.9036142495642707,  Predicted = 3.907242243615839  (A/P: 0.9991),  Trust Delta = 0.3890684157185231
   -- New Objective = 23377.19993773487,  Beta Change Norm = 4.3252276508826854E-5,  Gradient Norm = 4562.596683929567
 
-- Outer Iteration 21: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 2.4621394186397083E-4,  Predicted = 2.462032554160668E-4  (A/P: 1.0),  Trust Delta = 0.3890684157185231
   -- New Objective = 23377.199691520927,  Beta Change Norm = 1.0792242771895522E-7,  Gradient Norm = 293.5155793389021
 
-- Outer Iteration 22: Had 8 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 32.60430984508639,  Predicted = 32.63142558199526  (A/P: 0.9992),  Trust Delta = 0.6911480264449816
   -- New Objective = 23344.59538167584,  Beta Change Norm = 0.38906841571852313,  Gradient Norm = 13358.735388646046
 
-- Outer Iteration 23: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 0.0021210133490967564,  Predicted = 0.002120754723733256  (A/P: 1.0001),  Trust Delta = 0.6911480264449816
   -- New Objective = 23344.593260662492,  Beta Change Norm = 3.175083062930857E-7,  Gradient Norm = 969.5458081582332
 
-- Outer Iteration 24: Had 6 CG iterations
   -- Obj.Reduction:  Actual = 1.0072309033639613,  Predicted = 1.0078398039430247  (A/P: 0.9994),  Trust Delta = 0.6911480264449816
   -- New Objective = 23343.586029759128,  Beta Change Norm = 0.008749259137025917,  Gradient Norm = 1067.7896535923433
 
-- Outer Iteration 25: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 1.3547600246965885E-5,  Predicted = 1.3547465425594469E-5  (A/P: 1.0),  Trust Delta = 0.6911480264449816
   -- New Objective = 23343.586016211528,  Beta Change Norm = 2.5374783095185467E-8,  Gradient Norm = 83.20291366858535
 
-- Outer Iteration 26: Had 12 CG iterations
   -- Obj.Reduction:  Actual = 15.302215361618437,  Predicted = 15.310868474305936  (A/P: 0.9994),  Trust Delta = 0.6911480264449816
   -- New Objective = 23328.28380084991,  Beta Change Norm = 0.5120342239089952,  Gradient Norm = 15756.152919911565
 
-- Outer Iteration 27: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 0.0029535907960962504,  Predicted = 0.002953150612459315  (A/P: 1.0001),  Trust Delta = 0.6911480264449816
   -- New Objective = 23328.280847259113,  Beta Change Norm = 3.74856810221399E-7,  Gradient Norm = 933.6635694330404
 
-- Outer Iteration 28: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 1.0478267358848825E-4,  Predicted = 1.0478219919535331E-4  (A/P: 1.0),  Trust Delta = 0.6911480264449816
   -- New Objective = 23328.28074247644,  Beta Change Norm = 2.2480413822676833E-7,  Gradient Norm = 5.538385572102319
Termination / Convergence condition satisfied.
BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...
-- Initially:  Objective = 28230.498369845453,  Gradient Norm = 2.248032584752783E7,  Trust Delta = 0.001024586722033724
-- Outer Iteration 1: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 4685.514381090245,  Predicted = 4506.656096079343  (A/P: 1.0397),  Trust Delta = 4.1751229311831877E-4
   -- New Objective = 23544.983988755208,  Beta Change Norm = 4.0094223959613487E-4,  Gradient Norm = 1773112.5532909825
Termination / Convergence condition satisfied.
BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...
-- Initially:  Objective = 28230.498369845453,  Gradient Norm = 2.248032584752783E7,  Trust Delta = 0.001024586722033724
-- Outer Iteration 1: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 4685.51438109942,  Predicted = 4506.6560960873  (A/P: 1.0397),  Trust Delta = 4.17512293119143E-4
   -- New Objective = 23544.983988746033,  Beta Change Norm = 4.0094223959684285E-4,  Gradient Norm = 1773112.553299248
 
-- Outer Iteration 2: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 55.08478867724625,  Predicted = 54.14637164341834  (A/P: 1.0173),  Trust Delta = 4.17512293119143E-4
   -- New Objective = 23489.899200068787,  Beta Change Norm = 1.0409436207463608E-4,  Gradient Norm = 43863.264421495034
 
-- Outer Iteration 3: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.0425455416625482,  Predicted = 0.0425210724118125  (A/P: 1.0006),  Trust Delta = 4.17512293119143E-4
   -- New Objective = 23489.856654527124,  Beta Change Norm = 3.4860035525762597E-6,  Gradient Norm = 3473.0626928235138
Termination / Convergence condition satisfied.
BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...
-- Initially:  Objective = 28230.498369845453,  Gradient Norm = 2.248032584752783E7,  Trust Delta = 0.001024586722033724
-- Outer Iteration 1: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 4685.514381099514,  Predicted = 4506.65609608738  (A/P: 1.0397),  Trust Delta = 4.175122931191516E-4
   -- New Objective = 23544.98398874594,  Beta Change Norm = 4.0094223959685E-4,  Gradient Norm = 1773112.5532993283
 
-- Outer Iteration 2: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 55.08478867725353,  Predicted = 54.14637164342853  (A/P: 1.0173),  Trust Delta = 4.175122931191516E-4
   -- New Objective = 23489.899200068685,  Beta Change Norm = 1.0409436207466114E-4,  Gradient Norm = 43863.264421514185
 
-- Outer Iteration 3: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.0425455416625482,  Predicted = 0.04252107241182405  (A/P: 1.0006),  Trust Delta = 4.175122931191516E-4
   -- New Objective = 23489.856654527022,  Beta Change Norm = 3.486003552576232E-6,  Gradient Norm = 3473.0626928274914
 
-- Outer Iteration 4: Had 3 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 1.3618165665211563,  Predicted = 1.3618300786307123  (A/P: 1.0),  Trust Delta = 0.0016700491724766064
   -- New Objective = 23488.4948379605,  Beta Change Norm = 4.1751229311915155E-4,  Gradient Norm = 22750.17168667339
 
-- Outer Iteration 5: Had 3 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 5.399070530791505,  Predicted = 5.398983505048864  (A/P: 1.0),  Trust Delta = 0.006680196689906426
   -- New Objective = 23483.09576742971,  Beta Change Norm = 0.0016700491724766064,  Gradient Norm = 3277.243187563727
 
-- Outer Iteration 6: Had 3 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 19.04347611745834,  Predicted = 19.043530665204045  (A/P: 1.0),  Trust Delta = 0.026720786759625702
   -- New Objective = 23464.05229131225,  Beta Change Norm = 0.006680196689906425,  Gradient Norm = 12014.210859652962
 
-- Outer Iteration 7: Had 3 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 41.1452816738456,  Predicted = 41.09983187966176  (A/P: 1.0011),  Trust Delta = 0.03287099410333282
   -- New Objective = 23422.907009638406,  Beta Change Norm = 0.0267207867596257,  Gradient Norm = 30568.57509207747
 
-- Outer Iteration 8: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 0.011022901580872713,  Predicted = 0.01101972206380871  (A/P: 1.0003),  Trust Delta = 0.03287099410333282
   -- New Objective = 23422.895986736825,  Beta Change Norm = 7.209836919526366E-7,  Gradient Norm = 1251.6678613601161
 
-- Outer Iteration 9: Had 8 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 13.978709434930352,  Predicted = 13.974847661855666  (A/P: 1.0003),  Trust Delta = 0.033257209609599145
   -- New Objective = 23408.917277301895,  Beta Change Norm = 0.03287099410333282,  Gradient Norm = 15328.859090870203
 
-- Outer Iteration 10: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.004639191432943335,  Predicted = 0.004638318429644279  (A/P: 1.0002),  Trust Delta = 0.033257209609599145
   -- New Objective = 23408.91263811046,  Beta Change Norm = 1.0519781798129972E-6,  Gradient Norm = 335.02440722968106
 
-- Outer Iteration 11: Had 4 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 6.3662166226313275,  Predicted = 6.366164181244294  (A/P: 1.0),  Trust Delta = 0.06697443441569934
   -- New Objective = 23402.54642148783,  Beta Change Norm = 0.033257209609599145,  Gradient Norm = 2307.51433331859
 
-- Outer Iteration 12: Had 7 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 11.15761233725425,  Predicted = 11.149031539741129  (A/P: 1.0008),  Trust Delta = 0.10211243265236637
   -- New Objective = 23391.388809150576,  Beta Change Norm = 0.06697443441569932,  Gradient Norm = 71503.76594916714
 
-- Outer Iteration 13: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.600488582651451,  Predicted = 0.6001508149708464  (A/P: 1.0006),  Trust Delta = 0.10211243265236637
   -- New Objective = 23390.788320567925,  Beta Change Norm = 1.6834966454979097E-5,  Gradient Norm = 840.347770623361
 
-- Outer Iteration 14: Had 8 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 19.757560698417365,  Predicted = 19.765740859017424  (A/P: 0.9996),  Trust Delta = 0.24398632984391763
   -- New Objective = 23371.030759869507,  Beta Change Norm = 0.10211243265236637,  Gradient Norm = 48752.608649999434
 
-- Outer Iteration 15: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.2778570437403687,  Predicted = 0.2779044747609064  (A/P: 0.9998),  Trust Delta = 0.24398632984391763
   -- New Objective = 23370.752902825767,  Beta Change Norm = 1.1465782794751552E-5,  Gradient Norm = 490.74546662109907
 
-- Outer Iteration 16: Had 7 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 35.87021488765458,  Predicted = 35.87139479548606  (A/P: 1.0),  Trust Delta = 0.5998608188063514
   -- New Objective = 23334.882687938112,  Beta Change Norm = 0.24398632984391766,  Gradient Norm = 114111.92221839691
 
-- Outer Iteration 17: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 1.5378956803469919,  Predicted = 1.5387644534721423  (A/P: 0.9994),  Trust Delta = 0.5998608188063514
   -- New Objective = 23333.344792257765,  Beta Change Norm = 2.7062912410241883E-5,  Gradient Norm = 1827.5390228667288
 
-- Outer Iteration 18: Had 8 CG iterations, trust bound REACHED
   -- Obj.Reduction:  Actual = 55.357956099222065,  Predicted = 55.4569565918232  (A/P: 0.9982),  Trust Delta = 0.8894009952541146
   -- New Objective = 23277.986836158543,  Beta Change Norm = 0.5998608188063514,  Gradient Norm = 30684.985380679016
 
-- Outer Iteration 19: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 0.017656232350418577,  Predicted = 0.017644837185100737  (A/P: 1.0006),  Trust Delta = 0.8894009952541146
   -- New Objective = 23277.969179926193,  Beta Change Norm = 1.984483688888249E-6,  Gradient Norm = 137.4544897991739
 
-- Outer Iteration 20: Had 10 CG iterations
   -- Obj.Reduction:  Actual = 13.663528841007064,  Predicted = 13.567360160458493  (A/P: 1.0071),  Trust Delta = 0.8894009952541146
   -- New Objective = 23264.305651085186,  Beta Change Norm = 0.4790943358344082,  Gradient Norm = 15753.857353150117
 
-- Outer Iteration 21: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 0.002973383649077732,  Predicted = 0.002972929227391132  (A/P: 1.0002),  Trust Delta = 0.8894009952541146
   -- New Objective = 23264.302677701537,  Beta Change Norm = 3.774223875140864E-7,  Gradient Norm = 1264.8256951027395
 
-- Outer Iteration 22: Had 2 CG iterations
   -- Obj.Reduction:  Actual = 1.9038948812521994E-4,  Predicted = 1.9038853266221582E-4  (A/P: 1.0),  Trust Delta = 0.8894009952541146
   -- New Objective = 23264.30248731205,  Beta Change Norm = 3.019597152404477E-7,  Gradient Norm = 10.843636813611397
Termination / Convergence condition satisfied.
BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...
-- Initially:  Objective = 56433.27085246851,  Gradient Norm = 4.469119635504498E7,  Trust Delta = 0.001024586722033724
-- Outer Iteration 1: Had 1 CG iterations
   -- Obj.Reduction:  Actual = 9262.134848396847,  Predicted = 8912.05664441991  (A/P: 1.0393),  Trust Delta = 4.151353931079128E-4
   -- New Objective = 47171.13600407166,  Beta Change Norm = 3.9882828705765304E-4,  Gradient Norm = 3491408.3116066065
Termination / Convergence condition satisfied.

Evaluate the cross-validated model


In [1]:
val cvpredictions = cvmodel.transform(testDataset.withColumnRenamed("label", "OriginalLabel"))
cvpredictions.registerTempTable("cvpredictions")
sqlContext.sql("SELECT sqrt(avg(square(OriginalLabel - prediction))) FROM cvpredictions").show


+------------------+
|               _c0|
+------------------+
|0.8557362892866146|
+------------------+

Homework ;)

Read http://apache.github.io/incubator-systemml/algorithms-classification.html#multinomial-logistic-regression and perform cross validation on other hyperparameters: for example: icpt, tol, maxOuterIter, maxInnerIter