101 - Training and Evaluating Classifiers with `mmlspark`

In this example, we try to predict incomes from the Adult Census dataset.

First, we import the packages (use help(mmlspark) to view contents),



In [ ]:

    
import numpy as np
import pandas as pd
import mmlspark

# help(mmlspark)

Now let's read the data and split it to train and test sets:



In [ ]:

    
dataFile = "AdultCensusIncome.csv"
import os, urllib
if not os.path.isfile(dataFile):
    urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/"+dataFile, dataFile)
data = spark.createDataFrame(pd.read_csv(dataFile, dtype={" hours-per-week": np.float64}))
data = data.select([" education", " marital-status", " hours-per-week", " income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
train.limit(10).toPandas()

TrainClassifier can be used to initialize and fit a model, it wraps SparkML classifiers. You can use help(mmlspark.TrainClassifier) to view the different parameters.

Note that it implicitly converts the data into the format expected by the algorithm: tokenize and hash strings, one-hot encodes categorical variables, assembles the features into vector and so on. The parameter numFeatures controls the number of hashed features.



In [ ]:

    
from mmlspark import TrainClassifier
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol=" income", numFeatures=256).fit(train)
model.write().overwrite().save("adultCensusIncomeModel.mml")

After the model is trained, we score it against the test dataset and view metrics.



In [ ]:

    
from mmlspark import ComputeModelStatistics, TrainedClassifierModel
predictionModel = TrainedClassifierModel.load("adultCensusIncomeModel.mml")
prediction = predictionModel.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()

Finally, we save the model so it can be used in a scoring program.



In [ ]:

    
model.write().overwrite().save("AdultCensus.mml")

101 - Training and Evaluating Classifiers with mmlspark

101 - Training and Evaluating Classifiers with `mmlspark`