In [ ]:
import numpy as np
import pandas as pd
import mmlspark
# help(mmlspark)
Now let's read the data and split it to train and test sets:
In [ ]:
dataFile = "AdultCensusIncome.csv"
import os, urllib
if not os.path.isfile(dataFile):
urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/"+dataFile, dataFile)
data = spark.createDataFrame(pd.read_csv(dataFile, dtype={" hours-per-week": np.float64}))
data = data.select([" education", " marital-status", " hours-per-week", " income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
train.limit(10).toPandas()
TrainClassifier can be used to initialize and fit a model, it wraps SparkML classifiers.
You can use help(mmlspark.TrainClassifier) to view the different parameters.
Note that it implicitly converts the data into the format expected by the algorithm: tokenize
and hash strings, one-hot encodes categorical variables, assembles the features into vector
and so on. The parameter numFeatures controls the number of hashed features.
In [ ]:
from mmlspark import TrainClassifier
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol=" income", numFeatures=256).fit(train)
model.write().overwrite().save("adultCensusIncomeModel.mml")
After the model is trained, we score it against the test dataset and view metrics.
In [ ]:
from mmlspark import ComputeModelStatistics, TrainedClassifierModel
predictionModel = TrainedClassifierModel.load("adultCensusIncomeModel.mml")
prediction = predictionModel.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()
Finally, we save the model so it can be used in a scoring program.
In [ ]:
model.write().overwrite().save("AdultCensus.mml")