This notebook contains steps and code to create a predictive model to predict heart failure and then deploy that model to Watson Machine Learning so it can be used in an application.
The learning goals of this notebook are:
We'll be using a few libraries for this exercise:
In [1]:
!pip install --upgrade ibmos2spark
!pip install --upgrade pixiedust
!pip install --upgrade watson-machine-learning-client
In this section you will load the data as an Apache Spark DataFrame and perform a basic exploration. Load the data to the Spark DataFrame from your associated Object Storage instance.
IMPORTANT: Follow the lab instructions to insert an Apache Spark DataFrame in the cell below.
IMPORTANT: Ensure the DataFrame is named
df_data
.IMPORTANT: Add
.option('inferSchema','True')\
to the inserted code.
In [2]:
import ibmos2spark
# @hidden_cell
credentials = {
'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
'service_id': 'iam-ServiceId-abc',
'iam_service_endpoint': 'https://iam.ng.bluemix.net/oidc/token',
'api_key': '123'
}
configuration_name = 'os_bf77d026bb654c7ab3442e2b206102e9_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_data = spark.read\
.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
.option('header', 'true')\
.option('inferSchema','True')\
.load(cos.url('patientdataV6.csv', 'mydatasciencesandbox-donotdelete-pr-qt96ckabo9cjpv'))
df_data.take(5)
Out[2]:
Explore the loaded data by using the following Apache® Spark DataFrame methods:
df_data.printSchema
to print the data schemadf_data.describe()
to print the top twenty recordsdf_data.count()
to count all records
In [3]:
df_data.printSchema()
As you can see, the data contains ten fields. The HEARTFAILURE field is the one we would like to predict (label).
In [4]:
df_data.show()
In [5]:
df_data.describe().show()
In [6]:
df_data.count()
Out[6]:
As you can see, the data set contains 10800 records.
In [7]:
import pixiedust
With PixieDust's display()
method you can visually explore the loaded data using built-in charts, such as, bar charts, line charts, scatter plots, or maps.
To explore a data set: choose the desired chart type from the drop down, configure chart options, configure display options.
In [ ]:
display(df_data)
In [9]:
split_data = df_data.randomSplit([0.8, 0.20], 24)
train_data = split_data[0]
test_data = split_data[1]
print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))
As you can see our data has been successfully split into two data sets:
In this section you will create a Spark machine learning pipeline and then train the model. In the first step you need to import the Spark machine learning packages that will be needed in the subsequent steps. A sequence of data processing is called a data pipeline. Each step in the pipeline processes the data and passes the result to the next step in the pipeline, this allows you to transform and fit your model with the raw input data.
In [10]:
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
In the following step, convert all the string fields to numeric ones by using the StringIndexer transformer.
In [11]:
stringIndexer_label = StringIndexer(inputCol="HEARTFAILURE", outputCol="label").fit(df_data)
stringIndexer_sex = StringIndexer(inputCol="SEX", outputCol="SEX_IX")
stringIndexer_famhist = StringIndexer(inputCol="FAMILYHISTORY", outputCol="FAMILYHISTORY_IX")
stringIndexer_smoker = StringIndexer(inputCol="SMOKERLAST5YRS", outputCol="SMOKERLAST5YRS_IX")
In the following step, create a feature vector by combining all features together.
In [12]:
vectorAssembler_features = VectorAssembler(inputCols=["AVGHEARTBEATSPERMIN","PALPITATIONSPERDAY","CHOLESTEROL","BMI","AGE","SEX_IX","FAMILYHISTORY_IX","SMOKERLAST5YRS_IX","EXERCISEMINPERWEEK"], outputCol="features")
Next, define estimators you want to use for classification. Random Forest is used in the following example.
In [13]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
Finally, indexed labels back to original labels.
In [14]:
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)
In [15]:
transform_df_pipeline = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features])
transformed_df = transform_df_pipeline.fit(df_data).transform(df_data)
transformed_df.show()
Let's build the pipeline now. A pipeline consists of transformers and an estimator.
In [16]:
pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features, rf, labelConverter])
Now, you can train your Random Forest model by using the previously defined pipeline and training data.
In [17]:
model_rf = pipeline_rf.fit(train_data)
You can check your model accuracy now. To evaluate the model, use test data.
In [18]:
predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))
You can tune your model now to achieve better accuracy. For simplicity of this example tuning section is omitted.
In this section you will learn how to store your pipeline and model in Watson Machine Learning repository by using Python client libraries. First, you must import client libraries.
IMPORTANT: Update the
wml_credentials
variable below. Copy and paste the entire credential dictionary, which can be found on the Service Credentials tab of the Watson Machine Learning service instance created on the IBM Cloud.
In [ ]:
wml_credentials = {
"apikey": "xyz",
"iam_apikey_description": "Auto-generated for key abc",
"iam_apikey_name": "Service credentials-1",
"iam_role_crn": "crn:v1:bluemix:public:iam::::serviceRole:Writer",
"iam_serviceid_crn": "crn:v1:bluemix:public:iam-identity::a123",
"instance_id": "xyz",
"url": "https://us-south.ml.cloud.ibm.com"
}
print(wml_credentials)
In [21]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient
client = WatsonMachineLearningAPIClient(wml_credentials)
print(client.version)
TIP: Update the cell below with your name, email, and name you wish to give to your model.
Create model artifact (abstraction layer).
In [22]:
model_props = {client.repository.ModelMetaNames.AUTHOR_NAME: "IBM",
client.repository.ModelMetaNames.NAME: "Heart Failure Prediction Model"}
published_model = client.repository.store_model(model=model_rf, pipeline=pipeline_rf, meta_props=model_props, training_data=train_data)
In [23]:
import json
published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))
In [24]:
loaded_model = client.repository.load(published_model_uid)
print(loaded_model)
Call model against test data to verify that it has been loaded correctly. Examine top 3 results
In [25]:
test_predictions = loaded_model.transform(test_data)
test_predictions.select('probability', 'predictedLabel').show(n=3, truncate=False)
You can now switch to the Watson Machine Learning console to deploy the model and then test it in application, or continue within the notebook to deploy the model using the APIs.
In [26]:
created_deployment = client.deployments.create(published_model_uid, name="Heart Failure prediction")
In [27]:
scoring_endpoint = client.deployments.get_scoring_url(created_deployment)
print(scoring_endpoint)
In [28]:
client.deployments.list()
In [29]:
scoring_payload = { "fields":["AVGHEARTBEATSPERMIN","PALPITATIONSPERDAY","CHOLESTEROL","BMI","AGE","SEX","FAMILYHISTORY","SMOKERLAST5YRS","EXERCISEMINPERWEEK"],"values":[[100,85,242,24,44,"F","Y","Y",125]]}
predictions = client.deployments.score(scoring_endpoint, scoring_payload)
print(json.dumps(predictions, indent=2))
print(predictions['values'][0][18])
In [30]:
print('Is a 44 year old female that smokes with a low BMI at risk of Heart Failure?: {}'.format(client.deployments.score(scoring_endpoint, scoring_payload)
['values'][0][18]))