Predict heart failure with Watson Machine Learning

This notebook contains steps and code to create a predictive model to predict heart failure and then deploy that model to Watson Machine Learning so it can be used in an application.

Learning Goals

The learning goals of this notebook are:

  • Load a CSV file into the Object Storage service linked to your Watson Studio
  • Create an Apache Spark machine learning model
  • Train and evaluate a model
  • Persist a model in a Watson Machine Learning repository

1. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:

  • Create a Watson Machine Learning service instance (a free plan is offered) and associate it with your project
  • Upload heart failure data to the Object Store service that is part of Watson Studio

We'll be using a few libraries for this exercise:

  1. Watson Machine Learning Client: Client library to work with the Watson Machine Learning service on IBM Cloud.
  2. Pixiedust: Python Helper library for Jupyter Notebooks
  3. ibmos2spark: Facilitates Data I/O between Spark and IBM Object Storage services

In [1]:
!pip install --upgrade ibmos2spark
!pip install --upgrade pixiedust
!pip install --upgrade watson-machine-learning-client


Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190815201445-0000
KERNEL_ID = f22c7b07-99ef-411e-9000-1068fa3b4da8
Collecting ibmos2spark
  Downloading https://files.pythonhosted.org/packages/c6/81/1edb24382edef1ca636e87972b2da286b8271a586c728a21f916d3cd76cd/ibmos2spark-1.0.1-py2.py3-none-any.whl
Installing collected packages: ibmos2spark
Successfully installed ibmos2spark-1.0.1
Collecting pixiedust
  Downloading https://files.pythonhosted.org/packages/bc/a8/e84b2ed12ee387589c099734b6f914a520e1fef2733c955982623080e813/pixiedust-1.1.17.tar.gz (197kB)
    100% |################################| 204kB 3.4MB/s ta 0:00:01
Collecting mpld3 (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/91/95/a52d3a83d0a29ba0d6898f6727e9858fe7a43f6c2ce81a5fe7e05f0f4912/mpld3-0.3.tar.gz (788kB)
    100% |################################| 798kB 2.9MB/s eta 0:00:01
Collecting lxml (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/ec/be/5ab8abdd8663c0386ec2dd595a5bc0e23330a0549b8a91e32f38c20845b6/lxml-4.4.1-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
    100% |################################| 5.8MB 1.4MB/s eta 0:00:01
Collecting geojson (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/e4/8d/9e28e9af95739e6d2d2f8d4bef0b3432da40b7c3588fbad4298c1be09e48/geojson-2.5.0-py2.py3-none-any.whl
Collecting astunparse (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/2e/37/5dd0dd89b87bb5f0f32a7e775458412c52d78f230ab8d0c65df6aabc4479/astunparse-1.6.2-py2.py3-none-any.whl
Collecting markdown (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/c0/4e/fd492e91abdc2d2fcb70ef453064d980688762079397f779758e055f6575/Markdown-3.1.1-py2.py3-none-any.whl (87kB)
    100% |################################| 92kB 3.9MB/s eta 0:00:01
Collecting colour (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/74/46/e81907704ab203206769dee1385dc77e1407576ff8f50a0681d0a6b541be/colour-0.1.5-py2.py3-none-any.whl
Collecting requests (from pixiedust)
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
    100% |################################| 61kB 2.4MB/s eta 0:00:01
Collecting six<2.0,>=1.6.1 (from astunparse->pixiedust)
  Downloading https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Collecting wheel<1.0,>=0.23.0 (from astunparse->pixiedust)
  Downloading https://files.pythonhosted.org/packages/bb/10/44230dd6bf3563b8f227dbf344c908d412ad2ff48066476672f3a72e174e/wheel-0.33.4-py2.py3-none-any.whl
Collecting setuptools>=36 (from markdown->pixiedust)
  Downloading https://files.pythonhosted.org/packages/75/b3/0a106dfaf7f48aef638da80b32608617cc8de4b24a22c8cd3759c32e5d30/setuptools-41.1.0-py2.py3-none-any.whl (576kB)
    100% |################################| 583kB 4.4MB/s eta 0:00:01
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests->pixiedust)
  Downloading https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl (150kB)
    100% |################################| 153kB 3.0MB/s eta 0:00:01
Collecting certifi>=2017.4.17 (from requests->pixiedust)
  Downloading https://files.pythonhosted.org/packages/69/1b/b853c7a9d4f6a6d00749e94eb6f3a041e342a885b87340b79c1ef73e3a78/certifi-2019.6.16-py2.py3-none-any.whl (157kB)
    100% |################################| 163kB 3.4MB/s eta 0:00:01
Collecting chardet<3.1.0,>=3.0.2 (from requests->pixiedust)
  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
    100% |################################| 143kB 4.6MB/s eta 0:00:01
Collecting idna<2.9,>=2.5 (from requests->pixiedust)
  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
    100% |################################| 61kB 2.7MB/s eta 0:00:01
Building wheels for collected packages: pixiedust, mpld3
  Building wheel for pixiedust (setup.py) ... done
  Stored in directory: /home/spark/shared/.cache/pip/wheels/25/fa/a5/09c1e8f4c91b34c5f7f4ac6e41be81dd0667030a2372546a8d
  Building wheel for mpld3 (setup.py) ... done
  Stored in directory: /home/spark/shared/.cache/pip/wheels/c0/47/fb/8a64f89aecfe0059830479308ad42d62e898a3e3cefdf6ba28
Successfully built pixiedust mpld3
tensorflow 1.13.1 requires tensorboard<1.14.0,>=1.13.0, which is not installed.
spyder 3.3.3 requires pyqt5<=5.12; python_version >= "3", which is not installed.
ibm-cos-sdk-core 2.4.3 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.3 which is incompatible.
botocore 1.12.82 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.3 which is incompatible.
Installing collected packages: mpld3, lxml, geojson, six, wheel, astunparse, setuptools, markdown, colour, urllib3, certifi, chardet, idna, requests, pixiedust
Successfully installed astunparse-1.6.2 certifi-2019.6.16 chardet-3.0.4 colour-0.1.5 geojson-2.5.0 idna-2.8 lxml-4.4.1 markdown-3.1.1 mpld3-0.3 pixiedust-1.1.17 requests-2.22.0 setuptools-41.1.0 six-1.12.0 urllib3-1.25.3 wheel-0.33.4
Collecting watson-machine-learning-client
  Downloading https://files.pythonhosted.org/packages/0e/a1/c503614455fb734b0989e8d6abaf24d0544d7370f7eb2b80ffbc99a40caf/watson_machine_learning_client-1.0.371-py3-none-any.whl (536kB)
    100% |################################| 542kB 2.9MB/s eta 0:00:01
Collecting tqdm (from watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/02/56/60a5b1c2e634d8e4ff89c7bab47645604e19658f448050a21facffd43796/tqdm-4.33.0-py2.py3-none-any.whl (50kB)
    100% |################################| 51kB 2.1MB/s eta 0:00:01
Collecting tabulate (from watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/c2/fd/202954b3f0eb896c53b7b6f07390851b1fd2ca84aa95880d7ae4f434c4ac/tabulate-0.8.3.tar.gz (46kB)
    100% |################################| 51kB 1.7MB/s eta 0:00:01
Collecting pandas (from watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/1d/9a/7eb9952f4b4d73fbd75ad1d5d6112f407e695957444cb695cbb3cdab918a/pandas-0.25.0-cp36-cp36m-manylinux1_x86_64.whl (10.5MB)
    100% |################################| 10.5MB 1.3MB/s eta 0:00:01
Collecting certifi (from watson-machine-learning-client)
  Using cached https://files.pythonhosted.org/packages/69/1b/b853c7a9d4f6a6d00749e94eb6f3a041e342a885b87340b79c1ef73e3a78/certifi-2019.6.16-py2.py3-none-any.whl
Collecting ibm-cos-sdk (from watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/20/fd/7f00531e462bdef3b51a080b9efb2d32389a07fd73dc3ac1ceb27e247d6b/ibm-cos-sdk-2.5.2.tar.gz (52kB)
    100% |################################| 61kB 2.5MB/s eta 0:00:01
Collecting requests (from watson-machine-learning-client)
  Using cached https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl
Collecting lomond (from watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/0f/b1/02eebed49c754b01b17de7705caa8c4ceecfb4f926cdafc220c863584360/lomond-0.3.3-py2.py3-none-any.whl
Collecting urllib3 (from watson-machine-learning-client)
  Using cached https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl
Collecting python-dateutil>=2.6.1 (from pandas->watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl (226kB)
    100% |################################| 235kB 3.2MB/s eta 0:00:01
Collecting numpy>=1.13.3 (from pandas->watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/19/b9/bda9781f0a74b90ebd2e046fde1196182900bd4a8e1ea503d3ffebc50e7c/numpy-1.17.0-cp36-cp36m-manylinux1_x86_64.whl (20.4MB)
    100% |################################| 20.4MB 819kB/s eta 0:00:01
Collecting pytz>=2017.2 (from pandas->watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/87/76/46d697698a143e05f77bec5a526bf4e56a0be61d63425b68f4ba553b51f2/pytz-2019.2-py2.py3-none-any.whl (508kB)
    100% |################################| 512kB 2.6MB/s eta 0:00:01
Collecting ibm-cos-sdk-core>=2.0.0 (from ibm-cos-sdk->watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/a3/ef/036e9cdb65f6d126e23555b82cddca36e8f5ab70728167c18b9a2dc4ba74/ibm-cos-sdk-core-2.5.2.tar.gz (1.1MB)
    100% |################################| 1.1MB 1.7MB/s eta 0:00:01
Collecting ibm-cos-sdk-s3transfer>=2.0.0 (from ibm-cos-sdk->watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/f5/b7/1bf1792978a65668af1fe00887c65ffffa02b784b779f6392a707b045d33/ibm-cos-sdk-s3transfer-2.5.2.tar.gz (134kB)
    100% |################################| 143kB 4.0MB/s eta 0:00:01
Collecting jmespath<1.0.0,>=0.7.1 (from ibm-cos-sdk->watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/83/94/7179c3832a6d45b266ddb2aac329e101367fbdb11f425f13771d27f225bb/jmespath-0.9.4-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests->watson-machine-learning-client)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5 (from requests->watson-machine-learning-client)
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting six>=1.10.0 (from lomond->watson-machine-learning-client)
  Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Collecting docutils>=0.10 (from ibm-cos-sdk-core>=2.0.0->ibm-cos-sdk->watson-machine-learning-client)
  Downloading https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl (547kB)
    100% |################################| 552kB 3.9MB/s eta 0:00:01
Building wheels for collected packages: tabulate, ibm-cos-sdk, ibm-cos-sdk-core, ibm-cos-sdk-s3transfer
  Building wheel for tabulate (setup.py) ... done
  Stored in directory: /home/spark/shared/.cache/pip/wheels/2b/67/89/414471314a2d15de625d184d8be6d38a03ae1e983dbda91e84
  Building wheel for ibm-cos-sdk (setup.py) ... done
  Stored in directory: /home/spark/shared/.cache/pip/wheels/87/2f/02/93c831ab48a803554ee49c1ac6d506b3b8f0048fb07ced5af3
  Building wheel for ibm-cos-sdk-core (setup.py) ... done
  Stored in directory: /home/spark/shared/.cache/pip/wheels/ae/53/c4/96f29495ed8057390a2f88cae9f5590acaee9294fa04c3be1b
  Building wheel for ibm-cos-sdk-s3transfer (setup.py) ... done
  Stored in directory: /home/spark/shared/.cache/pip/wheels/e5/76/a0/5df69137a266c09774d7dabd71c4056b2e8262ebc63e012867
Successfully built tabulate ibm-cos-sdk ibm-cos-sdk-core ibm-cos-sdk-s3transfer
tensorflow 1.13.1 requires tensorboard<1.14.0,>=1.13.0, which is not installed.
spyder 3.3.3 requires pyqt5<=5.12; python_version >= "3", which is not installed.
ibm-cos-sdk-core 2.5.2 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.3 which is incompatible.
botocore 1.12.82 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.3 which is incompatible.
Installing collected packages: tqdm, tabulate, six, python-dateutil, numpy, pytz, pandas, certifi, jmespath, docutils, urllib3, ibm-cos-sdk-core, ibm-cos-sdk-s3transfer, ibm-cos-sdk, chardet, idna, requests, lomond, watson-machine-learning-client
Successfully installed certifi-2019.6.16 chardet-3.0.4 docutils-0.15.2 ibm-cos-sdk-2.5.2 ibm-cos-sdk-core-2.5.2 ibm-cos-sdk-s3transfer-2.5.2 idna-2.8 jmespath-0.9.4 lomond-0.3.3 numpy-1.17.0 pandas-0.25.0 python-dateutil-2.8.0 pytz-2019.2 requests-2.22.0 six-1.12.0 tabulate-0.8.3 tqdm-4.33.0 urllib3-1.25.3 watson-machine-learning-client-1.0.371

2. Load and explore data

In this section you will load the data as an Apache Spark DataFrame and perform a basic exploration. Load the data to the Spark DataFrame from your associated Object Storage instance.

IMPORTANT: Follow the lab instructions to insert an Apache Spark DataFrame in the cell below.

IMPORTANT: Ensure the DataFrame is named df_data.

IMPORTANT: Add .option('inferSchema','True')\ to the inserted code.


In [2]:
import ibmos2spark
# @hidden_cell
credentials = {
    'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'service_id': 'iam-ServiceId-abc',
    'iam_service_endpoint': 'https://iam.ng.bluemix.net/oidc/token',
    'api_key': '123'
}

configuration_name = 'os_bf77d026bb654c7ab3442e2b206102e9_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_data = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema','True')\
  .load(cos.url('patientdataV6.csv', 'mydatasciencesandbox-donotdelete-pr-qt96ckabo9cjpv'))
df_data.take(5)


Out[2]:
[Row(AVGHEARTBEATSPERMIN=93, PALPITATIONSPERDAY=22, CHOLESTEROL=163, BMI=25, HEARTFAILURE='N', AGE=49, SEX='F', FAMILYHISTORY='N', SMOKERLAST5YRS='N', EXERCISEMINPERWEEK=110),
 Row(AVGHEARTBEATSPERMIN=108, PALPITATIONSPERDAY=22, CHOLESTEROL=181, BMI=24, HEARTFAILURE='N', AGE=32, SEX='F', FAMILYHISTORY='N', SMOKERLAST5YRS='N', EXERCISEMINPERWEEK=192),
 Row(AVGHEARTBEATSPERMIN=86, PALPITATIONSPERDAY=0, CHOLESTEROL=239, BMI=20, HEARTFAILURE='N', AGE=60, SEX='F', FAMILYHISTORY='N', SMOKERLAST5YRS='N', EXERCISEMINPERWEEK=121),
 Row(AVGHEARTBEATSPERMIN=80, PALPITATIONSPERDAY=36, CHOLESTEROL=164, BMI=31, HEARTFAILURE='Y', AGE=45, SEX='F', FAMILYHISTORY='Y', SMOKERLAST5YRS='N', EXERCISEMINPERWEEK=141),
 Row(AVGHEARTBEATSPERMIN=66, PALPITATIONSPERDAY=36, CHOLESTEROL=185, BMI=23, HEARTFAILURE='N', AGE=39, SEX='F', FAMILYHISTORY='N', SMOKERLAST5YRS='N', EXERCISEMINPERWEEK=63)]

Explore the loaded data by using the following Apache® Spark DataFrame methods:

  • df_data.printSchema to print the data schema
  • df_data.describe() to print the top twenty records
  • df_data.count() to count all records

In [3]:
df_data.printSchema()


root
 |-- AVGHEARTBEATSPERMIN: integer (nullable = true)
 |-- PALPITATIONSPERDAY: integer (nullable = true)
 |-- CHOLESTEROL: integer (nullable = true)
 |-- BMI: integer (nullable = true)
 |-- HEARTFAILURE: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- SEX: string (nullable = true)
 |-- FAMILYHISTORY: string (nullable = true)
 |-- SMOKERLAST5YRS: string (nullable = true)
 |-- EXERCISEMINPERWEEK: integer (nullable = true)

As you can see, the data contains ten fields. The HEARTFAILURE field is the one we would like to predict (label).


In [4]:
df_data.show()


+-------------------+------------------+-----------+---+------------+---+---+-------------+--------------+------------------+
|AVGHEARTBEATSPERMIN|PALPITATIONSPERDAY|CHOLESTEROL|BMI|HEARTFAILURE|AGE|SEX|FAMILYHISTORY|SMOKERLAST5YRS|EXERCISEMINPERWEEK|
+-------------------+------------------+-----------+---+------------+---+---+-------------+--------------+------------------+
|                 93|                22|        163| 25|           N| 49|  F|            N|             N|               110|
|                108|                22|        181| 24|           N| 32|  F|            N|             N|               192|
|                 86|                 0|        239| 20|           N| 60|  F|            N|             N|               121|
|                 80|                36|        164| 31|           Y| 45|  F|            Y|             N|               141|
|                 66|                36|        185| 23|           N| 39|  F|            N|             N|                63|
|                125|                27|        201| 31|           N| 47|  M|            N|             N|                13|
|                 83|                27|        169| 20|           N| 71|  F|            Y|             N|               124|
|                107|                31|        199| 32|           N| 55|  F|            N|             N|                22|
|                 92|                28|        174| 22|           N| 44|  F|            N|             N|               107|
|                 84|                12|        206| 25|           N| 50|  M|            N|             N|               199|
|                 60|                 1|        194| 28|           N| 71|  M|            N|             N|                27|
|                134|                 7|        228| 34|           Y| 63|  F|            Y|             N|                92|
|                103|                 0|        237| 24|           N| 64|  F|            Y|             N|                34|
|                101|                39|        157| 20|           N| 49|  M|            N|             N|                33|
|                 92|                 2|        169| 26|           N| 36|  M|            N|             N|               217|
|                 80|                27|        234| 27|           N| 50|  M|            N|             N|                28|
|                 82|                14|        155| 30|           N| 70|  F|            N|             N|               207|
|                 63|                 9|        204| 26|           N| 42|  M|            N|             N|                88|
|                 83|                12|        209| 29|           N| 38|  M|            Y|             N|               220|
|                 80|                37|        157| 20|           N| 48|  M|            N|             N|                54|
+-------------------+------------------+-----------+---+------------+---+---+-------------+--------------+------------------+
only showing top 20 rows


In [5]:
df_data.describe().show()


+-------+-------------------+------------------+------------------+------------------+------------+------------------+-----+-------------+--------------+------------------+
|summary|AVGHEARTBEATSPERMIN|PALPITATIONSPERDAY|       CHOLESTEROL|               BMI|HEARTFAILURE|               AGE|  SEX|FAMILYHISTORY|SMOKERLAST5YRS|EXERCISEMINPERWEEK|
+-------+-------------------+------------------+------------------+------------------+------------+------------------+-----+-------------+--------------+------------------+
|  count|              10800|             10800|             10800|             10800|       10800|             10800|10800|        10800|         10800|             10800|
|   mean|  87.11509259259259|20.423148148148147|195.08027777777778| 26.35972222222222|        null|49.965185185185184| null|         null|          null|119.72953703703703|
| stddev| 19.744375148984474|12.165320351622993|26.136731865042325|3.8201472810942136|        null|13.079280962015586| null|         null|          null| 71.14706006382843|
|    min|                 48|                 0|               150|                20|           N|                28|    F|            N|             N|                 0|
|    max|                161|                45|               245|                34|           Y|                72|    M|            Y|             Y|               250|
+-------+-------------------+------------------+------------------+------------------+------------+------------------+-----+-------------+--------------+------------------+


In [6]:
df_data.count()


Out[6]:
10800

As you can see, the data set contains 10800 records.

3. Interactive Visualizations w/PixieDust


In [7]:
import pixiedust


Pixiedust database opened successfully
Table VERSION_TRACKER created successfully
Table METRICS_TRACKER created successfully

Share anonymous install statistics? (opt-out instructions)

PixieDust will record metadata on its environment the next time the package is installed or updated. The data is anonymized and aggregated to help plan for future releases, and records only the following values:

{
   "data_sent": currentDate,
   "runtime": "python",
   "application_version": currentPixiedustVersion,
   "space_id": nonIdentifyingUniqueId,
   "config": {
       "repository_id": "https://github.com/ibm-watson-data-lab/pixiedust",
       "target_runtimes": ["Data Science Experience"],
       "event_id": "web",
       "event_organizer": "dev-journeys"
   }
}
You can opt out by calling pixiedust.optOut() in a new cell.
Pixiedust version 1.1.17
Pixiedust runtime updated. Please restart kernel
Table SPARK_PACKAGES created successfully
Table USER_PREFERENCES created successfully
Table service_connections created successfully

Simple visualization using bar charts

With PixieDust's display() method you can visually explore the loaded data using built-in charts, such as, bar charts, line charts, scatter plots, or maps. To explore a data set: choose the desired chart type from the drop down, configure chart options, configure display options.


In [ ]:
display(df_data)


Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter
Explore

4. Create a Spark machine learning model

In this section you will learn how to prepare data, create and train a Spark machine learning model.

4.1 Prepare data

In this subsection you will split your data into: train and test data sets.


In [9]:
split_data = df_data.randomSplit([0.8, 0.20], 24)
train_data = split_data[0]
test_data = split_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))


Number of training records: 8637
Number of testing records : 2163

As you can see our data has been successfully split into two data sets:

  • The train data set, which is the largest group, is used for training.
  • The test data set will be used for model evaluation and is used to test the assumptions of the model.

4.2 Create pipeline and train a model

In this section you will create a Spark machine learning pipeline and then train the model. In the first step you need to import the Spark machine learning packages that will be needed in the subsequent steps. A sequence of data processing is called a data pipeline. Each step in the pipeline processes the data and passes the result to the next step in the pipeline, this allows you to transform and fit your model with the raw input data.


In [10]:
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, convert all the string fields to numeric ones by using the StringIndexer transformer.


In [11]:
stringIndexer_label = StringIndexer(inputCol="HEARTFAILURE", outputCol="label").fit(df_data)
stringIndexer_sex = StringIndexer(inputCol="SEX", outputCol="SEX_IX")
stringIndexer_famhist = StringIndexer(inputCol="FAMILYHISTORY", outputCol="FAMILYHISTORY_IX")
stringIndexer_smoker = StringIndexer(inputCol="SMOKERLAST5YRS", outputCol="SMOKERLAST5YRS_IX")

In the following step, create a feature vector by combining all features together.


In [12]:
vectorAssembler_features = VectorAssembler(inputCols=["AVGHEARTBEATSPERMIN","PALPITATIONSPERDAY","CHOLESTEROL","BMI","AGE","SEX_IX","FAMILYHISTORY_IX","SMOKERLAST5YRS_IX","EXERCISEMINPERWEEK"], outputCol="features")

Next, define estimators you want to use for classification. Random Forest is used in the following example.


In [13]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

Finally, indexed labels back to original labels.


In [14]:
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)

In [15]:
transform_df_pipeline = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features])
transformed_df = transform_df_pipeline.fit(df_data).transform(df_data)
transformed_df.show()


+-------------------+------------------+-----------+---+------------+---+---+-------------+--------------+------------------+-----+------+----------------+-----------------+--------------------+
|AVGHEARTBEATSPERMIN|PALPITATIONSPERDAY|CHOLESTEROL|BMI|HEARTFAILURE|AGE|SEX|FAMILYHISTORY|SMOKERLAST5YRS|EXERCISEMINPERWEEK|label|SEX_IX|FAMILYHISTORY_IX|SMOKERLAST5YRS_IX|            features|
+-------------------+------------------+-----------+---+------------+---+---+-------------+--------------+------------------+-----+------+----------------+-----------------+--------------------+
|                 93|                22|        163| 25|           N| 49|  F|            N|             N|               110|  0.0|   1.0|             0.0|              0.0|[93.0,22.0,163.0,...|
|                108|                22|        181| 24|           N| 32|  F|            N|             N|               192|  0.0|   1.0|             0.0|              0.0|[108.0,22.0,181.0...|
|                 86|                 0|        239| 20|           N| 60|  F|            N|             N|               121|  0.0|   1.0|             0.0|              0.0|[86.0,0.0,239.0,2...|
|                 80|                36|        164| 31|           Y| 45|  F|            Y|             N|               141|  1.0|   1.0|             1.0|              0.0|[80.0,36.0,164.0,...|
|                 66|                36|        185| 23|           N| 39|  F|            N|             N|                63|  0.0|   1.0|             0.0|              0.0|[66.0,36.0,185.0,...|
|                125|                27|        201| 31|           N| 47|  M|            N|             N|                13|  0.0|   0.0|             0.0|              0.0|[125.0,27.0,201.0...|
|                 83|                27|        169| 20|           N| 71|  F|            Y|             N|               124|  0.0|   1.0|             1.0|              0.0|[83.0,27.0,169.0,...|
|                107|                31|        199| 32|           N| 55|  F|            N|             N|                22|  0.0|   1.0|             0.0|              0.0|[107.0,31.0,199.0...|
|                 92|                28|        174| 22|           N| 44|  F|            N|             N|               107|  0.0|   1.0|             0.0|              0.0|[92.0,28.0,174.0,...|
|                 84|                12|        206| 25|           N| 50|  M|            N|             N|               199|  0.0|   0.0|             0.0|              0.0|[84.0,12.0,206.0,...|
|                 60|                 1|        194| 28|           N| 71|  M|            N|             N|                27|  0.0|   0.0|             0.0|              0.0|[60.0,1.0,194.0,2...|
|                134|                 7|        228| 34|           Y| 63|  F|            Y|             N|                92|  1.0|   1.0|             1.0|              0.0|[134.0,7.0,228.0,...|
|                103|                 0|        237| 24|           N| 64|  F|            Y|             N|                34|  0.0|   1.0|             1.0|              0.0|[103.0,0.0,237.0,...|
|                101|                39|        157| 20|           N| 49|  M|            N|             N|                33|  0.0|   0.0|             0.0|              0.0|[101.0,39.0,157.0...|
|                 92|                 2|        169| 26|           N| 36|  M|            N|             N|               217|  0.0|   0.0|             0.0|              0.0|[92.0,2.0,169.0,2...|
|                 80|                27|        234| 27|           N| 50|  M|            N|             N|                28|  0.0|   0.0|             0.0|              0.0|[80.0,27.0,234.0,...|
|                 82|                14|        155| 30|           N| 70|  F|            N|             N|               207|  0.0|   1.0|             0.0|              0.0|[82.0,14.0,155.0,...|
|                 63|                 9|        204| 26|           N| 42|  M|            N|             N|                88|  0.0|   0.0|             0.0|              0.0|[63.0,9.0,204.0,2...|
|                 83|                12|        209| 29|           N| 38|  M|            Y|             N|               220|  0.0|   0.0|             1.0|              0.0|[83.0,12.0,209.0,...|
|                 80|                37|        157| 20|           N| 48|  M|            N|             N|                54|  0.0|   0.0|             0.0|              0.0|[80.0,37.0,157.0,...|
+-------------------+------------------+-----------+---+------------+---+---+-------------+--------------+------------------+-----+------+----------------+-----------------+--------------------+
only showing top 20 rows

Let's build the pipeline now. A pipeline consists of transformers and an estimator.


In [16]:
pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_famhist, stringIndexer_smoker, vectorAssembler_features, rf, labelConverter])

Now, you can train your Random Forest model by using the previously defined pipeline and training data.


In [17]:
model_rf = pipeline_rf.fit(train_data)

You can check your model accuracy now. To evaluate the model, use test data.


In [18]:
predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))


Accuracy = 0.871475
Test Error = 0.128525

You can tune your model now to achieve better accuracy. For simplicity of this example tuning section is omitted.

5. Persist model

In this section you will learn how to store your pipeline and model in Watson Machine Learning repository by using Python client libraries. First, you must import client libraries.

IMPORTANT: Update the wml_credentials variable below. Copy and paste the entire credential dictionary, which can be found on the Service Credentials tab of the Watson Machine Learning service instance created on the IBM Cloud.


In [ ]:
wml_credentials = {
  "apikey": "xyz",
  "iam_apikey_description": "Auto-generated for key abc",
  "iam_apikey_name": "Service credentials-1",
  "iam_role_crn": "crn:v1:bluemix:public:iam::::serviceRole:Writer",
  "iam_serviceid_crn": "crn:v1:bluemix:public:iam-identity::a123",
  "instance_id": "xyz",
  "url": "https://us-south.ml.cloud.ibm.com"
}

print(wml_credentials)

In [21]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient
client = WatsonMachineLearningAPIClient(wml_credentials)
print(client.version)


1.0.371

TIP: Update the cell below with your name, email, and name you wish to give to your model.

Create model artifact (abstraction layer).


In [22]:
model_props = {client.repository.ModelMetaNames.AUTHOR_NAME: "IBM", 
               client.repository.ModelMetaNames.NAME: "Heart Failure Prediction Model"}
published_model = client.repository.store_model(model=model_rf, pipeline=pipeline_rf, meta_props=model_props, training_data=train_data)


/home/spark/shared/user-libs/python3/urllib3/connectionpool.py:851: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

5.1 Save pipeline and model

In this subsection you will learn how to save pipeline and model artifacts to your Watson Machine Learning instance.


In [23]:
import json
published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))


{
  "metadata": {
    "guid": "5dd30f48-f930-4373-acf8-2ab03a7742b2",
    "url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/ef880149-8252-4e31-af75-8a3b8a9d6b59/published_models/5dd30f48-f930-4373-acf8-2ab03a7742b2",
    "created_at": "2019-08-15T20:20:02.237Z",
    "modified_at": "2019-08-15T20:20:02.307Z"
  },
  "entity": {
    "runtime_environment": "spark-2.3",
    "learning_configuration_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/ef880149-8252-4e31-af75-8a3b8a9d6b59/published_models/5dd30f48-f930-4373-acf8-2ab03a7742b2/learning_configuration",
    "author": {
      "name": "IBM"
    },
    "name": "Heart Failure Prediction Model",
    "label_col": "HEARTFAILURE",
    "learning_iterations_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/ef880149-8252-4e31-af75-8a3b8a9d6b59/published_models/5dd30f48-f930-4373-acf8-2ab03a7742b2/learning_iterations",
    "training_data_schema": {
      "fields": [
        {
          "metadata": {},
          "name": "AVGHEARTBEATSPERMIN",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "PALPITATIONSPERDAY",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "CHOLESTEROL",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "BMI",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {
            "modeling_role": "target"
          },
          "name": "HEARTFAILURE",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "AGE",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "SEX",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "FAMILYHISTORY",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "SMOKERLAST5YRS",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "EXERCISEMINPERWEEK",
          "nullable": true,
          "type": "integer"
        }
      ],
      "type": "struct"
    },
    "feedback_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/ef880149-8252-4e31-af75-8a3b8a9d6b59/published_models/5dd30f48-f930-4373-acf8-2ab03a7742b2/feedback",
    "latest_version": {
      "url": "https://us-south.ml.cloud.ibm.com/v3/ml_assets/models/5dd30f48-f930-4373-acf8-2ab03a7742b2/versions/f3049502-6871-4ac3-bac5-e0e22851e6fc",
      "guid": "f3049502-6871-4ac3-bac5-e0e22851e6fc",
      "created_at": "2019-08-15T20:20:02.307Z"
    },
    "model_type": "mllib-2.3",
    "deployments": {
      "count": 0,
      "url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/ef880149-8252-4e31-af75-8a3b8a9d6b59/published_models/5dd30f48-f930-4373-acf8-2ab03a7742b2/deployments"
    },
    "evaluation_metrics_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/ef880149-8252-4e31-af75-8a3b8a9d6b59/published_models/5dd30f48-f930-4373-acf8-2ab03a7742b2/evaluation_metrics",
    "input_data_schema": {
      "fields": [
        {
          "metadata": {},
          "name": "AVGHEARTBEATSPERMIN",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "PALPITATIONSPERDAY",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "CHOLESTEROL",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "BMI",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "AGE",
          "nullable": true,
          "type": "integer"
        },
        {
          "metadata": {},
          "name": "SEX",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "FAMILYHISTORY",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "SMOKERLAST5YRS",
          "nullable": true,
          "type": "string"
        },
        {
          "metadata": {},
          "name": "EXERCISEMINPERWEEK",
          "nullable": true,
          "type": "integer"
        }
      ],
      "type": "struct"
    }
  }
}

5.2 Load model to verify that it was saved correctly

You can load your model to make sure that it was saved correctly.


In [24]:
loaded_model = client.repository.load(published_model_uid)
print(loaded_model)


PipelineModel_44cdae4f855e6ec2691d

Call model against test data to verify that it has been loaded correctly. Examine top 3 results


In [25]:
test_predictions = loaded_model.transform(test_data)
test_predictions.select('probability', 'predictedLabel').show(n=3, truncate=False)


+----------------------------------------+--------------+
|probability                             |predictedLabel|
+----------------------------------------+--------------+
|[0.9280472309536159,0.07195276904638422]|N             |
|[0.9241890079638004,0.07581099203619959]|N             |
|[0.9246185696576192,0.07538143034238076]|N             |
+----------------------------------------+--------------+
only showing top 3 rows

Congratulations, you've sucessfully created a predictive model and saved it in the Watson Machine Learning service.

You can now switch to the Watson Machine Learning console to deploy the model and then test it in application, or continue within the notebook to deploy the model using the APIs.


6.0 Accessing Watson ML Models and Deployments through API

Instead of jumping from your notebook into a web browser, manage your model and delopment through a set of APIs

Deploy model to WML Service


In [26]:
created_deployment = client.deployments.create(published_model_uid, name="Heart Failure prediction")



#######################################################################################

Synchronous deployment creation for uid: '5dd30f48-f930-4373-acf8-2ab03a7742b2' started

#######################################################################################


INITIALIZING
DEPLOY_SUCCESS


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='a65f0386-912e-452c-bf87-63c10d50e73e'
------------------------------------------------------------------------------------------------



In [27]:
scoring_endpoint = client.deployments.get_scoring_url(created_deployment)

print(scoring_endpoint)


https://us-south.ml.cloud.ibm.com/v3/wml_instances/ef880149-8252-4e31-af75-8a3b8a9d6b59/deployments/a65f0386-912e-452c-bf87-63c10d50e73e/online

List model deployments


In [28]:
client.deployments.list()


------------------------------------  -----------------------------------------  ------  --------------  ------------------------  ---------  -------------
GUID                                  NAME                                       TYPE    STATE           CREATED                   FRAMEWORK  ARTIFACT TYPE
a65f0386-912e-452c-bf87-63c10d50e73e  Heart Failure prediction                   online  DEPLOY_SUCCESS  2019-08-15T20:20:26.390Z  mllib-2.3  model
6b49bd0e-b8e6-4910-bf4f-410d5270c0b5  Shopping Recommendation Engine Deployment  online  DEPLOY_SUCCESS  2019-08-15T13:25:44.170Z  mllib-2.3  model
------------------------------------  -----------------------------------------  ------  --------------  ------------------------  ---------  -------------

6.1 Invoke prediction model deployment


In [29]:
scoring_payload = { "fields":["AVGHEARTBEATSPERMIN","PALPITATIONSPERDAY","CHOLESTEROL","BMI","AGE","SEX","FAMILYHISTORY","SMOKERLAST5YRS","EXERCISEMINPERWEEK"],"values":[[100,85,242,24,44,"F","Y","Y",125]]}

predictions = client.deployments.score(scoring_endpoint, scoring_payload)

print(json.dumps(predictions, indent=2))
print(predictions['values'][0][18])


{
  "fields": [
    "AVGHEARTBEATSPERMIN",
    "PALPITATIONSPERDAY",
    "CHOLESTEROL",
    "BMI",
    "AGE",
    "SEX",
    "FAMILYHISTORY",
    "SMOKERLAST5YRS",
    "EXERCISEMINPERWEEK",
    "HEARTFAILURE",
    "label",
    "SEX_IX",
    "FAMILYHISTORY_IX",
    "SMOKERLAST5YRS_IX",
    "features",
    "rawPrediction",
    "probability",
    "prediction",
    "predictedLabel"
  ],
  "values": [
    [
      100,
      85,
      242,
      24,
      44,
      "F",
      "Y",
      "Y",
      125,
      "N",
      0.0,
      1.0,
      1.0,
      1.0,
      [
        100.0,
        85.0,
        242.0,
        24.0,
        44.0,
        1.0,
        1.0,
        1.0,
        125.0
      ],
      [
        4.7413121526952215,
        15.258687847304778
      ],
      [
        0.2370656076347611,
        0.7629343923652389
      ],
      1.0,
      "Y"
    ]
  ]
}
Y

Narrow down prediction results to just the prediction


In [30]:
print('Is a 44 year old female that smokes with a low BMI at risk of Heart Failure?: {}'.format(client.deployments.score(scoring_endpoint, scoring_payload)
['values'][0][18]))


Is a 44 year old female that smokes with a low BMI at risk of Heart Failure?: Y