Train and deploy a heart disease prediction model using XGBoost and IBM Watson Machine Learning APIs | |
---|---|
This notebook demonstrates how to train a model using the XGBoost library to classify whether a person has heart disease or not. In addition to training, the notebook also explains how to persist a trained model to IBM Watson Machine Learning repository, deploy the model as a REST service and to predict using the deployed model using the REST APIs.
In order to train and test the heart disease prediction model, you will be using an open source data set published in the University of California, Irvine (UCI) Machine Learning Repository.
This notebook uses Python 3.5 runtime, XGBoost 0.6 and Scikit-Learn 0.17.
The learning goals of this notebook are:
Before you execute the sample code in this notebook, you must perform the following setup tasks:
Heart Disease Data Set is a freely available data set on the UCI Machine Learning Repository portal.
link_to_data
variable value with the URL mentioned above.In order to download the data from UCI Machine Learning Repository, use the wget
library. Please install this library if have you have not installed it already. Use the following command to install the wget
library: !pip install wget --user
In [ ]:
!pip install wget --user
Now, the code in the cell below downloads the data set and saves it in the local filesystem. The name of downloaded file containing the data will be displayed in the output of this cell.
In [ ]:
import wget
link_to_data = 'http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
ClevelandDataSet = wget.download(link_to_data)
print(ClevelandDataSet)
The .csv file, processed.cleveland.data, that contains the heart disease data set is now availble on your local gpfs filesystem.
The downloaded data set contains the following attributes pertaining to heart disease.
In this section you will load the data as a Pandas data frame and perform a basic exploration.
Load the data in the .csv file, processed.cleveland.data, into a Pandas data frame by running the following code:
In [ ]:
import pandas as pd
In [ ]:
col_names = ['age','sex','cp','restbp','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
heart_data_df = pd.read_csv(ClevelandDataSet, sep=',', header=None, names=col_names, na_filter= True, na_values= {'ca': '?', 'thal': '?'})
heart_data_df.head()
Let us see how many attributes and samples we have in this data set
In [ ]:
(samples, attributes) = heart_data_df.shape
print("No. of Sample data =", samples )
print("No. of Attributes =", attributes)
We have 303 rows of sample data with 14 columns of data per sample.
In recent years, ensemble learning models took the lead and became popular among machine learning practitioners.
Ensemble learning model employs multiple machine learning algorithms to overcome the potential weaknesses of a single model. For example, if you are going to pick a destination for your next vacation, you probably ask your family and friends, read reviews and blog posts. Based on all the information you have gathered, you make your final decision.
This phenomenon is referred as the Wisdom of Crowds (WOC) in social sciences and it states that averaging the answers (prediction or probability) of a group will often result better than the answer of one of its members. The idea is that the collective knowledge of diverse and independent individuals will exceed the knowledge of any one of those individuals, helping to eliminate the noise.
XGBoost is an open source library for ensemble based algorithms. It can be used for classification, regression and ranking type of problems. XGBoost supports multiple languages, such as C++, Python, R, and Java.
The Python library of XGBoost supports the following API interfaces to train and predict a model, also referred to as a Booster
:
xgboost
package, such as xgboost.train()
or xgboost.Booster
xgboost.sklearn.XGBClassifier
and xgboost.sklearn.XGBRegressor
Details about using the scikit-learn based Wrapper APIs to create and predict an XGBoost model is explained in the the Classify tumors with machine learning notebook.
In this section you will learn how to train and test an XGBoost model using XGBoost's native python APIs.
First, you must import the required libraries.
In [ ]:
import xgboost as xgb
import pandas as pd
from sklearn import cross_validation
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
import pprint
%matplotlib inline
In this section, clean and transform the data in the Pandas data frame into the data that can be given as input for training the model.
In [ ]:
print("List of features with their corresponding count of null values : ")
print("---------------------------------------------------------------- ")
print(heart_data_df.isnull().sum())
From the output of the above cell, there are 6 occurrences where there are null values. The rows containing these null values can be removed so that the data set does not have any incomplete data. The cell below contains the command to remove the rows that contain these null values.
In [ ]:
heart_data_df = heart_data_df.dropna(how='any',axis=0)
In this section, transform the existing data frame to derive the target data that contains the prediction value for the corresponding sample data.
The goal of the model here is to predict whether a patient has a heart problem. Although the data set currently available does not have this information, this information can be derived from the num
attribute. The num
column and its values pertain to the number of major vessels with more than 50% narrowing (values- 0,1,2,3 or 4) for the corresponding sample data.
Therefore, the target column diagnosed
can derived in the following way:
In [ ]:
heart_data_df['diagnosed'] = heart_data_df['num'].map(lambda d: 1 if d > 0 else 0)
The next step is to select the attributes in the current data set that can be used for training the model. Here, all the attributes other than num
attribute are chosen as the features.
In [ ]:
feature_cols = ['age','sex','cp','restbp','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
features_df = heart_data_df[feature_cols]
As the target and feature columns has been defined, you can now split the data set into two sets that will be used for training the model and for testing the trained model.
In [ ]:
heart_train, heart_test, target_train, target_test = cross_validation.train_test_split(features_df, heart_data_df.loc[:,'diagnosed'], test_size=0.33, random_state=0)
DMatrix is the data interface provided by the XGBoost library. The training data and test data are converted as DMatrix objects to perform training and to make predictions. The DMatrix objects can be created from various data formats, such as Numpy arrays, Pandas data frames, or a Scipy sparse array. For more information about the DMatrix interface, see Python Package Introduction.
Next, prepare the DMatrix objects for training and testing based on the training and test data that was split above
In [ ]:
dm_train = xgb.DMatrix(heart_train, label=target_train)
dm_test = xgb.DMatrix(heart_test)
Set the parameters of the Booster that we are about to create and train.
In [ ]:
param = {'objective':'multi:softmax', 'max_depth':2, 'eta':0.8, 'num_class': 2, 'eval_metric': 'auc', 'silent':1 }
Create a Booster by using the training data set, which is in the form of a DMatrix object.
In [ ]:
xgb_model = xgb.train(param, dm_train)
Make predictions on test data and evaluate the model.
In [ ]:
y_predict = xgb_model.predict(dm_test)
print(y_predict)
Evaluate the performance of the model using the predicted data.
In [ ]:
accuracy = accuracy_score(target_test, y_predict)
print("Accuracy: " + str(accuracy))
To understand the model better, XGBoost provides APIs that you can use to get insights about the trees used for training the model and the importance of the features in constructing the Booster.
To plot graphs you must use the commands in the following cell to set up the notebook for plotting the graphs.
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
The following cell contains the command to plot the graph depicting the importance of features.
In [ ]:
xgb.plot_importance(xgb_model)
To visualize the decision trees that are trained by XGBoost, that is, the XGBoost model you must first install the graphviz
package. The graphviz
Python package is installed in the notebook environment by running the following cell:
In [ ]:
!pip install graphviz
In [ ]:
xgb.plot_tree(xgb_model, num_trees=1)
In this section store the XGBoost model in the Watson Machine Learning repository by using Watson Machine Learning repository service Python client libraries.
In [ ]:
from repository.mlrepository import MetaNames
from repository.mlrepository import MetaProps
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact
Authenticate to Watson Machine Learning service on Bluemix.
Action: Put authentication information from your instance of Watson Machine Learning service in the following cell.
Tip: service_path, user and password can be found on Service Credentials tab of service instance created in Bluemix. If you cannot see instance_id field in Serice Credentials generate new credentials by pressing New credential (+) button.
In [ ]:
# @hidden_cell
wml_credentials = {
"url": "https://ibm-watson-ml.mybluemix.net",
"access_key": "$ACCESSKEY",
"username": "$USERNAME",
"password": "$PASSWORD",
"instance_id": "$INSTANCEID"
}
In [ ]:
ml_repository_client = MLRepositoryClient(wml_credentials['url'])
ml_repository_client.authorize(wml_credentials['username'], wml_credentials['password'])
In this subsection you will learn how to save a model artifact to your Watson Machine Learning instance by using the Watson Machine Learing repository Python client package.
Create an artifact and save it to the machine learning repository by running the following cells:
In [ ]:
# Check if props is mandatory
props1 = MetaProps({MetaNames.AUTHOR_NAME:"YOUR_name", MetaNames.AUTHOR_EMAIL:"Your_email@email.com"})
model_artifact = MLRepositoryArtifact(xgb_model, name='XGB_Heart_Disease_Detection', meta_props=props1)
saved_model = ml_repository_client.models.save(model_artifact)
In [ ]:
saved_model_meta = saved_model.meta.get()
pprint.pprint(saved_model_meta)
In [ ]:
saved_model_meta['modelVersionHref']
In [ ]:
loaded_artifact = ml_repository_client.models.version_from_href(saved_model_meta['modelVersionHref'])
loaded_xgb_model = loaded_artifact.model_instance()
print("Type of model: " + str(type(loaded_xgb_model)))
In [ ]:
y_lpredict = loaded_xgb_model.predict(dm_test)
print(y_lpredict)
In [ ]:
loaded_artifact = ml_repository_client.models.version_from_href(saved_model_meta['modelVersionHref'])
loaded_xgb_regressor = loaded_artifact.model_instance(as_type="XGBRegressor")
In [ ]:
print("Type of model: " + str(type(loaded_xgb_regressor)))
In [ ]:
y_pred_xgb_reg = loaded_xgb_regressor.predict(heart_test)
print(y_pred_xgb_reg)
To access the Watson Machine Learning REST APIs we require a Watson Machine Learning access token.
To create the WML access token, run the commands in the following cell:
In [ ]:
import urllib3, requests, json
headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(wml_credentials['username'], wml_credentials['password']))
url = '{}/v3/identity/token'.format(wml_credentials['url'])
response = requests.get(url, headers=headers)
mltoken = json.loads(response.text).get('token')
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}
Get the published_models
URL from instance details.
In [ ]:
endpoint_instance = wml_credentials['url'] + "/v3/wml_instances/" + wml_credentials['instance_id']
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}
response_get_instance = requests.get(endpoint_instance, headers=header)
print(response_get_instance)
print(response_get_instance.text)
In [ ]:
endpoint_published_models = json.loads(response_get_instance.text).get('entity').get('published_models').get('url')
print(endpoint_published_models)
Execute the following sample code that uses the published_models
endpoint to get the deployments URL.
Get the list of published models.
In [ ]:
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}
response_get = requests.get(endpoint_published_models, headers=header)
print(response_get)
print(response_get.text)
Get published model deployment URL.
In [ ]:
[endpoint_deployments] = [x.get('entity').get('deployments').get('url') for x in json.loads(response_get.text).get('resources') if x.get('metadata').get('guid') == saved_model.uid]
print(endpoint_deployments)
We can now create the online deployment for the published model.
In [ ]:
payload_online = {"name": "xgb_heart_disease_v1", "description": "xgb_heart_disease", "type": "online"}
response_online = requests.post(endpoint_deployments, json=payload_online, headers=header)
print(response_online.text)
In [ ]:
scoring_url = json.loads(response_online.text).get('entity').get('scoring_url')
print(scoring_url)
Now, let us perform predictions on a new set of data using the model that is deployed in the scoring service.
In [ ]:
payload_scoring = {
"values": [[64.0, 1.0, 4.0, 328.0, 263.0, 0.0, 0.0, 105.0, 1.0, 0.2, 2.0, 1.0, 7.0]]
}
response_scoring = requests.post(scoring_url, json=payload_scoring, headers=header)
pprint.pprint(response_scoring.text)
The scoring output contains the prediction value and the corresponding margin data.
You successfully completed this notebook! You learned how to use XGBoost machine learning as well as Watson Machine Learning for model creation and deployment. Check out our Online Documentation for more samples, tutorials, documentation, how-tos, and blog posts.
Copyright © 2017 IBM. This notebook and its source code are released under the terms of the MIT License.