Purpose

The purpose of this notebook is to improve relevancy of search results in your custom dataset by setting up training data for the collection that was setup as part of the Setup_Discovery notebook. If you have not completed that exercise, please exit this notebook and return when the Setup_Discovery notebook is complete.

For information about training Watson on your dataset, please see the documentation https://www.ibm.com/watson/developercloud/doc/discovery/train.html

Step 1: Check Training Status

First we need to take a look at the training attributes of our collection. The API reference for getCollection indicates that we receive 3 attributes that are required for relevancy training to begin:

sufficient_label_diversity -> "labels" are integers indicating how relevant a document is to a specified query. To make sure that the labels are diverse enough, there needs to be at least one "relavant" indicator (which we will assign as 10) and at least one "not relevant" indicator (which we will assign as 0) You are free to add more labels, but this attribute will be satisfied when there are at least two or more defined across your dataset and they are applied equally across questions (meaning that all questions have at least one "relevant" answer and one "not relevant" answer).
minimum_queries_added -> at the very least, 49 different queries must be added in order to start training Watson. In most cases (including this exercise) it will require many more, which is related to the entire size of the collection.
minimum_examples_added -> for each query added, at least 2 examples must be provided (and likely more depending on the scope of the query)



In [ ]:

    
%load ./scripts/get_training_status.py

Step 2: Download and Extract data

We will use the same dataset that was used in Setup_Discovery notebook (if you changed it there, please make sure to make the appropriate updates here as well). For simplicity, the download and extract script has been included here that may be skipped if the data has already been downloaded from stack exchange.



In [ ]:

    
%load ./scripts/download_and_extract_data.py

Step 3: Create training data

For each question in the dataset, we will use the highest answer "score" as the "relevant" example (assigned a RELEVANT_RATING) and the lowest answer "score" as the "not relevant" example (assigned a IRRELEVANT_RATING). We also limit our questions used for training to questions with less than MAX_WORDS to reduce the load on the system (longer questions typically produce more complexity for the service).

You are free to customize this script to tweak performance and experiment with training data and how it coordinates with relevancy metrics.

Please make sure that INPUT_DIR points to the correct location of the downloaded and extracted data from Step 2.

Variables for modification in this script:

DATA_TYPE: string indicating the name of the dataset to be used specified in the previous step
INPUT_DIR: the directory where the extracted data lives on your local machine (from the previous step as well)
TRAINING_DIR: the directory where you wish to output the training data which will be used in the next step
SPLIT_PERCENTAGE: indicates the percentage of questions to use for training
MAX_WORDS: the maximum words in a question that are allowed to be written for training data
RELEVANT_RATING: the label given to "relevant" answers, set to 10 in Watson Discovery Tooling
IRRELEVANT_RATING: the label given to "not relevant" answers, set to 0 in Watson Discovery Tooling



In [ ]:

    
%load ./scripts/create_training_data.py

Step 4: View training document

After writing the training data to file, check to see the structure of the output. It should match the model for the body indicated at the POST trainingQuery method



In [ ]:

    
%load ./scripts/print_training_document.py

Step 5: Upload training documents

Using the Watson Discovery Service API, we will upload the directory of training documents to the collection we set up in the Setup_Discovery notebook.



In [ ]:

    
%load ./scripts/upload_training_data.py