Purpose of this notebook

The purpose of this notebook is to help Developers, Data Scientists setup, configure and use the Watson Discovery service with custom data sets. This notebook uses the Watson Developer Cloud Python SDK, a set of scripts and python modules.

Step 0 : Setup credentials

Before beginning any of these steps, you must create a Watson Discovery instance on Bluemix (as specified in the README.md). From there, retrieve your service credentials and fill out the .env.example in this repo accordingly. Rename the file to .env and then you may begin setting up your Watson Discovery instance.

Step 1 : Create an enviroment

The environment defines the amount of storage space that you have for content in the Watson Discovery service. A maximum of one environment can be created for each instance of the Watson Discovery service.

Note : You have different environment sizes to choose from. It varies depending upon the storage and enrichments. Refer to the Discovery Service pricing plans and documentation for more details here: https://www.ibm.com/watson/developercloud/discovery.html.

To create an environment, run the following code:


In [ ]:
%load ./scripts/create_environment.py

If you see an error message about invalid credentials, make sure that you modified the file .env, as specified in the README.md file for this repository. If the environment is created successfully, the json response will show the status attribute of the environment as pending (see https://watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Environments/createEnvironment for more details). It takes a few minutes for the environment to show the status active. To check on the status of the environment, run the code below.

Step 2 : Check the Status of your environment


In [ ]:
%load ./scripts/get_environment_status.py

Note :

You have to wait for the environment status to become active before proceeding the next section for creating a Watson Discovery collection.

Step 3 : Create a Collection

After the environment is active, create a new collection which will store the documents by running the code below. You need to choose a name for the collection.


In [ ]:
%load ./scripts/create_collection.py

Step 4 : Test the Collection

After the collection has been created, it is ready to store and index documents. The command below uploads a sample document (format can be found below). Before uploading a lot of the documents, it is a good idea to test the new collection with a sample document. In this step, we will:

  1. upload a single document to the collection
  2. poll the document status until it is no longer 'processing'
  3. test to see if the document can be retrieved via a query
  4. remove the document (to ensure we don't have extra documents in our collection)

In [ ]:
%load ./scripts/test_collection.py

Step 5: Create another "trained" collection

This step involves creating another collection to set up a comparison collection (same as step #3). You MUST update the COLLECTION_NAME variable in the script to the new value:

COLLECTION_NAME = os.getenv('DISCOVERY_TRAINED_COLLECTION_NAME', 'knowledge_base_trained')

In [ ]:
%load ./scripts/create_collection.py

Step 6 : Test the "trained" collection

Same as step #4 - You MUST update the COLLECTION_NAME variable in the script to the new value:

COLLECTION_NAME = os.getenv('DISCOVERY_TRAINED_COLLECTION_NAME', 'knowledge_base_trained')

In [ ]:
%load ./scripts/test_collection.py

Step 7: Preparing the content

The next few sections demonstrate the steps necessary to upload data to a collection, which is where indexed data is stored. The format of the data is JSON and must have this schema to proceed:


In [ ]:
%run ./scripts/print_sample_doc.py

Using the Sample Data

The repository for this ASK includes sample data that has been created from dumps from StackExchange, as explained in the README.md. 100 of these example files are already available in the repository in the data/sample directory. In this sample, we make sure that our target answer field is named text as it is used in the default configuration to apply enrichments. Enrichments can be applied to other fields as well, but they require creation of a custom configuration which is not part of this excercise.

Using your own data

You may use your own data with this starter kit. The easiest way to do this is to:

  1. Download a different stack exchange data set from https://archive.org/download/stackexchange and extract it
  2. Transform the data into JSON

Step 7.1: Download and Extract data


In [ ]:
%load ./scripts/download_and_extract_data.py

Step 7.2: Transform data into JSON


In [ ]:
%load ./scripts/transform_xml_to_json.py

Step 8 : Upload Content

You can upload an entire archive of stack exchange data using the scripts provided in this repository. Since this is a very long-running task, this script can be packaged in a Docker container using the supplied Dockerfile at data/Dockerfile and run on Bluemix Container service.

If unfamiliar with Docker, please see documentation at https://www.docker.com/

If unfamiliar with the Bluemix Container service, please see documentation at https://console.bluemix.net/docs/containers/cs_cli_install.html#cs_cli_install

For more information on uploading documents via Discovery APIs, follow the API reference using https://www.ibm.com/watson/developercloud/discovery/api/v1/#update-doc link

When running the script below, please make sure to update the following variables:

  1. DATA_TYPE: set it to 'travel' or whatever data was downloaded in the previous Transform data into JSON
  2. DOCS_DIRECTORY: set it to the directory that contains the documents (either 'sample' if you skipped the Using your own data section, or
    DOCS_DIRECTORY = os.path.join(DATA_TYPE, 'json')
    
    if you didn't skip it)
  3. DOC_UPLOAD_LIMIT: set it to 0 if you would like to run all the documents (note this could take a long time) or an integer between 0 and the number of documents in the DOCS_DIRECTORY

In [ ]:
%load ./scripts/upload_documents.py

Note:

It is necessary to be using the "update" method for document uploading so that you have control over the document ID - this will be needed later when doing relevancy training tasks

Step 9 : Query the collections

At this point the collections are configured and ready to retrieve documents based on a query. To test it, submit a query to it by running the command below. You can change the collection name and collection id fields below to test using both the default as well as the enriched configuration. A list of document(s) should be returned from the query.

NOTE: In this step you should provide the values for the following variables:

  1. QUESTION: a string containing a natural language query for the dataset
  2. MAX_DOCUMENTS: an integer indicating how many documents you want to return from the service
  3. DESIRED_COLLECTION_NAME: should be set to either REGULAR_COLLECTION_NAME or ENRICHED_COLLECTION_NAME depending on which collection you want to query

In [ ]:
%load ./scripts/query_collection.py

You have reached the end of the Discovery service configuration.