Before beginning any of these steps, you must create a Watson Discovery instance on Bluemix (as specified in the README.md). From there, retrieve your service credentials and fill out the .env.example
in this repo accordingly. Rename the file to .env
and then you may begin setting up your Watson Discovery instance.
The environment defines the amount of storage space that you have for content in the Watson Discovery service. A maximum of one environment can be created for each instance of the Watson Discovery service.
Note : You have different environment sizes to choose from. It varies depending upon the storage and enrichments. Refer to the Discovery Service pricing plans and documentation for more details here: https://www.ibm.com/watson/developercloud/discovery.html.
To create an environment, run the following code:
In [ ]:
%load ./scripts/create_environment.py
If you see an error message about invalid credentials, make sure that you modified the file .env
, as specified in the README.md
file for this repository. If the environment is created successfully, the json response will show the status
attribute of the environment as pending
(see https://watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Environments/createEnvironment for more details). It takes a few minutes for the environment to show the status active
. To check on the status of the environment, run the code below.
In [ ]:
%load ./scripts/get_environment_status.py
In [ ]:
%load ./scripts/create_collection.py
After the collection has been created, it is ready to store and index documents. The command below uploads a sample document (format can be found below). Before uploading a lot of the documents, it is a good idea to test the new collection with a sample document. In this step, we will:
In [ ]:
%load ./scripts/test_collection.py
This step involves creating another collection to set up a comparison collection (same as step #3). You MUST update the COLLECTION_NAME
variable in the script to the new value:
COLLECTION_NAME = os.getenv('DISCOVERY_TRAINED_COLLECTION_NAME', 'knowledge_base_trained')
In [ ]:
%load ./scripts/create_collection.py
In [ ]:
%load ./scripts/test_collection.py
In [ ]:
%run ./scripts/print_sample_doc.py
The repository for this ASK includes sample data that has been created from dumps from StackExchange, as explained in the README.md. 100 of these example files are already available in the repository in the data/sample
directory. In this sample, we make sure that our target answer field is named text
as it is used in the default configuration to apply enrichments. Enrichments can be applied to other fields as well, but they require creation of a custom configuration which is not part of this excercise.
You may use your own data with this starter kit. The easiest way to do this is to:
In [ ]:
%load ./scripts/download_and_extract_data.py
In [ ]:
%load ./scripts/transform_xml_to_json.py
You can upload an entire archive of stack exchange data using the scripts provided in this repository. Since this is a very long-running task, this script can be packaged in a Docker container using the supplied Dockerfile at data/Dockerfile
and run on Bluemix Container service.
If unfamiliar with Docker, please see documentation at https://www.docker.com/
If unfamiliar with the Bluemix Container service, please see documentation at https://console.bluemix.net/docs/containers/cs_cli_install.html#cs_cli_install
For more information on uploading documents via Discovery APIs, follow the API reference using https://www.ibm.com/watson/developercloud/discovery/api/v1/#update-doc link
When running the script below, please make sure to update the following variables:
DATA_TYPE
: set it to 'travel'
or whatever data was downloaded in the previous Transform data into JSONDOCS_DIRECTORY
: set it to the directory that contains the documents (either 'sample'
if you skipped the Using your own data section, or DOCS_DIRECTORY = os.path.join(DATA_TYPE, 'json')
DOC_UPLOAD_LIMIT
: set it to 0 if you would like to run all the documents (note this could take a long time) or an integer between 0 and the number of documents in the DOCS_DIRECTORY
In [ ]:
%load ./scripts/upload_documents.py
At this point the collections are configured and ready to retrieve documents based on a query. To test it, submit a query to it by running the command below. You can change the collection name and collection id fields below to test using both the default as well as the enriched configuration. A list of document(s) should be returned from the query.
NOTE: In this step you should provide the values for the following variables:
QUESTION
: a string containing a natural language query for the datasetMAX_DOCUMENTS
: an integer indicating how many documents you want to return from the serviceDESIRED_COLLECTION_NAME
: should be set to either REGULAR_COLLECTION_NAME
or ENRICHED_COLLECTION_NAME
depending on which collection you want to query
In [ ]:
%load ./scripts/query_collection.py
You have reached the end of the Discovery service configuration.