1.0 - Create RnR Cluster (& Train Ranker)

Simple example which will:

  1. Create a RnR Cluster
  2. Create a new document collection
  3. Upload documents into that collection
  4. Train a ranker that can be used with the collection
  5. Query the collection with and without a ranker

Note: To better understand the impact of querying with and without a ranker, see the next example on evaluation.

The Data

We make use of the InsuranceLibV2 data: https://github.com/shuzi/insuranceQA. At a high level, the InsuranceLibV2 is a Question Answering data set provided for benchmarking and research, it consists of question and answers collected from the Insurance Library.

  • There are 27,413 possible answers in this extract and each looks something like:

Coverage follows the car. Example 1: if you were given a car (loaned) and the car has no insurance, you can buy insurance on the car and your insurance will be primary. Another option, someone helped you to buy a car. For example your credit score isn't good enough to finance, so a friend of yours signed under your loan as a primary payor. You can get insurance under your name and even list your friend on the policy as a loss payee. In this case, we always suggest you get a loan gap coverage: the difference between the car's actual cash value and the amount still owned on it. Example 2: the car you are loaned has insurance. You can buy a policy under your name, list the car on that policy and in case of the accident, your policy will become a secondary or excess. Once the limits of the primary car insurance are exhausted, your coverage would kick in and hopefully pay for the rest. I specifically used the word hopefully, because each accident is unique and it's hard to interpret the coverage without the actual claim scenario. And even with a given claim scenario, sometimes there are 2 possible outcomes of a claim.

  • In addition, a 16,899 questions have been labelled with the answer ids that are relevant to the question. We use a subset of those questions (specifically the 2,000 dev subset) as input to train a ranker. The questions look like this:

Does auto insurance go down when you turn 21?

Note: Ensure credentials have been updated in config/config.ini

Import the necessary scripts and data


In [1]:
import sys
from os import path, getcwd
import json
from tempfile import mkdtemp

sys.path.extend([path.abspath(path.join(getcwd(), path.pardir))])

from rnr_debug_helpers.utils.rnr_wrappers import RetrieveAndRankProxy, RankerProxy
from rnr_debug_helpers.utils.io_helpers import load_config, smart_file_open, RankerRelevanceFileQueryStream
from rnr_debug_helpers.generate_rnr_feature_file import generate_rnr_features

config_file_path = path.abspath(path.join(getcwd(), path.pardir, 'config', 'config.ini'))
print('Using config from {}'.format(config_file_path))

config = load_config(config_file_path=config_file_path)

insurance_lib_data_dir = path.abspath(path.join(getcwd(), path.pardir, 'resources', 'insurance_lib_v2'))
print('Using data from {}'.format(insurance_lib_data_dir))


Using config from /stuff/workspace/rnr-debugging-scripts/config/config.ini
Using data from /stuff/workspace/rnr-debugging-scripts/resources/insurance_lib_v2

Create a RnR Cluster


In [4]:
# Either re-use an existing solr cluster id by over riding the below, or leave as None to create a new cluster
cluster_id = None

# If you choose to leave it as None, it'll use these details to request a new cluster
cluster_name = 'Test Cluster'
cluster_size = '2'

bluemix_wrapper = RetrieveAndRankProxy(solr_cluster_id=cluster_id, 
                                       cluster_name=cluster_name, 
                                       cluster_size=cluster_size, 
                                       config=config)


2017-05-16 19:29:28,213 INFO BluemixServiceProxy - Creating a new cluster with name Test Cluster and size 2
2017-05-16 19:29:28,216 INFO BluemixServiceProxy - Submitting request to create a cluster
{
    "cluster_name": "Test Cluster",
    "cluster_size": "2",
    "solr_cluster_id": "sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
    "solr_cluster_status": "NOT_AVAILABLE"
}

Create a Solr collection

Here we create a Solr document collection in the previously created cluster and upload the InsuranceLibV2 documents (i.e. answers) to the collection.


In [5]:
collection_id = 'TestCollection'
config_id = 'TestConfig'
zipped_solr_config = path.join(insurance_lib_data_dir, 'config.zip')

bluemix_wrapper.setup_cluster_and_collection(collection_id=collection_id, config_id=config_id,
                                             config_zip=zipped_solr_config)


2017-05-16 19:29:29,030 INFO BluemixServiceProxy - Waiting for cluster <<sc40bbecbd_362a_4388_b61b_e3a90578d3b3>> to become available
2017-05-16 19:33:30,807 INFO BluemixServiceProxy - Solr cluster sc40bbecbd_362a_4388_b61b_e3a90578d3b3 is available for use
{
    "cluster_name": "Test Cluster",
    "cluster_size": "2",
    "solr_cluster_id": "sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
    "solr_cluster_status": "READY"
}2017-05-16 19:33:31,214 INFO BluemixServiceProxy - Uploading solr configurations
{
    "message": "WRRCSR026: Successfully uploaded named config [TestConfig] for Solr cluster [sc40bbecbd_362a_4388_b61b_e3a90578d3b3].",
    "statusCode": 200
}2017-05-16 19:33:33,191 INFO BluemixServiceProxy - Creating a collection: TestCollection
{
    "responseHeader": {
        "QTime": 12864,
        "status": 0
    },
    "success": {
        "10.176.142.184:5911_solr": {
            "core": "TestCollection_shard1_replica1",
            "responseHeader": {
                "QTime": 4011,
                "status": 0
            }
        },
        "10.176.39.45:6035_solr": {
            "core": "TestCollection_shard1_replica2",
            "responseHeader": {
                "QTime": 4579,
                "status": 0
            }
        }
    }
}2017-05-16 19:33:47,062 INFO BluemixServiceProxy - Collection: <<TestCollection>> in cluster: <<sc40bbecbd_362a_4388_b61b_e3a90578d3b3>> (with config: <<TestConfig>>) setup with 0 documents

Upload documents

The InsuranceLibV2 had to be pre-processed and formatted into the Solr format for adding documents.

TODO: show the scripts for how to do this conversion to solr format from the raw data provided at https://github.com/shuzi/insuranceQA.


In [7]:
documents = path.join(insurance_lib_data_dir, 'document_corpus.solr.xml')

print('Uploading from: %s' % documents)
bluemix_wrapper.upload_documents_to_collection(collection_id=collection_id, corpus_file=documents,
                                               content_type='application/xml')

print('Uploaded %d documents to the collection' % 
      bluemix_wrapper.get_num_docs_in_collection(collection_id=collection_id))


Uploading from: /stuff/workspace/rnr-debugging-scripts/resources/insurance_lib_v2/document_corpus.solr.xml
Uploaded 27413 documents to the collection

Train a Ranker

Since we already have the annotated queries with the document ids that are relevant in this case, we can use that to train a ranker.
TODO: show the scripts for how to do this conversion to the relevance file format from the raw data provided at https://github.com/shuzi/insuranceQA.

Generate a feature file

The ranker trains on top of a features derived between the questions and the answers; so we need to use the service to generate such a feature file first. During this feature file generation process, we need to decide on the num_rows parameter. Will go into this in more detail in a separate example, for now, we set this to 50.


In [7]:
collection_id = 'TestCollection'
cluster_id = 'sc40bbecbd_362a_4388_b61b_e3a90578d3b3'
temporary_output_dir = mkdtemp()

feature_file = path.join(temporary_output_dir, 'ranker_feature_file.csv')
print('Saving file to: %s' % feature_file)
num_rows = 50
with smart_file_open(path.join(insurance_lib_data_dir, 'validation_gt_relevance_file.csv')) as infile:
    query_stream = RankerRelevanceFileQueryStream(infile)
    with smart_file_open(feature_file, mode='w') as outfile:
        stats = generate_rnr_features(collection_id=collection_id, cluster_id=cluster_id, num_rows=num_rows,
                                      in_query_stream=query_stream, outfile=outfile, config=config)
        print(json.dumps(stats, sort_keys=True, indent=4))


Saving file to: /tmp/tmpiw275r2e/ranker_feature_file.csv
2017-05-16 22:40:55,172 INFO BluemixServiceProxy - Using previously created solr cluster id: sc40bbecbd_362a_4388_b61b_e3a90578d3b3
{
    "cluster_name": "Test Cluster",
    "cluster_size": "2",
    "solr_cluster_id": "sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
    "solr_cluster_status": "READY"
}2017-05-16 22:41:38,307 INFO generate_rnr_feature_file.py - Processed 100 queries from input file
2017-05-16 22:42:21,257 INFO generate_rnr_feature_file.py - Processed 200 queries from input file
2017-05-16 22:43:04,129 INFO generate_rnr_feature_file.py - Processed 300 queries from input file
2017-05-16 22:43:46,875 INFO generate_rnr_feature_file.py - Processed 400 queries from input file
2017-05-16 22:44:29,704 INFO generate_rnr_feature_file.py - Processed 500 queries from input file
2017-05-16 22:45:11,905 INFO generate_rnr_feature_file.py - Processed 600 queries from input file
2017-05-16 22:45:54,345 INFO generate_rnr_feature_file.py - Processed 700 queries from input file
2017-05-16 22:46:36,680 INFO generate_rnr_feature_file.py - Processed 800 queries from input file
2017-05-16 22:47:19,090 INFO generate_rnr_feature_file.py - Processed 900 queries from input file
2017-05-16 22:48:01,135 INFO generate_rnr_feature_file.py - Processed 1000 queries from input file
2017-05-16 22:48:43,585 INFO generate_rnr_feature_file.py - Processed 1100 queries from input file
2017-05-16 22:49:26,230 INFO generate_rnr_feature_file.py - Processed 1200 queries from input file
2017-05-16 22:50:08,852 INFO generate_rnr_feature_file.py - Processed 1300 queries from input file
2017-05-16 22:50:51,227 INFO generate_rnr_feature_file.py - Processed 1400 queries from input file
2017-05-16 22:51:33,910 INFO generate_rnr_feature_file.py - Processed 1500 queries from input file
2017-05-16 22:52:18,686 INFO generate_rnr_feature_file.py - Processed 1600 queries from input file
2017-05-16 22:53:05,727 INFO generate_rnr_feature_file.py - Processed 1700 queries from input file
2017-05-16 22:53:48,312 INFO generate_rnr_feature_file.py - Processed 1800 queries from input file
2017-05-16 22:54:30,721 INFO generate_rnr_feature_file.py - Processed 1900 queries from input file
2017-05-16 22:55:13,072 INFO generate_rnr_feature_file.py - Processed 2000 queries from input file
2017-05-16 22:55:13,074 ERROR LabelledQueryStream - Unable to parse values from line 2001 of file <_io.TextIOWrapper name='/stuff/workspace/rnr-debugging-scripts/resources/insurance_lib_v2/validation_gt_relevance_file.csv' mode='r' encoding='utf-8'> due to error: 
2017-05-16 22:55:13,076 INFO generate_rnr_feature_file.py - Finished processing 2000 queries from input file
{
    "avg_num_correct_answers_per_query_in_gt_file": 1.677,
    "avg_num_correct_answers_per_query_in_rnr_results_default": 0.3455,
    "avg_num_search_results_retrieved_per_query": 50.0,
    "num_correct_in_gt_file": 3354,
    "num_correct_in_search_result": 691,
    "num_occurrences_of_label_1": 3354,
    "num_queries": 2000,
    "num_queries_where_at_least_correct_answer_didnt_appear_in_rnr": 1578,
    "num_queries_with_atleast_one_search_result": 2000,
    "num_search_results_retrieved": 100000
}

Call Train with the Feature File

WARNING: Each RnR account gives you 8 rankers to be active at any given time, since I experiment a lot, I have a convenience flag to delete rankers in case the quota is full. You obviously want to switch this flag off if you have rankers you don't want deleted.


In [8]:
ranker_api_wrapper = RankerProxy(config=config)
ranker_name = 'TestRanker'
ranker_id = ranker_api_wrapper.train_ranker(train_file_location=feature_file, train_file_has_answer_id=True,
                                            is_enabled_make_space=True, ranker_name=ranker_name)
ranker_api_wrapper.wait_for_training_to_complete(ranker_id=ranker_id)

# Delete local feature file since ranker training is done
from shutil import rmtree
rmtree(temporary_output_dir)


2017-05-16 22:55:13,109 INFO BluemixServiceProxy - Submitting request to create a new ranker trained with file /tmp/tmpiw275r2e/ranker_feature_file.csv
2017-05-16 22:55:13,112 INFO BluemixServiceProxy - Generating a version of the feature file without answer id (which is what ranker training expects
2017-05-16 22:55:13,513 INFO BluemixServiceProxy - Done generating file: /tmp/tmpiw275r2e/ranker_feature_file.no_aid.csv
2017-05-16 22:55:13,514 INFO BluemixServiceProxy - Checking file size before making train call for /tmp/tmpiw275r2e/ranker_feature_file.no_aid.csv
2017-05-16 22:55:13,514 INFO BluemixServiceProxy - File size looks ok: 10265794 bytes
2017-05-16 22:55:21,790 INFO BluemixServiceProxy - Training request submitted successfully for ranker id:<<81aacex30-rank-5568>>
2017-05-16 22:55:21,793 INFO BluemixServiceProxy - Checking/Waiting for training to complete for ranker 81aacex30-rank-5568
2017-05-16 22:59:24,429 INFO BluemixServiceProxy - Finished waiting for ranker <<81aacex30-rank-5568>> to train: Available

Query the cluster with questions


In [23]:
query_string = 'can i add my brother to my health insurance '

def print_results(response, num_to_print=3):
    results = json.loads(response)['response']['docs']
    for i, doc in enumerate(results[0:num_to_print]):        
        print('Result {}:\n\tid: {}\n\tbody:{}...'.format(i+1,doc['id'], " ".join(doc['body'])[0:100]))
    
bluemix_wrapper = RetrieveAndRankProxy(solr_cluster_id="sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
                                       config=config)

print('Querying with: {}'.format(query_string))

# without the ranker
pysolr_client = bluemix_wrapper.get_pysolr_client(collection_id=collection_id)
response = pysolr_client._send_request("GET", path="/fcselect?q=%s&wt=json&rows=3" % query_string)

print("\nWithout Ranker")
print_results(response)

# with ranker
pysolr_client = bluemix_wrapper.get_pysolr_client(collection_id=collection_id)
response = pysolr_client._send_request("GET", path="/fcselect?q=%s&wt=json&rows=%d&ranker_id=%s" %
                                                  (query_string, num_rows, ranker_id))
print("\nWith Ranker")
print_results(response)


2017-05-17 00:23:41,885 INFO BluemixServiceProxy - Using previously created solr cluster id: sc40bbecbd_362a_4388_b61b_e3a90578d3b3
{
    "cluster_name": "Test Cluster",
    "cluster_size": "2",
    "solr_cluster_id": "sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
    "solr_cluster_status": "READY"
}Querying with: can i add my brother to my health insurance 

Without Ranker
Result 1:
	id: 2374
	body:talk to your insurance professional , but in my experience , yes , you would each need to obtain you...
Result 2:
	id: 6149
	body:is life insurance necessary for a single person ? i will answer based on my reasons for having life ...
Result 3:
	id: 4495
	body:right , right , right , right . in this case ... take the new plan . if this was outside of work , c...

With Ranker
Result 1:
	id: 7105
	body:typically , a parent would add your brother to to a health insurance policy that they purchase and y...
Result 2:
	id: 22273
	body:your brother can only add you to his health insurance if he claims you as a dependent on his taxes ....
Result 3:
	id: 8658
	body:did you know that this type of policy was first created by dr. marius bernard , in 1983 in south afr...

In [ ]: