1.0 - Create RnR Cluster (& Train Ranker)

Simple example which will:

Create a RnR Cluster
Create a new document collection
Upload documents into that collection
Train a ranker that can be used with the collection
Query the collection with and without a ranker

Note: To better understand the impact of querying with and without a ranker, see the next example on evaluation.

The Data

We make use of the InsuranceLibV2 data: https://github.com/shuzi/insuranceQA. At a high level, the InsuranceLibV2 is a Question Answering data set provided for benchmarking and research, it consists of question and answers collected from the Insurance Library.

There are 27,413 possible answers in this extract and each looks something like:

Coverage follows the car. Example 1: if you were given a car (loaned) and the car has no insurance, you can buy insurance on the car and your insurance will be primary. Another option, someone helped you to buy a car. For example your credit score isn't good enough to finance, so a friend of yours signed under your loan as a primary payor. You can get insurance under your name and even list your friend on the policy as a loss payee. In this case, we always suggest you get a loan gap coverage: the difference between the car's actual cash value and the amount still owned on it. Example 2: the car you are loaned has insurance. You can buy a policy under your name, list the car on that policy and in case of the accident, your policy will become a secondary or excess. Once the limits of the primary car insurance are exhausted, your coverage would kick in and hopefully pay for the rest. I specifically used the word hopefully, because each accident is unique and it's hard to interpret the coverage without the actual claim scenario. And even with a given claim scenario, sometimes there are 2 possible outcomes of a claim.

In addition, a 16,899 questions have been labelled with the answer ids that are relevant to the question. We use a subset of those questions (specifically the 2,000 dev subset) as input to train a ranker. The questions look like this:

Does auto insurance go down when you turn 21?

Note: Ensure credentials have been updated in config/config.ini

Import the necessary scripts and data



In [1]:

    
import sys
from os import path, getcwd
import json
from tempfile import mkdtemp

sys.path.extend([path.abspath(path.join(getcwd(), path.pardir))])

from rnr_debug_helpers.utils.rnr_wrappers import RetrieveAndRankProxy, RankerProxy
from rnr_debug_helpers.utils.io_helpers import load_config, smart_file_open, RankerRelevanceFileQueryStream
from rnr_debug_helpers.generate_rnr_feature_file import generate_rnr_features

config_file_path = path.abspath(path.join(getcwd(), path.pardir, 'config', 'config.ini'))
print('Using config from {}'.format(config_file_path))

config = load_config(config_file_path=config_file_path)

insurance_lib_data_dir = path.abspath(path.join(getcwd(), path.pardir, 'resources', 'insurance_lib_v2'))
print('Using data from {}'.format(insurance_lib_data_dir))









    



Using config from /stuff/workspace/rnr-debugging-scripts/config/config.ini
Using data from /stuff/workspace/rnr-debugging-scripts/resources/insurance_lib_v2

Create a RnR Cluster



In [4]:

    
# Either re-use an existing solr cluster id by over riding the below, or leave as None to create a new cluster
cluster_id = None

# If you choose to leave it as None, it'll use these details to request a new cluster
cluster_name = 'Test Cluster'
cluster_size = '2'

bluemix_wrapper = RetrieveAndRankProxy(solr_cluster_id=cluster_id, 
                                       cluster_name=cluster_name, 
                                       cluster_size=cluster_size, 
                                       config=config)









    



2017-05-16 19:29:28,213 INFO BluemixServiceProxy - Creating a new cluster with name Test Cluster and size 2
2017-05-16 19:29:28,216 INFO BluemixServiceProxy - Submitting request to create a cluster
{
    "cluster_name": "Test Cluster",
    "cluster_size": "2",
    "solr_cluster_id": "sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
    "solr_cluster_status": "NOT_AVAILABLE"
}

Create a Solr collection

Here we create a Solr document collection in the previously created cluster and upload the InsuranceLibV2 documents (i.e. answers) to the collection.



In [5]:

    
collection_id = 'TestCollection'
config_id = 'TestConfig'
zipped_solr_config = path.join(insurance_lib_data_dir, 'config.zip')

bluemix_wrapper.setup_cluster_and_collection(collection_id=collection_id, config_id=config_id,
                                             config_zip=zipped_solr_config)









    



2017-05-16 19:29:29,030 INFO BluemixServiceProxy - Waiting for cluster <<sc40bbecbd_362a_4388_b61b_e3a90578d3b3>> to become available
2017-05-16 19:33:30,807 INFO BluemixServiceProxy - Solr cluster sc40bbecbd_362a_4388_b61b_e3a90578d3b3 is available for use
{
    "cluster_name": "Test Cluster",
    "cluster_size": "2",
    "solr_cluster_id": "sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
    "solr_cluster_status": "READY"
}2017-05-16 19:33:31,214 INFO BluemixServiceProxy - Uploading solr configurations
{
    "message": "WRRCSR026: Successfully uploaded named config [TestConfig] for Solr cluster [sc40bbecbd_362a_4388_b61b_e3a90578d3b3].",
    "statusCode": 200
}2017-05-16 19:33:33,191 INFO BluemixServiceProxy - Creating a collection: TestCollection
{
    "responseHeader": {
        "QTime": 12864,
        "status": 0
    },
    "success": {
        "10.176.142.184:5911_solr": {
            "core": "TestCollection_shard1_replica1",
            "responseHeader": {
                "QTime": 4011,
                "status": 0
            }
        },
        "10.176.39.45:6035_solr": {
            "core": "TestCollection_shard1_replica2",
            "responseHeader": {
                "QTime": 4579,
                "status": 0
            }
        }
    }
}2017-05-16 19:33:47,062 INFO BluemixServiceProxy - Collection: <<TestCollection>> in cluster: <<sc40bbecbd_362a_4388_b61b_e3a90578d3b3>> (with config: <<TestConfig>>) setup with 0 documents

Upload documents

The InsuranceLibV2 had to be pre-processed and formatted into the Solr format for adding documents.

TODO: show the scripts for how to do this conversion to solr format from the raw data provided at https://github.com/shuzi/insuranceQA.



In [7]:

    
documents = path.join(insurance_lib_data_dir, 'document_corpus.solr.xml')

print('Uploading from: %s' % documents)
bluemix_wrapper.upload_documents_to_collection(collection_id=collection_id, corpus_file=documents,
                                               content_type='application/xml')

print('Uploaded %d documents to the collection' % 
      bluemix_wrapper.get_num_docs_in_collection(collection_id=collection_id))









    



Uploading from: /stuff/workspace/rnr-debugging-scripts/resources/insurance_lib_v2/document_corpus.solr.xml
Uploaded 27413 documents to the collection

Train a Ranker

Since we already have the annotated queries with the document ids that are relevant in this case, we can use that to train a ranker.
TODO: show the scripts for how to do this conversion to the relevance file format from the raw data provided at https://github.com/shuzi/insuranceQA.

Generate a feature file

The ranker trains on top of a features derived between the questions and the answers; so we need to use the service to generate such a feature file first. During this feature file generation process, we need to decide on the num_rows parameter. Will go into this in more detail in a separate example, for now, we set this to 50.



In [7]:

    
collection_id = 'TestCollection'
cluster_id = 'sc40bbecbd_362a_4388_b61b_e3a90578d3b3'
temporary_output_dir = mkdtemp()

feature_file = path.join(temporary_output_dir, 'ranker_feature_file.csv')
print('Saving file to: %s' % feature_file)
num_rows = 50
with smart_file_open(path.join(insurance_lib_data_dir, 'validation_gt_relevance_file.csv')) as infile:
    query_stream = RankerRelevanceFileQueryStream(infile)
    with smart_file_open(feature_file, mode='w') as outfile:
        stats = generate_rnr_features(collection_id=collection_id, cluster_id=cluster_id, num_rows=num_rows,
                                      in_query_stream=query_stream, outfile=outfile, config=config)
        print(json.dumps(stats, sort_keys=True, indent=4))









    



Saving file to: /tmp/tmpiw275r2e/ranker_feature_file.csv
2017-05-16 22:40:55,172 INFO BluemixServiceProxy - Using previously created solr cluster id: sc40bbecbd_362a_4388_b61b_e3a90578d3b3
{
    "cluster_name": "Test Cluster",
    "cluster_size": "2",
    "solr_cluster_id": "sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
    "solr_cluster_status": "READY"
}2017-05-16 22:41:38,307 INFO generate_rnr_feature_file.py - Processed 100 queries from input file
2017-05-16 22:42:21,257 INFO generate_rnr_feature_file.py - Processed 200 queries from input file
2017-05-16 22:43:04,129 INFO generate_rnr_feature_file.py - Processed 300 queries from input file
2017-05-16 22:43:46,875 INFO generate_rnr_feature_file.py - Processed 400 queries from input file
2017-05-16 22:44:29,704 INFO generate_rnr_feature_file.py - Processed 500 queries from input file
2017-05-16 22:45:11,905 INFO generate_rnr_feature_file.py - Processed 600 queries from input file
2017-05-16 22:45:54,345 INFO generate_rnr_feature_file.py - Processed 700 queries from input file
2017-05-16 22:46:36,680 INFO generate_rnr_feature_file.py - Processed 800 queries from input file
2017-05-16 22:47:19,090 INFO generate_rnr_feature_file.py - Processed 900 queries from input file
2017-05-16 22:48:01,135 INFO generate_rnr_feature_file.py - Processed 1000 queries from input file
2017-05-16 22:48:43,585 INFO generate_rnr_feature_file.py - Processed 1100 queries from input file
2017-05-16 22:49:26,230 INFO generate_rnr_feature_file.py - Processed 1200 queries from input file
2017-05-16 22:50:08,852 INFO generate_rnr_feature_file.py - Processed 1300 queries from input file
2017-05-16 22:50:51,227 INFO generate_rnr_feature_file.py - Processed 1400 queries from input file
2017-05-16 22:51:33,910 INFO generate_rnr_feature_file.py - Processed 1500 queries from input file
2017-05-16 22:52:18,686 INFO generate_rnr_feature_file.py - Processed 1600 queries from input file
2017-05-16 22:53:05,727 INFO generate_rnr_feature_file.py - Processed 1700 queries from input file
2017-05-16 22:53:48,312 INFO generate_rnr_feature_file.py - Processed 1800 queries from input file
2017-05-16 22:54:30,721 INFO generate_rnr_feature_file.py - Processed 1900 queries from input file
2017-05-16 22:55:13,072 INFO generate_rnr_feature_file.py - Processed 2000 queries from input file
2017-05-16 22:55:13,074 ERROR LabelledQueryStream - Unable to parse values from line 2001 of file <_io.TextIOWrapper name='/stuff/workspace/rnr-debugging-scripts/resources/insurance_lib_v2/validation_gt_relevance_file.csv' mode='r' encoding='utf-8'> due to error: 
2017-05-16 22:55:13,076 INFO generate_rnr_feature_file.py - Finished processing 2000 queries from input file
{
    "avg_num_correct_answers_per_query_in_gt_file": 1.677,
    "avg_num_correct_answers_per_query_in_rnr_results_default": 0.3455,
    "avg_num_search_results_retrieved_per_query": 50.0,
    "num_correct_in_gt_file": 3354,
    "num_correct_in_search_result": 691,
    "num_occurrences_of_label_1": 3354,
    "num_queries": 2000,
    "num_queries_where_at_least_correct_answer_didnt_appear_in_rnr": 1578,
    "num_queries_with_atleast_one_search_result": 2000,
    "num_search_results_retrieved": 100000
}

Call Train with the Feature File

WARNING: Each RnR account gives you 8 rankers to be active at any given time, since I experiment a lot, I have a convenience flag to delete rankers in case the quota is full. You obviously want to switch this flag off if you have rankers you don't want deleted.



In [8]:

    
ranker_api_wrapper = RankerProxy(config=config)
ranker_name = 'TestRanker'
ranker_id = ranker_api_wrapper.train_ranker(train_file_location=feature_file, train_file_has_answer_id=True,
                                            is_enabled_make_space=True, ranker_name=ranker_name)
ranker_api_wrapper.wait_for_training_to_complete(ranker_id=ranker_id)

# Delete local feature file since ranker training is done
from shutil import rmtree
rmtree(temporary_output_dir)









    



2017-05-16 22:55:13,109 INFO BluemixServiceProxy - Submitting request to create a new ranker trained with file /tmp/tmpiw275r2e/ranker_feature_file.csv
2017-05-16 22:55:13,112 INFO BluemixServiceProxy - Generating a version of the feature file without answer id (which is what ranker training expects
2017-05-16 22:55:13,513 INFO BluemixServiceProxy - Done generating file: /tmp/tmpiw275r2e/ranker_feature_file.no_aid.csv
2017-05-16 22:55:13,514 INFO BluemixServiceProxy - Checking file size before making train call for /tmp/tmpiw275r2e/ranker_feature_file.no_aid.csv
2017-05-16 22:55:13,514 INFO BluemixServiceProxy - File size looks ok: 10265794 bytes
2017-05-16 22:55:21,790 INFO BluemixServiceProxy - Training request submitted successfully for ranker id:<<81aacex30-rank-5568>>
2017-05-16 22:55:21,793 INFO BluemixServiceProxy - Checking/Waiting for training to complete for ranker 81aacex30-rank-5568
2017-05-16 22:59:24,429 INFO BluemixServiceProxy - Finished waiting for ranker <<81aacex30-rank-5568>> to train: Available

Query the cluster with questions



In [23]:

    
query_string = 'can i add my brother to my health insurance '

def print_results(response, num_to_print=3):
    results = json.loads(response)['response']['docs']
    for i, doc in enumerate(results[0:num_to_print]):        
        print('Result {}:\n\tid: {}\n\tbody:{}...'.format(i+1,doc['id'], " ".join(doc['body'])[0:100]))
    
bluemix_wrapper = RetrieveAndRankProxy(solr_cluster_id="sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
                                       config=config)

print('Querying with: {}'.format(query_string))

# without the ranker
pysolr_client = bluemix_wrapper.get_pysolr_client(collection_id=collection_id)
response = pysolr_client._send_request("GET", path="/fcselect?q=%s&wt=json&rows=3" % query_string)

print("\nWithout Ranker")
print_results(response)

# with ranker
pysolr_client = bluemix_wrapper.get_pysolr_client(collection_id=collection_id)
response = pysolr_client._send_request("GET", path="/fcselect?q=%s&wt=json&rows=%d&ranker_id=%s" %
                                                  (query_string, num_rows, ranker_id))
print("\nWith Ranker")
print_results(response)









    



2017-05-17 00:23:41,885 INFO BluemixServiceProxy - Using previously created solr cluster id: sc40bbecbd_362a_4388_b61b_e3a90578d3b3
{
    "cluster_name": "Test Cluster",
    "cluster_size": "2",
    "solr_cluster_id": "sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
    "solr_cluster_status": "READY"
}Querying with: can i add my brother to my health insurance 

Without Ranker
Result 1:
	id: 2374
	body:talk to your insurance professional , but in my experience , yes , you would each need to obtain you...
Result 2:
	id: 6149
	body:is life insurance necessary for a single person ? i will answer based on my reasons for having life ...
Result 3:
	id: 4495
	body:right , right , right , right . in this case ... take the new plan . if this was outside of work , c...

With Ranker
Result 1:
	id: 7105
	body:typically , a parent would add your brother to to a health insurance policy that they purchase and y...
Result 2:
	id: 22273
	body:your brother can only add you to his health insurance if he claims you as a dependent on his taxes ....
Result 3:
	id: 8658
	body:did you know that this type of policy was first created by dr. marius bernard , in 1983 in south afr...



In [ ]: