We begin by initializing Elasticsearch with the names of our index and type. You should have Elasticsearch installed and already run the "bin/elasticsearch" command. For more information you can visit: https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html.
Here we initialize an object where corpus will be the name of the index with a single type articles, and a field named sentence. It will automatically index the Document from our snorkelSession.
By default each document will contain a field corresponding to the sentence ID number, the sentence and an empty vector of 'o's (this will be useful for generating candidate tags below). Once everything has been indexed we can see its status with the amount of documents it contains as well as the size.
Using our Spouse candidates we will generate a new field that tags the corresponding position in the sentence. This is done using the cands
keyword when initializing the session, or with the generate_tags
function.
In [1]:
from snorkel import SnorkelSession
from snorkel.models import candidate_subclass
from elastics import ElasticSession,delete_index
# import os
# # TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# # Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'
#Repeat the Spouse definition from the tutorial
#To delete the corpus index, needed to re-run
# delete_index("corpus")
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])
%time eSearch=ElasticSession(cands=Spouse)
The type of query that we perform is defined by the first arguement. Each query also contains an optional size and slop keyword parameter. Size specifies the number of results that will be returned and slop is the amount of acceptable positions the values in the query can be away from each other. For more information about slop: https://www.elastic.co/guide/en/elasticsearch/guide/current/slop.html
Size and slop as defaulted to 5 and 0 respectively.
The regular queries return Sentence objects, and Candidate queries will return candidate objects. These objects are not stored in our index.
*Queries are case sensitive
Once we have all our documents indexed we can perform a simple query. Using the match
keyword, we query our sentence field in every document for the words married OR children. Matches that contain both and their entirety will be scored higher. After performing the query we print the results which are sorted in a decsending order.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
In [2]:
query="married children"
search_result = eSearch.search_index("match",query)
for i in search_result:
print i.text
Specifying a slop parameter will force the query to only return results that contain every word in the query. Explicitly stating a slop=0 will return results where the query values appear side by side. This query looks for white AND trousers that occur in a sentence.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html
In [3]:
query="white trousers"
search_result = eSearch.search_index("match",query,slop=0,size=3)
for i in search_result:
print i.text
We can also search in between two candidates which were defined as PERSON in the spousal tutorial. Specifically, we are querying for PERSON married PERSON in that order. To do this we use the between_cand
keyword,followed by the word we want to search for.
*All candidate searches only allow for a singular term
In [4]:
from snorkel.viewer import SentenceNgramViewer
import os
from snorkel import SnorkelSession
session = SnorkelSession()
result = eSearch.search_index("between_cand","married",slop=100)
if 'CI' not in os.environ:
sv = SentenceNgramViewer(result, session)
else:
sv = None
sv
In [5]:
result = eSearch.search_index("before_cand","married",slop=100,size=3)
if 'CI' not in os.environ:
sv = SentenceNgramViewer(result, session)
else:
sv = None
sv
In [6]:
result = eSearch.search_index("after_cand","married",slop=100,size=3)
if 'CI' not in os.environ:
sv = SentenceNgramViewer(result, session)
else:
sv = None
sv