Initializing Elasitcsearch

We begin by initializing Elasticsearch with the names of our index and type. You should have Elasticsearch installed and already run the "bin/elasticsearch" command. For more information you can visit: https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html.

Here we initialize an object where corpus will be the name of the index with a single type articles, and a field named sentence. It will automatically index the Document from our snorkelSession.

By default each document will contain a field corresponding to the sentence ID number, the sentence and an empty vector of 'o's (this will be useful for generating candidate tags below). Once everything has been indexed we can see its status with the amount of documents it contains as well as the size.

Generate Tags

Using our Spouse candidates we will generate a new field that tags the corresponding position in the sentence. This is done using the cands keyword when initializing the session, or with the generate_tags function.


In [1]:
from snorkel import SnorkelSession
from snorkel.models import candidate_subclass
from elastics import ElasticSession,delete_index

# import os

# # TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# # Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'
#Repeat the Spouse definition from the tutorial

#To delete the corpus index, needed to re-run
# delete_index("corpus")

Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

%time eSearch=ElasticSession(cands=Spouse)


{u'acknowledged': True}
Begin indexing
Index Information: 
 
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   corpus MfVBbNWoSqeFw-c1AiJaLA   5   1      67820            0     23.6mb         23.6mb

2591 items indexed

Begin generating tags
27644 candidates of 27688 tagged
CPU times: user 3min 55s, sys: 13.3 s, total: 4min 8s
Wall time: 6min 16s

Querying the Corpus

The type of query that we perform is defined by the first arguement. Each query also contains an optional size and slop keyword parameter. Size specifies the number of results that will be returned and slop is the amount of acceptable positions the values in the query can be away from each other. For more information about slop: https://www.elastic.co/guide/en/elasticsearch/guide/current/slop.html

Size and slop as defaulted to 5 and 0 respectively.

The regular queries return Sentence objects, and Candidate queries will return candidate objects. These objects are not stored in our index.

*Queries are case sensitive

Once we have all our documents indexed we can perform a simple query. Using the match keyword, we query our sentence field in every document for the words married OR children. Matches that contain both and their entirety will be scored higher. After performing the query we print the results which are sorted in a decsending order.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html


In [2]:
query="married children"
search_result = eSearch.search_index("match",query)

for i in search_result:
    print i.text


Number of hits: 2354
He is married with        three children.     
He was married twice and had five children.      
He was later married and had two children.   
She got married to my father, Joseph Ewherido, with whom she had eight children, all males.
All of the women were married (or divorced!), and some had children.   

Search containing all values in query

Specifying a slop parameter will force the query to only return results that contain every word in the query. Explicitly stating a slop=0 will return results where the query values appear side by side. This query looks for white AND trousers that occur in a sentence.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html


In [3]:
query="white trousers"
search_result = eSearch.search_index("match",query,slop=0,size=3)

for i in search_result:
    print i.text


Number of hits: 8
Dressed in a smart black shirt and white trousers, the lawyer looked in good spirits despite a difficult fight ahead.     
Amal, who was dressed in a smart black shirt and white trousers, looked in good spirits on Monday despite a difficult fight ahead.   
Clad in a breton top, white trousers and a matching cap, the Boomtown Rats frontman looked like he couldn't wait to get underway with the festivities.   

Search between Candidates

We can also search in between two candidates which were defined as PERSON in the spousal tutorial. Specifically, we are querying for PERSON married PERSON in that order. To do this we use the between_cand keyword,followed by the word we want to search for.

*All candidate searches only allow for a singular term


In [4]:
from snorkel.viewer import SentenceNgramViewer
import os
from snorkel import SnorkelSession

session = SnorkelSession()

result = eSearch.search_index("between_cand","married",slop=100)

if 'CI' not in os.environ:
    sv = SentenceNgramViewer(result, session)
else:
    sv = None

sv


Number of hits: 225

Search before candidate

Using the before_cand keyword we can search for the occurrence of a term that appears before any PERSON.


In [5]:
result = eSearch.search_index("before_cand","married",slop=100,size=3)

if 'CI' not in os.environ:
    sv = SentenceNgramViewer(result, session)
else:
    sv = None

sv


Number of hits: 291

Search after candidate

Using the after_cand keyword we can search for the occurrence of a term that appears after any PERSON.


In [6]:
result = eSearch.search_index("after_cand","married",slop=100,size=3)

if 'CI' not in os.environ:
    sv = SentenceNgramViewer(result, session)
else:
    sv = None

sv


Number of hits: 318