Pylucene links and examples

Pylucene is a Python wrapper around Lucene an open source Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

Pylucene provides a Python extension that allows launching a Lucene process and interoperating with it.

Pylucene exposes the Lucene API with essentially the same namespace as in the original Java API (hence, most of the Lucene API can be used to find Pylucene equivalents). Note however that the Lucene native API is Java-based, and hence a bit un-Pythonic. Pylucene contains a number of adapters to make some of the interfaces more palatable for Python processing, but the API feels a bit awkward at times.

Initialization

Importing lucene will bring into the Python context all the lucene namespace; from then all lucene modules can be imported (included the support Java modules)


In [ ]:
import lucene

In [ ]:
print(lucene.VERSION)

In [ ]:
# We can check all the Lucene packages included in this distribution of Pylucene
for p in sorted(lucene.CLASSPATH.split(':')):
    print(p)

The first operation is always to initialize the lucene backend. This only needs to be done once for each running Python process


In [ ]:
# Init
if not lucene.getVMEnv():
    lucene.initVM(vmargs=['-Djava.awt.headless=true'])

Tests

Let's test a few Lucene components


In [ ]:
test_strings = (
    'La lluvia en Sevilla es una pura maravilla',
    'En un lugar de La Mancha, de cuyo nombre no quiero acordarme',
    u'Con diez cañones por banda, viento en popa a toda vela' )

In [ ]:
# An auxiliary function used in the tokenizer and analyzer examples

from org.apache.lucene.analysis.tokenattributes import CharTermAttribute

def fetch_terms(obj):
    '''fetch all terms from a token list object, as strings'''
    termAtt = obj.getAttribute(CharTermAttribute.class_)
    try:
        obj.clearAttributes()
        obj.reset()
        while obj.incrementToken():
            yield termAtt.toString() 
    finally:
        obj.end()
        obj.close()

Stemming


In [ ]:
from lucene import JArray_char, JArray

from org.tartarus.snowball.ext import SpanishStemmer, EnglishStemmer

def stem(stemmer, word):
    # Add the word
    stemmer.setCurrent(JArray_char(word), len(word))
    # Fire stemming
    stemmer.stem()
    # Fetch the output (buffer & size)
    result = stemmer.getCurrentBuffer()
    l = stemmer.getCurrentBufferLength()
    return ''.join(result)[0:l]    

st = SpanishStemmer()
for w in (u'haciendo', u'lunes', u'vino', u'lápiz'):
    print( w, '->', stem(st, w))

st = EnglishStemmer()
for w in (u'making', u'Monday', u'came', u'pencil'):
    print( w, '->', stem(st, w))

Tokenizer

  • StandardTokenizer A grammar-based tokenizer constructed with JFlex. This class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
  • LetterTokenizer. A tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate.
  • NGramTokenizer Tokenizes the input into n-grams of the given size(s).

In [ ]:
from java.io import StringReader

def tokenize( tk, data ):
    '''Send a string to a tokenizer and get back the token list'''
    tk.setReader( StringReader(data) )
    return list(fetch_terms(tk))

In [ ]:
from org.apache.lucene.analysis.standard import StandardTokenizer
from org.apache.lucene.analysis.core import LetterTokenizer
from org.apache.lucene.analysis.ngram import NGramTokenizer

In [ ]:
tokenizers = (StandardTokenizer(), LetterTokenizer(), NGramTokenizer(4, 4))

for n, t in enumerate(tokenizers):
    print( "\n{} -----------".format(n+1), str(t) )
    for s in test_strings:
        print( "\n", tokenize(t,s) )

Analyzer

  • KeywordAnalyzer: "Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
  • SimpleAnalyzer: An Analyzer that filters LetterTokenizer with LowerCaseFilter
  • SpanishAnalyzer: built from an StandardTokenizer filtered with StandardFilter, LowerCaseFilter, StopFilter, SetKeywordMarkerFilter if a stem exclusion set is provided and SpanishLightStemFilter.
  • ShingleAnalyzerWrapper: A ShingleAnalyzerWrapper wraps a ShingleFilter around another Analyzer. A shingle is another name for a token based n-gram.

In [ ]:
from java.io import StringReader
    
def analyze(anal, data):
    '''Send a string to an analizer and get back the analyzed term list'''
    ts = anal.tokenStream( "dummy", StringReader(data) )
    return list(fetch_terms(ts))

In [ ]:
from org.apache.lucene.analysis.core import KeywordAnalyzer, SimpleAnalyzer
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.analysis.es import SpanishAnalyzer
from org.apache.lucene.analysis.shingle import ShingleAnalyzerWrapper

analyzers = ( KeywordAnalyzer(),
              SimpleAnalyzer(),
              SpanishAnalyzer(),
              ShingleAnalyzerWrapper( SimpleAnalyzer(), 2, 3 ),
              ShingleAnalyzerWrapper( SpanishAnalyzer(), 2, 3 ),
            )

for n, a in enumerate(analyzers):
    print( "\n {} ----------- {}".format(n+1, a) )
    for s in test_strings:
        print( "\n", analyze(a,s) )

In [ ]: