Example usages of the partitioner module

JRW, 4/19/2017

To load the module, run:



In [1]:

    
from partitioner.tools import partitioner

Set up

Note that partitioner utilized relatively-large training data files. Thus, this module will likely not be downloaded with any data (e.g., if downloaded from pip). If this is the case, training data may be downloaded through the .download() method. Note that this will initiate a prompt, to which a response is required.

pa = partitioner()
pa.download()

High-performance English model

Once the training data has been downloaded, the following will load all English data sets. This requires significant memory resources, but results in a high-performance (see https://arxiv.org/pdf/1608.02025.pdf for details) model:



In [2]:

    
pa = partitioner(language = "en", doPOS = True, doLFD = True, q = {"type": 0.77, "POS": 0.71})
print("\n".join(pa.partition("How could something like this simply pop up out of the blue?")))









    



How
 
could
 
something
 
like
 
this
 
simply
 
pop up
 
out of the blue
?

Note that this utilizes the parameterization determined in the above article. To change the threshold partition probabilities for both wordforms (type) and part-of-speech (POS), try the following. Note, lower values of q makes it more difficult for words to join together, and values outside of [0,1] will result in random partitions, which are discussed below.



In [3]:

    
print(pa.q)
pa.q['type'] = 0.5
print("\n".join(pa.partition("How could something like this simply pop up out of the blue?")))









    



{'type': 0.76, 'POS': 0.46}
How
 
could
 
something
 
like
 
this
 
simply
 
pop up
 
out of
 
the
 
blue
?

Reduced memory overhead

First, clear the data and then load all but the largest (Wikipedia) MWE dataset. Note: partitioner will not be able to resolve as many named entities without Wikipedia.



In [4]:

    
pa.q['type'] = 0.76
pa.clear()
pa.language = "en"
for source in ["wordnet", "tweebank", "trustpilot", "ted", "streusle", "ritter", "lowlands"]:
    pa.source = source
    pa.load()
print("\n".join(pa.partition("How could something like this simply pop up out of the blue?")))









    



How
 
could
 
something
 
like
 
this
 
simply
 
pop up
 
out of
 
the
 
blue
?

Run partitioner in a different language

partitioner comes with starter data from Wikipedia for nine other languages besides English: Dutch (nl), Finnish (fi), German (de), Greek (el), Italian (it), Polish (pl), Portuguese (pt), Russian (ru), and Spanish (es). Note that this is only starter data for these languages, which being from Wikipedia will mostly only cover nouns, as opposed to more conversational language. To learn more about how data are annotated for MWE segmentation, see https://www.cs.cmu.edu/~nschneid/mwecorpus.pdf for more information on comprehensive MWE annotations.



In [5]:

    
pa.clear()
pa.language = "de"
pa.source = ""
pa.load()
print("\n".join(pa.partition("Die binäre Suche ist ein Algorithmus.")))









    



Warning: no known contractions for the de language.
Die
 
binäre Suche
 
ist
 
ein
 
Algorithmus
.

Partition a whole text file

In addition to segmenting lines of text, partitioner can be applied to whole files to produce aggregated counts. This results in a rank-frequency distribution, which can be assessed for a bag-of-phrases goodness of fit ($R^2$).



In [9]:

    
pa.clear()
pa.language = "en"
pa.source = "streusle"
pa.load()
pa.partitionText(textfile="README.md")
pa.testFit()
print("R-squared: "+str(round(pa.rsq,2)))
print("")
phrases = sorted(pa.frequencies, key = lambda x: pa.frequencies[x], reverse = True)
for j in range(25):
    phrase = phrases[j]
    print("\""+phrase+"\": "+str(pa.frequencies[phrase]))









    



R-squared: 0.47

" ": 289.0
"
": 52.0
",": 29.0
">": 24.0
"'": 16.0
"the": 16.0
".": 15.0
"#": 12.0
""": 12.0
"partitioner": 10.0
":": 8.0
"(": 7.0
"of": 7.0
"=": 7.0
"data": 7.0
")": 7.0
"a": 6.0
"from": 5.0
"for": 4.0
"pa": 4.0
"with": 4.0
"The": 4.0
"to": 3.0
"source": 3.0
"segmentation": 3.0

Run non-deterministic partitions

The partitioner project and module grew out of a more simplistic, probabilistic framework. Instead of using the MWE partitions, we can maintain the training data and just partition at random, acording to the loaded probabilities. Random partitions ensue when the threshold parameters are outside of [0,1]. To really see the effects, clear out all partition data and use the uniform random partition probability.

Also, to run random partitions it is best to turn off part-of-speech tagging, the longest first defined (LFD) algorithm (which ensures that all partitioned MWEs are in fact defined), in addition to limiting the gap size to zero. Note that different runs on the same sentence will produce different partitions.



In [12]:

    
pa.clear()
print(pa.qunif)
pa.q['type'] = -1; pa.q['POS'] = -1
pa.doLFD = False
pa.doPOS = False
pa.maxgap = 0
print("\n".join(pa.partition("Randomness is hard to manage.")))
print("\n\n")
print("\n".join(pa.partition("Randomness is hard to manage.")))









    



0.5
Randomness
 
is hard to
 
manage.



Randomness is hard to manage.

Compute non-deterministic partition expectations

Rather can computing one-off non-deterministic partitions, which are the result of a random process, we can also compute the expectation. For a given phrase, the computed amount is the

"expected frequency that a phrase is partitioned from a text, given the partition probabilities"

Essentially, these may be treated like counts, generalizing the n-grams framework.



In [13]:

    
print(pa.expectation("On average, randomness is dull."))









    



{'average, randomness': 0.125, 'is': 0.25, 'On average': 0.25, 'On average, randomness': 0.125, 'average, randomness is dull': 0.03125, ' ': 2.0, 'randomness is dull.': 0.0625, 'On average, randomness is dull.': 0.03125, ',': 0.5, '.': 0.5, 'is dull.': 0.125, 'average, randomness is dull.': 0.03125, 'On': 0.5, 'On average, randomness is': 0.0625, 'randomness is': 0.125, 'randomness is dull': 0.0625, 'dull': 0.25, 'dull.': 0.25, 'randomness': 0.25, 'average': 0.25, 'On average, randomness is dull': 0.03125, 'is dull': 0.125, 'average, randomness is': 0.0625}



In [ ]: