In [1]:
from partitioner.tools import partitioner
Note that partitioner utilized relatively-large training data files. Thus, this module will likely not be downloaded with any data (e.g., if downloaded from pip
). If this is the case, training data may be downloaded through the .download()
method. Note that this will initiate a prompt, to which a response is required.
pa = partitioner()
pa.download()
Once the training data has been downloaded, the following will load all English data sets. This requires significant memory resources, but results in a high-performance (see https://arxiv.org/pdf/1608.02025.pdf for details) model:
In [2]:
pa = partitioner(language = "en", doPOS = True, doLFD = True, q = {"type": 0.77, "POS": 0.71})
print("\n".join(pa.partition("How could something like this simply pop up out of the blue?")))
Note that this utilizes the parameterization determined in the above article. To change the threshold partition probabilities for both wordforms (type) and part-of-speech (POS), try the following. Note, lower values of q makes it more difficult for words to join together, and values outside of [0,1] will result in random partitions, which are discussed below.
In [3]:
print(pa.q)
pa.q['type'] = 0.5
print("\n".join(pa.partition("How could something like this simply pop up out of the blue?")))
In [4]:
pa.q['type'] = 0.76
pa.clear()
pa.language = "en"
for source in ["wordnet", "tweebank", "trustpilot", "ted", "streusle", "ritter", "lowlands"]:
pa.source = source
pa.load()
print("\n".join(pa.partition("How could something like this simply pop up out of the blue?")))
partitioner comes with starter data from Wikipedia for nine other languages besides English: Dutch (nl), Finnish (fi), German (de), Greek (el), Italian (it), Polish (pl), Portuguese (pt), Russian (ru), and Spanish (es). Note that this is only starter data for these languages, which being from Wikipedia will mostly only cover nouns, as opposed to more conversational language. To learn more about how data are annotated for MWE segmentation, see https://www.cs.cmu.edu/~nschneid/mwecorpus.pdf for more information on comprehensive MWE annotations.
In [5]:
pa.clear()
pa.language = "de"
pa.source = ""
pa.load()
print("\n".join(pa.partition("Die binäre Suche ist ein Algorithmus.")))
In [9]:
pa.clear()
pa.language = "en"
pa.source = "streusle"
pa.load()
pa.partitionText(textfile="README.md")
pa.testFit()
print("R-squared: "+str(round(pa.rsq,2)))
print("")
phrases = sorted(pa.frequencies, key = lambda x: pa.frequencies[x], reverse = True)
for j in range(25):
phrase = phrases[j]
print("\""+phrase+"\": "+str(pa.frequencies[phrase]))
The partitioner project and module grew out of a more simplistic, probabilistic framework. Instead of using the MWE partitions, we can maintain the training data and just partition at random, acording to the loaded probabilities. Random partitions ensue when the threshold parameters are outside of [0,1]. To really see the effects, clear out all partition data and use the uniform random partition probability.
Also, to run random partitions it is best to turn off part-of-speech tagging, the longest first defined (LFD) algorithm (which ensures that all partitioned MWEs are in fact defined), in addition to limiting the gap size to zero. Note that different runs on the same sentence will produce different partitions.
In [12]:
pa.clear()
print(pa.qunif)
pa.q['type'] = -1; pa.q['POS'] = -1
pa.doLFD = False
pa.doPOS = False
pa.maxgap = 0
print("\n".join(pa.partition("Randomness is hard to manage.")))
print("\n\n")
print("\n".join(pa.partition("Randomness is hard to manage.")))
Rather can computing one-off non-deterministic partitions, which are the result of a random process, we can also compute the expectation. For a given phrase, the computed amount is the
Essentially, these may be treated like counts, generalizing the n-grams framework.
In [13]:
print(pa.expectation("On average, randomness is dull."))
In [ ]: