Note: Cleaning of text and determination of clauses occurs in the partitionText method. Because of this, it is unwise to pass large, uncleaned pieces of text as 'clauses' directly through the .partition() method (regardless of the type of partition being taken), as this will simply tokenize the text by splitting on " ", producing many long, punctuation-filled phrases, and likely run very slow. As such, best practices only use .partition() for testing and exploring the tool on case-interested clauses.
In [7]:
from partitioner import partitioner
from partitioner.methods import *
In [4]:
## Vignette 1: Build informed partition data from a dictionary,
## and store to local collection
def preprocessENwiktionary():
pa = partitioner(informed = True, dictionary = "./dictionaries/enwiktionary.txt")
pa.dumpqs(qsname="enwiktionary")
In [5]:
preprocessENwiktionary()
In [4]:
## Vignette 2: An informed, one-off partition of a single clause
def informedOneOffPartition(clause = "How are you doing today?"):
pa = oneoff()
print pa.partition(clause)
In [5]:
informedOneOffPartition()
informedOneOffPartition("Fine, thanks a bunch for asking!")
In [6]:
## Vignette 3: An informed, stochastic expectation partition of a single clause
def informedStochasticPartition(clause = "How are you doing today?"):
pa = stochastic()
print pa.partition(clause)
In [7]:
informedStochasticPartition()
In [8]:
## Vignette 4: An uniform, one-off partition of a single clause
def uniformOneOffPartition(informed = False, clause = "How are you doing today?", qunif = 0.25):
pa = oneoff(informed = informed, qunif = qunif)
print pa.partition(clause)
In [15]:
uniformOneOffPartition()
uniformOneOffPartition(qunif = 0.75)
In [16]:
## Vignette 5: An uniform, stochastic expectation partition of a single clause
def uniformStochasticPartition(informed = False, clause = "How are you doing today?", qunif = 0.25):
pa = stochastic(informed = informed, qunif = qunif)
print pa.partition(clause)
In [17]:
uniformStochasticPartition()
uniformStochasticPartition(clause = "Fine, thanks a bunch for asking!")
In [18]:
## Vignette 6: Use the default partitioning method to partition the main partitioner.py file and compute rsq
def testPartitionTextAndFit():
pa = oneoff()
pa.partitionText(textfile = pa.home+"/../README.md")
pa.testFit()
print "R-squared: ",round(pa.rsq,2)
print
phrases = sorted(pa.counts, key = lambda x: pa.counts[x], reverse = True)
for j in range(25):
phrase = phrases[j]
print phrase, pa.counts[phrase]
In [19]:
testPartitionTextAndFit()
Note: These dictionaries are not as well curated and potentially contain phrases from other languages (a consequence of wiktionary construction). As a result, they hold many many more phrases and will take longer to process. However, since the vast majority of these dictionaries are language-correct, effects on the partitioner and its (course) partition probabilities is likely negligable.
In [6]:
## Vignette X1: Build informed partition data from other dictionaries,
## and store to local collection
def preprocessOtherWiktionaries():
for lang in ["ru", "pt", "pl", "nl", "it", "fr", "fi", "es", "el", "de", "en"]:
print "working on "+lang+"..."
pa = partitioner(informed = True, dictionary = "./dictionaries/"+lang+".txt")
pa.dumpqs(qsname=lang)
In [7]:
preprocessOtherWiktionaries()
In [95]:
from partitioner import partitioner
from partitioner.methods import *
## Vignette X2: Use the default partitioning method to partition the main partitioner.py file and compute rsq
def testFrPartitionTextAndFit():
for lang in ["ru", "pt", "pl", "nl", "it", "fr", "fi", "es", "el", "de", "en"]:
pa = oneoff(qsname = lang)
pa.partitionText(textfile = "./tests/test_"+lang+".txt")
pa.testFit()
print
print lang+" R-squared: ",round(pa.rsq,2)
print
phrases = sorted(pa.counts, key = lambda x: pa.counts[x], reverse = True)
for j in range(5):
phrase = phrases[j]
print phrase, pa.counts[phrase]
In [96]:
testFrPartitionTextAndFit()
In [ ]: