Partitioner examples

This is a jupyter notebook with a few vignettes that present some of the Python partitioner package's functionality.

Note: Cleaning of text and determination of clauses occurs in the partitionText method. Because of this, it is unwise to pass large, uncleaned pieces of text as 'clauses' directly through the .partition() method (regardless of the type of partition being taken), as this will simply tokenize the text by splitting on " ", producing many long, punctuation-filled phrases, and likely run very slow. As such, best practices only use .partition() for testing and exploring the tool on case-interested clauses.



In [7]:

    
from partitioner import partitioner
from partitioner.methods import *

Process the English Wiktionary to generate the (default) partition probabilities.

Note: this step can take significant time for large dictionaries (~5 min).



In [4]:

    
## Vignette 1: Build informed partition data from a dictionary, 
##             and store to local collection
def preprocessENwiktionary():
    pa = partitioner(informed = True, dictionary = "./dictionaries/enwiktionary.txt")
    pa.dumpqs(qsname="enwiktionary")



In [5]:

    
preprocessENwiktionary()

Perform a few one-off partitions.



In [4]:

    
## Vignette 2: An informed, one-off partition of a single clause
def informedOneOffPartition(clause = "How are you doing today?"):
    pa = oneoff()
    print pa.partition(clause)



In [5]:

    
informedOneOffPartition()
informedOneOffPartition("Fine, thanks a bunch for asking!")









    



['How are you doing', 'today?']
['Fine,', 'thanks a bunch', 'for', 'asking!']

Solve for the informed stochastic expectation partition (given the informed partition probabilities).



In [6]:

    
## Vignette 3: An informed, stochastic expectation partition of a single clause
def informedStochasticPartition(clause = "How are you doing today?"):
    pa = stochastic()
    print pa.partition(clause)



In [7]:

    
informedStochasticPartition()









    



{'are you': 1.407092428930965e-09, 'How are you': 0.00025712526951610467, 'How': 5.472370457590498e-06, 'doing': 0.000257136448270894, 'you doing': 3.79920846523168e-05, 'How are': 3.800164835444141e-05, 'are': 2.0796023583003835e-10, 'are you doing': 5.47075540492574e-06, 'today?': 1.0, 'you': 9.771662360456963e-09, 'How are you doing': 0.999699400711672}

Perform a pure random (uniform) one-off partition.



In [8]:

    
## Vignette 4: An uniform, one-off partition of a single clause
def uniformOneOffPartition(informed = False, clause = "How are you doing today?", qunif = 0.25):
    pa = oneoff(informed = informed, qunif = qunif)
    print pa.partition(clause)



In [15]:

    
uniformOneOffPartition()
uniformOneOffPartition(qunif = 0.75)









    



['How are', 'you doing today?']
['How', 'are', 'you doing today?']

Solve for the uniform stochastic expectation partition (given the uniform partition probabilities).



In [16]:

    
## Vignette 5: An uniform, stochastic expectation partition of a single clause
def uniformStochasticPartition(informed = False, clause = "How are you doing today?", qunif = 0.25):
    pa = stochastic(informed = informed, qunif = qunif)
    print pa.partition(clause)



In [17]:

    
uniformStochasticPartition()
uniformStochasticPartition(clause = "Fine, thanks a bunch for asking!")









    



{'are you doing today?': 0.10546875000000001, 'How are you': 0.14062499999999997, 'How': 0.25, 'doing': 0.0625, 'How are': 0.1875, 'How are you doing today?': 0.31640625, 'doing today?': 0.1875, 'you doing': 0.046875, 'are you doing': 0.03515624999999999, 'are': 0.0625, 'you doing today?': 0.14062499999999997, 'today?': 0.25, 'are you': 0.046875, 'you': 0.0625, 'How are you doing': 0.10546875000000001}
{'a': 0.0625, 'Fine,': 0.25, 'thanks a': 0.046875, 'Fine, thanks a bunch for asking!': 0.23730468749999997, 'bunch for asking!': 0.14062499999999997, 'a bunch for': 0.03515624999999999, 'for': 0.0625, 'thanks a bunch for': 0.026367187499999993, 'Fine, thanks a bunch': 0.10546875000000001, 'Fine, thanks a bunch for': 0.0791015625, 'a bunch': 0.046875, 'Fine, thanks a': 0.14062499999999997, 'Fine, thanks': 0.1875, 'thanks': 0.0625, 'a bunch for asking!': 0.10546875000000001, 'asking!': 0.25, 'bunch for': 0.046875, 'for asking!': 0.1875, 'thanks a bunch for asking!': 0.0791015625, 'thanks a bunch': 0.03515624999999999, 'bunch': 0.0625}

Build a rank-frequency distribution for a text and determine its Zipf/Simon (bag-of-phrase) $R^2$.



In [18]:

    
## Vignette 6: Use the default partitioning method to partition the main partitioner.py file and compute rsq
def testPartitionTextAndFit():
    pa = oneoff()
    pa.partitionText(textfile = pa.home+"/../README.md")
    pa.testFit()
    print "R-squared: ",round(pa.rsq,2)
    print
    phrases = sorted(pa.counts, key = lambda x: pa.counts[x], reverse = True)
    for j in range(25):
        phrase = phrases[j]
        print phrase, pa.counts[phrase]



In [19]:

    
testPartitionTextAndFit()









    



R-squared:  0.11

project 7.0
 5.0
the 5.0
code 4.0
to 4.0
and 4.0
of the 3.0
API 3.0
should 2.0
docs 2.0
A short 2.0
This 2.0
etc 2.0
your 2.0
size 2.0
reference 2.0
can 2.0
examples 2.0
is 2.0
how 2.0
added 2.0
description 2.0
important 2.0
Make sure 1.0
show 1.0

Process the some other Wiktionaries to generate the partition probabilities.

Note: These dictionaries are not as well curated and potentially contain phrases from other languages (a consequence of wiktionary construction). As a result, they hold many many more phrases and will take longer to process. However, since the vast majority of these dictionaries are language-correct, effects on the partitioner and its (course) partition probabilities is likely negligable.



In [6]:

    
## Vignette X1: Build informed partition data from other dictionaries, 
##             and store to local collection
def preprocessOtherWiktionaries():
    for lang in ["ru", "pt", "pl", "nl", "it", "fr", "fi", "es", "el", "de", "en"]:
        print "working on "+lang+"..."
        pa = partitioner(informed = True, dictionary = "./dictionaries/"+lang+".txt")
        pa.dumpqs(qsname=lang)



In [7]:

    
preprocessOtherWiktionaries()









    



working on ru...
working on pt...
working on pl...
working on nl...
working on it...
working on fr...
working on fi...
working on es...
working on el...
working on de...
working on en...

Test partitioner on some other languages.



In [95]:

    
from partitioner import partitioner
from partitioner.methods import *
## Vignette X2: Use the default partitioning method to partition the main partitioner.py file and compute rsq
def testFrPartitionTextAndFit():
    for lang in ["ru", "pt", "pl", "nl", "it", "fr", "fi", "es", "el", "de", "en"]:
        pa = oneoff(qsname = lang)
        pa.partitionText(textfile = "./tests/test_"+lang+".txt")
        pa.testFit()
        print
        print lang+" R-squared: ",round(pa.rsq,2)
        print
        phrases = sorted(pa.counts, key = lambda x: pa.counts[x], reverse = True)
        for j in range(5):
            phrase = phrases[j]
            print phrase, pa.counts[phrase]



In [96]:

    
testFrPartitionTextAndFit()









    



ru R-squared:  0.07

и 328.0
въ 204.0
я 126.0
е 106.0
на 101.0

pt R-squared:  0.75

de 470.0
e 265.0
que 243.0
da 234.0
a 193.0

pl R-squared:  0.04

i 40.0
Illustration 26.0
się 23.0
z 20.0
w 18.0

nl R-squared:  0.74

ik 980.0
een 741.0
dat 705.0
van 644.0
de 634.0

it R-squared:  0.7

e 6646.0
che 5656.0
di 5393.0
a 3873.0
il 3692.0

fr R-squared:  0.87

et 2001.0
a 1486.0
de 1333.0
les 1139.0
des 1060.0

fi R-squared:  0.31

ja 246.0
oli 150.0
hän 147.0
Lopo 109.0
että 88.0

es R-squared:  0.67

de 1981.0
y 1651.0
que 1311.0
el 698.0
en 684.0

el R-squared:  0.6

να 332.0
του 253.0
τον 250.0
και 205.0
ΟΙΔΙΠΟΥΣ 192.0

de R-squared:  0.77

und 2691.0
die 2521.0
der 2282.0
zu 2145.0
sie 1702.0

en R-squared:  0.91

and 3691.0
the 2838.0
that 1556.0
of 1472.0
to 1358.0



In [ ]: