polyglot package offer a command line interface along with the library access.
For each task in polyglot, there is a subcommand with specific options for that task.
Common options are gathered under the main command polyglot
In [1]:
!polyglot --help
Notice that most of the operations are language specific.
For example, tokenization rules and part of speech taggers differ between languages.
Therefore, it is important that the lanaguage of the input is detected or given.
The --lang option allows you to tell polyglot which language the input is written in.
In [2]:
!polyglot --lang en tokenize --input testdata/cricket.txt | head -n 3
In case the user did not supply the the language code, polyglot will peek ahead and read the first 1KB of data to detect the language used in the input.
In [3]:
!polyglot tokenize --input testdata/cricket.txt | head -n 3
Polyglot will process the input contents line by line assuming that the lines are separated by "\n".
If the file is formatted differently, you can use the polyglot main command option delimiter to specify any string other than "\n".
You can pass text to the polyglot subcommands in several ways:
workers is activated, the execution will be parallelized and each file will be processed by a different process.This example will demonstrate how to use the polyglot main command options and the subcommand count to generate a count of the words appearing in a collection of text files.
First, let us examine the subcommand count options
In [4]:
!polyglot count --help
To avoid long output, we will restrict the count to the words that appeared at least twice
In [5]:
!polyglot count --input testdata/cricket.txt --min-count 2
Let us consider the scenario where we have hundreds of files that contains words we want to count.
Notice, that we can parallelize the process by passing a number higher than 1 to the polyglot main command option workers.
In [6]:
!polyglot --log debug --workers 5 count --input testdata/cricket.txt testdata/cricket.txt --min-count 3
The previous subcommand count assumed that the words are separted by spaces.
Given that we never tokenized the text file, that may result in suboptimal word counting.
Let us take a closer look at the tail of the word counts
In [7]:
!polyglot count --input testdata/cricket.txt | tail -n 10
Observe that words like "2007." could have been considered two words "2007" and "." and the same for "Africa's". To fix this issue, we can use the polyglot subcommand tokenize to deal with these cases. We can stage the counting to happen after the tokenization using the stdin to build a simple pipe.
In [8]:
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot count --min-count 2
Notice, that the word "2007" started appearing in the words counts list.