The tutorial corpus used here is a version of the LibriSpeech test-clean subset, forced aligned with the Montreal Forced Aligner (tutorial corpus download link). Extract the files to somewhere on your local machine.
We begin by importing the necessary classes and functions from polyglotdb as well as defining variables. Change the path to reflect where the tutorial corpus was extracted to on your local machine.
In [2]:
from polyglotdb import CorpusContext
import polyglotdb.io as pgio
corpus_root = '/mnt/e/Data/pg_tutorial'
The import statements get the necessary classes and functions for importing, namely the CorpusContext class and
the polyglot IO module. CorpusContext objects are how all interactions with the database are handled. The CorpusContext is
created as a context manager in Python (the with ... as ...
pattern), so that clean up and closing of connections are
automatically handled both on successful completion of the code as well as if errors are encountered.
The IO module handles all import and export functionality in polyglotdb. The principle functions that a user will encounter
are the inspect_X
functions that generate parsers for corpus formats. In the above code, the MFA parser is used because
the tutorial corpus was aligned using the MFA. See Importing corpora for more information on the inspect functions and parser
objects they generate for various formats.
Once the proper path to the tutorial corpus is set, it can be imported via the following code:
In [17]:
parser = pgio.inspect_mfa(corpus_root)
parser.call_back = print # To show progress output
with CorpusContext('pg_tutorial') as c:
c.load(parser, corpus_root)
If during the running of the import code, a neo4j.exceptions.ServiceUnavailable
error is raised, then double check
that the pgdb database is running. Once polyglotdb is installed, simply call pgdb start
, assuming pgdb install
has already been called. See the relevant documentation for more information.
In [16]:
with CorpusContext('pg_tutorial') as c:
c.reset()
In [11]:
with CorpusContext('pg_tutorial') as c:
print('Speakers:', c.speakers)
print('Discourses:', c.discourses)
q = c.query_lexicon(c.lexicon_phone)
q = q.order_by(c.lexicon_phone.label)
q = q.columns(c.lexicon_phone.label.column_name('phone'))
results = q.all()
print(results)
A more interesting summary query is perhaps looking at the count and average duration of different phone types across the corpus, via:
In [15]:
from polyglotdb.query.base.func import Count, Average
with CorpusContext('pg_tutorial') as c:
q = c.query_graph(c.phone).group_by(c.phone.label.column_name('phone'))
results = q.aggregate(Count().column_name('count'), Average(c.phone.duration).column_name('average_duration'))
for r in results:
print('The phone {} had {} occurrences and an average duration of {}.'.format(r['phone'], r['count'], r['average_duration']))