Tutorial 1: First steps

Downloading the tutorial corpus

The tutorial corpus used here is a version of the LibriSpeech test-clean subset, forced aligned with the Montreal Forced Aligner (tutorial corpus download link). Extract the files to somewhere on your local machine.

Importing the tutorial corpus

We begin by importing the necessary classes and functions from polyglotdb as well as defining variables. Change the path to reflect where the tutorial corpus was extracted to on your local machine.



In [2]:

    
from polyglotdb import CorpusContext
import polyglotdb.io as pgio

corpus_root = '/mnt/e/Data/pg_tutorial'

The import statements get the necessary classes and functions for importing, namely the CorpusContext class and the polyglot IO module. CorpusContext objects are how all interactions with the database are handled. The CorpusContext is created as a context manager in Python (the with ... as ... pattern), so that clean up and closing of connections are automatically handled both on successful completion of the code as well as if errors are encountered.

The IO module handles all import and export functionality in polyglotdb. The principle functions that a user will encounter are the inspect_X functions that generate parsers for corpus formats. In the above code, the MFA parser is used because the tutorial corpus was aligned using the MFA. See Importing corpora for more information on the inspect functions and parser objects they generate for various formats.

Once the proper path to the tutorial corpus is set, it can be imported via the following code:



In [17]:

    
parser = pgio.inspect_mfa(corpus_root)
parser.call_back = print # To show progress output

with CorpusContext('pg_tutorial') as c:
    c.load(parser, corpus_root)









    



loading /mnt/e/Data/pg_tutorial with <polyglotdb.io.parsers.mfa.MfaParser object at 0x7f8d740a0208>
Finding  files...
0 0
Parsing types...
0 87
Parsing types from file 1 of 87...
0
Parsing types from file 2 of 87...
1
Parsing types from file 3 of 87...
2
Parsing types from file 4 of 87...
3
Parsing types from file 5 of 87...
4
Parsing types from file 6 of 87...
5
Parsing types from file 7 of 87...
6
Parsing types from file 8 of 87...
7
Parsing types from file 9 of 87...
8
Parsing types from file 10 of 87...
9
Parsing types from file 11 of 87...
10
Parsing types from file 12 of 87...
11
Parsing types from file 13 of 87...
12
Parsing types from file 14 of 87...
13
Parsing types from file 15 of 87...
14
Parsing types from file 16 of 87...
15
Parsing types from file 17 of 87...
16
Parsing types from file 18 of 87...
17
Parsing types from file 19 of 87...
18
Parsing types from file 20 of 87...
19
Parsing types from file 21 of 87...
20
Parsing types from file 22 of 87...
21
Parsing types from file 23 of 87...
22
Parsing types from file 24 of 87...
23
Parsing types from file 25 of 87...
24
Parsing types from file 26 of 87...
25
Parsing types from file 27 of 87...
26
Parsing types from file 28 of 87...
27
Parsing types from file 29 of 87...
28
Parsing types from file 30 of 87...
29
Parsing types from file 31 of 87...
30
Parsing types from file 32 of 87...
31
Parsing types from file 33 of 87...
32
Parsing types from file 34 of 87...
33
Parsing types from file 35 of 87...
34
Parsing types from file 36 of 87...
35
Parsing types from file 37 of 87...
36
Parsing types from file 38 of 87...
37
Parsing types from file 39 of 87...
38
Parsing types from file 40 of 87...
39
Parsing types from file 41 of 87...
40
Parsing types from file 42 of 87...
41
Parsing types from file 43 of 87...
42
Parsing types from file 44 of 87...
43
Parsing types from file 45 of 87...
44
Parsing types from file 46 of 87...
45
Parsing types from file 47 of 87...
46
Parsing types from file 48 of 87...
47
Parsing types from file 49 of 87...
48
Parsing types from file 50 of 87...
49
Parsing types from file 51 of 87...
50
Parsing types from file 52 of 87...
51
Parsing types from file 53 of 87...
52
Parsing types from file 54 of 87...
53
Parsing types from file 55 of 87...
54
Parsing types from file 56 of 87...
55
Parsing types from file 57 of 87...
56
Parsing types from file 58 of 87...
57
Parsing types from file 59 of 87...
58
Parsing types from file 60 of 87...
59
Parsing types from file 61 of 87...
60
Parsing types from file 62 of 87...
61
Parsing types from file 63 of 87...
62
Parsing types from file 64 of 87...
63
Parsing types from file 65 of 87...
64
Parsing types from file 66 of 87...
65
Parsing types from file 67 of 87...
66
Parsing types from file 68 of 87...
67
Parsing types from file 69 of 87...
68
Parsing types from file 70 of 87...
69
Parsing types from file 71 of 87...
70
Parsing types from file 72 of 87...
71
Parsing types from file 73 of 87...
72
Parsing types from file 74 of 87...
73
Parsing types from file 75 of 87...
74
Parsing types from file 76 of 87...
75
Parsing types from file 77 of 87...
76
Parsing types from file 78 of 87...
77
Parsing types from file 79 of 87...
78
Parsing types from file 80 of 87...
79
Parsing types from file 81 of 87...
80
Parsing types from file 82 of 87...
81
Parsing types from file 83 of 87...
82
Parsing types from file 84 of 87...
83
Parsing types from file 85 of 87...
84
Parsing types from file 86 of 87...
85
Parsing types from file 87 of 87...
86
Importing types...
Parsing files...
0 87
Parsing file 1 of 87 (134686)...
0
Parsing file 2 of 87 (134691)...
1
Parsing file 3 of 87 (133604)...
2
Parsing file 4 of 87 (121726)...
3
Parsing file 5 of 87 (123852)...
4
Parsing file 6 of 87 (123859)...
5
Parsing file 7 of 87 (127105)...
6
Parsing file 8 of 87 (135766)...
7
Parsing file 9 of 87 (135767)...
8
Parsing file 10 of 87 (1180)...
9
Parsing file 11 of 87 (1181)...
10
Parsing file 12 of 87 (134647)...
11
Parsing file 13 of 87 (122612)...
12
Parsing file 14 of 87 (122617)...
13
Parsing file 15 of 87 (141083)...
14
Parsing file 16 of 87 (141084)...
15
Parsing file 17 of 87 (1826)...
16
Parsing file 18 of 87 (1836)...
17
Parsing file 19 of 87 (1837)...
18
Parsing file 20 of 87 (142345)...
19
Parsing file 21 of 87 (131720)...
20
Parsing file 22 of 87 (126133)...
21
Parsing file 23 of 87 (134493)...
22
Parsing file 24 of 87 (134500)...
23
Parsing file 25 of 87 (123286)...
24
Parsing file 26 of 87 (123288)...
25
Parsing file 27 of 87 (123440)...
26
Parsing file 28 of 87 (3979)...
27
Parsing file 29 of 87 (3980)...
28
Parsing file 30 of 87 (960)...
29
Parsing file 31 of 87 (961)...
30
Parsing file 32 of 87 (5694)...
31
Parsing file 33 of 87 (5695)...
32
Parsing file 34 of 87 (5696)...
33
Parsing file 35 of 87 (170457)...
34
Parsing file 36 of 87 (6852)...
35
Parsing file 37 of 87 (13751)...
36
Parsing file 38 of 87 (13754)...
37
Parsing file 39 of 87 (2271)...
38
Parsing file 40 of 87 (2273)...
39
Parsing file 41 of 87 (2275)...
40
Parsing file 42 of 87 (16021)...
41
Parsing file 43 of 87 (29093)...
42
Parsing file 44 of 87 (29095)...
43
Parsing file 45 of 87 (23283)...
44
Parsing file 46 of 87 (41797)...
45
Parsing file 47 of 87 (41806)...
46
Parsing file 48 of 87 (28233)...
47
Parsing file 49 of 87 (28240)...
48
Parsing file 50 of 87 (28241)...
49
Parsing file 51 of 87 (33396)...
50
Parsing file 52 of 87 (36377)...
51
Parsing file 53 of 87 (36586)...
52
Parsing file 54 of 87 (36600)...
53
Parsing file 55 of 87 (40744)...
54
Parsing file 56 of 87 (32865)...
55
Parsing file 57 of 87 (32866)...
56
Parsing file 58 of 87 (32879)...
57
Parsing file 59 of 87 (70968)...
58
Parsing file 60 of 87 (70970)...
59
Parsing file 61 of 87 (122797)...
60
Parsing file 62 of 87 (68769)...
61
Parsing file 63 of 87 (68771)...
62
Parsing file 64 of 87 (75918)...
63
Parsing file 65 of 87 (76324)...
64
Parsing file 66 of 87 (81414)...
65
Parsing file 67 of 87 (79730)...
66
Parsing file 68 of 87 (79740)...
67
Parsing file 69 of 87 (79759)...
68
Parsing file 70 of 87 (85628)...
69
Parsing file 71 of 87 (75946)...
70
Parsing file 72 of 87 (75947)...
71
Parsing file 73 of 87 (88083)...
72
Parsing file 74 of 87 (92135)...
73
Parsing file 75 of 87 (102255)...
74
Parsing file 76 of 87 (274381)...
75
Parsing file 77 of 87 (274384)...
76
Parsing file 78 of 87 (279154)...
77
Parsing file 79 of 87 (210777)...
78
Parsing file 80 of 87 (287645)...
79
Parsing file 81 of 87 (294825)...
80
Parsing file 82 of 87 (294828)...
81
Parsing file 83 of 87 (284447)...
82
Parsing file 84 of 87 (284449)...
83
Parsing file 85 of 87 (292519)...
84
Parsing file 86 of 87 (157963)...
85
Parsing file 87 of 87 (31957)...
86
Importing data...
0 80
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
Importing data for speaker 0 of 40 (1089)...
Importing data for speaker 1 of 40 (1188)...
Importing data for speaker 2 of 40 (121)...
Importing data for speaker 3 of 40 (1221)...
Importing data for speaker 4 of 40 (1284)...
Importing data for speaker 5 of 40 (1320)...
Importing data for speaker 6 of 40 (1580)...
Importing data for speaker 7 of 40 (1995)...
Importing data for speaker 8 of 40 (2094)...
Importing data for speaker 9 of 40 (2300)...
Importing data for speaker 10 of 40 (237)...
Importing data for speaker 11 of 40 (260)...
Importing data for speaker 12 of 40 (2830)...
Importing data for speaker 13 of 40 (2961)...
Importing data for speaker 14 of 40 (3570)...
Importing data for speaker 15 of 40 (3575)...
Importing data for speaker 16 of 40 (3729)...
Importing data for speaker 17 of 40 (4077)...
Importing data for speaker 18 of 40 (4446)...
Importing data for speaker 19 of 40 (4507)...
Importing data for speaker 20 of 40 (4970)...
Importing data for speaker 21 of 40 (4992)...
Importing data for speaker 22 of 40 (5105)...
Importing data for speaker 23 of 40 (5142)...
Importing data for speaker 24 of 40 (5639)...
Importing data for speaker 25 of 40 (5683)...
Importing data for speaker 26 of 40 (61)...
Importing data for speaker 27 of 40 (672)...
Importing data for speaker 28 of 40 (6829)...
Importing data for speaker 29 of 40 (6930)...
Importing data for speaker 30 of 40 (7021)...
Importing data for speaker 31 of 40 (7127)...
Importing data for speaker 32 of 40 (7176)...
Importing data for speaker 33 of 40 (7729)...
Importing data for speaker 34 of 40 (8224)...
Importing data for speaker 35 of 40 (8230)...
Importing data for speaker 36 of 40 (8455)...
Importing data for speaker 37 of 40 (8463)...
Importing data for speaker 38 of 40 (8555)...
Importing data for speaker 39 of 40 (908)...

Important

If during the running of the import code, a neo4j.exceptions.ServiceUnavailable error is raised, then double check that the pgdb database is running. Once polyglotdb is installed, simply call pgdb start, assuming pgdb install has already been called. See the relevant documentation for more information.

Resetting the corpus

If at any point there's some error or interruption in import or other stages of the tutorial, the corpus can be reset to a fresh state via the following code:



In [16]:

    
with CorpusContext('pg_tutorial') as c:
    c.reset()

Warning

Be careful when running this code as it will delete any and all information in the corpus. For smaller corpora such as the one presented here, the time to set up is not huge, but for larger corpora this can result in several hours worth of time to reimport and re-enrich the corpus.

Testing some simple queries

To ensure that data import completed successfully, we can print the list of speakers, discourses, and phone types in the corpus, via:



In [11]:

    
with CorpusContext('pg_tutorial') as c:
    print('Speakers:', c.speakers)
    print('Discourses:', c.discourses)
    q = c.query_lexicon(c.lexicon_phone)
    
    q = q.order_by(c.lexicon_phone.label)
    q = q.columns(c.lexicon_phone.label.column_name('phone'))
    results = q.all()
    print(results)









    



Speakers: ['2300', '1580', '237', '260', '1995', '2830', '2961', '3570', '2094', '1089', '1188', '121', '1221', '1284', '1320', '3575', '3729', '4077', '4446', '4507', '4970', '4992', '5105', '5142', '5639', '5683', '61', '672', '6829', '6930', '7021', '7127', '7176', '7729', '8224', '8230', '8455', '8463', '8555', '908']
Discourses: ['5696', '122617', '142345', '131720', '141083', '126133', '134493', '134500', '141084', '123286', '123288', '123440', '1826', '3979', '3980', '1836', '960', '1837', '961', '5694', '5695', '134686', '134691', '133604', '121726', '123852', '123859', '127105', '135766', '135767', '1180', '1181', '134647', '122612', '170457', '6852', '13751', '13754', '2271', '2273', '2275', '16021', '29093', '29095', '23283', '41797', '41806', '28233', '28240', '28241', '33396', '36377', '36586', '36600', '40744', '32865', '32866', '32879', '70968', '70970', '122797', '68769', '68771', '75918', '76324', '81414', '79730', '79740', '79759', '85628', '75946', '75947', '88083', '92135', '102255', '274381', '274384', '279154', '210777', '287645', '294825', '294828', '284447', '284449', '292519', '157963', '31957']
phone: <SIL>
phone: AA0
phone: AA1
phone: AA2
phone: AE0
phone: AE1
phone: AE2
phone: AH0
phone: AH1
phone: AH2
phone: AO0
phone: AO1
phone: AO2
phone: AW0
phone: AW1
phone: AW2
phone: AY0
phone: AY1
phone: AY2
phone: B
phone: CH
phone: D
phone: DH
phone: EH0
phone: EH1
phone: EH2
phone: ER0
phone: ER1
phone: ER2
phone: EY0
phone: EY1
phone: EY2
phone: F
phone: G
phone: HH
phone: IH0
phone: IH1
phone: IH2
phone: IY0
phone: IY1
phone: IY2
phone: JH
phone: K
phone: L
phone: M
phone: N
phone: NG
phone: OW0
phone: OW1
phone: OW2
phone: OY1
phone: OY2
phone: P
phone: R
phone: S
phone: SH
phone: T
phone: TH
phone: UH0
phone: UH1
phone: UH2
phone: UW0
phone: UW1
phone: UW2
phone: V
phone: W
phone: Y
phone: Z
phone: ZH
phone: sil
phone: spn

A more interesting summary query is perhaps looking at the count and average duration of different phone types across the corpus, via:



In [15]:

    
from polyglotdb.query.base.func import Count, Average

with CorpusContext('pg_tutorial') as c:
    q = c.query_graph(c.phone).group_by(c.phone.label.column_name('phone'))
    results = q.aggregate(Count().column_name('count'), Average(c.phone.duration).column_name('average_duration'))
    for r in results:
        print('The phone {} had {} occurrences and an average duration of {}.'.format(r['phone'], r['count'], r['average_duration']))









    



The phone DH had 6099 occurances and an average duration of 0.04958681751106775
The phone EY1 had 2770 occurances and an average duration of 0.13542599277978343
The phone V had 4046 occurances and an average duration of 0.06466633712308456
The phone W had 4195 occurances and an average duration of 0.08360667461263366
The phone EH0 had 152 occurances and an average duration of 0.07815789473684207
The phone L had 7821 occurances and an average duration of 0.08735583684950785
The phone F had 3562 occurances and an average duration of 0.10241998877035363
The phone AW2 had 43 occurances and an average duration of 0.13372093023255463
The phone N had 13864 occurances and an average duration of 0.06601125216387765
The phone sil had 11482 occurances and an average duration of 0.2826415258665748
The phone AH1 had 4533 occurances and an average duration of 0.07383631149349201
The phone AA1 had 2604 occurances and an average duration of 0.1173694316436253
The phone UH1 had 905 occurances and an average duration of 0.06307182320441995
The phone IY2 had 83 occurances and an average duration of 0.08891566265060166
The phone T had 13251 occurances and an average duration of 0.07481850426382967
The phone JH had 899 occurances and an average duration of 0.10505005561735233
The phone B had 3335 occurances and an average duration of 0.06859970014992495
The phone IH0 had 7406 occurances and an average duration of 0.060013502565487264
The phone R had 8001 occurances and an average duration of 0.07633795775528061
The phone ER0 had 4281 occurances and an average duration of 0.09890212567157203
The phone OW1 had 2079 occurances and an average duration of 0.1433140933140936
The phone NG had 1957 occurances and an average duration of 0.10058252427184482
The phone SH had 1583 occurances and an average duration of 0.12084017687934254
The phone ER2 had 23 occurances and an average duration of 0.07434782608695903
The phone TH had 1288 occurances and an average duration of 0.09742236024844719
The phone IH2 had 214 occurances and an average duration of 0.06196261682242916
The phone AH2 had 93 occurances and an average duration of 0.0707526881720434
The phone AE2 had 144 occurances and an average duration of 0.10263888888888895
The phone Y had 1414 occurances and an average duration of 0.07690947666195265
The phone UW1 had 2243 occurances and an average duration of 0.0999910833704855
The phone EH2 had 270 occurances and an average duration of 0.06874074074073995
The phone AO2 had 102 occurances and an average duration of 0.10176470588235245
The phone OY2 had 2 occurances and an average duration of 0.029999999999986926
The phone M had 5654 occurances and an average duration of 0.08143968871595296
The phone AO0 had 44 occurances and an average duration of 0.12340909090909293
The phone AY0 had 61 occurances and an average duration of 0.11770491803278742
The phone OW0 had 288 occurances and an average duration of 0.12618055555555466
The phone AE1 had 5035 occurances and an average duration of 0.10723733862959331
The phone AA0 had 58 occurances and an average duration of 0.1091379310344847
The phone EY0 had 41 occurances and an average duration of 0.11024390243902515
The phone OY1 had 184 occurances and an average duration of 0.18217391304347916
The phone IH1 had 4675 occurances and an average duration of 0.07357860962566895
The phone G had 1622 occurances and an average duration of 0.07601726263871732
The phone Z had 5555 occurances and an average duration of 0.10288928892889279
The phone CH had 1120 occurances and an average duration of 0.12339285714285714
The phone AA2 had 117 occurances and an average duration of 0.09495726495726335
The phone AY1 had 3233 occurances and an average duration of 0.15051964120012398
The phone AH0 had 15418 occurances and an average duration of 0.05444869632896593
The phone ZH had 92 occurances and an average duration of 0.1007608695652175
The phone AY2 had 187 occurances and an average duration of 0.15064171122994535
The phone HH had 3791 occurances and an average duration of 0.07448694275916617
The phone EH1 had 5230 occurances and an average duration of 0.08475143403441712
The phone P had 3647 occurances and an average duration of 0.09835481217438963
The phone AE0 had 58 occurances and an average duration of 0.07603448275861927
The phone UH2 had 17 occurances and an average duration of 0.06588235294117663
The phone AO1 had 2458 occurances and an average duration of 0.12319772172497973
The phone ER1 had 1235 occurances and an average duration of 0.1302995951417008
The phone UH0 had 7 occurances and an average duration of 0.0442857142857136
The phone <SIL> had 2643 occurances and an average duration of 0.017644971623155415
The phone IY0 had 3741 occurances and an average duration of 0.10120556001069253
The phone EY2 had 165 occurances and an average duration of 0.10927272727272833
The phone S had 9176 occurances and an average duration of 0.12295444638186538
The phone UW2 had 60 occurances and an average duration of 0.0766666666666665
The phone AW0 had 11 occurances and an average duration of 0.10545454545454498
The phone OW2 had 101 occurances and an average duration of 0.11574257425742682
The phone UW0 had 210 occurances and an average duration of 0.06614285714285861
The phone K had 5257 occurances and an average duration of 0.09368461099486376
The phone spn had 220 occurances and an average duration of 0.549409090909092
The phone D had 9128 occurances and an average duration of 0.062369631901840354
The phone IY1 had 3373 occurances and an average duration of 0.10762822413281947
The phone AW1 had 1007 occurances and an average duration of 0.16140019860973234