Sometimes filling up a pipeline works great to speed up processing, but we can also get some speedup by using data-parallel methods. If we have hundreds of similar files to process with one script, for example, on a machine with several cores, you can parallelize their processing and take advantage of more than one core at a time.
Let's look at a simple example. What if we had five texts from Gutenberg to count words for?
First, let's get some data:
In [1]:
ls *.txt
alice.txt douglass.txt frankenstein.txt pride.txt wuthering.txt
In [2]:
wc *.txt
3735 29461 167518 alice.txt
4104 43789 248369 douglass.txt
7653 77986 448689 frankenstein.txt
13426 124588 717575 pride.txt
12486 118899 681641 wuthering.txt
41404 394723 2263792 total
Okay, we've got five texts, with a total of nearly 400,000 words combined. That's a good start.
Remember our previous pipeline for counting words? It was something like this:
In [6]:
grep -oE '\w{2,}' douglass.txt | tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -rn | head -25
2435 the
1669 of
1574 to
1445 and
794 in
742 was
530 he
457 my
433 it
426 with
422 that
388 his
365 as
359 for
336 me
295 this
292 at
285 be
255 had
253 by
237 not
208 or
206 but
201 him
200 is
sort: write failed: standard output: Broken pipe
sort: write error
Note that we can add a timer with the command time at front:
In [8]:
time grep -oE '\w{2,}' douglass.txt | tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -rn | head -25
2435 the
1669 of
1574 to
1445 and
794 in
742 was
530 he
457 my
433 it
426 with
422 that
388 his
365 as
359 for
336 me
295 this
292 at
285 be
255 had
253 by
237 not
208 or
206 but
201 him
200 is
sort: write failed: standard output: Broken pipe
sort: write error
real 0m0.137s
user 0m0.115s
sys 0m0.023s
What if we just used a filename wildcard, what would happen?
In [10]:
time grep -oE '\w{2,}' *.txt | tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -rn | head -25
4816 wuthering.txt:and
4750 wuthering.txt:the
4507 pride.txt:the
4371 frankenstein.txt:the
4242 pride.txt:to
3729 pride.txt:of
3658 pride.txt:and
3616 wuthering.txt:to
3046 frankenstein.txt:and
2760 frankenstein.txt:of
2435 douglass.txt:the
2340 wuthering.txt:of
2203 pride.txt:her
2174 frankenstein.txt:to
2124 wuthering.txt:he
1944 wuthering.txt:you
1937 pride.txt:in
1844 pride.txt:was
1818 alice.txt:the
1776 frankenstein.txt:my
1695 pride.txt:she
1669 douglass.txt:of
1574 douglass.txt:to
1556 pride.txt:that
1550 pride.txt:it
sort: write failed: standard output: Broken pipe
sort: write error
real 0m3.313s
user 0m3.169s
sys 0m0.378s
In [11]:
time ls *.txt | parallel -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' > {}-words.txt"
real 0m0.442s
user 0m0.297s
sys 0m0.222s
In [13]:
wc *-words.txt
28625 28625 149885 alice.txt-words.txt
42277 42277 231365 douglass.txt-words.txt
73902 73902 417234 frankenstein.txt-words.txt
121166 121166 669063 pride.txt-words.txt
114101 114101 619516 wuthering.txt-words.txt
380071 380071 2087063 total
Looks about right, with some space and punctuation removed. Did you see how fast that went?
Now we can combine them and apply the rest of our pipeline:
In [17]:
time cat *-words.txt >> combined.txt
real 0m0.004s
user 0m0.002s
sys 0m0.003s
In [16]:
time < combined.txt sort | uniq -c | sort -rn | head -25
17881 the
13905 and
12415 to
11129 of
5889 in
5090 was
4720 he
4611 you
4605 that
4561 it
4518 her
4090 my
3926 she
3693 his
3332 not
3320 with
3316 as
2982 for
2981 had
2836 be
2772 but
2761 me
2452 at
2422 on
2230 is
sort: write failed: standard output: Broken pipe
sort: write error
real 0m0.506s
user 0m0.810s
sys 0m0.070s
Faster, right? Well, if a gain of only a couple of seconds doesn't seem like much, imagine if we had hundreds of texts. Or thousands... (see how to get ebook files for more details).
In [18]:
mkdir pg-text
cd pg-text
wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"
(380 text files later...)
In [27]:
ls *.txt
10001.txt 10062.txt 10119-8.txt 10234-8.txt
10002-8.txt 10063.txt 10119.txt 10234.txt
10002.txt 10064-8.txt 10120-8.txt 12370-8.txt
10003.txt 10064.txt 10120.txt 12370.txt
10004-8.txt 10065-8.txt 10121-8.txt 12372-8.txt
10004.txt 10065.txt 10121.txt 12372.txt
10005-8.txt 10066-8.txt 10122-8.txt 12373-8.txt
10005.txt 10066.txt 10122.txt 12373.txt
10006-8.txt 10067-8.txt 10123.txt 12374-8.txt
10006.txt 10067.txt 10124-8.txt 12374.txt
10007-8.txt 10068-8.txt 10124.txt 12375-8.txt
10007.txt 10068.txt 10125-8.txt 12375.txt
10008-8.txt 10069-8.txt 10125.txt 12376-8.txt
10008.txt 10069.txt 10126.txt 12376.txt
10009.txt 10070.txt 10127-8.txt 12377.txt
10010.txt 10071-8.txt 10127.txt 12378-8.txt
10011-8.txt 10071.txt 10128-8.txt 12378.txt
10011.txt 10072.txt 10128.txt 12380-8.txt
10012-8.txt 10073-8.txt 10129-8.txt 12380.txt
10012.txt 10073.txt 10129.txt 12381.txt
10013-8.txt 10074-8.txt 10130-8.txt 12383-8.txt
10013.txt 10074.txt 10130.txt 12383.txt
10014-8.txt 10075.txt 10131.txt 12384-8.txt
10014.txt 10076-8.txt 10132-8.txt 12384.txt
10015.txt 10076.txt 10132.txt 12385-8.txt
10016-8.txt 10077-8.txt 10133-8.txt 12385.txt
10016.txt 10077.txt 10133.txt 12386.txt
10017-8.txt 10078-8.txt 10134.txt 1jcfs10.txt
10017.txt 10078.txt 10135-8.txt 2babb10.txt
10018-8.txt 10079-8.txt 10135.txt 3babb10.txt
10018.txt 10079.txt 10136-8.txt 50bab10.txt
10019-8.txt 10080-8.txt 10136.txt ajtl10.txt
10019.txt 10080.txt 10138-8.txt allyr10.txt
10020-8.txt 10081.txt 10138.txt alpsn10.txt
10020.txt 10082-8.txt 10139-8.txt balen10.txt
10021-8.txt 10082.txt 10139.txt baleng2.txt
10021.txt 10083.txt 10140-0.txt batlf10.txt
10022.txt 10084-8.txt 10141.txt bgopr10.txt
10023.txt 10084.txt 10142-8.txt brnte10.txt
10024-8.txt 10085-8.txt 10142.txt bstjg10.txt
10024.txt 10085.txt 10143-8.txt cambp10.txt
10025-8.txt 10086-8.txt 10143.txt canbe10.txt
10025.txt 10086.txt 10144-8.txt cantp10.txt
10026-8.txt 10087.txt 10144.txt cfrz10.txt
10026.txt 10088-8.txt 10145-8.txt crsnk10.txt
10027.txt 10088.txt 10145.txt esbio10.txt
10028.txt 10089-8.txt 10146.txt G-1_Corinthians.txt
10029-8.txt 10089.txt 10147-8.txt G-1_John.txt
10029.txt 10090-8.txt 10147.txt G-1_Peter.txt
10030-8.txt 10090.txt 10148.txt G-1_Thessalonians.txt
10030.txt 10091-8.txt 10149-8.txt G-1_Timothy.txt
10031-8.txt 10091.txt 10149.txt G-2_Corinthians.txt
10031.txt 10092-8.txt 10150-8.txt G-2_John.txt
10032-8.txt 10092.txt 10150.txt G-2_Peter.txt
10032.txt 10093.txt 10151-8.txt G-2_Thessalonians.txt
10033-8.txt 10094-8.txt 10151.txt G-2_Timothy.txt
10033.txt 10094.txt 10159-8.txt G-3_John.txt
10034-8.txt 10095-8.txt 10159.txt G-Acts.txt
10034.txt 10095.txt 10161-8.txt G-Colossians.txt
10035-8.txt 10096-8.txt 10161.txt G-Ephesians.txt
10035.txt 10096.txt 10162-8.txt G-Galatians.txt
10036-8.txt 10097-8.txt 10162.txt G-Hebrews.txt
10036.txt 10097.txt 10163-8.txt G-James.txt
10037-8.txt 10098-8.txt 10163.txt G-John.txt
10037.txt 10098.txt 10164-8.txt G-Jude.txt
10038-8.txt 10099-8.txt 10164.txt G-Luke.txt
10038.txt 10099.txt 10165-8.txt G-Mark.txt
10039-8.txt 10100-8.txt 10165.txt G-Matthew.txt
10039.txt 10100.txt 10166-8.txt G-Philemon.txt
10040.txt 10101.txt 10166.txt G-Philippians.txt
10041-8.txt 10102-8.txt 10201-8.txt G-Revelation.txt
10041.txt 10102.txt 10201.txt G-Romans.txt
10042-8.txt 10103-8.txt 10202-8.txt grybr10.txt
10042.txt 10103.txt 10202.txt G-Titus.txt
10043-8.txt 10104-8.txt 10210-8.txt Introduction_and_Copyright.txt
10043.txt 10104.txt 10210.txt mklmt10.txt
10044.txt 10105-8.txt 10211.txt morem10.txt
10045.txt 10105.txt 10212.txt mspcd10.txt
10046.txt 10106-8.txt 10213-8.txt penbr10.txt
10047-8.txt 10106.txt 10213.txt pgjr10.txt
10047.txt 10107-8.txt 10214-8.txt pntvw10.txt
10048.txt 10107.txt 10214.txt prcpg10.txt
10049.txt 10108.txt 10216.txt prhg10.txt
10050.txt 10109.txt 10217-8.txt prhsb10.txt
10051.txt 10110-8.txt 10217.txt rlsl110.txt
10052-8.txt 10110.txt 10219-8.txt rlsl210.txt
10052.txt 10111-8.txt 10219.txt rmlav10.txt
10056-8.txt 10111.txt 10220.txt sesli10.txt
10056.txt 10112-8.txt 10221.txt svyrd10.txt
10057.txt 10112.txt 10222-8.txt tecom10.txt
10058.txt 10114-8.txt 10222.txt utrkj10.txt
10059.txt 10114.txt 10224.txt vpasm10.txt
10060-8.txt 10116.txt 10225-8.txt wldsp10.txt
10060.txt 10118-8.txt 10225.txt wtrbs10.txt
10062-8.txt 10118.txt 10226.txt zncli10.txt
In [28]:
wc *.txt
958 8807 52510 10001.txt
5690 54201 306901 10002-8.txt
5690 54201 306892 10002.txt
6327 64594 380817 10003.txt
5361 51299 302753 10004-8.txt
5361 51300 302750 10004.txt
7313 73655 434769 10005-8.txt
7313 73656 434760 10005.txt
1582 16512 95836 10006-8.txt
1582 16512 95831 10006.txt
3695 31295 180138 10007-8.txt
3695 31295 180129 10007.txt
9154 69542 407280 10008-8.txt
9154 69542 407271 10008.txt
8502 90180 504214 10009.txt
1451 13938 86336 10010.txt
3881 26555 155576 10011-8.txt
3881 26555 155567 10011.txt
9175 93985 561045 10012-8.txt
9175 94007 561124 10012.txt
2535 18202 136659 10013-8.txt
2535 18202 136650 10013.txt
2858 19257 142501 10014-8.txt
2858 19257 142492 10014.txt
2681 18382 138460 10015.txt
2824 19781 147286 10016-8.txt
2824 19782 147277 10016.txt
2661 18392 136298 10017-8.txt
2661 18392 136288 10017.txt
2741 18548 139711 10018-8.txt
2741 18548 139702 10018.txt
2740 18474 138446 10019-8.txt
2740 18474 138437 10019.txt
5632 48315 280958 10020-8.txt
5632 48315 280956 10020.txt
7316 55173 318638 10021-8.txt
7316 55174 318632 10021.txt
4811 41693 227841 10022.txt
3583 38245 214800 10023.txt
6441 53315 306207 10024-8.txt
6441 53315 306215 10024.txt
9639 69686 404697 10025-8.txt
9639 69686 404693 10025.txt
1938 16222 99768 10026-8.txt
1937 16222 99760 10026.txt
9072 75404 439080 10027.txt
5369 46349 276424 10028.txt
6896 54856 316292 10029-8.txt
6896 54856 316283 10029.txt
10790 104068 587423 10030-8.txt
10790 104068 587414 10030.txt
10377 64145 409992 10031-8.txt
10377 64146 410012 10031.txt
2622 18401 136850 10032-8.txt
2622 18401 136841 10032.txt
2648 18820 138038 10033-8.txt
2648 18820 138029 10033.txt
2888 18082 137039 10034-8.txt
2888 18082 137047 10034.txt
2750 19139 143006 10035-8.txt
2750 19139 142997 10035.txt
2778 18691 142266 10036-8.txt
2778 18691 142257 10036.txt
8222 69785 396129 10037-8.txt
8222 69786 396120 10037.txt
21255 154845 895435 10038-8.txt
21255 154845 895429 10038.txt
23550 147548 884479 10039-8.txt
23550 147559 884471 10039.txt
1053 4683 42014 10040.txt
9132 75482 449980 10041-8.txt
9132 75482 449973 10041.txt
8147 75355 450534 10042-8.txt
8147 75355 450525 10042.txt
3319 32323 186254 10043-8.txt
3319 32326 186257 10043.txt
5621 48042 293397 10044.txt
6265 40441 242697 10045.txt
9802 95425 514564 10046.txt
2785 20102 148910 10047-8.txt
2785 20103 148901 10047.txt
6225 45314 261984 10048.txt
2981 30351 169713 10049.txt
2838 29895 161556 10050.txt
4081 45376 246157 10051.txt
3508 38255 210952 10052-8.txt
3508 38255 210943 10052.txt
12006 90236 531847 10056-8.txt
12006 90235 531848 10056.txt
6270 54036 314672 10057.txt
8321 76409 461647 10058.txt
6650 51775 305801 10059.txt
9956 98878 600419 10060-8.txt
9956 98940 600646 10060.txt
15639 144792 830489 10062-8.txt
15638 144792 830476 10062.txt
2686 24700 141667 10063.txt
21769 176439 1026737 10064-8.txt
21764 176439 1026718 10064.txt
4032 36993 229427 10065-8.txt
4031 36993 229410 10065.txt
10679 82916 462890 10066-8.txt
10679 82916 462881 10066.txt
9563 71433 407039 10067-8.txt
9563 71433 407030 10067.txt
10585 94471 542138 10068-8.txt
10585 94471 542129 10068.txt
1368 11421 70918 10069-8.txt
1368 11421 70925 10069.txt
4169 31908 176568 10070.txt
3378 33499 198401 10071-8.txt
3378 33499 198392 10071.txt
8253 67318 358916 10072.txt
12786 94416 613382 10073-8.txt
12786 94417 613377 10073.txt
1916 15999 98887 10074-8.txt
1916 15999 98882 10074.txt
1339 13280 76968 10075.txt
9408 80555 458449 10076-8.txt
9408 80555 458443 10076.txt
9286 87938 515450 10077-8.txt
9286 87940 515454 10077.txt
1441 13639 85354 10078-8.txt
1441 13639 85350 10078.txt
9182 88050 517560 10079-8.txt
9182 88052 517556 10079.txt
4694 41469 249141 10080-8.txt
4694 41469 249154 10080.txt
8072 54081 312957 10081.txt
12082 111460 631660 10082-8.txt
12082 111460 631655 10082.txt
13157 114410 644275 10083.txt
6819 69022 381318 10084-8.txt
6819 69022 381309 10084.txt
12525 94947 545008 10085-8.txt
12522 94948 544995 10085.txt
9862 88811 496825 10086-8.txt
9862 88811 496819 10086.txt
4148 35075 203239 10087.txt
9832 95176 577279 10088-8.txt
9832 95178 577281 10088.txt
3139 19082 114356 10089-8.txt
3139 19082 114350 10089.txt
8998 78993 485849 10090-8.txt
8998 79000 485844 10090.txt
2650 17966 137367 10091-8.txt
2650 17967 137363 10091.txt
2694 18224 139593 10092-8.txt
2694 18224 139584 10092.txt
5606 51207 280751 10093.txt
8987 84512 469091 10094-8.txt
8987 84512 469085 10094.txt
9762 83914 506198 10095-8.txt
9764 83914 506219 10095.txt
3635 21987 129983 10096-8.txt
3635 21987 129984 10096.txt
5340 50481 306676 10097-8.txt
5340 50481 306672 10097.txt
9040 90482 534029 10098-8.txt
9040 90482 534020 10098.txt
5363 53244 314437 10099-8.txt
5363 53245 314430 10099.txt
7037 67893 409190 10100-8.txt
7037 67893 409181 10100.txt
4351 42179 232173 10101.txt
11389 94144 541263 10102-8.txt
11389 94144 541257 10102.txt
15736 163993 972256 10103-8.txt
15736 163998 972271 10103.txt
2721 19143 142832 10104-8.txt
2721 19143 142823 10104.txt
2719 18314 136937 10105-8.txt
2719 18314 136928 10105.txt
2508 18178 132089 10106-8.txt
2508 18178 132082 10106.txt
7857 77301 449102 10107-8.txt
7857 77302 449093 10107.txt
1079 10156 60968 10108.txt
6686 53543 302048 10109.txt
9162 68588 406024 10110-8.txt
9168 68576 405932 10110.txt
10566 106762 610364 10111-8.txt
10566 106762 610358 10111.txt
3727 35454 220704 10112-8.txt
3727 35454 220705 10112.txt
16624 164025 976088 10114-8.txt
16624 164025 976429 10114.txt
10534 119482 667001 10116.txt
9424 79966 485781 10118-8.txt
9424 79966 485777 10118.txt
6006 52726 316595 10119-8.txt
6006 52726 316586 10119.txt
9340 96565 549199 10120-8.txt
9340 96566 549190 10120.txt
6071 52577 301494 10121-8.txt
6071 52577 301485 10121.txt
2194 11159 68206 10122-8.txt
2194 11159 68201 10122.txt
7692 54583 315258 10123.txt
5955 46308 268922 10124-8.txt
5955 46308 268916 10124.txt
9762 89812 517064 10125-8.txt
9762 89812 517068 10125.txt
6475 65931 371271 10126.txt
3892 34886 197497 10127-8.txt
3889 34842 197252 10127.txt
15264 155012 936026 10128-8.txt
15264 155021 936056 10128.txt
11776 113594 662009 10129-8.txt
11776 113596 662002 10129.txt
22431 207317 1176514 10130-8.txt
22431 207320 1176607 10130.txt
6418 32832 198795 10131.txt
14249 109348 634318 10132-8.txt
14249 109356 634311 10132.txt
3110 16386 104268 10133-8.txt
3110 16386 104282 10133.txt
7843 75315 426703 10134.txt
9352 89111 513276 10135-8.txt
9352 89111 513270 10135.txt
67002 517818 3116511 10136-8.txt
67002 518041 3116615 10136.txt
9085 88036 518861 10138-8.txt
9085 88037 518881 10138.txt
8363 84810 510840 10139-8.txt
8363 84813 510854 10139.txt
6209 48976 310611 10140-0.txt
1680 17177 102750 10141.txt
7358 67499 378164 10142-8.txt
7358 67499 378461 10142.txt
2025 14976 91434 10143-8.txt
2025 14976 91427 10143.txt
2614 18545 137075 10144-8.txt
2614 18545 137067 10144.txt
2746 24830 156141 10145-8.txt
2746 24830 156146 10145.txt
4862 46266 275076 10146.txt
23552 203634 1241751 10147-8.txt
23552 203634 1241756 10147.txt
11580 114400 622517 10148.txt
18149 159055 927432 10149-8.txt
18149 159070 927431 10149.txt
6085 60311 333656 10150-8.txt
6085 60312 333670 10150.txt
15498 154737 923081 10151-8.txt
15498 154738 923082 10151.txt
3948 42212 242914 10159-8.txt
3948 42212 242906 10159.txt
17233 104218 656902 10161-8.txt
17233 104221 656911 10161.txt
8690 96876 561825 10162-8.txt
8690 96876 561829 10162.txt
4024 36730 228995 10163-8.txt
4024 36731 228990 10163.txt
5357 46047 261924 10164-8.txt
5357 46047 261915 10164.txt
16437 167676 976036 10165-8.txt
16437 167700 976131 10165.txt
9239 79198 497350 10166-8.txt
9239 79203 497479 10166.txt
14273 120725 697947 10201-8.txt
14273 120725 697938 10201.txt
9137 82708 477848 10202-8.txt
9137 82747 477846 10202.txt
11089 97074 550493 10210-8.txt
11089 97074 550484 10210.txt
5227 49159 277181 10211.txt
5651 58856 312124 10212.txt
12424 116190 652545 10213-8.txt
12424 116190 652539 10213.txt
4083 42457 257989 10214-8.txt
4083 42457 257985 10214.txt
5004 30067 173852 10216.txt
3674 38752 221770 10217-8.txt
3674 38752 221762 10217.txt
20676 107819 729564 10219-8.txt
20676 107820 729566 10219.txt
2256 16828 93922 10220.txt
9990 89670 503416 10221.txt
12116 123788 726414 10222-8.txt
12116 123789 726429 10222.txt
3508 26644 155304 10224.txt
2041 17529 107008 10225-8.txt
2041 17531 107001 10225.txt
9648 93151 495026 10226.txt
9136 70999 418638 10234-8.txt
9136 71002 418630 10234.txt
9574 100602 569580 12370-8.txt
9574 100603 569574 12370.txt
9331 85806 503559 12372-8.txt
9331 85819 503561 12372.txt
8776 86266 511480 12373-8.txt
8776 86268 511478 12373.txt
8747 84637 499648 12374-8.txt
8747 84637 499663 12374.txt
6151 54689 335844 12375-8.txt
6151 54689 335840 12375.txt
9468 92699 556880 12376-8.txt
9468 92699 556871 12376.txt
3710 33010 190511 12377.txt
1905 13876 86293 12378-8.txt
1905 13876 86285 12378.txt
9379 82181 478811 12380-8.txt
9379 82181 478797 12380.txt
5456 56619 343561 12381.txt
20247 126857 834168 12383-8.txt
20247 126872 834180 12383.txt
6015 33748 211638 12384-8.txt
6015 33748 211632 12384.txt
14015 119266 706626 12385-8.txt
14015 119266 706620 12385.txt
7972 76744 418886 12386.txt
13569 141089 864515 1jcfs10.txt
4658 22391 133753 2babb10.txt
4814 25104 148572 3babb10.txt
5859 30669 179964 50bab10.txt
5702 48306 273967 ajtl10.txt
2752 24292 145780 allyr10.txt
7758 76519 429807 alpsn10.txt
2893 17536 98212 balen10.txt
10930 67628 368952 baleng2.txt
3897 31279 183018 batlf10.txt
2844 19051 110426 bgopr10.txt
824 6677 41155 brnte10.txt
2187 20352 114878 bstjg10.txt
2334 19883 114379 cambp10.txt
6965 62419 360031 canbe10.txt
1975 16461 97478 cantp10.txt
7790 69450 392387 cfrz10.txt
8070 81365 461521 crsnk10.txt
3529 27169 166625 esbio10.txt
761 9885 51008 G-1_Corinthians.txt
209 2590 12974 G-1_John.txt
231 2574 14036 G-1_Peter.txt
161 1988 10271 G-1_Thessalonians.txt
214 2504 13892 G-1_Timothy.txt
512 6470 33690 G-2_Corinthians.txt
33 334 1731 G-2_John.txt
142 1632 9138 G-2_Peter.txt
96 1149 6024 G-2_Thessalonians.txt
152 1778 9689 G-2_Timothy.txt
36 355 1865 G-3_John.txt
1985 24675 132360 G-Acts.txt
177 2122 11587 G-Colossians.txt
260 3273 17385 G-Ephesians.txt
267 3321 17384 G-Galatians.txt
641 7307 39559 G-Hebrews.txt
197 2476 13019 G-James.txt
1463 19648 99454 G-John.txt
62 653 3771 G-Jude.txt
2017 25734 133733 G-Luke.txt
1177 15060 78535 G-Mark.txt
1893 23967 124912 G-Matthew.txt
45 507 2641 G-Philemon.txt
182 2322 12211 G-Philippians.txt
1050 12072 63154 G-Revelation.txt
807 9918 52386 G-Romans.txt
2389 18359 105427 grybr10.txt
90 1013 5613 G-Titus.txt
90 725 4407 Introduction_and_Copyright.txt
6257 57131 325285 mklmt10.txt
1275 6277 38273 morem10.txt
2682 25522 149315 mspcd10.txt
2719 21442 124367 penbr10.txt
5968 54254 309320 pgjr10.txt
2047 21858 123684 pntvw10.txt
7689 68597 393344 prcpg10.txt
2365 19728 116330 prhg10.txt
4171 34653 202188 prhsb10.txt
12644 103791 576111 rlsl110.txt
13022 113937 629627 rlsl210.txt
7687 48233 288195 rmlav10.txt
4920 49931 285499 sesli10.txt
4176 20935 119559 svyrd10.txt
4796 47165 266759 tecom10.txt
11757 114971 671233 utrkj10.txt
902 5482 32586 vpasm10.txt
1540 14174 82920 wldsp10.txt
7991 70571 385159 wtrbs10.txt
11144 100665 598020 zncli10.txt
2760143 23731100 140186425 total
Okay, so now we have 23.7 million words. Let's count 'em!
In [29]:
# File by file - should be slow
time grep -oE '\w{2,}' *.txt | tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -rn | head -25
37463 10136.txt:the
37463 10136-8.txt:the
22092 10136.txt:of
22092 10136-8.txt:of
19746 10136.txt:and
19746 10136-8.txt:and
19113 10147.txt:the
19113 10147-8.txt:the
14295 10128.txt:the
14295 10128-8.txt:the
14264 10151.txt:the
14264 10151-8.txt:the
14051 1jcfs10.txt:the
13631 10103.txt:the
13631 10103-8.txt:the
13411 10114.txt:the
13411 10114-8.txt:the
12642 10136.txt:to
12642 10136-8.txt:to
11808 10038.txt:the
11808 10038-8.txt:the
11126 10165.txt:the
11126 10165-8.txt:the
10856 10062.txt:the
10856 10062-8.txt:the
sort: write failed: standard output: Broken pipe
sort: write error
real 3m52.574s
user 3m45.721s
sys 0m22.869s
In [30]:
# In parallel - should be about twice as fast
time ls *.txt | parallel -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' > {}-words.txt"
real 0m14.181s
user 0m14.433s
sys 0m7.436s
In [31]:
time cat *-words.txt >> combined.txt
real 0m0.363s
user 0m0.001s
sys 0m0.303s
In [32]:
time < combined.txt sort | uniq -c | sort -rn | head -25
1554064 the
824506 of
799253 and
629027 to
429747 in
266622 that
238989 it
237853 was
237255 he
200894 with
195603 his
188414 is
186255 for
175814 as
164763 you
140903 had
137538 on
135345 but
135036 not
131652 be
131261 at
126337 by
118609 this
113214 her
108945 or
sort: write failed: standard output: Broken pipe
sort: write error
real 0m40.816s
user 1m7.956s
sys 0m2.530s
Content source: jdelpino1/warehousing-course
Similar notebooks: