parallel demo

Sometimes filling up a pipeline works great to speed up processing, but we can also get some speedup by using data-parallel methods. If we have hundreds of similar files to process with one script, for example, on a machine with several cores, you can parallelize their processing and take advantage of more than one core at a time.

Let's look at a simple example. What if we had five texts from Gutenberg to count words for?

First, let's get some data:


In [1]:
ls *.txt


alice.txt  douglass.txt  frankenstein.txt  pride.txt  wuthering.txt

In [2]:
wc *.txt


   3735   29461  167518 alice.txt
   4104   43789  248369 douglass.txt
   7653   77986  448689 frankenstein.txt
  13426  124588  717575 pride.txt
  12486  118899  681641 wuthering.txt
  41404  394723 2263792 total

Okay, we've got five texts, with a total of nearly 400,000 words combined. That's a good start.

Remember our previous pipeline for counting words? It was something like this:


In [6]:
grep -oE '\w{2,}' douglass.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25


   2435 the
   1669 of
   1574 to
   1445 and
    794 in
    742 was
    530 he
    457 my
    433 it
    426 with
    422 that
    388 his
    365 as
    359 for
    336 me
    295 this
    292 at
    285 be
    255 had
    253 by
    237 not
    208 or
    206 but
    201 him
    200 is
sort: write failed: standard output: Broken pipe
sort: write error

Note that we can add a timer with the command time at front:


In [8]:
time grep -oE '\w{2,}' douglass.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25


   2435 the
   1669 of
   1574 to
   1445 and
    794 in
    742 was
    530 he
    457 my
    433 it
    426 with
    422 that
    388 his
    365 as
    359 for
    336 me
    295 this
    292 at
    285 be
    255 had
    253 by
    237 not
    208 or
    206 but
    201 him
    200 is
sort: write failed: standard output: Broken pipe
sort: write error

real	0m0.137s
user	0m0.115s
sys	0m0.023s

What if we just used a filename wildcard, what would happen?


In [10]:
time grep -oE '\w{2,}' *.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25


   4816 wuthering.txt:and
   4750 wuthering.txt:the
   4507 pride.txt:the
   4371 frankenstein.txt:the
   4242 pride.txt:to
   3729 pride.txt:of
   3658 pride.txt:and
   3616 wuthering.txt:to
   3046 frankenstein.txt:and
   2760 frankenstein.txt:of
   2435 douglass.txt:the
   2340 wuthering.txt:of
   2203 pride.txt:her
   2174 frankenstein.txt:to
   2124 wuthering.txt:he
   1944 wuthering.txt:you
   1937 pride.txt:in
   1844 pride.txt:was
   1818 alice.txt:the
   1776 frankenstein.txt:my
   1695 pride.txt:she
   1669 douglass.txt:of
   1574 douglass.txt:to
   1556 pride.txt:that
   1550 pride.txt:it
sort: write failed: standard output: Broken pipe
sort: write error

real	0m3.313s
user	0m3.169s
sys	0m0.378s

What just happened?

A different approach - data parallel

We can use the parallel command to assign the pipeline job to multiple cores... as many as we have available. We'll need to break the task up first, just getting raw word lists, which we can then combine easily.


In [11]:
time ls *.txt | parallel -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' > {}-words.txt"


real	0m0.442s
user	0m0.297s
sys	0m0.222s

In [13]:
wc *-words.txt


  28625   28625  149885 alice.txt-words.txt
  42277   42277  231365 douglass.txt-words.txt
  73902   73902  417234 frankenstein.txt-words.txt
 121166  121166  669063 pride.txt-words.txt
 114101  114101  619516 wuthering.txt-words.txt
 380071  380071 2087063 total

Looks about right, with some space and punctuation removed. Did you see how fast that went?

Now we can combine them and apply the rest of our pipeline:


In [17]:
time cat *-words.txt >> combined.txt


real	0m0.004s
user	0m0.002s
sys	0m0.003s

In [16]:
time < combined.txt sort | uniq -c | sort -rn | head -25


  17881 the
  13905 and
  12415 to
  11129 of
   5889 in
   5090 was
   4720 he
   4611 you
   4605 that
   4561 it
   4518 her
   4090 my
   3926 she
   3693 his
   3332 not
   3320 with
   3316 as
   2982 for
   2981 had
   2836 be
   2772 but
   2761 me
   2452 at
   2422 on
   2230 is
sort: write failed: standard output: Broken pipe
sort: write error

real	0m0.506s
user	0m0.810s
sys	0m0.070s

Faster, right? Well, if a gain of only a couple of seconds doesn't seem like much, imagine if we had hundreds of texts. Or thousands... (see how to get ebook files for more details).


In [18]:
mkdir pg-text
cd pg-text
wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"



(380 text files later...)


In [27]:
ls *.txt


10001.txt    10062.txt    10119-8.txt  10234-8.txt
10002-8.txt  10063.txt    10119.txt    10234.txt
10002.txt    10064-8.txt  10120-8.txt  12370-8.txt
10003.txt    10064.txt    10120.txt    12370.txt
10004-8.txt  10065-8.txt  10121-8.txt  12372-8.txt
10004.txt    10065.txt    10121.txt    12372.txt
10005-8.txt  10066-8.txt  10122-8.txt  12373-8.txt
10005.txt    10066.txt    10122.txt    12373.txt
10006-8.txt  10067-8.txt  10123.txt    12374-8.txt
10006.txt    10067.txt    10124-8.txt  12374.txt
10007-8.txt  10068-8.txt  10124.txt    12375-8.txt
10007.txt    10068.txt    10125-8.txt  12375.txt
10008-8.txt  10069-8.txt  10125.txt    12376-8.txt
10008.txt    10069.txt    10126.txt    12376.txt
10009.txt    10070.txt    10127-8.txt  12377.txt
10010.txt    10071-8.txt  10127.txt    12378-8.txt
10011-8.txt  10071.txt    10128-8.txt  12378.txt
10011.txt    10072.txt    10128.txt    12380-8.txt
10012-8.txt  10073-8.txt  10129-8.txt  12380.txt
10012.txt    10073.txt    10129.txt    12381.txt
10013-8.txt  10074-8.txt  10130-8.txt  12383-8.txt
10013.txt    10074.txt    10130.txt    12383.txt
10014-8.txt  10075.txt    10131.txt    12384-8.txt
10014.txt    10076-8.txt  10132-8.txt  12384.txt
10015.txt    10076.txt    10132.txt    12385-8.txt
10016-8.txt  10077-8.txt  10133-8.txt  12385.txt
10016.txt    10077.txt    10133.txt    12386.txt
10017-8.txt  10078-8.txt  10134.txt    1jcfs10.txt
10017.txt    10078.txt    10135-8.txt  2babb10.txt
10018-8.txt  10079-8.txt  10135.txt    3babb10.txt
10018.txt    10079.txt    10136-8.txt  50bab10.txt
10019-8.txt  10080-8.txt  10136.txt    ajtl10.txt
10019.txt    10080.txt    10138-8.txt  allyr10.txt
10020-8.txt  10081.txt    10138.txt    alpsn10.txt
10020.txt    10082-8.txt  10139-8.txt  balen10.txt
10021-8.txt  10082.txt    10139.txt    baleng2.txt
10021.txt    10083.txt    10140-0.txt  batlf10.txt
10022.txt    10084-8.txt  10141.txt    bgopr10.txt
10023.txt    10084.txt    10142-8.txt  brnte10.txt
10024-8.txt  10085-8.txt  10142.txt    bstjg10.txt
10024.txt    10085.txt    10143-8.txt  cambp10.txt
10025-8.txt  10086-8.txt  10143.txt    canbe10.txt
10025.txt    10086.txt    10144-8.txt  cantp10.txt
10026-8.txt  10087.txt    10144.txt    cfrz10.txt
10026.txt    10088-8.txt  10145-8.txt  crsnk10.txt
10027.txt    10088.txt    10145.txt    esbio10.txt
10028.txt    10089-8.txt  10146.txt    G-1_Corinthians.txt
10029-8.txt  10089.txt    10147-8.txt  G-1_John.txt
10029.txt    10090-8.txt  10147.txt    G-1_Peter.txt
10030-8.txt  10090.txt    10148.txt    G-1_Thessalonians.txt
10030.txt    10091-8.txt  10149-8.txt  G-1_Timothy.txt
10031-8.txt  10091.txt    10149.txt    G-2_Corinthians.txt
10031.txt    10092-8.txt  10150-8.txt  G-2_John.txt
10032-8.txt  10092.txt    10150.txt    G-2_Peter.txt
10032.txt    10093.txt    10151-8.txt  G-2_Thessalonians.txt
10033-8.txt  10094-8.txt  10151.txt    G-2_Timothy.txt
10033.txt    10094.txt    10159-8.txt  G-3_John.txt
10034-8.txt  10095-8.txt  10159.txt    G-Acts.txt
10034.txt    10095.txt    10161-8.txt  G-Colossians.txt
10035-8.txt  10096-8.txt  10161.txt    G-Ephesians.txt
10035.txt    10096.txt    10162-8.txt  G-Galatians.txt
10036-8.txt  10097-8.txt  10162.txt    G-Hebrews.txt
10036.txt    10097.txt    10163-8.txt  G-James.txt
10037-8.txt  10098-8.txt  10163.txt    G-John.txt
10037.txt    10098.txt    10164-8.txt  G-Jude.txt
10038-8.txt  10099-8.txt  10164.txt    G-Luke.txt
10038.txt    10099.txt    10165-8.txt  G-Mark.txt
10039-8.txt  10100-8.txt  10165.txt    G-Matthew.txt
10039.txt    10100.txt    10166-8.txt  G-Philemon.txt
10040.txt    10101.txt    10166.txt    G-Philippians.txt
10041-8.txt  10102-8.txt  10201-8.txt  G-Revelation.txt
10041.txt    10102.txt    10201.txt    G-Romans.txt
10042-8.txt  10103-8.txt  10202-8.txt  grybr10.txt
10042.txt    10103.txt    10202.txt    G-Titus.txt
10043-8.txt  10104-8.txt  10210-8.txt  Introduction_and_Copyright.txt
10043.txt    10104.txt    10210.txt    mklmt10.txt
10044.txt    10105-8.txt  10211.txt    morem10.txt
10045.txt    10105.txt    10212.txt    mspcd10.txt
10046.txt    10106-8.txt  10213-8.txt  penbr10.txt
10047-8.txt  10106.txt    10213.txt    pgjr10.txt
10047.txt    10107-8.txt  10214-8.txt  pntvw10.txt
10048.txt    10107.txt    10214.txt    prcpg10.txt
10049.txt    10108.txt    10216.txt    prhg10.txt
10050.txt    10109.txt    10217-8.txt  prhsb10.txt
10051.txt    10110-8.txt  10217.txt    rlsl110.txt
10052-8.txt  10110.txt    10219-8.txt  rlsl210.txt
10052.txt    10111-8.txt  10219.txt    rmlav10.txt
10056-8.txt  10111.txt    10220.txt    sesli10.txt
10056.txt    10112-8.txt  10221.txt    svyrd10.txt
10057.txt    10112.txt    10222-8.txt  tecom10.txt
10058.txt    10114-8.txt  10222.txt    utrkj10.txt
10059.txt    10114.txt    10224.txt    vpasm10.txt
10060-8.txt  10116.txt    10225-8.txt  wldsp10.txt
10060.txt    10118-8.txt  10225.txt    wtrbs10.txt
10062-8.txt  10118.txt    10226.txt    zncli10.txt

In [28]:
wc *.txt


      958      8807     52510 10001.txt
     5690     54201    306901 10002-8.txt
     5690     54201    306892 10002.txt
     6327     64594    380817 10003.txt
     5361     51299    302753 10004-8.txt
     5361     51300    302750 10004.txt
     7313     73655    434769 10005-8.txt
     7313     73656    434760 10005.txt
     1582     16512     95836 10006-8.txt
     1582     16512     95831 10006.txt
     3695     31295    180138 10007-8.txt
     3695     31295    180129 10007.txt
     9154     69542    407280 10008-8.txt
     9154     69542    407271 10008.txt
     8502     90180    504214 10009.txt
     1451     13938     86336 10010.txt
     3881     26555    155576 10011-8.txt
     3881     26555    155567 10011.txt
     9175     93985    561045 10012-8.txt
     9175     94007    561124 10012.txt
     2535     18202    136659 10013-8.txt
     2535     18202    136650 10013.txt
     2858     19257    142501 10014-8.txt
     2858     19257    142492 10014.txt
     2681     18382    138460 10015.txt
     2824     19781    147286 10016-8.txt
     2824     19782    147277 10016.txt
     2661     18392    136298 10017-8.txt
     2661     18392    136288 10017.txt
     2741     18548    139711 10018-8.txt
     2741     18548    139702 10018.txt
     2740     18474    138446 10019-8.txt
     2740     18474    138437 10019.txt
     5632     48315    280958 10020-8.txt
     5632     48315    280956 10020.txt
     7316     55173    318638 10021-8.txt
     7316     55174    318632 10021.txt
     4811     41693    227841 10022.txt
     3583     38245    214800 10023.txt
     6441     53315    306207 10024-8.txt
     6441     53315    306215 10024.txt
     9639     69686    404697 10025-8.txt
     9639     69686    404693 10025.txt
     1938     16222     99768 10026-8.txt
     1937     16222     99760 10026.txt
     9072     75404    439080 10027.txt
     5369     46349    276424 10028.txt
     6896     54856    316292 10029-8.txt
     6896     54856    316283 10029.txt
    10790    104068    587423 10030-8.txt
    10790    104068    587414 10030.txt
    10377     64145    409992 10031-8.txt
    10377     64146    410012 10031.txt
     2622     18401    136850 10032-8.txt
     2622     18401    136841 10032.txt
     2648     18820    138038 10033-8.txt
     2648     18820    138029 10033.txt
     2888     18082    137039 10034-8.txt
     2888     18082    137047 10034.txt
     2750     19139    143006 10035-8.txt
     2750     19139    142997 10035.txt
     2778     18691    142266 10036-8.txt
     2778     18691    142257 10036.txt
     8222     69785    396129 10037-8.txt
     8222     69786    396120 10037.txt
    21255    154845    895435 10038-8.txt
    21255    154845    895429 10038.txt
    23550    147548    884479 10039-8.txt
    23550    147559    884471 10039.txt
     1053      4683     42014 10040.txt
     9132     75482    449980 10041-8.txt
     9132     75482    449973 10041.txt
     8147     75355    450534 10042-8.txt
     8147     75355    450525 10042.txt
     3319     32323    186254 10043-8.txt
     3319     32326    186257 10043.txt
     5621     48042    293397 10044.txt
     6265     40441    242697 10045.txt
     9802     95425    514564 10046.txt
     2785     20102    148910 10047-8.txt
     2785     20103    148901 10047.txt
     6225     45314    261984 10048.txt
     2981     30351    169713 10049.txt
     2838     29895    161556 10050.txt
     4081     45376    246157 10051.txt
     3508     38255    210952 10052-8.txt
     3508     38255    210943 10052.txt
    12006     90236    531847 10056-8.txt
    12006     90235    531848 10056.txt
     6270     54036    314672 10057.txt
     8321     76409    461647 10058.txt
     6650     51775    305801 10059.txt
     9956     98878    600419 10060-8.txt
     9956     98940    600646 10060.txt
    15639    144792    830489 10062-8.txt
    15638    144792    830476 10062.txt
     2686     24700    141667 10063.txt
    21769    176439   1026737 10064-8.txt
    21764    176439   1026718 10064.txt
     4032     36993    229427 10065-8.txt
     4031     36993    229410 10065.txt
    10679     82916    462890 10066-8.txt
    10679     82916    462881 10066.txt
     9563     71433    407039 10067-8.txt
     9563     71433    407030 10067.txt
    10585     94471    542138 10068-8.txt
    10585     94471    542129 10068.txt
     1368     11421     70918 10069-8.txt
     1368     11421     70925 10069.txt
     4169     31908    176568 10070.txt
     3378     33499    198401 10071-8.txt
     3378     33499    198392 10071.txt
     8253     67318    358916 10072.txt
    12786     94416    613382 10073-8.txt
    12786     94417    613377 10073.txt
     1916     15999     98887 10074-8.txt
     1916     15999     98882 10074.txt
     1339     13280     76968 10075.txt
     9408     80555    458449 10076-8.txt
     9408     80555    458443 10076.txt
     9286     87938    515450 10077-8.txt
     9286     87940    515454 10077.txt
     1441     13639     85354 10078-8.txt
     1441     13639     85350 10078.txt
     9182     88050    517560 10079-8.txt
     9182     88052    517556 10079.txt
     4694     41469    249141 10080-8.txt
     4694     41469    249154 10080.txt
     8072     54081    312957 10081.txt
    12082    111460    631660 10082-8.txt
    12082    111460    631655 10082.txt
    13157    114410    644275 10083.txt
     6819     69022    381318 10084-8.txt
     6819     69022    381309 10084.txt
    12525     94947    545008 10085-8.txt
    12522     94948    544995 10085.txt
     9862     88811    496825 10086-8.txt
     9862     88811    496819 10086.txt
     4148     35075    203239 10087.txt
     9832     95176    577279 10088-8.txt
     9832     95178    577281 10088.txt
     3139     19082    114356 10089-8.txt
     3139     19082    114350 10089.txt
     8998     78993    485849 10090-8.txt
     8998     79000    485844 10090.txt
     2650     17966    137367 10091-8.txt
     2650     17967    137363 10091.txt
     2694     18224    139593 10092-8.txt
     2694     18224    139584 10092.txt
     5606     51207    280751 10093.txt
     8987     84512    469091 10094-8.txt
     8987     84512    469085 10094.txt
     9762     83914    506198 10095-8.txt
     9764     83914    506219 10095.txt
     3635     21987    129983 10096-8.txt
     3635     21987    129984 10096.txt
     5340     50481    306676 10097-8.txt
     5340     50481    306672 10097.txt
     9040     90482    534029 10098-8.txt
     9040     90482    534020 10098.txt
     5363     53244    314437 10099-8.txt
     5363     53245    314430 10099.txt
     7037     67893    409190 10100-8.txt
     7037     67893    409181 10100.txt
     4351     42179    232173 10101.txt
    11389     94144    541263 10102-8.txt
    11389     94144    541257 10102.txt
    15736    163993    972256 10103-8.txt
    15736    163998    972271 10103.txt
     2721     19143    142832 10104-8.txt
     2721     19143    142823 10104.txt
     2719     18314    136937 10105-8.txt
     2719     18314    136928 10105.txt
     2508     18178    132089 10106-8.txt
     2508     18178    132082 10106.txt
     7857     77301    449102 10107-8.txt
     7857     77302    449093 10107.txt
     1079     10156     60968 10108.txt
     6686     53543    302048 10109.txt
     9162     68588    406024 10110-8.txt
     9168     68576    405932 10110.txt
    10566    106762    610364 10111-8.txt
    10566    106762    610358 10111.txt
     3727     35454    220704 10112-8.txt
     3727     35454    220705 10112.txt
    16624    164025    976088 10114-8.txt
    16624    164025    976429 10114.txt
    10534    119482    667001 10116.txt
     9424     79966    485781 10118-8.txt
     9424     79966    485777 10118.txt
     6006     52726    316595 10119-8.txt
     6006     52726    316586 10119.txt
     9340     96565    549199 10120-8.txt
     9340     96566    549190 10120.txt
     6071     52577    301494 10121-8.txt
     6071     52577    301485 10121.txt
     2194     11159     68206 10122-8.txt
     2194     11159     68201 10122.txt
     7692     54583    315258 10123.txt
     5955     46308    268922 10124-8.txt
     5955     46308    268916 10124.txt
     9762     89812    517064 10125-8.txt
     9762     89812    517068 10125.txt
     6475     65931    371271 10126.txt
     3892     34886    197497 10127-8.txt
     3889     34842    197252 10127.txt
    15264    155012    936026 10128-8.txt
    15264    155021    936056 10128.txt
    11776    113594    662009 10129-8.txt
    11776    113596    662002 10129.txt
    22431    207317   1176514 10130-8.txt
    22431    207320   1176607 10130.txt
     6418     32832    198795 10131.txt
    14249    109348    634318 10132-8.txt
    14249    109356    634311 10132.txt
     3110     16386    104268 10133-8.txt
     3110     16386    104282 10133.txt
     7843     75315    426703 10134.txt
     9352     89111    513276 10135-8.txt
     9352     89111    513270 10135.txt
    67002    517818   3116511 10136-8.txt
    67002    518041   3116615 10136.txt
     9085     88036    518861 10138-8.txt
     9085     88037    518881 10138.txt
     8363     84810    510840 10139-8.txt
     8363     84813    510854 10139.txt
     6209     48976    310611 10140-0.txt
     1680     17177    102750 10141.txt
     7358     67499    378164 10142-8.txt
     7358     67499    378461 10142.txt
     2025     14976     91434 10143-8.txt
     2025     14976     91427 10143.txt
     2614     18545    137075 10144-8.txt
     2614     18545    137067 10144.txt
     2746     24830    156141 10145-8.txt
     2746     24830    156146 10145.txt
     4862     46266    275076 10146.txt
    23552    203634   1241751 10147-8.txt
    23552    203634   1241756 10147.txt
    11580    114400    622517 10148.txt
    18149    159055    927432 10149-8.txt
    18149    159070    927431 10149.txt
     6085     60311    333656 10150-8.txt
     6085     60312    333670 10150.txt
    15498    154737    923081 10151-8.txt
    15498    154738    923082 10151.txt
     3948     42212    242914 10159-8.txt
     3948     42212    242906 10159.txt
    17233    104218    656902 10161-8.txt
    17233    104221    656911 10161.txt
     8690     96876    561825 10162-8.txt
     8690     96876    561829 10162.txt
     4024     36730    228995 10163-8.txt
     4024     36731    228990 10163.txt
     5357     46047    261924 10164-8.txt
     5357     46047    261915 10164.txt
    16437    167676    976036 10165-8.txt
    16437    167700    976131 10165.txt
     9239     79198    497350 10166-8.txt
     9239     79203    497479 10166.txt
    14273    120725    697947 10201-8.txt
    14273    120725    697938 10201.txt
     9137     82708    477848 10202-8.txt
     9137     82747    477846 10202.txt
    11089     97074    550493 10210-8.txt
    11089     97074    550484 10210.txt
     5227     49159    277181 10211.txt
     5651     58856    312124 10212.txt
    12424    116190    652545 10213-8.txt
    12424    116190    652539 10213.txt
     4083     42457    257989 10214-8.txt
     4083     42457    257985 10214.txt
     5004     30067    173852 10216.txt
     3674     38752    221770 10217-8.txt
     3674     38752    221762 10217.txt
    20676    107819    729564 10219-8.txt
    20676    107820    729566 10219.txt
     2256     16828     93922 10220.txt
     9990     89670    503416 10221.txt
    12116    123788    726414 10222-8.txt
    12116    123789    726429 10222.txt
     3508     26644    155304 10224.txt
     2041     17529    107008 10225-8.txt
     2041     17531    107001 10225.txt
     9648     93151    495026 10226.txt
     9136     70999    418638 10234-8.txt
     9136     71002    418630 10234.txt
     9574    100602    569580 12370-8.txt
     9574    100603    569574 12370.txt
     9331     85806    503559 12372-8.txt
     9331     85819    503561 12372.txt
     8776     86266    511480 12373-8.txt
     8776     86268    511478 12373.txt
     8747     84637    499648 12374-8.txt
     8747     84637    499663 12374.txt
     6151     54689    335844 12375-8.txt
     6151     54689    335840 12375.txt
     9468     92699    556880 12376-8.txt
     9468     92699    556871 12376.txt
     3710     33010    190511 12377.txt
     1905     13876     86293 12378-8.txt
     1905     13876     86285 12378.txt
     9379     82181    478811 12380-8.txt
     9379     82181    478797 12380.txt
     5456     56619    343561 12381.txt
    20247    126857    834168 12383-8.txt
    20247    126872    834180 12383.txt
     6015     33748    211638 12384-8.txt
     6015     33748    211632 12384.txt
    14015    119266    706626 12385-8.txt
    14015    119266    706620 12385.txt
     7972     76744    418886 12386.txt
    13569    141089    864515 1jcfs10.txt
     4658     22391    133753 2babb10.txt
     4814     25104    148572 3babb10.txt
     5859     30669    179964 50bab10.txt
     5702     48306    273967 ajtl10.txt
     2752     24292    145780 allyr10.txt
     7758     76519    429807 alpsn10.txt
     2893     17536     98212 balen10.txt
    10930     67628    368952 baleng2.txt
     3897     31279    183018 batlf10.txt
     2844     19051    110426 bgopr10.txt
      824      6677     41155 brnte10.txt
     2187     20352    114878 bstjg10.txt
     2334     19883    114379 cambp10.txt
     6965     62419    360031 canbe10.txt
     1975     16461     97478 cantp10.txt
     7790     69450    392387 cfrz10.txt
     8070     81365    461521 crsnk10.txt
     3529     27169    166625 esbio10.txt
      761      9885     51008 G-1_Corinthians.txt
      209      2590     12974 G-1_John.txt
      231      2574     14036 G-1_Peter.txt
      161      1988     10271 G-1_Thessalonians.txt
      214      2504     13892 G-1_Timothy.txt
      512      6470     33690 G-2_Corinthians.txt
       33       334      1731 G-2_John.txt
      142      1632      9138 G-2_Peter.txt
       96      1149      6024 G-2_Thessalonians.txt
      152      1778      9689 G-2_Timothy.txt
       36       355      1865 G-3_John.txt
     1985     24675    132360 G-Acts.txt
      177      2122     11587 G-Colossians.txt
      260      3273     17385 G-Ephesians.txt
      267      3321     17384 G-Galatians.txt
      641      7307     39559 G-Hebrews.txt
      197      2476     13019 G-James.txt
     1463     19648     99454 G-John.txt
       62       653      3771 G-Jude.txt
     2017     25734    133733 G-Luke.txt
     1177     15060     78535 G-Mark.txt
     1893     23967    124912 G-Matthew.txt
       45       507      2641 G-Philemon.txt
      182      2322     12211 G-Philippians.txt
     1050     12072     63154 G-Revelation.txt
      807      9918     52386 G-Romans.txt
     2389     18359    105427 grybr10.txt
       90      1013      5613 G-Titus.txt
       90       725      4407 Introduction_and_Copyright.txt
     6257     57131    325285 mklmt10.txt
     1275      6277     38273 morem10.txt
     2682     25522    149315 mspcd10.txt
     2719     21442    124367 penbr10.txt
     5968     54254    309320 pgjr10.txt
     2047     21858    123684 pntvw10.txt
     7689     68597    393344 prcpg10.txt
     2365     19728    116330 prhg10.txt
     4171     34653    202188 prhsb10.txt
    12644    103791    576111 rlsl110.txt
    13022    113937    629627 rlsl210.txt
     7687     48233    288195 rmlav10.txt
     4920     49931    285499 sesli10.txt
     4176     20935    119559 svyrd10.txt
     4796     47165    266759 tecom10.txt
    11757    114971    671233 utrkj10.txt
      902      5482     32586 vpasm10.txt
     1540     14174     82920 wldsp10.txt
     7991     70571    385159 wtrbs10.txt
    11144    100665    598020 zncli10.txt
  2760143  23731100 140186425 total

Okay, so now we have 23.7 million words. Let's count 'em!


In [29]:
# File by file - should be slow
time grep -oE '\w{2,}' *.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25


  37463 10136.txt:the
  37463 10136-8.txt:the
  22092 10136.txt:of
  22092 10136-8.txt:of
  19746 10136.txt:and
  19746 10136-8.txt:and
  19113 10147.txt:the
  19113 10147-8.txt:the
  14295 10128.txt:the
  14295 10128-8.txt:the
  14264 10151.txt:the
  14264 10151-8.txt:the
  14051 1jcfs10.txt:the
  13631 10103.txt:the
  13631 10103-8.txt:the
  13411 10114.txt:the
  13411 10114-8.txt:the
  12642 10136.txt:to
  12642 10136-8.txt:to
  11808 10038.txt:the
  11808 10038-8.txt:the
  11126 10165.txt:the
  11126 10165-8.txt:the
  10856 10062.txt:the
  10856 10062-8.txt:the
sort: write failed: standard output: Broken pipe
sort: write error

real	3m52.574s
user	3m45.721s
sys	0m22.869s

In [30]:
# In parallel - should be about twice as fast
time ls *.txt | parallel -j+0 "grep -oE '\w{2,}' {} | tr '[:upper:]' '[:lower:]' > {}-words.txt"


real	0m14.181s
user	0m14.433s
sys	0m7.436s

In [31]:
time cat *-words.txt >> combined.txt


real	0m0.363s
user	0m0.001s
sys	0m0.303s

In [32]:
time < combined.txt sort | uniq -c | sort -rn | head -25


1554064 the
 824506 of
 799253 and
 629027 to
 429747 in
 266622 that
 238989 it
 237853 was
 237255 he
 200894 with
 195603 his
 188414 is
 186255 for
 175814 as
 164763 you
 140903 had
 137538 on
 135345 but
 135036 not
 131652 be
 131261 at
 126337 by
 118609 this
 113214 her
 108945 or
sort: write failed: standard output: Broken pipe
sort: write error

real	0m40.816s
user	1m7.956s
sys	0m2.530s