Week 1 - September 1, 2015

Hi, welcome to our first class lecture notebook. Here I'll gather all the commands I showed in class the other night and add a little explanation along the way. During that first talk I wanted to be sure you saw what it looks like to work directly in bash, or in the command line, and what it means to use a REPL. In future weeks I'll just use a notebook directly during lectures and will post those right after class.

This was a somewhat unrehearsed tour of useful stuff. For a more thoughtful lesson on the command line shell, see Software Carpentry's Lesson "The UNIX Shell".

Who, what, where, when

We started with a little tour of basic bash notebook functions. Starting with the who / what / where, we can ask about our account name:


In [17]:
whoami


vagrant

That's a default set because we're using vagrant - it's not commentary.

Next we asked where we are in the file system:


In [18]:
pwd


/home/vagrant/warehousing-course/lectures

This is a little different from what you saw last night, because I'm putting this in a new folder. Yes, it's odd that pwd isn't whereami. But you have to admit pwd is easier to type than whereami. Just think to yourself: "Print Working Directory" and pwd will be easy to remember.

How does bash know how to execute these commands? First, it looks up which command you might mean, using which:


In [19]:
which whoami


/usr/bin/whoami

Okay, so whoami is under /usr/bin. How did it know to look under /usr/bin? Well, because that's the PATH.


In [20]:
echo $PATH


/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/vagrant/repos/command-line-tools-for-data-science/tools:/home/vagrant/tools:/usr/lib/go/bin:/home/vagrant/.go/bin:/home/vagrant/apps/spark/bin:/home/vagrant/apps/spark/bin

Wait, what's echo? It's just a way to print stuff to the screen. Such as saying hello:


In [21]:
echo "hello world"


hello world

Okay, so all those different directories on the system, separated by colons like you see above, are the PATH system variable. You might note that there's something like this on Windows, too, along with a lot of other variables.

So when you type a command in bash, it looks in every one of those directories for it, and if it finds it, it executes it. In this case, it finds whoami under /usr/bin, which is the fourth place it checks.

And if you type a command it can't find, it tells you so:


In [22]:
turtle


No command 'turtle' found, did you mean:
 Command 'kturtle' from package 'kturtle' (universe)
turtle: command not found

In [23]:
which turtle



See? Nothing there. But at least it's helpful to know we could install kturtle. Let's not for now, though.

Going back to the questions of what and where, it's helpful to look around. ls is the command for listing files:


In [24]:
ls


20150901-week-01.ipynb  siddhartha.txt

In the current directory, there's only this notebook! What else can we find out about the notebook file? Let's use ls with some options (also called "flags"):


In [25]:
ls -l


total 296
-rw-rw-r-- 1 vagrant vagrant  57754 Sep  4 08:24 20150901-week-01.ipynb
-rw-rw-r-- 1 vagrant vagrant 241176 Sep  3 18:12 siddhartha.txt

Okay, so now we know the permissions (the "-rw-rw-r--" part), who owns this file (vagrant), which group owns the file (also vagrant), how many bytes it is (5975 or probably more after I type more here), when it was last modified (Sep 3 at 17:06 or probably later because I'll keep typing), and the file name itself.

ls -l is the ls command with the -l option which stands for "long list". There are lots of options. Another useful one is ls -a:


In [26]:
ls -a


.   .~20150901-week-01.ipynb  .ipynb_checkpoints
..  20150901-week-01.ipynb    siddhartha.txt

ls -a shows all the "dotfiles", semi-hidden files that you don't normally want to see but are actually all over your drive. The .ipynb_checkpoints file is a support file for this notebook. The . file is actually a reference to this very directory, and is often called "dot". The .. file is actually a reference to this directory's parent directory, and is often called -- yep -- "dot dot".

Note that we can combine flags:


In [27]:
ls -al


total 276
drwxrwxr-x 3 vagrant vagrant   4096 Sep  4 08:26 .
drwxrwxr-x 4 vagrant vagrant   4096 Sep  3 16:50 ..
-rw-rw-r-- 1 vagrant vagrant  28380 Sep  4 08:26 20150901-week-01.ipynb
drwxr-xr-x 2 vagrant vagrant   4096 Sep  3 16:57 .ipynb_checkpoints
-rw-rw-r-- 1 vagrant vagrant 241176 Sep  3 18:12 siddhartha.txt

That's "give me a file listing, long form, with hidden files."

You can also specify a directory, using an argument:


In [28]:
ls -al ..


total 288
drwxrwxr-x  4 vagrant vagrant   4096 Sep  3 16:50 .
drwxr-xr-x 35 vagrant vagrant   4096 Sep  3 16:49 ..
-rw-rw-r--  1 vagrant vagrant   1125 Sep  3 16:49 assignment-01.md
drwxrwxr-x  8 vagrant vagrant   4096 Sep  3 16:49 .git
-rw-rw-r--  1 vagrant vagrant    715 Sep  3 16:49 .gitignore
drwxrwxr-x  3 vagrant vagrant   4096 Sep  4 08:26 lectures
-rw-rw-r--  1 vagrant vagrant   6556 Sep  3 16:49 LICENSE
-rw-rw-r--  1 vagrant vagrant   1866 Sep  3 16:49 README.md
-rw-rw-r--  1 vagrant vagrant   5769 Sep  3 16:49 README-vm-installation.txt
-rw-rw-r--  1 vagrant vagrant  64941 Sep  3 16:49 Schedule.pdf
-rw-rw-r--  1 vagrant vagrant 181689 Sep  3 16:49 Syllabus.pdf

ls -al .. means "give me a file listing, long form, with hidden files, for the current directory's parent directory."

Moving around

To move around or Change Directories, use cd:


In [29]:
cd ..
pwd


/home/vagrant/warehousing-course

And to go back:


In [30]:
cd lectures
pwd


/home/vagrant/warehousing-course/lectures

Easy, right?

Most unix commands have a manual page or "man page". You can access them with the command man, which takes the name of a command as an argument (e.g. man ls, which I won't do here, because it generates a lot of output).

Absolute and relative paths

. and .. and lectures are examples of "relative paths". This is the same concept as relative links on a web site, which should be familiar to any of you have work on web sites before. And just like with web sites, there are also absolute paths, like there are absolute links. On unix, absolute paths start with /.


In [31]:
ls /


bin   home            lib64       opt   sbin  usr      vmlinuz.old
boot  initrd.img      lost+found  proc  srv   vagrant
dev   initrd.img.old  media       root  sys   var
etc   lib             mnt         run   tmp   vmlinuz

In [32]:
ls /home


vagrant

In [33]:
ls /home/vagrant


20150901-lecture-01-typing.txt  Downloads  pride.txt  repos
apps                            foo.txt    Public     tools
Desktop                         notebooks  R          warehousing-course

In [34]:
ls /home/vagrant/warehousing-course


assignment-01.md  LICENSE    README-vm-installation.txt  Syllabus.pdf
lectures          README.md  Schedule.pdf

In [35]:
ls /home/vagrant/warehousing-course/lectures


20150901-week-01.ipynb  siddhartha.txt

Fyi, that directory /home/vagrant is special, it's know as your "home directory". There are a lot of extra configuration files in there:


In [36]:
ls -a /home/vagrant


.                               .go                 repos
..                              .ICEauthority       .Rhistory
20150901-lecture-01-typing.txt  .ipynb_checkpoints  .rstudio-desktop
.ansible                        .ipython            .scala_history
apps                            .julia              .spark_history
.bash_aliases                   .julia_history      .spyder
.bash_history                   .jupyter            .spyder2
.bash_logout                    .local              .spyder2-py3
.bashrc                         .m2                 .sqlite_history
.bashrc-anaconda.bak            .mysql              .ssh
.cache                          .mysql_history      tools
.config                         notebooks           .vboxclient-clipboard.pid
.continuum                      .pip                .vboxclient-display.pid
.dbus                           .pki                .vboxclient-draganddrop.pid
Desktop                         pride.txt           .vboxclient-seamless.pid
Downloads                       .profile            .vbox_version
foo.txt                         .psql_history       .viminfo
.gconf                          Public              warehousing-course
.gnome2                         .python_history     .zinc
.gnupg                          R

Your home directory has a special shortcut, ~. Try:


In [37]:
ls -a ~


.                               .go                 repos
..                              .ICEauthority       .Rhistory
20150901-lecture-01-typing.txt  .ipynb_checkpoints  .rstudio-desktop
.ansible                        .ipython            .scala_history
apps                            .julia              .spark_history
.bash_aliases                   .julia_history      .spyder
.bash_history                   .jupyter            .spyder2
.bash_logout                    .local              .spyder2-py3
.bashrc                         .m2                 .sqlite_history
.bashrc-anaconda.bak            .mysql              .ssh
.cache                          .mysql_history      tools
.config                         notebooks           .vboxclient-clipboard.pid
.continuum                      .pip                .vboxclient-display.pid
.dbus                           .pki                .vboxclient-draganddrop.pid
Desktop                         pride.txt           .vboxclient-seamless.pid
Downloads                       .profile            .vbox_version
foo.txt                         .psql_history       .viminfo
.gconf                          Public              warehousing-course
.gnome2                         .python_history     .zinc
.gnupg                          R

You can even connect that shortcut with relative path segments, like this:


In [38]:
ls -a ~/warehousing-course/lectures


.  ..  20150901-week-01.ipynb  .ipynb_checkpoints  siddhartha.txt

Btw, all those .bash files are your account's bash configuration, for example. You can also see there are history files for mysql, julia, python, psql, R, scala, and spark. These are all from when I was setting those up, but as soon as you start using them, they'll grow as you use each.

Creating and removing files

Let's look at adding and removing files. First, we can create a file that doesn't really have anything in it with touch:


In [39]:
touch foo




In [40]:
ls -l foo


-rw-rw-r-- 1 vagrant vagrant 0 Sep  4 08:26 foo

touch just creates an empty file. See how its byte count is 0?

We can remove files with rm, "ReMove".


In [41]:
rm -f foo



I added the -f flag because rm is configured to confirm first whether a removal should really happen or not. The -i flag does that, aka rm -i. In fact using rm -i instead of rm is such a good idea that I created an alias for it, so that every time you type rm -- which doesn't normally confirm removal, it just goes ahead -- it will instead run rm -i, which will check with you before proceeding. rm -f means "force it", i.e. "don't confirm."

NOTE: this only works directly in the bash shell... not here in jupyter. If you try it, and jupyter hangs with the line just showing a * next to it, use Kernel -> Restart to drop that connection and restart the kernel again.

How do you know what aliases exist? Just ask:


In [42]:
alias


alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'
alias doadance='echo "-=-=-=-=-=-=-=-=-=-"'
alias egrep='egrep --color=auto'
alias fgrep='fgrep --color=auto'
alias grep='grep --color=auto'
alias l='ls -CF'
alias la='ls -A'
alias ll='ls -alF'
alias ls='ls --color=auto'
alias rm='rm -i'

In [43]:
doadance


-=-=-=-=-=-=-=-=-=-

Remember when I added doadance in class? I added it to the .bash_aliases file, where you can put your own custom aliases.

Looking at files, flows, and pipes

To look at what's in a file, use cat, which stands for "concatenate."


In [44]:
cat ~/.bash_aliases


alias rm="rm -i"
alias doadance='echo "-=-=-=-=-=-=-=-=-=-"'

If you want to add your own silly alias, try editing ~/.bash_aliases with nano. After that, type source ~/.bash_aliases. source says "read in and act on the commands in this file." When you first log in, or open any new terminal window, all those config files get "sourced" like that, including your .bash_aliases. So when you make a change like you could with nano, you just need to then source it yourself.

The next thing we talked about was looking at long files. We looked at recursive directory listings using ls -R. Try that in your bash shell.

Go ahead, open a new shell window and type ls -R ~. It's okay, I'll wait.

It's a lot of stuff, right? It flies by too fast to read. What would be better is to be able to read it one page at a time. Fortunately, there's a command for that, a "pager", called more.

Try this -- again, do it in a bash window, not in the notebook ---:

ls -R ~ | more

"Give me a recursive listing of all files in my home directory and pipe the list through the more pager."

To page ahead, just type the space bar key. If you get bored and want to quit, just type 'q'.

That pipe -- the vertical bar character, |, is very important. It takes the output of the command before it and hooks it up as input to the command after it. We will be doing a lot of stuff with pipes, or "pipelines".

For example, to look at only a part of a file -- sometimes you just want to see the beginning, or the end, or a sampling -- there are commands for that too. head and tail do what you might expect:


In [45]:
ls -laR ~ | head


/home/vagrant:
total 936
drwxr-xr-x 35 vagrant vagrant   4096 Sep  3 16:49 .
drwxr-xr-x  3 root    root      4096 Sep 23  2014 ..
-rw-rw-r--  1 vagrant vagrant   3155 Sep  3 16:34 20150901-lecture-01-typing.txt
drwxrwxr-x  3 vagrant vagrant   4096 Sep 23  2014 .ansible
drwxrwxr-x  4 vagrant vagrant   4096 Aug 26 17:46 apps
-rw-rw-r--  1 vagrant vagrant     61 Sep  1 21:20 .bash_aliases
-rw-------  1 vagrant vagrant   6094 Sep  3 18:20 .bash_history
-rw-r--r--  1 vagrant vagrant    220 Sep 23  2014 .bash_logout
ls: write error: Broken pipe

(don't worry about the "write error: Broken pipe" bit... that's a little funky thing with the bash kernel, we can ignore it here.)


In [46]:
ls -laR ~ | tail


total 12
drwxrwxr-x 3 vagrant vagrant 4096 Aug 21 00:37 .
drwxrwxr-x 3 vagrant vagrant 4096 Aug 21 00:37 ..
drwxrwxr-x 2 vagrant vagrant 4096 Aug 21 00:37 compiler-interface-2.10.4-51.0

/home/vagrant/.zinc/0.3.5/compiler-interface-2.10.4-51.0:
total 244
drwxrwxr-x 2 vagrant vagrant   4096 Aug 21 00:37 .
drwxrwxr-x 3 vagrant vagrant   4096 Aug 21 00:37 ..
-rw-rw-r-- 1 vagrant vagrant 240885 Aug 21 00:37 compiler-interface.jar

Both head and tail take a simple flag, a count of lines to show.


In [47]:
ls ~ | head -3


20150901-lecture-01-typing.txt
apps
Desktop

In [48]:
ls ~ | tail -6


pride.txt
Public
R
repos
tools
warehousing-course

Another command, seq, generates sequences, like so:


In [49]:
seq 10


1
2
3
4
5
6
7
8
9
10

So it might be better to show off head and tail with seq and a pipe:


In [50]:
seq 10 | head -3


1
2
3

In [51]:
seq 10 | tail -6


5
6
7
8
9
10

What if you want a random sample, picking ten items from 1000 (a 1% sample)?


In [52]:
seq 1000 | shuf -n 10


371
654
706
197
959
70
198
964
148
246

See how that worked? seq generated a population, and shuf sampled 10 items from it.

Searching and sorting

Let's add a few more commands then put it all together. We can visit a great site like Project Gutenber's top 100 texts and grab the raw text of a book like Siddhartha. wget is a useful command for getting a single (or more) web pages from the web and storing it locally, like this:


In [53]:
wget https://www.gutenberg.org/ebooks/2500.txt.utf-8


--2015-09-04 08:27:09--  https://www.gutenberg.org/ebooks/2500.txt.utf-8
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/cache/epub/2500/pg2500.txt [following]
--2015-09-04 08:27:09--  https://www.gutenberg.org/cache/epub/2500/pg2500.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 241176 (236K) [text/plain]
Saving to: ‘2500.txt.utf-8’

100%[======================================>] 241,176      564KB/s   in 0.4s   

2015-09-04 08:27:10 (564 KB/s) - ‘2500.txt.utf-8’ saved [241176/241176]


In [54]:
mv 2500.txt.utf-8 siddhartha.txt




In [55]:
ls -l siddhartha.txt


-rw-rw-r-- 1 vagrant vagrant 241176 Sep  4 08:27 siddhartha.txt

Yep, 241,176 bytes, looks right. Let's search for the word "river" in the text, using grep:


In [56]:
grep river siddhartha.txt | head











grep: write error: Broken pipe

Hmm, that's not that useful, it would be more useful to know the line numbers:


In [57]:
grep -n river siddhartha.txt | head











grep: write error: Broken pipe

And come to think of it, that's only finding "river", but not "River". Does "River" appear at all?


In [58]:
grep -n River siddhartha.txt | head



Guess not!

But the word "blue" appears as "Blue". I bet it appears as both. There's a flag for that, a case-insensitive grep:


In [59]:
grep -in blue siddhartha.txt | head










grep takes options, then one argument which is a token or pattern to search for, then a second or more options which are file names to search within. We'll see more examples of these later.

In the meantime, let's play with sorting things a bit. Back to sequences, let's try sorting a list of numbers:


In [60]:
seq 10 20 | sort


10
11
12
13
14
15
16
17
18
19
20

Well, that's silly, they're already sorted. Let's do something more complicated:


In [61]:
seq 100 | shuf -n 10 | sort


13
23
25
4
48
56
57
6
84
87

Wait, what just happened?

  • Get a sequence of numbers, 1 - 100
  • pipe that into shuf and sample 10 items from that
  • pipe that into sort, and get a sorted result

But it doesn't look sorted... 100 comes before 24. That's because it's doing a character sort, not a numeric one. Good thing there's a flag for that.


In [62]:
seq 100 | shuf -n 10 | sort -n


2
5
6
17
24
37
38
59
75
87

Much better. Remember man pages? They're kind of long. Many commands have a shorter form of help, available through an option -h or --help:


In [63]:
sort --help


Usage: sort [OPTION]... [FILE]...
  or:  sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.

Mandatory arguments to long options are mandatory for short options too.
Ordering options:

  -b, --ignore-leading-blanks  ignore leading blanks
  -d, --dictionary-order      consider only blanks and alphanumeric characters
  -f, --ignore-case           fold lower case to upper case characters
  -g, --general-numeric-sort  compare according to general numerical value
  -i, --ignore-nonprinting    consider only printable characters
  -M, --month-sort            compare (unknown) < 'JAN' < ... < 'DEC'
  -h, --human-numeric-sort    compare human readable numbers (e.g., 2K 1G)
  -n, --numeric-sort          compare according to string numerical value
  -R, --random-sort           sort by random hash of keys
      --random-source=FILE    get random bytes from FILE
  -r, --reverse               reverse the result of comparisons
      --sort=WORD             sort according to WORD:
                                general-numeric -g, human-numeric -h, month -M,
                                numeric -n, random -R, version -V
  -V, --version-sort          natural sort of (version) numbers within text

Other options:

      --batch-size=NMERGE   merge at most NMERGE inputs at once;
                            for more use temp files
  -c, --check, --check=diagnose-first  check for sorted input; do not sort
  -C, --check=quiet, --check=silent  like -c, but do not report first bad line
      --compress-program=PROG  compress temporaries with PROG;
                              decompress them with PROG -d
      --debug               annotate the part of the line used to sort,
                              and warn about questionable usage to stderr
      --files0-from=F       read input from the files specified by
                            NUL-terminated names in file F;
                            If F is - then read names from standard input
  -k, --key=KEYDEF          sort via a key; KEYDEF gives location and type
  -m, --merge               merge already sorted files; do not sort
  -o, --output=FILE         write result to FILE instead of standard output
  -s, --stable              stabilize sort by disabling last-resort comparison
  -S, --buffer-size=SIZE    use SIZE for main memory buffer
  -t, --field-separator=SEP  use SEP instead of non-blank to blank transition
  -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                              multiple options specify multiple directories
      --parallel=N          change the number of sorts run concurrently to N
  -u, --unique              with -c, check for strict ordering;
                              without -c, output only the first of an equal run
  -z, --zero-terminated     end lines with 0 byte, not newline
      --help     display this help and exit
      --version  output version information and exit

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end.  If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace.  OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key.  If no key is given, use
the entire line as the key.

SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

With no FILE, or when FILE is -, read standard input.

*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.

Report sort bugs to bug-coreutils@gnu.org
GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
General help using GNU software: <http://www.gnu.org/gethelp/>
For complete documentation, run: info coreutils 'sort invocation'

See, lots of options. :)

A data analysis task

Let's put this all together and demonstrate a typical simple command line pipeline that performs a very useful data preparation task for text processing: counting words in a text. What do we need to do to get a count of the most-used words in a text?

  • get a list of all the words
  • lower-case them all
  • sort them all (well, not necessarily, but it helps here)
  • get a count of each one
  • sort that resulting list numerically

Let's do that, it will require a few new commands and new options on a few commands you've already seen.

First, we need to split up lines of text into individual words per line.

(I'll just start with the first three lines to keep output minimal.)


In [64]:
head -3 siddhartha.txt





In [65]:
head -3 siddhartha.txt | grep -oE '\w{2,}'


The
Project
Gutenberg
EBook
of
Siddhartha
by
Herman
Hesse
This
eBook
is
for
the
use
of
anyone
anywhere
at
no
cost
and
with

If we sort that as is, we get:


In [66]:
head -3 siddhartha.txt | grep -oE '\w{2,}' | sort


and
anyone
anywhere
at
by
cost
eBook
EBook
for
Gutenberg
Herman
Hesse
is
no
of
of
Project
Siddhartha
the
The
This
use
with

Ah, there's the cap/no-cap problem again. We can use tr (think "translate") to address that:


In [67]:
head -3 siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' | sort


and
anyone
anywhere
at
by
cost
ebook
ebook
for
gutenberg
herman
hesse
is
no
of
of
project
siddhartha
the
the
this
use
with

And then collapse multiple occurences with uniq:


In [68]:
head -3 siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' | sort | uniq


and
anyone
anywhere
at
by
cost
ebook
for
gutenberg
herman
hesse
is
no
of
project
siddhartha
the
this
use
with

...and uniq's flag -c, which gives you a count for each (note the reverse solidus ("backslash") denoting line continuation):


In [69]:
head -3 siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' \
    | sort | uniq -c


      1 and
      1 anyone
      1 anywhere
      1 at
      1 by
      1 cost
      2 ebook
      1 for
      1 gutenberg
      1 herman
      1 hesse
      1 is
      1 no
      2 of
      1 project
      1 siddhartha
      2 the
      1 this
      1 use
      1 with

Alright! Now we're getting somewhere. Let's run this against the whole set, and clip off the top 25 words.


In [70]:
grep -oE '\w{2,}' siddhartha.txt | tr '[:upper:]' '[:lower:]' \
    | sort | uniq -c | sort | head -25


      1 000
    101 do
    106 more
    108 any
    108 so
     10 access
     10 agree
     10 arms
     10 ask
     10 ate
     10 beginning
     10 better
     10 between
     10 blue
     10 brightly
     10 chest
     10 cycle
     10 disgust
     10 entire
     10 entirely
     10 except
     10 exclaimed
     10 feeling
     10 help
     10 isn
sort: write failed: standard output: Broken pipe
sort: write error

Ah, I forgot: numeric sort, not character sort.


In [71]:
grep -oE '\w{2,}' siddhartha.txt | tr '[:upper:]' '[:lower:]' \
    | sort | uniq -c | sort -n | head -25


      1 000
      1 1500
      1 1887
      1 20
      1 2001
      1 2008
      1 2011
      1 2013
      1 23
      1 30
      1 4557
      1 50
      1 596
      1 60
      1 6221541
      1 64
      1 801
      1 809
      1 84116
      1 99712
      1 abide
      1 abilities
      1 abode
      1 absorbed
      1 absorbing
sort: write failed: standard output: Broken pipe
sort: write error

Oh! And reverse that, so we get the top counts.


In [72]:
grep -oE '\w{2,}' siddhartha.txt | tr '[:upper:]' '[:lower:]' \
 | sort | uniq -c | sort -rn | head -25


   2221 the
   1434 and
   1225 to
   1106 of
    960 he
    708 his
    686 in
    540 you
    524 had
    512 was
    499 this
    496 it
    459 him
    410 with
    409 siddhartha
    371 for
    343 is
    341 that
    328 not
    261 from
    259 but
    242 as
    235 one
    235 have
    213 be
sort: write failed: standard output: Broken pipe
sort: write error

And there you have it. One quick pipeline, one useful result.