Hi, welcome to our first class lecture notebook. Here I'll gather all the commands I showed in class the other night and add a little explanation along the way. During that first talk I wanted to be sure you saw what it looks like to work directly in bash, or in the command line, and what it means to use a REPL. In future weeks I'll just use a notebook directly during lectures and will post those right after class.
This was a somewhat unrehearsed tour of useful stuff. For a more thoughtful lesson on the command line shell, see Software Carpentry's Lesson "The UNIX Shell".
In [17]:
whoami
That's a default set because we're using vagrant - it's not commentary.
Next we asked where we are in the file system:
In [18]:
pwd
This is a little different from what you saw last night, because I'm putting this in a new folder. Yes, it's odd that pwd
isn't whereami
. But you have to admit pwd
is easier to type than whereami
. Just think to yourself: "Print Working Directory" and pwd
will be easy to remember.
How does bash know how to execute these commands? First, it looks up which command you might mean, using which
:
In [19]:
which whoami
Okay, so whoami
is under /usr/bin
. How did it know to look under /usr/bin
? Well, because that's the PATH
.
In [20]:
echo $PATH
Wait, what's echo
? It's just a way to print stuff to the screen. Such as saying hello:
In [21]:
echo "hello world"
Okay, so all those different directories on the system, separated by colons like you see above, are the PATH
system variable. You might note that there's something like this on Windows, too, along with a lot of other variables.
So when you type a command in bash, it looks in every one of those directories for it, and if it finds it, it executes it. In this case, it finds whoami
under /usr/bin
, which is the fourth place it checks.
And if you type a command it can't find, it tells you so:
In [22]:
turtle
In [23]:
which turtle
See? Nothing there. But at least it's helpful to know we could install kturtle
. Let's not for now, though.
Going back to the questions of what and where, it's helpful to look around. ls
is the command for listing files:
In [24]:
ls
In the current directory, there's only this notebook! What else can we find out about the notebook file? Let's use ls
with some options (also called "flags"):
In [25]:
ls -l
Okay, so now we know the permissions (the "-rw-rw-r--
" part), who owns this file (vagrant), which group owns the file (also vagrant), how many bytes it is (5975 or probably more after I type more here), when it was last modified (Sep 3 at 17:06 or probably later because I'll keep typing), and the file name itself.
ls -l
is the ls
command with the -l
option which stands for "long list". There are lots of options. Another useful one is ls -a
:
In [26]:
ls -a
ls -a
shows all the "dotfiles", semi-hidden files that you don't normally want to see but are actually all over your drive. The .ipynb_checkpoints
file is a support file for this notebook. The .
file is actually a reference to this very directory, and is often called "dot". The ..
file is actually a reference to this directory's parent directory, and is often called -- yep -- "dot dot".
Note that we can combine flags:
In [27]:
ls -al
That's "give me a file listing, long form, with hidden files."
You can also specify a directory, using an argument:
In [28]:
ls -al ..
In [29]:
cd ..
pwd
And to go back:
In [30]:
cd lectures
pwd
Easy, right?
Most unix commands have a manual page or "man page". You can access them with the command man
, which takes the name of a command as an argument (e.g. man ls
, which I won't do here, because it generates a lot of output).
.
and ..
and lectures
are examples of "relative paths". This is the same concept as relative links on a web site, which should be familiar to any of you have work on web sites before. And just like with web sites, there are also absolute paths, like there are absolute links. On unix, absolute paths start with /
.
In [31]:
ls /
In [32]:
ls /home
In [33]:
ls /home/vagrant
In [34]:
ls /home/vagrant/warehousing-course
In [35]:
ls /home/vagrant/warehousing-course/lectures
Fyi, that directory /home/vagrant
is special, it's know as your "home directory". There are a lot of extra configuration files in there:
In [36]:
ls -a /home/vagrant
Your home directory has a special shortcut, ~
. Try:
In [37]:
ls -a ~
You can even connect that shortcut with relative path segments, like this:
In [38]:
ls -a ~/warehousing-course/lectures
Btw, all those .bash
files are your account's bash configuration, for example. You can also see there are history files for mysql, julia, python, psql, R, scala, and spark. These are all from when I was setting those up, but as soon as you start using them, they'll grow as you use each.
Let's look at adding and removing files. First, we can create a file that doesn't really have anything in it with touch
:
In [39]:
touch foo
In [40]:
ls -l foo
touch
just creates an empty file. See how its byte count is 0?
We can remove files with rm
, "ReMove".
In [41]:
rm -f foo
I added the -f
flag because rm
is configured to confirm first whether a removal should really happen or not. The -i
flag does that, aka rm -i
. In fact using rm -i
instead of rm
is such a good idea that I created an alias for it, so that every time you type rm
-- which doesn't normally confirm removal, it just goes ahead -- it will instead run rm -i
, which will check with you before proceeding. rm -f
means "force it", i.e. "don't confirm."
NOTE: this only works directly in the bash shell... not here in jupyter. If you try it, and jupyter hangs with the line just showing a *
next to it, use Kernel -> Restart to drop that connection and restart the kernel again.
How do you know what aliases exist? Just ask:
In [42]:
alias
In [43]:
doadance
In [44]:
cat ~/.bash_aliases
If you want to add your own silly alias, try editing ~/.bash_aliases
with nano
. After that, type source ~/.bash_aliases
. source
says "read in and act on the commands in this file." When you first log in, or open any new terminal window, all those config files get "sourced" like that, including your .bash_aliases
. So when you make a change like you could with nano
, you just need to then source it yourself.
The next thing we talked about was looking at long files. We looked at recursive directory listings using ls -R
. Try that in your bash shell.
Go ahead, open a new shell window and type ls -R ~
. It's okay, I'll wait.
It's a lot of stuff, right? It flies by too fast to read. What would be better is to be able to read it one page at a time. Fortunately, there's a command for that, a "pager", called more
.
Try this -- again, do it in a bash window, not in the notebook ---:
ls -R ~ | more
"Give me a recursive listing of all files in my home directory and pipe the list through the more pager."
To page ahead, just type the space bar key. If you get bored and want to quit, just type 'q'.
That pipe -- the vertical bar character, |
, is very important. It takes the output of the command before it and hooks it up as input to the command after it. We will be doing a lot of stuff with pipes, or "pipelines".
For example, to look at only a part of a file -- sometimes you just want to see the beginning, or the end, or a sampling -- there are commands for that too. head
and tail
do what you might expect:
In [45]:
ls -laR ~ | head
(don't worry about the "write error: Broken pipe" bit... that's a little funky thing with the bash kernel, we can ignore it here.)
In [46]:
ls -laR ~ | tail
Both head and tail take a simple flag, a count of lines to show.
In [47]:
ls ~ | head -3
In [48]:
ls ~ | tail -6
Another command, seq
, generates sequences, like so:
In [49]:
seq 10
So it might be better to show off head
and tail
with seq
and a pipe:
In [50]:
seq 10 | head -3
In [51]:
seq 10 | tail -6
What if you want a random sample, picking ten items from 1000 (a 1% sample)?
In [52]:
seq 1000 | shuf -n 10
See how that worked? seq
generated a population, and shuf
sampled 10 items from it.
Let's add a few more commands then put it all together. We can visit a great site like Project Gutenber's top 100 texts and grab the raw text of a book like Siddhartha. wget
is a useful command for getting a single (or more) web pages from the web and storing it locally, like this:
In [53]:
wget https://www.gutenberg.org/ebooks/2500.txt.utf-8
In [54]:
mv 2500.txt.utf-8 siddhartha.txt
In [55]:
ls -l siddhartha.txt
Yep, 241,176 bytes, looks right. Let's search for the word "river" in the text, using grep
:
In [56]:
grep river siddhartha.txt | head
Hmm, that's not that useful, it would be more useful to know the line numbers:
In [57]:
grep -n river siddhartha.txt | head
And come to think of it, that's only finding "river", but not "River". Does "River" appear at all?
In [58]:
grep -n River siddhartha.txt | head
Guess not!
But the word "blue" appears as "Blue". I bet it appears as both. There's a flag for that, a case-insensitive grep
:
In [59]:
grep -in blue siddhartha.txt | head
grep
takes options, then one argument which is a token or pattern to search for, then a second or more options which are file names to search within. We'll see more examples of these later.
In the meantime, let's play with sorting things a bit. Back to sequences, let's try sorting a list of numbers:
In [60]:
seq 10 20 | sort
Well, that's silly, they're already sorted. Let's do something more complicated:
In [61]:
seq 100 | shuf -n 10 | sort
Wait, what just happened?
shuf
and sample 10 items from thatBut it doesn't look sorted... 100 comes before 24. That's because it's doing a character sort, not a numeric one. Good thing there's a flag for that.
In [62]:
seq 100 | shuf -n 10 | sort -n
Much better. Remember man pages? They're kind of long. Many commands have a shorter form of help, available through an option -h
or --help
:
In [63]:
sort --help
See, lots of options. :)
Let's put this all together and demonstrate a typical simple command line pipeline that performs a very useful data preparation task for text processing: counting words in a text. What do we need to do to get a count of the most-used words in a text?
Let's do that, it will require a few new commands and new options on a few commands you've already seen.
First, we need to split up lines of text into individual words per line.
(I'll just start with the first three lines to keep output minimal.)
In [64]:
head -3 siddhartha.txt
In [65]:
head -3 siddhartha.txt | grep -oE '\w{2,}'
If we sort that as is, we get:
In [66]:
head -3 siddhartha.txt | grep -oE '\w{2,}' | sort
Ah, there's the cap/no-cap problem again. We can use tr
(think "translate") to address that:
In [67]:
head -3 siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' | sort
And then collapse multiple occurences with uniq
:
In [68]:
head -3 siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' | sort | uniq
...and uniq
's flag -c
, which gives you a count for each (note the reverse solidus ("backslash") denoting line continuation):
In [69]:
head -3 siddhartha.txt | grep -oE '\w{2,}' | tr '[:upper:]' '[:lower:]' \
| sort | uniq -c
Alright! Now we're getting somewhere. Let's run this against the whole set, and clip off the top 25 words.
In [70]:
grep -oE '\w{2,}' siddhartha.txt | tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort | head -25
Ah, I forgot: numeric sort, not character sort.
In [71]:
grep -oE '\w{2,}' siddhartha.txt | tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -n | head -25
Oh! And reverse that, so we get the top counts.
In [72]:
grep -oE '\w{2,}' siddhartha.txt | tr '[:upper:]' '[:lower:]' \
| sort | uniq -c | sort -rn | head -25
And there you have it. One quick pipeline, one useful result.