Duration: 15 minutes (longer if people have trouble getting an editor to work).
Nelle Nemo, a marine biologist, has just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She has 1520 samples in all, and now needs to:
goostat
.goodiff
.It takes about half an hour for the assay machine to process each sample. The good news is, it only takes two minutes to set each one up. Since her lab has eight assay machines that she can use in parallel, this step will "only" take about two weeks.
The bad news is, if she has to run goostat
and goodiff
by hand,
she'll have to enter filenames and click "OK" roughly 300^2 times (300
runs of goostat
, plus 300×299 runs of goodiff
). At 30 seconds each,
that will 750 hours, or 18 weeks. Not only would she miss her paper
deadline, the chances of her getting all 90,000 commands right are
approximately zero.
This chapter is about what she should do instead. More specifically, it's about how she can use a command shell to automate the repetitive steps in her processing pipeline, so that her computer can work 24 hours a day while she writes her paper. As a bonus, once she has put a processing pipeline together, she will be able to use it again whenever she collects more data.
The file system is the part of your operating system responsible for managing files and directories. This may sound obscure, but the UNIX shell just uses different names for things you already know:
Imagine you are away from your main computer, and suddenly realize you need an important excel spreadsheet. So you call back to your lab, and direct someone to email you the excel document. What do you tell them? It's probably something like:
Double click on the
Work_Stuff
folder on my desktop, then choose theProject_1
folder, then click thedatasheets
folder, and my excel sheet is in there, it's calledsuper_important.xlsx
The folders and documents you've just described are what the UNIX shell calls 'directories' and 'files' respectively, and the collection of all of them is your 'file system'. The series of pointing and clicking you gave your coworker also has a name: the 'path', your words led them down the path to your file.
Now imagine that instead of pointing and clicking, and searching for the right folder on your desktop, you could just dictate this path to your computer directly. It turns out you can! To give your computer a path, you replace all the clicking with slashes:
Desktop/Work_Stuff/Project_1/datasheets/super_important.xlsx
This particular path is called a 'relative path' because it begins in an arbitrary place: your co-worker had to start from your desktop to follow the instructions.
Instead, you could also use the 'absolute path':
/Users/Amanda/Desktop/Work_Stuff/Project_1/datasheets/super_important.xlsx
This path is 'absolute' because it always starts from the same place: your 'root directory'. Root is the top directory in your computer, everything else: every other file and directory is stored inside of it. The UNIX shell refers to root as simply '/'. It's the very first thing in the absolute path above.
So why would you want two ways of specifying a path?
The same reason you can put in different starting directions on Google maps! If Google always gave you directions starting from the North Pole, no matter where you were, it would be giving the absolute path to your destination. This would be useful in some situations, but more often you want the relative path, i.e. the directions from where you are now.
Now that you understand conceptually what all these words mean, let's log into a UNIX shell and see how they can improve your workflow
To start exploring, let's open a terminal and log in to the computer by typing our user ID and password. Most systems will print stars to obscure the password, or nothing at all, in case some evildoer is shoulder surfing behind us.
login: vlad
password: ********
$
Once we have logged in we'll see a prompt, which
is how the shell tells us that it's waiting for input. This is usually
just a dollar sign, but it may show extra information such as our
user ID or the current time. First let's find out where we are right now
by running a command called pwd
(which stands for "print working directory").
At any moment, our current working directory
is our current default directory, it's where you are right now.
vlad$ pwd
/Users/vlad
vlad$
Here, the computer's response is /Users/vlad
, which is Vlad's home directory. This is where the operating system stores all the files and directories that belong to Vlad. To see what is in Vlad's
home directory, we would ask the computer to list all the entries with the command ls
("list")
vlad$ ls
Applications Documents Movies Tacos
Desktop Downloads Music
vlad$
Notice that only six directories are visible, even though there may be hundreds of directories on Vlad's computer.
Working in the shell behaves in exactly the same way as navigating through your file system by pointing and clicking:
you can only see the directories and files that are in your current directory. By extension, you can also only work with
the directories and files in your current directory unless you specify their path. If we cd
or "change directories"
to the Movies
directory:
vlad$ cd Movies/
When cd
works correctly, it doesn't give you any output, but you can list the contents of your new directory with ls
:
vlad$ ls
avengers.mp4 I_am_legend kids_movies
gravity.avi matrix_3.mov
vlad$
We can now see Vlad's movies. cd
moved us into the Movies
directory, and now we only see the files and directories inside of it.
One other thing to note from Vlad's movies is the "file extensions", the letters after the '.' on most files. File extensions are an easy way for you to guess what type of file you're looking at, but aren't actually necessary. It is considered a "best practice" to include file extensions on your files, but your computer will happily play the movie I_am_legend
, even if we changed the file extension to .xlsx
. We can see why file extensions are a best practice by looking at kids_movies
. Is that a movie title that's missing an extension? Or a directory full of more movies? We can check this using a 'flag' for ls
. A 'flag' is an extra modifier you issue after the command to run ls
. You can find an exhaustive list of the modifiers for ls
, or almost any other command by looking in the manual man ls
, but for now, let's just use -F
, which adds a /
after any directory.
vlad$ ls -F
avengers.mp4 I_am_legend kids_movies/
gravity.avi matrix_3.mov
Now we can tell that while neither I_am_legend
nor kids_movies
has a file extension, they aren't the same kinds of files. kids_movies
is clearly a directory.
We can also try this on other directories, as long as we specify the path. For instance, we can check out Vlad's home directory using an absolute path:
vlad$ ls -F /users/vlad
Applications/ Documents/ Movies/ Tacos
Desktop/ Downloads/ Music/
Or a relative path:
vlad$ ls -F ..
Applications/ Documents/ Movies/ Tacos
Desktop/ Downloads/ Music/
How is ..
a relative path? This is actually a special shortcut. The shell was
originally created in the 70's for teletype, and every keystroke counted: the devices
of the day were slow, and backspacing on a teletype was so painful that cutting
the number of keystrokes in order to cut the number of typing mistakes was actually
a win for usability. This is also why commands like ls
aren't simply list
,
shorter was better! In this instance, ..
is a special directory name meaning
"the directory containing this one", or, more succinctly, the parent
of the current directory.
Any given directory probably has several directories inside of it, so it would
be difficult to give each one a shortcut and remember them all, but the
creators of the shell came up with a clever solution, tab complete.
Let's cd
back to Vlad's home directory to see how it works:
vlad$ cd ..
vlad$ ls -F
Applications/ Documents/ Movies/ Tacos
Desktop/ Downloads/ Music/
Let's go find Vlad's copy of our super_important.xlsx
from earlier. First we need the cd
, to tell the shell we want to traverse the file system, and to start typing the path to our file:
vlad$ cd D
If we type tab
twice instead of typing the rest of the word, the shell will try to guess where we want to go. If it can't decide the shell will present us with all the available options. Since Vlad's home directory has more than one file or directory that starts with "D", the shell provides all the files and directories that start with "D":
vlad$ cd D
Desktop/ Documents/ Downloads/
We want to go to Vlad's desktop, so we type an "e" and then type tab
twice more:
vlad$ cd De
vlad$ cd Desktop/
The shell has guessed that "De" means we want "Desktop" and filled in the rest. To see what is available now, we can tap tab
again, or we can add a "W" because we know the file we're looking for is in the Work_Stuff
directory. In this way we can wind our way down the path with a minimum of typing. But this isn't just a way to move our fingers less! Tab complete greatly minimizes error and the most common frustrations of learning the shell. Let's say that in our last example, we just pushed Enter
instead of continuing down the path to our Excel file, and now we're just in the Desktop
directory:
vlad$ pwd
/users/vlad/Desktop
vlad$ ls -F
grant_Proposal/ phone_backup Project_2/
home pictures/ Project_1/
What if we want to go into the directory with Vlad's grant proposal?
vlad$ cd grant_proposal/
However, this won't get us anywhere! This is because the shell is case sensitive, so it thinks grant_Proposal/
and grant_proposal/
are completely different things! If we tab complete, this is easy to avoid, because when we type "g" and tab complete, the shell fills in the capital letters appropriately.
What if we want to go into the directory with Vlad's home pictures? Presumably, we would type:
vlad$ cd home pictures/
But this doesn't work either! Instead we get an error!
-bash: cd: home: No such file or directory
Why? Because the shell interprets spaces as special characters, in this case, as the delimiters for command modifiers. It thinks you want to "cd" to "home" with the "pictures/" option, but cd
doesn't have a pictures option, so it errors. This illustrates two important points: first, use tab complete as much as possible, and second, try to avoid spaces in file and directory names. To correctly change to the home pictures
directory, we need to tell the shell that this particular space is not a special character, but is instead part of the file name. We do this by "escaping" out the space, which looks like this:
vlad$ cd home\ pictures/
Putting a \
in front of any character tells the shell to interpret that character as the literal text. Now, the command will cd
correctly. But remember, we could have avoided this whole issue with tab completion! There are many, many other commands that are used to add, move, modify and delete files and directories, and we will cover several of them in later sessions. But one of the most important is how to create new files. First, let's make a new directory to hold our file and move into it:
vlad$ cd /user/vlad
vlad$ mkdir my_new_directory
vlad$ ls -F
Applications/ Documents/ Movies/ my_new_directory/
Desktop/ Downloads/ Music/ Tacos
vlad$ cd my_new_directory
The command mkdir
is short for "make directory", and it creates a new directory with whatever name you type after the command, we've made "my_new_directory" with no spaces, so its easier to type later. After we made the directory, we asked for a list (ls
) of the items in the current directory to make sure our new one was created, and finally we cd
into our new directory so we can make files inside of it. So let's make a file:
vlad$ nano my_first_file.txt
Nano is a text editor. To use it, you type nano
and then the name of the file you'd like to open. If the filename you specify doesn't exist in your current working directory, Nano creates it for you and opens the empty file. We've just created a new text file called my_first_file.txt
. We have carefully typed a .txt
after it so our future selves remember what kind of file it is, but Nano doesn't care. Let's type something in it:
echo "hello world"
To save your work in Nano, you type ctrl
and x
, then y
(for "yes") and enter
. Nano also helpfully keeps all of these instructions at the bottom of the screen for your reference.
It may not look like it, but you've just created your first program! Like ls
, echo
is a shell command. echo
prints any following text to the terminal. Normally you would use echo
as part of a program to give the user and update, like how much processing time is left, but here we're just printing a greeting. To run your program:
vlad$ sh my_first_file.txt
hello world
sh
is yet another command, which refers to the shell program. Using sh
followed by a filename executes the named file line-by-line just as if we'd typed them in ourselves. The shell opened our text file, ran the echo
command, and printed it to our screen, just as if we'd typed echo "hello world"
at the prompt.
Everything we have seen so far works on Unix and its descendants, such
as Linux and Mac OS X. Things are a bit different on Windows. A typical
directory path on a Windows machine might be C:\Users\vlad
. The
first part, C:
, is a drive letter that
identifies which disk we're talking about. This notation dates back to
the days of floppy drives; today, different "drives" are usually
different filesystems on the network.
Instead of a forward slash, Windows uses a backslash to separate the
names in a path. This causes headaches because Unix uses backslash for
input of special characters. For example, if we want to put a space in a
filename on Unix, we would write the filename as my\ results.txt
.
Please don't ever do this, though: if you put spaces, question marks,
and other special characters in filenames on Unix, you can confuse the
shell as we saw earlier.
Finally, Windows filenames and directory names are case insensitive: upper and lower case
letters mean the same thing. This means that the path name
C:\Users\vlad
could be spelled c:\users\VLAD
, C:\Users\Vlad
, and
so on. Some people argue that this is more natural: after all, "VLAD" in
all upper case and "Vlad" spelled normally refer to the same person.
However, it causes headaches for programmers, and can be difficult for
people to understand if their first language uses a cased
alphabet as in the example above
Cygwin tries to make Windows paths look more
like Unix paths by allowing us to use a forward slash instead of a
backslash as a separator. It also allows us to refer to the C drive as
/cygdrive/c/
instead of as C:
. (The latter usually works too, but
not always.) Paths are still case insensitive, though, which means that
if you try to put files called backup.txt
(in all lower case) and
Backup.txt
(with a capital 'B') into the same directory, the second
will overwrite the first.
Cygwin does something else that frequently causes confusion. By default,
it interprets a path like /home/vlad
to mean C:\cygwin\home\vlad
,
i.e., it acts as if C:\cygwin
was the root of the filesystem. This is
sometimes helpful, but if you are using an editor like Notepad, and want
to save a file in what Cygwin thinks of as your home directory, you need
to keep this translation in mind.
Knowing just this much about files and directories, Nelle is ready to
organize the files that the protein assay machine will create. First,
she creates a directory called north-pacific-gyre
(to remind herself
where the data came from). Inside that, she creates a directory called
2012-07-03
, which is the date she started processing the samples. She
used to use names like conference-paper
and revised-results
, but she
found them hard to understand after a couple of years. (The final straw
was when she found herself creating a directory called
revised-revised-results-3
.)
Each of her physical samples is labeled according to her lab's
convention with a unique ten-character ID, such as "NENE01729A". This is
what she used in her collection log to record the location, time, depth,
and other characteristics of the sample, so she decides to use it as
part of each data file's name. Since the assay machine's output is plain
text, she will call her files NENE01729A.txt
, NENE01812A.txt
, and so
on. All 1520 files will go into the same directory.
If she is in her home directory, Nelle can see what files she has using the command:
ls north-pacific-gyre/2012-07-03/
Since this is a lot to type, she can take advantage of Bash's command completion. If she types:
ls no
and then presses tab, Bash will automatically complete the directory name for her:
ls north-pacific-gyre/
If she presses tab again, Bash will add 2012-07-03/
to the command,
since it's the only possible completion. Pressing tab again does
nothing, since there are 1520 possibilities; pressing tab twice brings
up a list of all the files, and so on. As above, this is called tab completion, and we will see it in many
other tools as we go on.
/
on its own is the root directory of the whole filesystem.something.extension
; the extension isn't
required, and doesn't guarantee anything, but is normally used to
indicate the type of data in the file.cd path
changes the current working directory.ls path
prints a listing of a specific file or directory; ls
on
its own lists the current working directory.pwd
prints the user's current working directory (current default
location in the filesystem).