The Unix Shell: Files and Directories

Students should be able to:

  • Describe the parts of a file-system
  • Explain the difference between a relative and absolute path.
  • Construct absolute and relative paths that identify specific files and directories.
  • Identify the actual command, flags, and filenames in a command-line call.
  • Demonstrate the use of tab completion, and explain its advantages.

Duration: 15 minutes (longer if people have trouble getting an editor to work).

Motivation

Nelle Nemo, a marine biologist, has just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She has 1520 samples in all, and now needs to:

  1. Run each sample through an assay machine that will measure the relative abundance of 300 different proteins. The machine's output for a single sample is a file with one line for each protein.
  2. Calculate statistics for each of the proteins separately using a program her supervisor wrote called goostat.
  3. Compare the statistics for each protein with corresponding statistics for each other protein using a program one of the other graduate students wrote called goodiff.
  4. Write up. Her supervisor would really like her to do this by the end of the month so that her paper can appear in an upcoming special issue of Aquatic Goo Letters.

It takes about half an hour for the assay machine to process each sample. The good news is, it only takes two minutes to set each one up. Since her lab has eight assay machines that she can use in parallel, this step will "only" take about two weeks.

The bad news is, if she has to run goostat and goodiff by hand, she'll have to enter filenames and click "OK" roughly 300^2 times (300 runs of goostat, plus 300×299 runs of goodiff). At 30 seconds each, that will 750 hours, or 18 weeks. Not only would she miss her paper deadline, the chances of her getting all 90,000 commands right are approximately zero.

This chapter is about what she should do instead. More specifically, it's about how she can use a command shell to automate the repetitive steps in her processing pipeline, so that her computer can work 24 hours a day while she writes her paper. As a bonus, once she has put a processing pipeline together, she will be able to use it again whenever she collects more data.

Lesson

The file system is the part of your operating system responsible for managing files and directories. This may sound obscure, but the UNIX shell just uses different names for things you already know:

Imagine you are away from your main computer, and suddenly realize you need an important excel spreadsheet. So you call back to your lab, and direct someone to email you the excel document. What do you tell them? It's probably something like:

Double click on the Work_Stuff folder on my desktop, then choose the Project_1 folder, then click the datasheets folder, and my excel sheet is in there, it's called super_important.xlsx

The folders and documents you've just described are what the UNIX shell calls 'directories' and 'files' respectively, and the collection of all of them is your 'file system'. The series of pointing and clicking you gave your coworker also has a name: the 'path', your words led them down the path to your file.

Now imagine that instead of pointing and clicking, and searching for the right folder on your desktop, you could just dictate this path to your computer directly. It turns out you can! To give your computer a path, you replace all the clicking with slashes:

Desktop/Work_Stuff/Project_1/datasheets/super_important.xlsx

This particular path is called a 'relative path' because it begins in an arbitrary place: your co-worker had to start from your desktop to follow the instructions.

Instead, you could also use the 'absolute path':

/Users/Amanda/Desktop/Work_Stuff/Project_1/datasheets/super_important.xlsx

This path is 'absolute' because it always starts from the same place: your 'root directory'. Root is the top directory in your computer, everything else: every other file and directory is stored inside of it. The UNIX shell refers to root as simply '/'. It's the very first thing in the absolute path above.

So why would you want two ways of specifying a path?

The same reason you can put in different starting directions on Google maps! If Google always gave you directions starting from the North Pole, no matter where you were, it would be giving the absolute path to your destination. This would be useful in some situations, but more often you want the relative path, i.e. the directions from where you are now.

Now that you understand conceptually what all these words mean, let's log into a UNIX shell and see how they can improve your workflow

To start exploring, let's open a terminal and log in to the computer by typing our user ID and password. Most systems will print stars to obscure the password, or nothing at all, in case some evildoer is shoulder surfing behind us.

login: vlad
password: ********
$

Once we have logged in we'll see a prompt, which is how the shell tells us that it's waiting for input. This is usually just a dollar sign, but it may show extra information such as our user ID or the current time. First let's find out where we are right now by running a command called pwd (which stands for "print working directory"). At any moment, our current working directory is our current default directory, it's where you are right now.

vlad$ pwd

/Users/vlad

vlad$

Here, the computer's response is /Users/vlad, which is Vlad's home directory. This is where the operating system stores all the files and directories that belong to Vlad. To see what is in Vlad's home directory, we would ask the computer to list all the entries with the command ls ("list")

vlad$ ls

Applications   Documents   Movies   Tacos
Desktop        Downloads   Music

vlad$

Notice that only six directories are visible, even though there may be hundreds of directories on Vlad's computer. Working in the shell behaves in exactly the same way as navigating through your file system by pointing and clicking: you can only see the directories and files that are in your current directory. By extension, you can also only work with the directories and files in your current directory unless you specify their path. If we cd or "change directories" to the Movies directory:

vlad$ cd Movies/

When cd works correctly, it doesn't give you any output, but you can list the contents of your new directory with ls:

vlad$ ls

avengers.mp4    I_am_legend    kids_movies
gravity.avi     matrix_3.mov

vlad$

We can now see Vlad's movies. cd moved us into the Movies directory, and now we only see the files and directories inside of it. One other thing to note from Vlad's movies is the "file extensions", the letters after the '.' on most files. File extensions are an easy way for you to guess what type of file you're looking at, but aren't actually necessary. It is considered a "best practice" to include file extensions on your files, but your computer will happily play the movie I_am_legend, even if we changed the file extension to .xlsx. We can see why file extensions are a best practice by looking at kids_movies. Is that a movie title that's missing an extension? Or a directory full of more movies? We can check this using a 'flag' for ls. A 'flag' is an extra modifier you issue after the command to run ls. You can find an exhaustive list of the modifiers for ls, or almost any other command by looking in the manual man ls, but for now, let's just use -F, which adds a / after any directory.

vlad$ ls -F

avengers.mp4    I_am_legend    kids_movies/
gravity.avi     matrix_3.mov

Now we can tell that while neither I_am_legend nor kids_movies has a file extension, they aren't the same kinds of files. kids_movies is clearly a directory.

We can also try this on other directories, as long as we specify the path. For instance, we can check out Vlad's home directory using an absolute path:

vlad$ ls -F /users/vlad

Applications/   Documents/   Movies/   Tacos
Desktop/        Downloads/   Music/

Or a relative path:

vlad$ ls -F ..

Applications/   Documents/   Movies/   Tacos
Desktop/        Downloads/   Music/

How is .. a relative path? This is actually a special shortcut. The shell was originally created in the 70's for teletype, and every keystroke counted: the devices of the day were slow, and backspacing on a teletype was so painful that cutting the number of keystrokes in order to cut the number of typing mistakes was actually a win for usability. This is also why commands like ls aren't simply list, shorter was better! In this instance, .. is a special directory name meaning "the directory containing this one", or, more succinctly, the parent of the current directory.

Any given directory probably has several directories inside of it, so it would be difficult to give each one a shortcut and remember them all, but the creators of the shell came up with a clever solution, tab complete. Let's cd back to Vlad's home directory to see how it works:

vlad$ cd ..
vlad$ ls -F

Applications/   Documents/   Movies/   Tacos
Desktop/        Downloads/   Music/

Let's go find Vlad's copy of our super_important.xlsx from earlier. First we need the cd, to tell the shell we want to traverse the file system, and to start typing the path to our file:

vlad$ cd D

If we type tab twice instead of typing the rest of the word, the shell will try to guess where we want to go. If it can't decide the shell will present us with all the available options. Since Vlad's home directory has more than one file or directory that starts with "D", the shell provides all the files and directories that start with "D":

vlad$ cd D
Desktop/   Documents/   Downloads/

We want to go to Vlad's desktop, so we type an "e" and then type tab twice more:

vlad$ cd De

vlad$ cd Desktop/

The shell has guessed that "De" means we want "Desktop" and filled in the rest. To see what is available now, we can tap tab again, or we can add a "W" because we know the file we're looking for is in the Work_Stuff directory. In this way we can wind our way down the path with a minimum of typing. But this isn't just a way to move our fingers less! Tab complete greatly minimizes error and the most common frustrations of learning the shell. Let's say that in our last example, we just pushed Enter instead of continuing down the path to our Excel file, and now we're just in the Desktop directory:

vlad$ pwd
/users/vlad/Desktop

vlad$ ls -F

grant_Proposal/   phone_backup   Project_2/
home pictures/       Project_1/

What if we want to go into the directory with Vlad's grant proposal?

vlad$ cd grant_proposal/

However, this won't get us anywhere! This is because the shell is case sensitive, so it thinks grant_Proposal/ and grant_proposal/ are completely different things! If we tab complete, this is easy to avoid, because when we type "g" and tab complete, the shell fills in the capital letters appropriately.

What if we want to go into the directory with Vlad's home pictures? Presumably, we would type:

vlad$ cd home pictures/

But this doesn't work either! Instead we get an error!

-bash: cd: home: No such file or directory

Why? Because the shell interprets spaces as special characters, in this case, as the delimiters for command modifiers. It thinks you want to "cd" to "home" with the "pictures/" option, but cd doesn't have a pictures option, so it errors. This illustrates two important points: first, use tab complete as much as possible, and second, try to avoid spaces in file and directory names. To correctly change to the home pictures directory, we need to tell the shell that this particular space is not a special character, but is instead part of the file name. We do this by "escaping" out the space, which looks like this:

vlad$ cd home\ pictures/

Putting a \ in front of any character tells the shell to interpret that character as the literal text. Now, the command will cd correctly. But remember, we could have avoided this whole issue with tab completion! There are many, many other commands that are used to add, move, modify and delete files and directories, and we will cover several of them in later sessions. But one of the most important is how to create new files. First, let's make a new directory to hold our file and move into it:

vlad$ cd /user/vlad
vlad$ mkdir my_new_directory

vlad$ ls -F

Applications/   Documents/   Movies/   my_new_directory/
Desktop/        Downloads/   Music/    Tacos

vlad$ cd my_new_directory

The command mkdir is short for "make directory", and it creates a new directory with whatever name you type after the command, we've made "my_new_directory" with no spaces, so its easier to type later. After we made the directory, we asked for a list (ls) of the items in the current directory to make sure our new one was created, and finally we cd into our new directory so we can make files inside of it. So let's make a file:

vlad$ nano my_first_file.txt

Nano is a text editor. To use it, you type nano and then the name of the file you'd like to open. If the filename you specify doesn't exist in your current working directory, Nano creates it for you and opens the empty file. We've just created a new text file called my_first_file.txt. We have carefully typed a .txt after it so our future selves remember what kind of file it is, but Nano doesn't care. Let's type something in it:

echo "hello world"

To save your work in Nano, you type ctrl and x, then y (for "yes") and enter. Nano also helpfully keeps all of these instructions at the bottom of the screen for your reference.

It may not look like it, but you've just created your first program! Like ls, echo is a shell command. echo prints any following text to the terminal. Normally you would use echo as part of a program to give the user and update, like how much processing time is left, but here we're just printing a greeting. To run your program:

vlad$ sh my_first_file.txt

hello world

sh is yet another command, which refers to the shell program. Using sh followed by a filename executes the named file line-by-line just as if we'd typed them in ourselves. The shell opened our text file, ran the echo command, and printed it to our screen, just as if we'd typed echo "hello world" at the prompt.

Windows

Everything we have seen so far works on Unix and its descendants, such as Linux and Mac OS X. Things are a bit different on Windows. A typical directory path on a Windows machine might be C:\Users\vlad. The first part, C:, is a drive letter that identifies which disk we're talking about. This notation dates back to the days of floppy drives; today, different "drives" are usually different filesystems on the network.

Instead of a forward slash, Windows uses a backslash to separate the names in a path. This causes headaches because Unix uses backslash for input of special characters. For example, if we want to put a space in a filename on Unix, we would write the filename as my\ results.txt. Please don't ever do this, though: if you put spaces, question marks, and other special characters in filenames on Unix, you can confuse the shell as we saw earlier.

Finally, Windows filenames and directory names are case insensitive: upper and lower case letters mean the same thing. This means that the path name C:\Users\vlad could be spelled c:\users\VLAD, C:\Users\Vlad, and so on. Some people argue that this is more natural: after all, "VLAD" in all upper case and "Vlad" spelled normally refer to the same person. However, it causes headaches for programmers, and can be difficult for people to understand if their first language uses a cased alphabet as in the example above

For Cygwin Users

Cygwin tries to make Windows paths look more like Unix paths by allowing us to use a forward slash instead of a backslash as a separator. It also allows us to refer to the C drive as /cygdrive/c/ instead of as C:. (The latter usually works too, but not always.) Paths are still case insensitive, though, which means that if you try to put files called backup.txt (in all lower case) and Backup.txt (with a capital 'B') into the same directory, the second will overwrite the first.

Cygwin does something else that frequently causes confusion. By default, it interprets a path like /home/vlad to mean C:\cygwin\home\vlad, i.e., it acts as if C:\cygwin was the root of the filesystem. This is sometimes helpful, but if you are using an editor like Notepad, and want to save a file in what Cygwin thinks of as your home directory, you need to keep this translation in mind.

Nelle's Pipeline: Organizing Files

Knowing just this much about files and directories, Nelle is ready to organize the files that the protein assay machine will create. First, she creates a directory called north-pacific-gyre (to remind herself where the data came from). Inside that, she creates a directory called 2012-07-03, which is the date she started processing the samples. She used to use names like conference-paper and revised-results, but she found them hard to understand after a couple of years. (The final straw was when she found herself creating a directory called revised-revised-results-3.)

Each of her physical samples is labeled according to her lab's convention with a unique ten-character ID, such as "NENE01729A". This is what she used in her collection log to record the location, time, depth, and other characteristics of the sample, so she decides to use it as part of each data file's name. Since the assay machine's output is plain text, she will call her files NENE01729A.txt, NENE01812A.txt, and so on. All 1520 files will go into the same directory.

If she is in her home directory, Nelle can see what files she has using the command:

ls north-pacific-gyre/2012-07-03/

Since this is a lot to type, she can take advantage of Bash's command completion. If she types:

ls no

and then presses tab, Bash will automatically complete the directory name for her:

ls north-pacific-gyre/

If she presses tab again, Bash will add 2012-07-03/ to the command, since it's the only possible completion. Pressing tab again does nothing, since there are 1520 possibilities; pressing tab twice brings up a list of all the files, and so on. As above, this is called tab completion, and we will see it in many other tools as we go on.

Key Points

  • The file system is responsible for managing information on disk.
  • Information is stored in files, which are stored in directories (folders).
  • Directories can also store other directories, which forms a directory tree.
  • / on its own is the root directory of the whole filesystem.
  • A relative path specifies a location starting from the current location.
  • An absolute path specifies a location from the root of the filesystem.
  • Directory names in a path are separated with '/' on Unix, but '\' on Windows.
  • '..' means "the directory above the current one"; '.' on its own means "the current directory".
  • Most files' names are something.extension; the extension isn't required, and doesn't guarantee anything, but is normally used to indicate the type of data in the file.
  • cd path changes the current working directory.
  • ls path prints a listing of a specific file or directory; ls on its own lists the current working directory.
  • pwd prints the user's current working directory (current default location in the filesystem).
  • Most commands take modifiers (flags) which begin with a '-'.