Introduction to Unix I - Prelab

Table of Contents

  1. Why bother?
  2. Getting started and moving around
  3. Getting information, useful tips
  4. Reading files
  5. Creating files and directories
  6. Editing files
  7. Moving things around
  8. More advanced processing
  9. Other useful commands
  10. Questions

1. Why bother?

If you haven't heard of Unix before, you may be wondering what it is and why we are starting off by teaching it. Unix is a computing environment, similar to the Windows or Apple operating systems, but instead of an environment where you use a mouse to open specific programs and interact with those (such as your internet browser or Powerpoint), you interact with a text-based terminal. If you're used to always using a mouse to interact with your computer, this might sound daunting, and you now may be wondering why we're making you do this. Windows and Mac are just fine, you say! Well, they are fine for most types of computing tasks you might want to do, but the fact is that the vast majority of open source software, which means that the underlying code is made readily available and which almost all bioinformatics tools follow, is written on Unix systems and therefore runs best (or oftentimes, only can run) on Unix systems. Not only this, but Unix systems Excel (pun intended) at being able to handle large datasets, which we are all presumably interested in being able to analyze. Instead of having to look at data in your spreadsheet program of choice, learning Unix will give you the power to not only look at but also to manipulate large data sets. And, contrary to what you may think, using a command line for computing is not just for sweaty nerds who dream in green text on black backgrounds; hopefully we will show you that it is actually quite easy to get a hang of!

2. Getting started and moving around

But that's enough babbling. By far the best way to get comfortable in computing on a Unix system is by getting your hands "dirty", messing around on the command line, and trying things, so let's try running some commands!

First, open up a terminal: in CoCalc, click the arrow to the right of the "New" label and select Terminal (.term) from the drop-down menu, or if you are already in this notebook, just hit the new button and then select Terminal for the type. Once you have this terminal open, you can switch between that terminal and this notebook by clicking on the tabs directly above the notebook (next to where it says Files, New, Log, etc.).

Once you have the terminal open, you can start entering commands. For this class, we will have you copy and paste each command after the \$ symbol, which I use to denote the beginning of the command prompt. In the Jupyter terminal, the beginning of the command prompt shows where you are in the system, e.g. "~/01_UNIX-I/prelab\$, so the \$ symbol in this notebook corresponds to the \$ symbol in the Jupyter terminal, and that is where you should paste in the commands we give you. After pasting the command, you run it by hitting the "enter" key.

First, let's see how to tell what files are in your directory:

$ ls

The 'ls' command lists the files in a directory. This command will be, more or less, your best friend; without it, you have no idea what you're looking at! An easy way to remember it is the word LiSt. See how there is a directory named move_here..we should probably move there, but how do we do this? Run this command:

$ cd move_here

The 'cd' command lets you move around, and will also be part of your posse of buddies. You can remember this as Change Directory. An important fact about Unix systems is that their files are organized in a tree-like structure:

In this example, I'm just showing a few of the directories that are in a Unix file system. Everything comes from the root directory, denoted as "/". The directory you will typically start in is your home folder; in my case, this is under /home/alexaml/ (not shown on this diagram). A very important shortcut is the tilde, "~", which denotes your home folder. In other words, for me, typing the command 'cd ~' is the same as typing 'cd /home/alexaml/'. To get to your home directory, you can run either of these commands (for now, don't run these):

\$ cd ~

\$ cd

The "~" shortcut is useful for writing file paths: in Unix, you describe the path to a file or directory by writing the directories that make up that path, each split by a "/". For example, if you want to list the files in a specific directory, such as the "move_here" directory that we just entered, you can use the following command:

$ ls ~/02_UNIX-I_prelab/move_here/

Another important trick for file paths in Unix is the concept of "parent directories". If you consider the tree structure shown above, you can think of the "parent directory" of a given directory as the directory that is directly above it in the tree structure. For example, the parent directory of /home/alexaml/ is /home/. In any directory, the parent directory can be accessed as "../", so for example, if we are in the move_here directory and want to see what was in the place we just moved from, we can run:

$ ls ../

Similarly, the directory we are currently in can be accessed as "./". This may not seem very useful at first, but actually comes up quite often; for example, we will see later in this prelab how to move files into the directory you are in.

The "cd" command, in conjunction with knowledge of paths, is very powerful; you can move anywhere you want to from anywhere else. For example, if you want to move three directories up from where you are, you could string together references to parent directories by running (don't run this now):

$ cd ../../../

After you've moved somewhere, or if you just can't remember where you are after moving around too much, you can use this command:

$ pwd

This one corresponds to print working directory, and is very useful if you forget where you are!

Finally, if you accidentally moved somewhere you didn't want to, you can always jump back to the most recent directory you were in by using (don't run this now unless you accidentally ran one of the other cd commands from before):

$ cd -

3. Getting information, useful tips

We've only started learning Unix commands and we already have seen several with pretty arcane names. You can't be expected to remember exactly which one does which, and luckily for you (and us and pretty much anybody who has ever used Unix), there is a very convenient way to get information about a command. Try running this:

$ man ls

The man command gives you access to the "man pages", or "manual pages", giving you information about what the command does. To move through these documents, you can press the spacebar ([space]) to jump down a page, the up- and down-arrows to move line-by-line, and if you type the letter 'q', you will get back to the terminal for more input.

You will probably also notice that there is a ton of other information on the man page for ls, and this is a good time to learn about flags. First, just try running this and see what you get:

$ ls -l

See how the output is now different? Instead of just a simple list of the files, we now get more information about each one such as when it was last modified, who created it, and what kind of permissions it has (don't worry about this for now). If you go back to the man page for ls, you can look through to find the flag we just used. A useful hint when you're doing this is to hit the "/" key, type in the flag you're looking for, and then cycle through the matches using the "n" key. You can then hit "q" to get out of the man page entirely.

Anyway, so when you look for the "-l" flag, you will see that the description says it lists the results in long format, which makes sense with what we've seen! Now try running this command:

$ ls -lh

Notice the difference between this and "ls -l"? When we look up the "-h" flag, we see that it uses unit suffixes to reduce the number of digits used to describe the size of each file; in other words, it makes the sizes human readable. This also illustrates the fact that you can (usually) string together different flags using a single "-" character; conversely, see for yourself how the command "ls -lh" is exactly the same as running this:

$ ls -l -h

There are a few other basic things that are very useful to know. One is that sooner or later, you will inevitably have something not work the way you expect. For example consider the following commands and outputs:

\$ ls moev_here
ls: moev_here: No such file or directory

\$ ls -y
ls: illegal option -- y
usage: ls [-ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1] [file ...] 

\$ ls made_up_folder/
ls: made_up_folder/: No such file or directory

When you get an error like this, it is useful to cycle through the following possibilities:

  1. Did I make a typo? (e.g. the first example)
  2. Are the arguments I provided correct? Did I use the right flags? (e.g. the second example)
  3. Am I using the right paths? Do the files and folders actually exist? (e.g. the third example)
  4. Did I spell the command correctly? Unix is case sensitive, so you have to pay attention to exactly how you're writing commands.

Another useful trick is to use the [Tab] key as much as possible. In Unix, when you are typing a command and hit this key, it will try to guess what comes next. This applies both to command names (some programs have very long names) as well as paths. To see this, type: $ ls very_ And then hit the tab key. It should complete the rest of the filename, very_long_welcome_message.txt, for you!

Another useful trick for being lazy (or efficient, depending how you look at it..) is to use the up- and down-arrow keys in a terminal. See how these cycle through the commands that we have run? This is very useful for avoiding having to retype commands over and over again.

Finally, if you run a command that isn't what you wanted to do, or something is taking too long, the combination of the [Control] key (often denoted as a '^') and the [c] key sends a message to the terminal to terminate the currently running process.

4. Reading files

So using the 'ls' command, we can see that there are some welcome messages in the move_here folder! But how do we look at them? There are many ways to look at the information in a file. The simplest command is one that just prints out the entire file. To see how this behaves, try running this command with no file name argument and see how it behaves: enter text and press enter when you're done with a line, then when you want to stop entering text, type '^d' (control+d) to send a signal that you are done:

$ cat

Now that you've seen that this command will just spit back whatever you type into it, you might guess that if you give it a file, it will just print out that whole file. Try this out:

$ cat welcome_message.txt

The 'cat' command, while easy to remember because it's the name of fuzzy adorable animals, can also be remembered by thinking of concatenate. It is called this because it spits out entire files, and can also be used to concatenate more than one file together, like so:

$ cat welcome_part1.txt welcome_part2.txt

However, what happens if the file is too long to easily read? Try looking at the other welcome message:

$ cat very_long_welcome_message.txt

Not very easy to read, is it? You may be able to scroll up in the terminal to see the whole thing, but this may be dependent on which browser you're using, so this is not a very good solution..how else can we look at this file? Run this:

$ head very_long_welcome_message.txt

By default, head shows you the first 10 lines of the file. We can see that this doesn't show us everything, so we can use a flag to decide how many lines we want to look at. Try running this:

$ head -n 23 very_long_welcome_message.txt

What if we want to read the end of a file? There's a command for that too!

$ tail very_long_welcome_message.txt

You can also give a similar argument to change the number of lines:

$ tail -n 16 very_long_welcome_message.txt

There are also other useful arguments to tail, including being able to offset from the beginning of the file using a "+" in the argument for the "-n" flag. We encourage you to read the man page for tail and see how to use this..it will come in handy!

So far, all of these commands spit out the files on the command line, but what if we want to be able to scroll through a file without spitting it out? There are two very useful and similar tools for scrolling through large files:

$ less very_long_welcome_message.txt
$ more very_long_welcome_message.txt

These may feel similar to the way the man pages are displayed; that's because one of these two is indeed typically used to display man pages. So we can use the same keys that we learned for man pages to move through files using these tools: [space] to page down, the up- and down-arrows to scroll line-by-line, and [q] to quit.

5. Creating and removing files and directories

So now we have a good idea of how to look at files and move around in the directory structure. How about actually making files and directories? Let's start by making a new directory. Run this:

$ mkdir temp_dir
$ ls

Now you should see the directory we just created listed. How about getting rid of a directory? Run this:

$ rmdir temp_dir

Now if you do another 'ls', you should see that it's gone. Note that rmdir can only remove directories that don't have any files in them, so if there are any files, they must be removed. Check this for yourself:

$ rmdir delete_me

This leads to the obvious question of how do we delete files? Let's see what kind of files are in the directory:

$ ls delete_me/

So we see that there's only one file in there, so let's get rid of it!

$ rm delete_me/cant_touch_this.txt

Confirm that the directory is now empty with another 'ls', and see that we can now get rid of this directory using rmdir.

How about the other directory, "delete_me_too"? You can try deleting it with rmdir, but again it doesn't work. Let's see what's in there:

$ ls delete_me_too/

This time we can see that there are 200 files in this directory. We could delete them by hand, but that would take a very long time. Instead, we will introduce a very powerful concept: the asterisk, "*" acts as a wildcard character when you use it in a script, meaning that it can represent "0 or more of any character." In other words, you can use this to represent all files that match a certain pattern. For example, listing all the files that end with ".txt" is done as follows. Try this in the delete_me_too directory:

$ ls delete_me_too/*.txt

Notice that this doesn't show us all the files, because some of them end in ".text". Correspondingly, we can list these using:

$ ls delete_me_too/*.text

However, we'd like to use this wildcard to delete all the files in the directory using a single command. This is quite simple:

$ rm delete_me_too/*

The wildcard construct is a very powerful way to deal with lots of files, but should be used with caution, especially the command we just gave, "rm *". This command will delete all the files in a directory, and unlike Windows or Mac file explorers, the "rm" command will straight up delete your files with no way to get them back and no recycling bin, so use caution! If you feel particularly uncomfortable with this, you can give the "-i" flag to rm to have it prompt you for every file you are deleting, so that you can double check to be sure. Now that we have deleted the files in this folder, we can use rmdir to delete the "delete_me_too" folder.

Now we know how to get rid of files, make directories, and remove directories, leaving us only the question of how to actually create files. There are many different ways to do this, but we'll start with the simplest. Try running this:

$ touch new_file.txt

Now we have created a file! The touch command operates on the file access and modification time: if the file given as an argument does not exist, then it is created, and if it does exist, then the modification time is updated to the current time/date (remember how we saw the modification times when we ran "$ ls -l"?).

However, this doesn't actually put anything in the file, so we need some other tools to actually generate files. A commonly used approach is to use the ">" character, which sends the output from any command to a specified file. For example, try running this:

$ ls > file_listing.txt

Now if you "cat" the file_listing.txt file, you will see that the output from "ls", which you've been seeing on the command line, is now stored in that file. This can be done for any command with any flags, for example:

$ head -n 23 very_long_welcome_message.txt > beginning_of_welcome_message.txt

Another use of the "cat" command is to create your own files. If you don't give it any arguments, then it will just read in what you type and spit it back out. To signal that the file is ending, you type "^d" (remember that "^" denotes the [Control] key). First just try running this command, typing some stuff, and then type "^d" when you're done, and see that every time you press the enter key, the cat command spits back the line you just typed:

$ cat

Now, if we want to create a file to save our output, it is as simple as using ">":

$ cat > cat_test.txt

Notice that now when you press the enter key, you don't see the line you just wrote spit back on the command line; this is because the output is now redirected to the file "cat_test.txt". To check this, run:

$ cat cat_test.txt

6. Editing files

Using the cat command with the ">" redirection is useful for creating files, but you may have noticed that if you make a typo or accidentally hit return, you can't go back to change what you have done, so it's not the best choice for text editing.

There are several more powerful tools for text editing in a bash shell, some of which are quite simple and some of which are rather complex. A very simple, easy to use one, which we will be using for the class, is called "nano". Try running this command and explore the interface:

$ nano new_file_test.txt

Remember that the "^" character stands for the control key, so for example when you are done editing, you can type [Control]+x to get out. Nano is a basic text editor that is easy to use and straightforward.

7. Moving things around

One thing we haven't yet addressed is how to move files around. Let's see how we can do that by learning how to move all the files from "dir1" into "dir2". First, let's move into dir1:

$ cd dir1/

Let's first try moving a single file:

$ mv file_1.txt ../dir2/

We see that the move command is the way to do this. Now if we check dir2, we should see file_1.txt in there, and file_1.txt no longer in dir1:

\$ ls ../dir2/

\$ ls

What if we just want to make a copy? There is also a command for that:

$ cp ../dir2/file_1.txt .

Now if we check the two directories, we should see that they both have the file called "file_1.txt" in them:

\$ ls ../dir2/

\$ ls

Now let's try moving all the files from dir1 into dir2 using the wildcard construct:

$ mv * ../dir2/

Notice that dir2 now only has a single copy of file_1.txt. When you use the "mv" command, if a file of the same name exists in the target directory, the file being moved overwrites the file that was there, so you have to be careful not to accidentally overwrite something with the same name! Now let's try copying the files back into dir1 so that we have the same files in both directories:

$ cp ../dir2/* ./

Notice the use of the "./", which denotes the directory we are currently in! The "mv" command also has another use, which is simply to rename files or directories. Let's rename one of the files that we just copied into this directory:

$ mv file_6.txt my_favorite_file.txt

Now we no longer have file_6.txt, as it has been renamed. Note that the file that we copied it from remains unchanged in dir2/. Let's also see how to rename directories:

\$ cd ..

\$ mv dir1/ orig_dir/

Now we have changed the name dir1 to reflect that it was the one that originally contained the files. Note that if you give the name of a directory that already exists, the directory will be moved into that directory. Try running this:

\$ mv dir2/ orig_dir/

\$ ls orig_dir/

Now we see that dir2 has been moved into the directory named orig_dir/. Also note that you can use any absolute or relative path for these commands; in other words, you can use these commands to move anything from any place to any other place!

8. More advanced processing

Now we know the basics of moving around, determining what files exist in a given directory, and creating and removing files and directories. Now let's learn some other tools that will start to give you a sense of how powerful the Unix shell is for manipulating files. First, let's move into a folder with some data to manipulate in it:

\$ cd data_to_sort/

\$ ls

Let's look at this unsorted data:

$ cat unsorted_fruits.txt

So we see that it appears to be a list of fruits (possibly with an exception), that is not in alphabetical order. First let's learn how to check how many lines, words, and bytes are in the file. Run this:

$ wc unsorted_fruits.txt

The default output is kind of hard to understand, but if you check the man page for the word count command, you will see that the first column is the number of lines (42), the second is the number of words (44), and the third is the number of bytes in the file. Sometimes you might want to just know the number of lines, in which case you can run:

$ wc -l unsorted_fruits.txt

Check the man pages to see what else you can do! Now let's see how we can actually sort this file:

$ sort unsorted_fruits.txt

As you would probably expect, the sort command sorts the input you give it. This lets us see that there are several repeated fruits in this list, and you might also notice that there are some non-fruits hiding out in this list. First, let's write the sorted file to a new one, so we can then see how to get only the unique entries, using the following commands:

\$ sort unsorted_fruits.txt > sorted_fruits.txt

\$ uniq sorted_fruits.txt

\$ uniq sorted_fruits.txt > unique_fruits.txt

So we see that the uniq command gets you unique entries only. Try running uniq on the unsorted_fruits.txt list to see that it only finds repeated lines that are adjacent to one another; if the file isn't sorted, then it won't find the unique entries. Uniq has some useful options such as counting how many times each entry was found:

\$ uniq -c sorted_fruits.txt

This raises the question of whether we can find these unique entries without having to create a sorted intermediary file in between. And as you might expect, we're not just raising this question without an answer: there is a very powerful concept in Unix that allows you to do exactly this, called the pipe, denoted by the "|" character. This allows you to "pipe" the output of one command as input into another command. For example, if we want to sort our file first and then find the unique entries, we can run:

\$ sort unsorted_fruits.txt | uniq

See that this gives the same output as running uniq on the sorted_fruits.txt file. And of course, we can save the output of the piped commands in a similar way:

\$ sort unsorted_fruits.txt | uniq > unique_fruits_2.txt

What if we want to check that they gave the same output? There is another useful command called "diff" which allows you to do this. Try running it:

\$ diff unique_fruits.txt unique_fruits_2.txt

You should see that it doesn't give any output; this signifies that the files are identical. Finally, we can use piping to check how many unique fruits there are in total by stringing together the three commands we just learned using pipes:

\$ sort unsorted_fruits.txt | uniq | wc -l

This shows us that there are 35 unique fruits (and a vegetable or two) in this list.

9. Other useful commands

Now you know the basics of using a Unix system, and are hopefully well on your way to getting comfortable with it! In this section, we introduce some extra commands that are also useful to know. We won't be using these tricks much for this class, but they often come in handy on a real server. If you recall that using the up- and down-arrow keys cycle through the commands that you ran, you might wonder how the shell knows those commands. It does this by saving each command that is run (up to a user-defined limit) in a file that can be accessed using the following command:

\$ history

The actual file is stored at ~/.bash_history, and the above command is equivalent to running:

\$ cat ~/.bash_history

This illustrates another aspect of Unix systems, which is that files that start with a "." are hidden. Notice that if you just run:

\$ ls ~

You don't see this file. Of course, there is a flag to see all these hidden files:

\$ ls -a ~

Another useful command is to check how much space is being used by a directory. Since we should still be in the sorted data directory, let's check how big it is:

\$ du

There are some useful flags for disk usage that often get used, especially to check all subdirectories (-s) and to report the results in human readable format (-h), i.e.:

\$ du -sh

We can also go back to the main folder for this module and check how much space is used by all the subdirectories:

\$ cd ../..

\$ du -sh

Finally, you might also want to know how much space is available on the whole system. You can display the disk free space using the command:

\$ df

And of course, it is often useful to have the output in a human readable format:

\$ df -h

When setting up an analysis pipeline, it is often useful to put all of the relevant data sets, programs, scripts, readme documents, etc. into the same working directory.

However, if one has a particular large data set that is used in multiple analysis pipelines (and directories, if you keep your analyses somewhat organized instead of keeping them in a single place), it is highly undesirable to make duplicated copies of those files.

Instead, one would much rather have a shortcut to where those files are actually stored, rather than the files themselves. Somewhat (but not exactly) the same as an "alias" in Windows/OSX. In UNIX, you can create a thing called a symbolic link. A symbolic link contains a text string that is automatically interpreted and followed by the operating system as a path to another file or directory, or "target". ie. it "points" to the target.

The symbolic link is itself a new file independent of its target. If a symbolic link is deleted, its target remains unaffected.

If a symbolic link points to a target that is later moved, renamed or deleted, the symbolic link is not automatically updated or deleted. It will continues to exist and still points to the old target, now a non-existing location or file - so something to be wary of.

So, imagine you had a file called "myRNASeqdata.fasta" located in the /mylab/allthedata directory. To make a symbolic link in your current directory, you could:

\$ ln -s /mylab/allthedata/myRNASeqdata.fasta

12. Questions

In the cells below, write the unix commands you would use to do the following below:

  1. Create a file called "myName.txt"
  2. Write your name in the text file
  3. Check the file size of "myName.txt"
  4. Create a new directory called "data"
  5. Move "myName.txt" to the new directory
  6. Create a symbolic link to the file "myName.txt", which resides in the current directory (i.e., not /data)

Protip: You can always open a UNIX terminal to try it all out!