Introduction to Unix II - Prelab

More on data manipulation commands (cut)
How to search through files (grep)
A gentle introduction to regular expressions (sed)
Tools for compressing files and directories
Accessing the internet through the command line
Extra footnotes
Questions

1. More on data manipulation commands

As we saw in the assignment from the first Unix module, data files can be more complicated than just a list of items in a single column, and so we often want to be able to manipulate these more complex files. There are many useful Unix tools for doing so, and an important and useful one of these is called cut. This command is used to extract specific columns from a data file, and in the Unix I assignment, was how we manipulated the full Pokémon data file into the separate data files containing the Pokémon names, main types, secondary types, and combination of types. Now we will learn how to use the cut command. First, move to the directory called 'move_here', as usual:

\$ cd move_here/

Now list the files in this directory:

\$ ls

There are a few directories in here. First, we are going to look at the poke_data directory, which contains similar data to what we used for the unix module I assignment. Now move into that directory and see what's in there

\$ cd poke_data/

\$ ls

There are three files: orig_151_pokemon.csv, orig_151_pokemon.txt, and orig_151_pokemon.txt.2. Let's first look at the orig_151_pokemon.txt file, which is the same as the one we used in the first Unix assignment, and learn how to generate the data files that we used. Recall that we had four data files for the assignment: pokemon_names.txt, pokemon_main_types.txt, pokemon_secondary_types.txt, and pokemon_both_types.txt, corresponding to the first column, the second column, the third column, and the second and third columns together. You may also have noticed that these files did not contain the entries in the first line of the input file (the header describing what each column is). Let's start with a quick aside, by learning how we accomplished that.

The 'tail' command has a flag, '-n', that describes how many lines before the end you want to look at. However, there is a very useful construction you can use to tell it to essentially skip the first line. Try running this command:

\$ tail -n +2 orig_151_pokemon.txt | head

See how the header is now gone! Now, let's learn how we generated the assignment data files. As mentioned, we used the cut command to accomplish this, which is pretty simple: you tell it which columns you want to include in your output using the "-f" flag. Try running the following commands to generate the four data files:

\$ tail -n +2 orig_151_pokemon.txt | cut -f1 > pokemon_names.txt

\$ tail -n +2 orig_151_pokemon.txt | cut -f2 > pokemon_main_types.txt

\$ tail -n +2 orig_151_pokemon.txt | cut -f3 > pokemon_secondary_types.txt

\$ tail -n +2 orig_151_pokemon.txt | cut -f2,3 > pokemon_both_types.txt

This is a fairly straightforward command. One thing to notice is that if you want to include more than one column, you can specify which columns you want by separating them with commas. You can also include a range of columns by using a flag such as "-f2-4", which would include the second, third, and fourth column.

Now that we learned the simplest use of cut, let's look at the other data files in this directory. Use the 'head' command to look at these files, remembering that head can take in as many files as you'd like to look at:

\$ head orig_151_pokemon.csv orig_151_pokemon.txt orig_151_pokemon.txt.2

So we can see that these look very similar, with one difference: where there appear to be blank spaces between the columns in the 'orig_151_pokemon.txt' file as well as in the 'orig_151_pokemon.txt.2' file, there are commas in the orig_151_pokemon.csv file. This raises an issue that you will often run into in bioinformatics, which is that data files can be in many different formats: some people prefer to use commas to separate columns, some use tabs, and some just use spaces. So, it is important to know both how to tell what kind of file you're looking at, and how to deal with it.

Let's learn how to figure out exactly how your data files are organized. Typically, the first thing you want to do when you have a new data file is to use the 'head' command to look at it, as we have done here. If there are commas, you can see them right away, and so you don't need any further processing. However, if there aren't commas, you can run into the issue we have here, which is that we have two text files with spaces between the columns, but they seem to be slightly different. This may not seem super important at the moment, but many programs that read in data need to know exactly what the delimiting character is, or they will not work correctly. So, how do we figure out how the two .txt files are different?

To do this, we will use a very powerful flag for the 'cat' program, which is '-A'. Try running the following commands:

\$ cat -A orig_151_pokemon.txt | head

\$ cat -A orig_151_pokemon.txt.2 | head

First, notice that every line now has a '$' character at its end. This doesn't refer to the command prompt, as we've been using it here, but rather represents what is called a new line character, meaning that the line ends at that point. Next, notice that the orig_151_pokemon.txt file has '^I' between each column. When you call the cat command with this flag, '^I' represents a tab character (which is not the same as four spaces). This tells us that this file is tab separated. On the other hand, when we look at the orig_151_pokemon.txt.2 file, we just see single spaces between the columns, so this one is delimited by spaces.

Now, what do we do with this information? Let's try to get the first column from each file, as we just did to get the Pokémon names. Try these commands:

\$ cut -f1 orig_151_pokemon.txt | head

\$ cut -f1 orig_151_pokemon.txt.2 | head

\$ cut -f1 orig_151_pokemon.csv | head

Notice that only the first one actually did what we wanted. The issue here is that cut does not know what we now know, which is how the columns are separated. The reason it works for the tab separated file is that it assumes that the columns are tab separated, by default. To fix this issue, we can use the '-d' flag, which lets you tell cut how the columns are separated. Try running these commands:

\$ cut -f1 -d',' orig_151_pokemon.csv | head

\$ cut -f1 -d' ' orig_151_pokemon.txt.2 | head

Now this works! We have provided the space-separated file to illustrate an important point, which is that it can be very dangerous to use spaces to separate a data file. Any Pokémon fans out there may remember Mr. Mime, which of course has a space in its name. What happens to Mr. Mime when we try to pull out the name column from the space separated data file? See for yourself (it should be right between Starmie and Scyther)!

\$ cut -f1 -d' ' orig_151_pokemon.txt.2 | less

See how there is just a 'Mr.' where the name should be? The 'cut' command doesn't know anything about Pokémon, so it just split that row of the file right between 'Mr.' and 'Mime', just like we told it to. This is an easy trap to fall into when doing your own data analysis, since if you just look at the top few lines using head, you might not notice such an issue. That's why it's always better to use tabs or commas to separate columns in a data file!

2. How to search through files

So we have now equipped you with several useful tools that can be used to manipulate your data files. However, we have not yet shown you something really useful: how to search through files to find the specific lines you're interested in. This can be done using an indispensable tool that can make your life much easier, called grep. There isn't a really intuitive reason why it's called this: it's short for globally search a regular expression and print, which isn't very catchy. However, the tool itself is incredibly powerful and useful. Let's start by saying we'd like to find a specific Pokémon in our file, such as Dugtrio. We could look through the file ourselves and find where it is, or we could use grep. Try running this:

\$ grep "Dugtrio" orig_151_pokemon.txt

Instead of having to look ourselves, this returns the line we're interested in. There are a ton of different flags for grep, and as usual, the best way to learn them is to read the man page and try to use them to accomplish a specific task you're interested in. Let's learn a few of these flags.

Notice how we capitalized the 'D' in Dugtrio in the above command. If we don't do this, it won't return anything, as you can see if you run this:

\$ grep "dugtrio" orig_151_pokemon.txt

Let's say we're lazy and don't want to hit the shift key, or we just don't care about capitalization. The '-i' flag tells grep to ignore capitalization, allowing us to just match by letter. All of these will find the line we want. See for yourself:

\$ grep -i "dugtrio" orig_151_pokemon.txt

\$ grep -i "DuGtRiO" orig_151_pokemon.txt

\$ grep -i "DUGTRIO" orig_151_pokemon.txt

Now let's say we want to know where in the list of Pokemon Dugtrio is. The '-n', or '--line-number' flag allows us to do this:

\$ grep -n "Dugtrio" orig_151_pokemon.txt

\$ grep --line-number "Dugtrio" orig_151_pokemon.txt

This is the first time we have seen two different flags doing the same thing, but this is actually a common construction. A flag, given by a single letter prefixed by a single dash, is often equivalent to a longer, more descriptive flag prefixed by two dashes.

Now let's give grep a bit more of a workout. Let's say, instead of looking for a specific Pokémon, we just want to find all the other ground-type Pokémon in the file. The grep command can handle this with ease, as you can see if you run this:

\$ grep "Ground" orig_151_pokemon.txt

And of course, the flags we already tried will still work here:

\$ grep -i "ground" orig_151_pokemon.txt

\$ grep -n "Ground" orig_151_pokemon.txt

What if we'd like to count all the ground-type Pokémon? We saw previously how to use pipes and the 'wc' command to count, so let's see how this can work with grep. Run this:

\$ grep "Ground" orig_151_pokemon.txt | wc -l

However, there is an even easier method we can use to do this, that doesn't involve piping to another command. Instead, we can use the '-c' or '--count' flag to get the same output. Run this:

\$ grep -c "Ground" orig_151_pokemon.txt

Another useful flag is to pick out all the lines that do not match the pattern. For example, if we want all the non-ground Pokémon, we can use the -v flag. Run this if you want (but it will spit out many Pokémon!):

\$ grep -v "Ground" orig_151_pokemon.txt

3. A gentle introduction to regular expressions

So far, we have just been searching through files by looking for exact matches to the pattern we're interested in. However, this has barely scratched the surface of what grep is capable of. Instead of looking for these exact matches, we can also use grep to find lines that fit certain patterns. If you recall from the prelab of the first Unix module, we used the * (wild card) character to match all the files ending in a certain pattern (such as *.txt vs *.text). This character can also be used in grep, and is a member of a very expressive system of pattern matching called regular expressions. These can get quite complicated, and we could easily spend several classes showing you all the different things these are capable of. We don't have time for that, so let's start by seeing some simple examples. Let's say we want to pull out all the Nidoran-related Pokémon, which are split by gender, and each gender evolves into two different forms, for a total of 6 Pokémon we want to pull out. These all start with 'Nido', so let's see how we can use the wild card * character to find these:

\$ grep "Nido*" orig_151_pokemon.txt

See how we now get all 6 of these Pokémon from this simple command. The * character, in this context, tells grep to find all the lines that include "Nido" and not to care about what comes after that. In technical terms, it means that grep should match the previous character 0 or more times.

This is just one of the many special characters that can be used as part of regular expressions. We will dive a little more deeply into grep for the in-class assignment, but a full exploration of regular expressions is out of the scope of this course. If you would like to learn more, here are some useful websites:

Intro: http://zytrax.com/tech/web/regex.htm

Examples: http://www.regular-expressions.info/examples.html

4. Tools for compressing files and directories

So far, we have only been working with relatively small text files that are easy to look at and process. However, in the realm of bioinformatics (and any computational field), most data will not be so small and easy to handle. So, it is very useful to know how to use tools for compression, the process of encoding data or information using fewer bits (which means that the files will take up less space on the computer). We will now go through a few of the different options for performing compression on Unix systems. First, move to the folder called "compression_data" and look at the files:

\$ cd ../compression_data/

\$ ls

You may recognize some of these file extensions, including ".zip", which you have probably come across in your day-to-day computer usage; these are supported by programs like WinZip and the Mac OS finder. These files all contain the same data compressed using different programs. First let's compare their sizes:

\$ ls -l

As an aside, let's learn how to sort this list by size. The sort command accepts a flag, "-k", which lets you define a specific column that you'd like to sort by, and how you want to do it. Here, we want to sort by the 5th column (the size), and do it numerically, so we can use this command:

\$ ls -l | sort -k5n

We can also sort using the -S flag for 'ls':

\$ ls -lS

Using either approach, we can see that the file ending in ".gz" is the smallest, followed closely by ".zip" and "tar.gz", with ".tar" trailing far behind. This is because .gz, corresponding to the gzip program, and .zip, corresponding to the zip program, are actual compression tools, while .tar, corresponding to the tar program, is actually an archiving tool: it doesn't compress the data directly, but it provides a way to collect many files (and their metadata) into a single file. It is most useful for compressing multiple files, which we will see in the in class assignment.

Now let's learn how to compress and decompress these files, starting with .zip files. Although you can use interactive programs like WinZip to compress and decompress these files, there are also useful command line tools that do the trick. For .zip files, they are very easy to remember: the 'zip' command compresses files, and the 'unzip' command decompresses files. Let's try decompressing:

\$ unzip meow.zip

\$ ls

You should now see the 'meow.txt' file in this directory, and you can peruse it at your leisure. Let's make a new subdirectory so that we can practice compressing files without overwriting what we already have:

\$ mkdir compression_practice

\$ mv meow.txt compression_practice/

\$ cd compression_practice

Now we have the original text file in this folder, so let's try zipping it back up. The syntax of the zip command is that you first give it the name of your desired zip file, followed by a list of the files you want to compress. In our case, we only have one file to compress, so the command is:

\$ zip meow.zip meow.txt

Let's compare this to the file I provided:

\$ ls -l meow.zip ../meow.zip

They should be the same size! Now let's move on to learning about gzip, which gives the file extension ".gz". Similarly to zip, gzip has a matching command for decompression called gunzip. Let's try it out:

\$ cd ..

\$ gunzip meow.gz

\$ ls -l

Notice how this time, instead of creating a separate file called 'meow.txt', gunzip replaced the 'meow.gz' file with a file simply called 'meow'. This file is identical to the meow.txt file, but illustrates the default behavior of gzip/gunzip, which is that it doesn't store the original file extensions, and replaces the input file. Storing the original file extensions is where 'tar' comes in, as we'll see in a second. First, let's recompress the 'meow' file and learn how to send the gunzip output to a file of our choice without overwriting the compressed file:

\$ gzip meow

\$ ls

Now we have meow.gz back. Let's try extracting it into the practice folder and compare it to the file we extracted from zip. The way to do this is to use the "-c" flag to gunzip, which tells it to decompress to standard output (i.e. to print it in the terminal), and then redirect that output into our file of interest (recall that redirection is done using ">"):

\$ gunzip -c meow.gz > compression_practice/meow_gzip.txt

Now let's check that this is the same as the file we got from zip:

\$ diff compression_practice/meow.txt compression_practice/meow_gzip.txt

It worked! Finally, let's learn how to use the tar program. As I mentioned above, the tar command on its own is just a way to combine the metadata from several files, creating a .tar file, such as 'meow.tar', which we saw was actually bigger than the original input file. First, let's learn how to recover the original file from that one. The most important flags to remember for the tar command are "-c", which creates a new archive, and "-x", which extracts files from an archive. We also need to use "-f", which tells the command that we want to extract/compress a specific file. Finally, the "-v" flag, which tells it to be verbose, is useful for seeing exactly what is being extracted or compressed. Like the flags we've seen before, these may seem very hard to remember, but as you continually use these commands, they will start to become second nature.

Let's start by getting some practice and extracting the meow.tar file (notice how we can combine the letters for all three flags into a single "-" argument and have it be correctly interpreted):

\$ tar -xvf meow.tar

\$ ls

Notice how the output tells you that meow.txt was the file that was extracted, and that instead of removing the meow.tar file, this program writes out the original files again. Let's move that file to the compression_practice folder and rename it so that we can practice re-compressing it:

\$ mv meow.txt compression_practice/meow_for_tar.txt

\$ cd compression_practice/

Compressing using tar follows a similar syntax to gzip. We give it the three flags we used before, except with "-c" replacing "-x", followed by the name of the compressed file you want to create followed by the list of files to compress into that file (note that we only are using one file, for now):

\$ tar -cvf meow.tar meow_for_tar.txt

\$ ls

See how we now have the meow.tar file back, as well as the meow_for_tar.txt file. Now, we have one final file type to learn about, which is the meow.tar.gz file. You can probably figure out from the file extension that this is a combination of the tar and the gzip commands; this is a powerful approach, since tar is an archiving program and gzip is a compression program, and combining the two usually gives the best compression. It is also very convenient to use, as the tar command has a single flag that allows it to call gzip to either compress or decompress these files: "-z" (for gzip), so these commands look very similar to the ones we just learned.

Let's start by decompressing the meow.tar.gz file:

\$ cd ..

\$ tar -xzvf meow.tar.gz

See how the behavior is basically the same as what we used for meow.tar, except now we operate on the .tar.gz file. Let's move and rename this as well so we can see how to compress:

\$ mv meow.txt compression_practice/meow_for_targz.txt

\$ cd compression_practice/

\$ tar -czvf meow.tar.gz meow_for_targz.txt

Now we've recreated all the compressed files, and we can compare to the ones I gave you to see that they are the same size:

\$ ls -l

\$ ls -l ..

5. Accessing the internet through the command line

Now you should have a grasp on many of the very useful tools in the Unix tool kit, so to speak, and are hopefully becoming more and more comfortable with the command line. There is another useful class of tools that we haven't yet discussed, and those are the commands for accessing the internet through the command line.

One common task you will want to accomplish is to download software from the web directly into your Unix system. There are two tools that can be used to accomplish this: wget and curl. By default, Mac systems do not come with wget, so we will show you how to use both tools. For this, we will try downloading an example configuration file for a track hub on the UCSC Genome Browser. First, let's move out of the compression practice folder and make a new folder for downloading. Run these commands:

\$ cd ../../

\$ mkdir ucsc_downloads

\$ cd ucsc_downloads

The file we want to download is located at http://genome.ucsc.edu/goldenPath/help/examples/hubDirectory/hub.txt. First, let's try downloading it using wget by running this command:

\$ wget http://genome.ucsc.edu/goldenPath/help/examples/hubDirectory/hub.txt

\$ ls

Notice how the hub.txt file is now present. In wget, we can also use the '-O' flag to specify a different filename that we want to download a file to (this will prove useful in the Unix module II assignment). Try running this:

\$ wget -O wget_hub.txt http://genome.ucsc.edu/goldenPath/help/examples/hubDirectory/hub.txt

\$ ls

See how the same file is now downloaded to wget_hub.txt, instead of hub.txt! Now, let's see how to download files with curl. Unlike wget, curl will, by default, print the transferred file to the standard output, rather than put it in a file, so we can use output redirection to put it in the right place. First just try running curl alone to see the file output:

\$ curl http://genome.ucsc.edu/goldenPath/help/examples/hubDirectory/hub.txt

Now, we can redirect this to a file:

\$ curl http://genome.ucsc.edu/goldenPath/help/examples/hubDirectory/hub.txt > curl_hub.txt

Finally, there are also ways to interact more deeply with a remote server. These tools are used when you have a remote server that you can work on. For example, if you use PMACS, the Penn medicine computing cluster service, you will have to use these tools to access those servers. However, because of the way we've set this course up, we don't have remote servers for you, so here we will just show you what the commands look like. The ssh command, which represents secure shell, is a way to get onto a remote server and access a Unix terminal on the server. It is used as follows:

\$ ssh username@servername

Where servername might be something like consign.pmacs.upenn.edu, for example, and username is your username on that server (i.e. for PMACS, this is your PennKey). The scp command, which represents secure copy, is analagous to the 'cp' command that we've been using, but is used to copy files to or from a remote server. The syntax is:

\$ scp myfile username@servername:/path/to/directory

For sending myfile to the server and putting it in the folder specified following the colon. To copy files from a remote server, the syntax is:

\$ scp username@servername:/path/to/directory/myfile /path/to/local/dir/

Which puts the file at /path/to/directory/myfile from the remote server into /path/to/local/dir/ on the local machine.

6. Extra footnotes

Note that if you try to use the 'cat -A' flag on a Mac terminal, it will not work, because the version of 'cat' that is on Macs is out of date. To reproduce this behavior, you can give '-et' as a flag.

7. Questions

Write a command to find all lines in orig_151_pokemon.csv with type "Poison".

Which flags would you use with the tar command to extract a file?