Looking inside files

A common task is to look at the contents of a file. This can be achieved using several different Unix commands, less, head and tail. Let us consider some examples.

But first, change directory into the Unix/files/ directory (hint: you might need to go up a directories first using cd ../..). Check that the following commands give you a similar output:


In [ ]:
pwd
ls

less

The less command displays the contents of a specified file one screen at a time. To test this command type the following command followed by the enter key:

less Styphi.gff

The contents of the file Styphi.gff is displayed one screen at a time, to view the next screen press the space bar. As Styphi.gff is a large file this will take a while, therefore you may want to escape or exit from this command. To do this, press the q key, this kills the less command and returns you to the Unix prompt. less can also scroll backwards if you hit the b key. Another useful feature is the slash key, /, to search for an expression in the file. Try it, search for the gene with locus tag t0038. What is the start and end position of this gene?

head and tail

Sometimes you may just want to view the text at the beginning or the end of a file, without having to display all of the file. The head and tail commands can be used to do this.

The head command displays the first ten lines of a file.

To look at the beginning of the fie Styphi.gff file type:


In [ ]:
head Styphi.gff

The tail command displays the last ten lines of a file.

To look at the end of Styphi.gff type:


In [ ]:
tail Styphi.gff

The amount of the file that is displayed can be increased by adding extra arguments. To increase the number of lines viewed from 10 to 25 add -n 25 to the command:


In [ ]:
tail -n 25 Styphi.gff

In this case you've given tail an argument in two parts. In this case the -n says that you want to specify the number of lines to show and the 25 bit tells it how many. Unlike earlier when we merged arguments like ls -lha together, it's not a good idea to merge multiple two part arguments together because otherwise it is ambiguous which value goes with which argument.

-n is such a common argument for tail and head that it even has a shorthand: -n 25 and -25 mean the same thing.

Saving time

Saving time while typing may not seem important, but the longer that you spend in front of a computer, the happier you will be if you can reduce the time you spend at the keyboard.

  • Pressing the up/down arrows will let you scroll through previous commands entered.

  • If you highlight some text, middle clicking on the mouse will paste it on the command line.

  • Tab completion doesn't just work on filenames, it also works on commands. Try it by typing fin and pressing tab...

fin  <press the TAB key>

Although tab completion works on commands and filenames, unfortunately it rarely works on options or other arguments.

Getting help man

To obtain further information on any of the Unix commands introduced in this course you can use the man command. For example, to get a full description and examples of how to use the tail command type the following command in a terminal window.

man tail

There are several other useful commands that can be used to manipulate and summarise information inside files and we will introduce some of these next, cat, sort, wc and uniq.

Writing to files

So far we've been running commands and outputting the results into the terminal. That's obviously useful but what if you want to save the results to another file?

Type this:


In [ ]:
head -1 Styphi.gff > first_Styphi_line.txt

It may look like nothing has happened. This is because the > character has redirected the output of the head command. Instead of writing to the standard output (your terminal) it sent the output into the file first_Styphi_line.txt. Note that tab completion works for Styphi.gff because it exists but doesn't work for first_Styphi_line.txt because it doesn't exist yet.

cat

cat is another way of reading files, but unlike less it just throws the entire contents of the file onto your standard output. Try it on first_Styphi_line.txt


In [ ]:
cat first_Styphi_line.txt

We don't need first_Styphi_line.txt any more so delete it by typing


In [ ]:
rm first_Styphi_line.txt

The cat command can also be given the names of multiple files, one after the other and it will just output the contents of all files. The order in which the files are displayed is determined by the order in which they appear in the command line. You can use this concept and the > symbol to join files together into a single file.

Having looked at the beginning and end of the Styphi.gff file you should notice that in the GFF file the annotation comes first, then the DNA sequence at the end. If you had two separate files containing the annotation and the DNA sequence, it is possible to concatenate or join the two together to make a single file like the Styphi.gff file you have just looked at.

For example, we have two separate files, Styphi.noseq.gff and Styphi.fa, that contain the annotation and DNA sequence, respectively for the Salmonella typhi CT18 genome. To join together these files type:


In [ ]:
cat Styphi.noseq.gff Styphi.fa > Styphi.concatenated.gff

The files Styphi.noseq.gff and Styphi.fa will be joined together and written to a file called Styphi.concatenated.gff.

The > symbol in the command line directs the output of the cat program to the designated file Styphi.concatenated.gff. Use the command ls to check for the presence of this file.


In [ ]:
ls

wc - counting

The command wc counts lines, words or characters.

There are two ways you could use it:


In [ ]:
wc -l Styphi.gff

or


In [ ]:
cat Styphi.gff | wc -l

Both give a similar answer. In the first example you tell wc the file that you want it to review (Styphi.gff) and pass the -l option to say that you're only interested in the number of lines.

In the second example you use the | symbol which is also known as the pipe symbol. This pipes the output of cat Styphi.gff into the input of wc -l. This means that you can also use the same wc tool to count other things. For example to count the number of files that are listed by ls type:


In [ ]:
ls | wc -l

You can connect as many commands as you want. For example, type:


In [ ]:
ls | grep ".gff" | wc -l

What does this command do? You will learn more about the grep command later in this course.

sort - sorting values

The sort lets you sort the contents of the input. When you sort the input, lines with identical content end up next to each other in the output. This is useful as the output can then be fed to the uniq command (see below) to count the number of unique lines in the input.

To sort the contents of a BED file type:

sort Pfalciparum.bed

Now type:


In [ ]:
sort Pfalciparum.bed | head

In [ ]:
sort Pfalciparum.bed | tail

To sort the contents of a BED file on position, type the following command.

sort -k 2 -n Pfalciparum.bed

The sort command can sort by multiple columns e.g. 1st column and then 2nd column by specifying successive -k parameters in the command. Type the following commands:


In [ ]:
sort -k 2 -n Pfalciparum.bed | head

In [ ]:
sort -k 2 -n Pfalciparum.bed | tail

Why not have a look at the manual for sort to see what these options do? Remember that you can type / followed by a search phrase, n to find the next search hit, N to find the previous search hit and q to exit.

man sort

uniq - finding unique values

The uniq command extracts unique lines from the input. It is usually used in combination with sort to count unique values in the input.

To get the list of chromosomes in the Pfalciparum bed file type:


In [ ]:
awk '{ print $1 }' Pfalciparum.bed | sort | uniq

How many chromosomes are there? You will learn more about the awk command later in this course.

Warning: uniq is really stupid; it can only spot that two lines are the same if they are right next to one another. You therefore almost always want to sort your input data before using uniq.

Do you understand how this command is working? Why not try building it up piece by piece to see what it does?

awk '{ print $1 }' Pfalciparum.bed | less
awk '{ print $1 }' Pfalciparum.bed | sort | less
awk '{ print $1 }' Pfalciparum.bed | sort | uniq | less

Exercises

Open up a new terminal window, navigate to the files directory in the Unix directory and complete the following exercise:

  1. Use the head command to extract the first 500 lines of the file Styphi.gff and store the output in a new file called Styphi.500.gff.
  2. Use the wc command to count the number of lines in the Pfalciparum.bed file.
  3. Use the sort command to sort the file Pfalciparum.bed on chromosome and then gene position.
  4. Use the uniq command to count the number of features per chromosome in the Pfalciparum.bed file. Hint: use the man command to look at the options for the uniq command. Or peruse the wc or grep manuals. There’s more than one way to do it!