In [ ]:
cd files
less Styphi.gff
In [ ]:
head Styphi.gff
The contents of the file Styphi.gff
is displayed one screen at a time, to view the next screen press the space bar. As Styphi.gff
is a large file this will take a while, therefore you may want to escape or exit from this command. To do this, press the q key, this kills the less
command and returns you to the Unix prompt. less
can also scroll backwards if you hit the b
key. Another useful feature is the slash key, /
, to search for an expression in the file. Try it, search for the gene with locus tag t0038. What is the start and end position of this gene?
The head
command displays the first ten lines of a file.
To look at the beginning of the fie Styphi.gff
file type:
In [ ]:
head Styphi.gff
The tail
command displays the last ten lines of a file.
To look at the end of Styphi.gff
type:
In [ ]:
tail Styphi.gff
The amount of the file that is displayed can be increased by adding extra arguments. To increase the number of lines viewed from 10 to 25 add -n 25
to the command:
In [ ]:
tail -n 25 Styphi.gff
In this case you've given tail an argument in two parts. In this case the -n
says that you want to specify the number of lines to show and the 25
bit tells it how many. Unlike earlier when we merged arguments like ls -lha
together, it's not a good idea to merge multiple two part arguments together because otherwise it is ambiguous which value goes with which argument.
-n
is such a common argument for tail
and head
that it even has a shorthand: -n 25
and -25
mean the same thing.
Saving time while typing may not seem important, but the longer that you spend in front of a computer, the happier you will be if you can reduce the time you spend at the keyboard.
Pressing the up/down arrows will let you scroll through previous commands entered.
If you highlight some text, middle clicking on the mouse will paste it on the command line.
Tab completion doesn't just work on filenames, it also works on commands. Try it by typing tai
and pressing tab...
tai <press the TAB key>
Although tab completion works on commands and filenames, unfortunately it does not work on options or other arguments.
In [ ]:
In [ ]:
There are several other useful commands that can be used to manipulate and summarise information inside files and we will introduce some of these next, cat
, sort
, wc
and uniq
.
In [ ]:
head -1 Styphi.gff > first_Styphi_line.txt
It may look like nothing has happened. This is because the >
character has redirected the output of the head
command. Instead of writing to the standard output (your terminal) it sent the output into the file first_Styphi_line.txt
. Note that tab completion works for Styphi.gff
because it exists but doesn't work for first_Styphi_line.txt
because it doesn't exist yet.
In [ ]:
cat first_Styphi_line.txt
We don't need first_Styphi_line.txt
any more so delete it by typing
In [ ]:
rm first_Styphi_line.txt
The cat
command can also be given the names of multiple files, one after the other and it will just output the contents of all files. The order in which the files are displayed is determined by the order in which they appear in the command line. You can use this concept and the >
symbol to join files together into a single file.
Having looked at the beginning and end of the Styphi.gff
file you should notice that in the GFF file the annotation comes first, then the DNA sequence at the end. If you had two separate files containing the annotation and the DNA sequence, it is possible to concatenate or join the two together to make a single file like the Styphi.gff
file you have just looked at.
For example, we have two separate files, Styphi.noseq.gff
and Styphi.fa
, that contain the annotation and DNA sequence, respectively for the Salmonella typhi CT18 genome. To join together these files type:
In [ ]:
cat Styphi.noseq.gff Styphi.fa > Styphi.concatenated.gff
The files Styphi.noseq.gff
and Styphi.fa
will be joined together and written to a file called Styphi.concatenated.gff
.
The >
symbol in the command line directs the output of the cat program to the designated file Styphi.concatenated.gff
. Use the command ls
to check for the presence of this file.
In [ ]:
ls
In [ ]:
wc -l Styphi.gff
or
In [ ]:
cat Styphi.gff | wc -l
Both give a similar answer. In the first example you tell wc
the file that you want it to review (Styphi.gff
) and pass the -l
option to say that you're only interested in the number of lines.
In the second example you use the |
symbol which is also known as the pipe symbol. This pipes the output of cat Styphi.gff
into the input of wc -l
. This means that you can also use the same wc
tool to count other things. For example to count the number of files that are listed by ls
type:
In [ ]:
ls | wc -l
You can connect as many commands as you want. For example, type:
In [ ]:
ls | grep ".gff" | wc -l
What does this command do? You will learn more about the grep
command later in this course.
The sort
lets you sort the contents of the input. When you sort the input, lines with identical content end up next to each other in the output. This is useful as the output can then be fed to the uniq
command (see below) to count the number of unique lines in the input.
To sort the contents of a BED file type:
sort Pfalciparum.bed
Now type:
In [ ]:
sort Pfalciparum.bed | head
In [ ]:
sort Pfalciparum.bed | tail
To sort the contents of a BED file on position, type the following command.
sort -k 2 -n Pfalciparum.bed
The sort
command can sort by multiple columns e.g. 1st column and then 2nd column by specifying successive -k parameters in the command. Type the following commands:
In [ ]:
sort -k 2 -n Pfalciparum.bed | head
In [ ]:
sort -k 2 -n Pfalciparum.bed | tail
Why not have a look at the manual for sort
to see what these options do? Remember that you can type /
followed by a search phrase, n
to find the next search hit, N
to find the previous search hit and q
to exit.
man sort
In [ ]:
To get the list of chromosomes in the Pfalciparum bed file type:
In [ ]:
awk '{ print $1 }' Pfalciparum.bed | sort | uniq
How many chromosomes are there? You will learn more about the awk
command later in this course.
Warning: uniq
is really stupid; it can only spot that two lines are the same if they are right next to one another. You therefore almost always want to sort
your input data before using uniq
.
Do you understand how this command is working? Why not try building it up piece by piece to see what it does?
awk '{ print $1 }' Pfalciparum.bed | less
awk '{ print $1 }' Pfalciparum.bed | sort | less
awk '{ print $1 }' Pfalciparum.bed | sort | uniq | less
In [ ]:
Open up a new terminal window, navigate to the files
directory in the Unix
directory and complete the following exercise:
head
command to extract the first 500 lines of the file Styphi.gff
and store the output in a new file called Styphi.500.gff
.wc
command to count the number of lines in the Pfalciparum.bed
file.sort
command to sort the file Pfalciparum.bed
on chromosome and then gene position.uniq
command to count the number of features per chromosome in the Pfalciparum.bed
file. Hint: use the man command to look at the options for the uniq command. Or peruse the wc
or grep
manuals. There’s more than one way to do it!
In [ ]:
Now go to the next part of the tutorial, searching inside files with grep.
You can also return to the index or revisit the previous section.