grep
A common task is to extract information from large files. This can be achieved using the Unix command grep
, which stands for "Globally search for a Regular Expression and Print". The meaning of this acronym will become clear later, when we discuss Regular Expressions. First, we will consider simpler examples.
Before we start, change into the Unix/grep
directory:
In [ ]:
cd grep
We will use a small example file (in "BED" format), which contains the expression levels of some genes. This is a column-based file, with a tab character between each column. There can be more than 10 columns, but only the first three are required to be a valid file. The file format is described in full here: http://genome.ucsc.edu/FAQ/FAQformat#format1. We will use the first 5 columns:
Here is the contents of the first example BED file used in this course:
In [ ]:
cat gene_expression.bed
In reality, such a file could contain 100,000s of lines, so that it is not practical to read manually. Suppose we are interested in all the genes from chromosome 2. We can find all these lines using grep:
In [ ]:
grep chr2 gene_expression.bed
This has shown us all the lines that contain the string "chr2".
We can use a pipe to then just extract the genes that are on the positive strand, using grep a second time:
In [ ]:
grep chr2 gene_expression.bed | grep +
However, since grep
is reporting a match to a string anywhere on a line, such simple searches can have undesired consequences. For example, consider the result of doing a similar search for all the genes in chromosome 1:
In [ ]:
grep chr1 gene_expression.bed
Oops! We found genes in chromosome 10, because "chr1" is a substring of "chr10".
Or consider the following file, where the genes have unpredictable names (which is not unusual for bioinformatics data).
In [ ]:
cat gene_expression_sneaky.bed
Now we try to find genes on chromosome 1 that are on the negative strand. We put the minus sign in quotes, to stop Unix interpreting this as an option to grep
, as opposed to the string we are searching for:
In [ ]:
grep chr1 gene_expression_sneaky.bed | grep '-'
The extra lines are found by grep
because of matches in columns we were not expecting to match. Remember, grep
is reporting these lines because they each contain the strings "chr1" and "-" somewhere.
We need a way to make searching with grep
more specific.
Regular expressions provide the solution to the above problems. They are a way of defining more specific patterns to search for.
First, we can specify that a match must be at the start of a line using the symbol "^
", which means "start of line". Without the ^
, we find any match to "chr1":
In [ ]:
grep chr1 gene_expression_sneaky.bed
However, notice the effect of searching for ^chr1
instead. Note that we put the regular expression in quotes, to avoid Unix errors. Not using quotes may or may not give an error, but it is safest to use quotes for anything but the simplest of searches.
In [ ]:
grep '^chr1' gene_expression_sneaky.bed
Good! We have removed the match to the badly-named gene "chr11.gene1", which is on chromosome 8. Now we want to avoid matching chromosomes 10 and 11. This can be done by also looking for a "tab" character, which is represented by writing \t
. For technical reasons, which are beyond the scope of this course, we must also put a dollar sign before the quotes to make any search involving a tab character work.
In [ ]:
grep $'^chr1\t' gene_expression_sneaky.bed
To find the genes on the negative strand, all that remains is to match a minus sign at the end of the line (so that we do not find "sneaky-gene3"). We can do this using the dollar "$
", which means "end of line".
In [ ]:
grep $'^chr1\t' gene_expression_sneaky.bed | grep '\-$'
In [ ]:
grep $'^chr.\t' gene_expression.bed
In fact, the earlier command that found all genes on chromosome 1 that are on the negative strand, could be found with a single call to grep
instead of two calls piped together. To do this, we need a regular expression that finds lines that:
The asterisk "*" has a special meaning: it says to match any number (including zero) of whatever character is before the *. For example, the regular expression 'AC*G' will match AG, ACG, ACCG, etc. The simpler, improved command is:
In [ ]:
grep $'^chr1\t.*-$' gene_expression_sneaky.bed
As well as matching any character using a dot, we can define any list of characters to match, using square brackets. For example, [12X] means match a 1, 2, or an X. This can be used to find all genes from chromosomes 1, 2 and X:
In [ ]:
grep $'^chr[12X]\t' gene_expression.bed
Or just the autosomes may be of interest. To do this we introduce two new features:
Warning: Adding a backslash is often called escaping (e.g. escape the plus symbol). Depending on the software you're using (and the options you give it), you may need to escape the symbol to indicate that you want its special regex meaning (e.g. multiple copies of the last character please) or its literal meaning (e.g. give me a '+' symbol please). If your command isn't working as you expect, try playing with these options and always test your regular expression before assuming it gave you the right answer.
The command to find the autosomes is:
In [ ]:
grep $'^chr[0-9]\+\t' gene_expression.bed
grep
optionsThe Unix command grep
and regular expressions are extremely powerful and we have only scratched the surface of what they can do. Take a look at the manual (by typing man grep
) to get an idea. A few particularly useful options are discussed below.
A common use-case is counting matches within files. Instead of output each matching line, the option "-c
" tells grep
to report the number of lines that matched. For example, the number of genes in the autosomes in the above example can be found by simply adding -c
to the command.
In [ ]:
grep -c $'^chr[0-9]\+\t' gene_expression.bed
By default, grep
is case-sensitive. It can be useful to ignore the distinction between upper and lower case using the option "-i
". Suppose we have a file of sequences, and want to find the sequences that contain the string ACGT. It is not unusual to come across files that have a mix of upper and lower case nucleotides. Consider this FASTA file:
In [ ]:
cat sequences.fasta
A simple search for ACGT will not return all the results:
In [ ]:
grep ACGT sequences.fasta
However, making the search case-insensitive solves the problem.
In [ ]:
grep -i ACGT sequences.fasta
So far, we have restricted to searches in one file, but grep
can be given a list of files in which to search. As an example, we are given three files called list_example.1
, list_example.2
, and list_example.3
. They are simple lists of genes, for illustrative purposes. For example, the first file looks like this:
In [ ]:
cat list_example.1
Which files contain "gene1"?
In [ ]:
grep '^gene1$' list_example.1 list_example.2
gene1 only appears in the file list_example.1
. The output format of grep
has now changed, because it was given a list of files. The format is:
ie, the name of the file has been added to the start of each matching line.
For convenience, there's also a way of specifying all of the list examples:
In [ ]:
echo list_example.*
In [ ]:
grep '^gene1$' list_example.*
How about gene42?
In [ ]:
grep '^gene42$' list_example.*
gene42 appears once in list_example.2
and twice in list_example.3
.
By default, grep
reports all lines that do match the regular expression. Sometimes it is useful to filter a file, by reporting lines that do not match the regular expression. Using the option "-v
" makes grep
"invert" the output. For example, we could exclude genes from autosomes in the BED file from earlier.
In [ ]:
grep -v $'^chr[0-9]\+\t' gene_expression.bed
Finally, we show how to replace every match to a regular expression with something else, using the command "sed
". The general form of this is:
sed 's/regular expression/new string/' input_file
This will output a new version of the input file, with each match to the regular expression replaced with "new string
". For example:
In [ ]:
sed 's/^chr/chromosome/' gene_expression.bed
The following exercises all use the FASTA file exercises.fasta
. Before starting the exercises, open a new terminal and navigate to the grep/
directory, which contains exercises.fasta
.
Use grep
to find the answers. Hint: some questions require you to use grep
twice, and possibly some other Unix commands.
grep
command that outputs just the lines with the sequence names.grep
command that outputs just the lines with the sequences, not the names.
In [ ]:
Now go to the next part of the tutorial, file processing with AWK.
You can also return to the index or revisit the previous section.