Searching inside files with `grep`

A common task is to extract information from large files. This can be achieved using the Unix command grep, which stands for "Globally search for a Regular Expression and Print". The meaning of this acronym will become clear later, when we discuss Regular Expressions. First, we will consider simpler examples.

Before we start, change into the Unix/grep directory:



In [ ]:

    
cd grep

Simple pattern matching

We will use a small example file (in "BED" format), which contains the expression levels of some genes. This is a column-based file, with a tab character between each column. There can be more than 10 columns, but only the first three are required to be a valid file. The file format is described in full here: http://genome.ucsc.edu/FAQ/FAQformat#format1. We will use the first 5 columns:

Sequence name
start position (starting from 0, not 1)
end position (starting from 0, not 1)
feature name
score (which is used to store the gene expression level in our examples).

Here is the contents of the first example BED file used in this course:



In [ ]:

    
cat gene_expression.bed

In reality, such a file could contain 100,000s of lines, so that it is not practical to read manually. Suppose we are interested in all the genes from chromosome 2. We can find all these lines using grep:



In [ ]:

    
grep chr2 gene_expression.bed

This has shown us all the lines that contain the string "chr2".

We can use a pipe to then just extract the genes that are on the positive strand, using grep a second time:



In [ ]:

    
grep chr2 gene_expression.bed | grep +

However, since grep is reporting a match to a string anywhere on a line, such simple searches can have undesired consequences. For example, consider the result of doing a similar search for all the genes in chromosome 1:



In [ ]:

    
grep chr1 gene_expression.bed

Oops! We found genes in chromosome 10, because "chr1" is a substring of "chr10".

Or consider the following file, where the genes have unpredictable names (which is not unusual for bioinformatics data).



In [ ]:

    
cat gene_expression_sneaky.bed

Now we try to find genes on chromosome 1 that are on the negative strand. We put the minus sign in quotes, to stop Unix interpreting this as an option to grep, as opposed to the string we are searching for:



In [ ]:

    
grep chr1 gene_expression_sneaky.bed | grep '-'

The extra lines are found by grep because of matches in columns we were not expecting to match. Remember, grep is reporting these lines because they each contain the strings "chr1" and "-" somewhere.

We need a way to make searching with grep more specific.

Regular expressions

Regular expressions provide the solution to the above problems. They are a way of defining more specific patterns to search for.

Matching the start and end of lines

First, we can specify that a match must be at the start of a line using the symbol "^", which means "start of line". Without the ^, we find any match to "chr1":



In [ ]:

    
grep chr1 gene_expression_sneaky.bed

However, notice the effect of searching for ^chr1 instead. Note that we put the regular expression in quotes, to avoid Unix errors. Not using quotes may or may not give an error, but it is safest to use quotes for anything but the simplest of searches.



In [ ]:

    
grep '^chr1' gene_expression_sneaky.bed

Good! We have removed the match to the badly-named gene "chr11.gene1", which is on chromosome 8. Now we want to avoid matching chromosomes 10 and 11. This can be done by also looking for a "tab" character, which is represented by writing \t. For technical reasons, which are beyond the scope of this course, we must also put a dollar sign before the quotes to make any search involving a tab character work.



In [ ]:

    
grep $'^chr1\t' gene_expression_sneaky.bed

To find the genes on the negative strand, all that remains is to match a minus sign at the end of the line (so that we do not find "sneaky-gene3"). We can do this using the dollar "$", which means "end of line".



In [ ]:

    
grep $'^chr1\t' gene_expression_sneaky.bed | grep '\-$'

Wildcards and alphabets

Another special character in regular expressions is the dot: ".". This stands for any single character. For example, this finds all matches to chromosomes 1-9, and chromosomes X and Y:



In [ ]:

    
grep $'^chr.\t' gene_expression.bed

In fact, the earlier command that found all genes on chromosome 1 that are on the negative strand, could be found with a single call to grep instead of two calls piped together. To do this, we need a regular expression that finds lines that:

start with chr1, then a tab character
end with a minus
have arbitrary characters between.

The asterisk "*" has a special meaning: it says to match any number (including zero) of whatever character is before the *. For example, the regular expression 'AC*G' will match AG, ACG, ACCG, etc. The simpler, improved command is:



In [ ]:

    
grep $'^chr1\t.*-$' gene_expression_sneaky.bed

As well as matching any character using a dot, we can define any list of characters to match, using square brackets. For example, [12X] means match a 1, 2, or an X. This can be used to find all genes from chromosomes 1, 2 and X:



In [ ]:

    
grep $'^chr[12X]\t' gene_expression.bed

Or just the autosomes may be of interest. To do this we introduce two new features:

Ranges can be given in square brackets, for example [1-5] will match 1, 2, 3, 4 or 5.
The plus sign "+" has a special meaning that is similar to "*". Instead of any number of matches (including zero), it looks for at least one match. To avoid simply matching a plus sign, it must be preceded by a backslash: "\+". For example, the regular expression 'AC\+G' will match ACG, ACCG, ACCCG etc (but will not match AG).

Warning: Adding a backslash is often called escaping (e.g. escape the plus symbol). Depending on the software you're using (and the options you give it), you may need to escape the symbol to indicate that you want its special regex meaning (e.g. multiple copies of the last character please) or its literal meaning (e.g. give me a '+' symbol please). If your command isn't working as you expect, try playing with these options and always test your regular expression before assuming it gave you the right answer.

The command to find the autosomes is:



In [ ]:

    
grep $'^chr[0-9]\+\t' gene_expression.bed

Other `grep` options

The Unix command grep and regular expressions are extremely powerful and we have only scratched the surface of what they can do. Take a look at the manual (by typing man grep) to get an idea. A few particularly useful options are discussed below.

Counting matches

A common use-case is counting matches within files. Instead of output each matching line, the option "-c" tells grep to report the number of lines that matched. For example, the number of genes in the autosomes in the above example can be found by simply adding -c to the command.



In [ ]:

    
grep -c $'^chr[0-9]\+\t' gene_expression.bed

Case sensitivity

By default, grep is case-sensitive. It can be useful to ignore the distinction between upper and lower case using the option "-i". Suppose we have a file of sequences, and want to find the sequences that contain the string ACGT. It is not unusual to come across files that have a mix of upper and lower case nucleotides. Consider this FASTA file:



In [ ]:

    
cat sequences.fasta

A simple search for ACGT will not return all the results:



In [ ]:

    
grep ACGT sequences.fasta

However, making the search case-insensitive solves the problem.



In [ ]:

    
grep -i ACGT sequences.fasta

Searching in more than one file

So far, we have restricted to searches in one file, but grep can be given a list of files in which to search. As an example, we are given three files called list_example.1, list_example.2, and list_example.3. They are simple lists of genes, for illustrative purposes. For example, the first file looks like this:



In [ ]:

    
cat list_example.1

Which files contain "gene1"?



In [ ]:

    
grep '^gene1$' list_example.1 list_example.2

gene1 only appears in the file list_example.1. The output format of grep has now changed, because it was given a list of files. The format is:

filename:line_that_matches

ie, the name of the file has been added to the start of each matching line.

For convenience, there's also a way of specifying all of the list examples:



In [ ]:

    
echo list_example.*



In [ ]:

    
grep '^gene1$' list_example.*

How about gene42?



In [ ]:

    
grep '^gene42$' list_example.*

gene42 appears once in list_example.2 and twice in list_example.3.

Inverting matches

By default, grep reports all lines that do match the regular expression. Sometimes it is useful to filter a file, by reporting lines that do not match the regular expression. Using the option "-v" makes grep "invert" the output. For example, we could exclude genes from autosomes in the BED file from earlier.



In [ ]:

    
grep -v $'^chr[0-9]\+\t' gene_expression.bed

Replacing matches to regular expressions

Finally, we show how to replace every match to a regular expression with something else, using the command "sed". The general form of this is:

sed 's/regular expression/new string/' input_file

This will output a new version of the input file, with each match to the regular expression replaced with "new string". For example:



In [ ]:

    
sed 's/^chr/chromosome/' gene_expression.bed

Exercises

The following exercises all use the FASTA file exercises.fasta. Before starting the exercises, open a new terminal and navigate to the grep/ directory, which contains exercises.fasta.

Use grep to find the answers. Hint: some questions require you to use grep twice, and possibly some other Unix commands.

Make a grep command that outputs just the lines with the sequence names.
How many sequences are in the file?
Do any sequence names have spaces in them? What are their names?
Make a grep command that outputs just the lines with the sequences, not the names.
How many sequences contain unknown bases (an "n" or "N")?
Are there any sequences that contain non-nucleotides (something other than A, C, G, T or N)?
How many sequences contain the 5' cut site GCWGC (where W can be an A or T) for the restriction enzyme AceI?
Are there any sequences that have the same name? You do not need to find the actual repeated names, just whether any names are repeated. (Hint: it may be easier to first discover how many unique names there are).



In [ ]:

Now go to the next part of the tutorial, file processing with AWK.
You can also return to the index or revisit the previous section.

Searching inside files with grep