Preparing the data

In this tutorial we have included three assemblies of Streptococcus pneumoniae. The assemblies are available for download from the ENA and the accession numbers are included below. If you have access to the the clustre at the Wellcome Sanger Institute the lane ids are also listed below.

Name Accession Sanger Lane ID
sample1 GCA_900194945.1 13681_1#18
sample2 GCA_900194155.1 13682_2#34
sample3 GCA_900194195.1 13682_2#39

If you are using the cluster at the Wellcome Sanger Institute and want to create a symlink to one of the samples in your working directory, you can use the command below. However, note that this is not neccessary for the sake of this tutorial.

pf assembly -t lane -i 13681_1#18 -l .

Roary input files

Roary takes annotated assemblies in GFF3 format as input. The files must include the nucleotide sequence at the end of the file, and to make it easier for you to identify where genes came from, each input file should have a unique locus tag for the gene IDs.

All GFF3 files created by Prokka are valid with Roary and this is therefor the recommended way of generating the input files. We are now going to look closer at how you can use Prokka to annotate your genomes.

Annotation

Prokka is a tool that performs whole genome annotation. It is easy to install and use and as mentioned the GFF files that it outputs are compatible with Roary.

Our three assembled S. pneumoniae genomes are located in a directory called "assemblies".


In [ ]:
ls assemblies

To run Prokka on a single file using the default settings, you can use the following command:

prokka sample1.fasta

If you have a lot of assemblies that you want to analyse, running this for each sample will soon become tedious. Instead, we will use a for-loop to run Prokka on all the fasta files in the assemblies directory. We will also use the following options for Prokka:

Option Description
--locustag Specifying a locus tag prefix
--outdir Specifying a directory to put the output in
--prefix Specifying a prefix for the output files

By specifying a unique locus tag we make it easier to identify which sample different genes came from when we look at the results from Roary. The outdir and prefix options will make it easier for us to keep track of our files.


In [ ]:
for F in assemblies/*.fasta; do FILE=${F##*/}; PREFIX=${FILE/.fasta/}; \
    prokka --locustag $PREFIX --outdir annotated_$PREFIX \
    --prefix $PREFIX $F; done

This is going to take around 5 minutes to run, so be patient.

Once this has finished, you should have three new directories called annotated_sample1, annotated_sample2 and annotated_sample3. Have a look to see that it worked:


In [ ]:
ls -l

In [ ]:
ls -l annotated_sample1

As you can see, for sample1 we now have a number of annotation files. There is more information about the different output files, along with information about other usage options, on the Prokka GitHub page. For now, we are only interrested in the GFF files that were generated as this is what we are going to use as input for Roary.

Note: If you are working on the Sanger Institute cluster, Prokka is automatically run as part of the annotation pipeline. To create a symlink to the GFF file, you can use the command below (though this is not neccessary for the sake of this tutorial):

pf annotation -t lane -i 13681_1#18 -l .

Also for Sanger users, to run Prokka independently of the automated pipeline, you can use the script called annotate_bacteria. Run the below command for more information:

annotate_bacteria -h

Check your understanding

Q3: Why do we need to run Prokka?
a) It will perform QC on our data
b) It will annotate our data
c) We don't, Roary can handle fasta files as input

Q4: Why do we use the --locustag option when we run Prokka?
a) To make it easier to keep track of the output files
b) Because Roary won't work without it
c) To make the Roary results easier to interpret

The answers to these questions can be found here.

Now continue to the next section of the tutorial: Performing QC on your data.
You can also revisit the previous section or return to the index page