In this tutorial we have included three assemblies of Streptococcus pneumoniae. The assemblies are available for download from the ENA and the accession numbers are included below. If you have access to the the clustre at the Wellcome Sanger Institute the lane ids are also listed below.
Name | Accession | Sanger Lane ID |
---|---|---|
sample1 | GCA_900194945.1 | 13681_1#18 |
sample2 | GCA_900194155.1 | 13682_2#34 |
sample3 | GCA_900194195.1 | 13682_2#39 |
If you are using the cluster at the Wellcome Sanger Institute and want to create a symlink to one of the samples in your working directory, you can use the command below. However, note that this is not neccessary for the sake of this tutorial.
pf assembly -t lane -i 13681_1#18 -l .
Roary takes annotated assemblies in GFF3 format as input. The files must include the nucleotide sequence at the end of the file, and to make it easier for you to identify where genes came from, each input file should have a unique locus tag for the gene IDs.
All GFF3 files created by Prokka are valid with Roary and this is therefor the recommended way of generating the input files. We are now going to look closer at how you can use Prokka to annotate your genomes.
Prokka is a tool that performs whole genome annotation. It is easy to install and use and as mentioned the GFF files that it outputs are compatible with Roary.
Our three assembled S. pneumoniae genomes are located in a directory called "assemblies".
In [ ]:
ls assemblies
To run Prokka on a single file using the default settings, you can use the following command:
prokka sample1.fasta
If you have a lot of assemblies that you want to analyse, running this for each sample will soon become tedious. Instead, we will use a for-loop to run Prokka on all the fasta files in the assemblies directory. We will also use the following options for Prokka:
Option | Description |
---|---|
--locustag | Specifying a locus tag prefix |
--outdir | Specifying a directory to put the output in |
--prefix | Specifying a prefix for the output files |
By specifying a unique locus tag we make it easier to identify which sample different genes came from when we look at the results from Roary. The outdir and prefix options will make it easier for us to keep track of our files.
In [ ]:
for F in assemblies/*.fasta; do FILE=${F##*/}; PREFIX=${FILE/.fasta/}; \
prokka --locustag $PREFIX --outdir annotated_$PREFIX \
--prefix $PREFIX $F; done
This is going to take around 5 minutes to run, so be patient.
Once this has finished, you should have three new directories called annotated_sample1, annotated_sample2 and annotated_sample3. Have a look to see that it worked:
In [ ]:
ls -l
In [ ]:
ls -l annotated_sample1
As you can see, for sample1 we now have a number of annotation files. There is more information about the different output files, along with information about other usage options, on the Prokka GitHub page. For now, we are only interrested in the GFF files that were generated as this is what we are going to use as input for Roary.
Note: If you are working on the Sanger Institute cluster, Prokka is automatically run as part of the annotation pipeline. To create a symlink to the GFF file, you can use the command below (though this is not neccessary for the sake of this tutorial):
pf annotation -t lane -i 13681_1#18 -l .
Also for Sanger users, to run Prokka independently of the automated pipeline, you can use the script called annotate_bacteria. Run the below command for more information:
annotate_bacteria -h
Q3: Why do we need to run Prokka?
a) It will perform QC on our data
b) It will annotate our data
c) We don't, Roary can handle fasta files as input
Q4: Why do we use the --locustag option when we run Prokka?
a) To make it easier to keep track of the output files
b) Because Roary won't work without it
c) To make the Roary results easier to interpret
The answers to these questions can be found here.
Now continue to the next section of the tutorial: Performing QC on your data.
You can also revisit the previous section or return to the index page