For the N. gonorrhoeae data of the manuscript, we were interested in particular sequences and SNPs. This means using custom reference data as opposed to one of the public reference sets (the use of which is described in the ARIBA wiki page). If you just want to use a public database, then use the command "ariba getref" instead.
We are interested in several reference sequences, some of which are coding seqeunces and others are not, and particular variants in those sequences. The idea is to generate input files to ARIBA that contain all the sequences and variants of interest, using the ARIBA function aln2meta.
First, let's change to the directory with the reference data.
In [ ]:cd data/Ref/
We start by describing folP, but the method is (nearly) the same for all other sequences.
There are several alleles of folP and we want to include them all the ARIBA database. We have a SNP of interest R228S in the sequence in the allele called "folP.WHO_F_01537c", which confers resistance to sulphonamides. When we run ARIBA, a particular sample may have a different allele from folP.WHO_F_01537c, but we would still like to know whether it has the SNP R228S. However, the position may not be 228 because of insertion or deletions. So we use a multiple alignment of all the reference alleles, and just supply the SNP to ARIBA in one of the alleles, in this case folP.WHO_F_01537c.
The "aln2meta" function of ARIBA needs two input files: a multiple alignment file of the alleles, and a tab-delimited file of the SNPs of interest. In this case, the SNPs file simply contains one line:
In [ ]:cat aln2meta_input/folP_in.tsv
There are four columns:
This file is used together with the mulitple alignment file to generate input files to ARIBA when making the database. This is the command to run:
In [ ]:ariba aln2meta --variant_only aln2meta_input/folP.aln \ aln2meta_input/folP_in.tsv coding aln2meta_output/folP
A few things to note about the above command:
--variant_only was used, which affects how ARIBA reports later on when summarizing across all samples. We are only interested in this gene being present if it has a variant that causes resistance.
aln2meta_input/folP.aln is the name of the multiple alignment file.
The sequence is "coding", which makes ARIBA treat it as such, in particular it will interpret the variant R228S as an amino acid change at position 228 in the translated amino acid sequence. The input sequence is still in nucleotides, not amino acids.
The command output three files, which can be used as input to the command
ariba prepareref (see later).
In [ ]:ls aln2meta_output/folP*
Although we have many more reference sequences of interest to deal with for the complete analysis, for illustrative purposes here we can use the three files
aln2meta_output/folP* to make an ARIBA reference database:
In [ ]:ariba prepareref -f aln2meta_output/folP.fa \ -m aln2meta_output/folP.tsv --cdhit_clusters \ aln2meta_output/folP.cluster test.aribadb
This made an ARIBA database of just those sequences and the SNP R228S in a new directory called
test.aribadb. Let's check that it was made:
In [ ]:ls test.aribadb
You do not need to worry about the contents of the new directory
test.aribadb, just know that it can be used as input to run ARIBA. However, this was for just one of the many reference sequences of interest, so we will delete it.
In [ ]:rm -r test.aribadb
We need to deal with each of the reference sequences in turn by running
ariba aln2meta on each, like in the above example with folP. The only difference is that some of them are non-coding sequences, which means that the command must have
noncoding instead of
coding. For example:
In [ ]:ariba aln2meta --variant_only aln2meta_input/16S.aln \ aln2meta_input/16S_in.tsv noncoding aln2meta_output/16S
There are 10 coding sequences and two non-coding sequences. Instead of writing 12 commands, we will use two 'for loops'. First, the coding sequences:
In [ ]:for x in folP gyrA mtrR parC parE penA ponA porB1b rpoB rpsJ do ariba aln2meta --variant_only aln2meta_input/$x.aln \ aln2meta_input/$x\_in.tsv coding aln2meta_output/$x done
And now the two non-coding sequences:
In [ ]:for x in 16S 23S do ariba aln2meta --variant_only aln2meta_input/$x.aln \ aln2meta_input/$x\_in.tsv noncoding aln2meta_output/$x done
This has generated three files for each sequence. We will combine these to make input files for running
In [ ]:cat aln2meta_output/*.fa presence_absence/*.fa > Ngo_ARIBA.fa
In [ ]:cat aln2meta_output/*.tsv presence_absence/presence_absence.tsv \ > Ngo_ARIBA.tsv
In [ ]:cat aln2meta_output/*.cluster \ presence_absence/presence_absence.clusters \ > Ngo_ARIBA.clusters
Finally, we have the three input files needed to make a single ARIBA database that has information on all the sequences and SNPs of interest. In case the directory is already there, we delete it first, then generate the database:
In [ ]:rm -rf Ngo_ARIBAdb
In [ ]:ariba prepareref -f Ngo_ARIBA.fa -m Ngo_ARIBA.tsv \ --cdhit_clusters Ngo_ARIBA.clusters Ngo_ARIBAdb
We now have a directory
Ngo_ARIBAdb that can be used as the reference database when running
ariba run on each sample.
Now move on to the next part of the tutorial where we run ARIBA using the custom reference data, or return to the index.