Make custom ARIBA database

For the N. gonorrhoeae data of the manuscript, we were interested in particular sequences and SNPs. This means using custom reference data as opposed to one of the public reference sets (the use of which is described in the ARIBA wiki page). If you just want to use a public database, then use the command "ariba getref" instead.

We are interested in several reference sequences, some of which are coding seqeunces and others are not, and particular variants in those sequences. The idea is to generate input files to ARIBA that contain all the sequences and variants of interest, using the ARIBA function aln2meta.

First, let's change to the directory with the reference data.

In [ ]:
cd data/Ref/

Using one reference sequence

We start by describing folP, but the method is (nearly) the same for all other sequences.

There are several alleles of folP and we want to include them all the ARIBA database. We have a SNP of interest R228S in the sequence in the allele called "folP.WHO_F_01537c", which confers resistance to sulphonamides. When we run ARIBA, a particular sample may have a different allele from folP.WHO_F_01537c, but we would still like to know whether it has the SNP R228S. However, the position may not be 228 because of insertion or deletions. So we use a multiple alignment of all the reference alleles, and just supply the SNP to ARIBA in one of the alleles, in this case folP.WHO_F_01537c.

The "aln2meta" function of ARIBA needs two input files: a multiple alignment file of the alleles, and a tab-delimited file of the SNPs of interest. In this case, the SNPs file simply contains one line:

In [ ]:
cat aln2meta_input/folP_in.tsv

There are four columns:

  1. Sequence name
  2. SNP (an amino acid change at position 228, where R is the wild type and S is the variant)
  3. A "group name" for this SNP. This is optional and a dot "." means no group name. Putting SNPs into the same group allows ARIBA to report them together later on.
  4. A description of the SNP. This will appear in ARIBA's output files to save looking up the reason the SNP is of interest.

This file is used together with the mulitple alignment file to generate input files to ARIBA when making the database. This is the command to run:

In [ ]:
ariba aln2meta --variant_only aln2meta_input/folP.aln \
    aln2meta_input/folP_in.tsv coding aln2meta_output/folP

A few things to note about the above command:

  1. The option --variant_only was used, which affects how ARIBA reports later on when summarizing across all samples. We are only interested in this gene being present if it has a variant that causes resistance.

  2. aln2meta_input/folP.aln is the name of the multiple alignment file.

  3. The sequence is "coding", which makes ARIBA treat it as such, in particular it will interpret the variant R228S as an amino acid change at position 228 in the translated amino acid sequence. The input sequence is still in nucleotides, not amino acids.

  4. The command output three files, which can be used as input to the command ariba prepareref (see later).

In [ ]:
ls aln2meta_output/folP*

Although we have many more reference sequences of interest to deal with for the complete analysis, for illustrative purposes here we can use the three files aln2meta_output/folP* to make an ARIBA reference database:

In [ ]:
ariba prepareref -f aln2meta_output/folP.fa \
    -m aln2meta_output/folP.tsv --cdhit_clusters \
    aln2meta_output/folP.cluster test.aribadb

This made an ARIBA database of just those sequences and the SNP R228S in a new directory called test.aribadb. Let's check that it was made:

In [ ]:
ls test.aribadb

You do not need to worry about the contents of the new directory test.aribadb, just know that it can be used as input to run ARIBA. However, this was for just one of the many reference sequences of interest, so we will delete it.

In [ ]:
rm -r test.aribadb

Using all reference sequences

We need to deal with each of the reference sequences in turn by running ariba aln2meta on each, like in the above example with folP. The only difference is that some of them are non-coding sequences, which means that the command must have noncoding instead of coding. For example:

In [ ]:
ariba aln2meta --variant_only aln2meta_input/16S.aln \
    aln2meta_input/16S_in.tsv noncoding aln2meta_output/16S

There are 10 coding sequences and two non-coding sequences. Instead of writing 12 commands, we will use two 'for loops'. First, the coding sequences:

In [ ]:
for x in folP gyrA mtrR parC parE penA ponA porB1b rpoB rpsJ
    ariba aln2meta --variant_only aln2meta_input/$x.aln \
    aln2meta_input/$x\_in.tsv coding aln2meta_output/$x

And now the two non-coding sequences:

In [ ]:
for x in 16S 23S
    ariba aln2meta --variant_only aln2meta_input/$x.aln \
    aln2meta_input/$x\_in.tsv noncoding aln2meta_output/$x

This has generated three files for each sequence. We will combine these to make input files for running ariba prepareref.

In [ ]:
cat aln2meta_output/*.fa presence_absence/*.fa > Ngo_ARIBA.fa

In [ ]:
cat aln2meta_output/*.tsv presence_absence/presence_absence.tsv \
    > Ngo_ARIBA.tsv

In [ ]:
cat aln2meta_output/*.cluster \
    presence_absence/presence_absence.clusters \
    > Ngo_ARIBA.clusters

Finally, we have the three input files needed to make a single ARIBA database that has information on all the sequences and SNPs of interest. In case the directory is already there, we delete it first, then generate the database:

In [ ]:
rm -rf Ngo_ARIBAdb

In [ ]:
ariba prepareref -f Ngo_ARIBA.fa -m Ngo_ARIBA.tsv \
    --cdhit_clusters Ngo_ARIBA.clusters Ngo_ARIBAdb

We now have a directory Ngo_ARIBAdb that can be used as the reference database when running ariba run on each sample.

Now move on to the next part of the tutorial where we run ARIBA using the custom reference data, or return to the index.