In [1]:
# depending on how you installed mentalist, you might have to add it and julia to the PATH:
PATH=$PATH:/rhome/pfeijao/sfu/MentaLiST/src:/rhome/pfeijao/bin
It might also be a good idea to create a new folder to store the results of the examples below:
In [2]:
mkdir -p /tmp/mentalist_quick_start
cd /tmp/mentalist_quick_start
In [3]:
# Help: shows all available commands:
mentalist -h
To see the help of a particular command, run MentaLiST with the command name and the -h flag:
In [4]:
mentalist call -h
MentaLiST can search and install MLST schema from pubMLST.org, as shown.
The command 'list_publist' lists the available schema on pubMLST. Since there are many, it is also possible to give a prefix, such that only schema matching this prefix are listed.
In [5]:
mentalist list_pubmlst -h
In [6]:
# List campylobacter schema:
mentalist list_pubmlst -p Campylobacter
In [7]:
mentalist download_pubmlst -k 31 -o campy_mlst_fasta_files -s 28 --db campy_mlst.db
In [8]:
# The output folder (-o) has all the FASTA files and profile for the scheme.
ls campy_mlst_fasta_files
In [9]:
# The --db flag indicates the database file, the will be used by MentaLiST in the calling phase.
ls -lh campy_mlst.db
In [10]:
mentalist list_cgmlst
In [11]:
mentalist download_cgmlst -h
In [12]:
mentalist download_cgmlst -o mtb_cgmlst_fasta -s 741110 -k 31 --db mtb_cgmlst.db --threads 16
It is also possible to install a custom MLST scheme from the FASTA files. Each file should be called LOCUS.fa (the extension is not important, can be .fasta, .tfa, etc.), and each different allele in this file should have identifier LOCUS_N (or alternatively LOCUS.N), where N is a unique number for each allele, and it is usually a sequence from 1 to N for N alleles.
For instance, let's test this functionality with the Campylobacter scheme FASTA files that were downloaded in a previous example above:
In [13]:
# Each file is a different locus:
ls campy_mlst_fasta_files/*.tfa
In [14]:
# For each locus file, a different ID and sequence for each allele:
head -n 18 campy_mlst_fasta_files/glnA.tfa
In [15]:
# Install the Campylobacter jejuni scheme directly from the FASTA files; let's use a different k-mer length:
mentalist build_db -k 25 --db campy_mlst_25.db -p campy_mlst_fasta_files/campylobacter.txt -f campy_mlst_fasta_files/*.tfa
In [16]:
# Help:
mentalist call -h
For this example we are using a Campylobacter jejuni sample from EMBL ENA. You can download the FASTQ file with the following command:
In [17]:
# the --no-clobber option checks if the file already exists, so it does not download it again.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR582/007/SRR5824107/SRR5824107_1.fastq.gz --no-clobber
Now, run MentaLiST caller on this sample, passing the MentaLiST database that we created previously, using the --db flag.
In [18]:
mentalist call -o campy_call.txt --db campy_mlst.db -1 SRR5824107_1.fastq.gz
The output consists of two files: one has the calls, and the other some details about the coverage of calls and special cases.
In [19]:
# results:
ls campy_call.*
In [20]:
# Allele calls and ST are on the campy_call.txt file:
column -ts $'\t' campy_call.txt
In [21]:
# Detailed vote count for each allele:
cat campy_call.txt.coverage.txt
In [22]:
# the --no-clobber option checks if the file already exists, so it does not download it again.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR615/008/SRR6152708/SRR6152708_{1,2}.fastq.gz --no-clobber
For this example, we will use the flags --output_votes
and --output_special
, that tell MentaLiST to create additional output files. To use paired end samples, just include both files at the end of the command:
In [23]:
## Call alleles for the sample:
mentalist call -o SRR6152708.txt --output_votes --output_special --db mtb_cgmlst.db -1 SRR6152708_1.fastq.gz -2 SRR6152708_2.fastq.gz
In addition to the regular output files from the previous example (legionella.txt and legionella.txt.coverage.txt),
there are some new files, due to the use of flags --output_votes
and --output_special
.
In [24]:
ls -l SRR6152708.txt*
Let's do a quick check of the first 12 calls:
In [25]:
cut -f1-12 SRR6152708.txt | column -ts $'\t'
The coverage file has more details of each call. Looking at the first lines, we can see that MentaLiST
found some novel alleles:
In [26]:
head -n 15 SRR6152708.txt.coverage.txt
In [40]:
head SRR6152708.txt.novel.txt | column -ts $'\t'
The novel allele DNA sequences are on the .novel.fa
FASTA file.
In [28]:
head -n20 SRR6152708.txt.novel.fa
The --output-votes
flag makes MentaLiST output three additional files: SRR6152708.txt.byvote
, SRR6152708.txt.votes.txt
and SRR6152708.txt.ties.txt
. These are the results of the old calling algorithm in MentaLiST 0.1
, where only the top voted allele is called. In the current version, MentaLiST 1.0
checks the allele sequences to ensure that the called allele has full coverage, and also tries to find novel alleles.
In [29]:
# Calls by the old voting algorithm:
cut -f1-12 SRR6152708.txt.byvote | column -ts $'\t'
The SRR6152708.txt.votes.txt
file has the top voted alleles on each loci:
In [30]:
head -n12 SRR6152708.txt.votes.txt
As we can see, there is a tie on locus Rv0024. This might happen, specially on loci with novel alleles. The SRR6152708.txt.ties.txt
file has a list of loci where there is a tie for most voted alleles, listing all tied alleles:
In [31]:
cat SRR6152708.txt.ties.txt
If we check those alleles in the MentaLiST coverage report, we can see that we had some different cases:
In [32]:
for p in $(cut -f2 SRR6152708.txt.ties.txt); do grep $p SRR6152708.txt.coverage.txt; done | column -ts $'\t'
MentaLiST
could find that only one of the top voted alleles had full coverage, and made the call.Missing allele: For Allele Rv1417, MentaLiST
called it as not present, since it only has 188/435 < 50% coverage; this might be due a poorly covered region in the sample, or because the gene is really not present in the sample, but some other regions in the genome have some similarity with this gene, causing the partial $k$-mer match.
Novel allele: For all the other loci where there was a tie, MentaLiST
found a putative novel allele.
In [33]:
grep Multiple SRR6152708.txt.coverage.txt
On these cases, MentaLiST chooses the most voted allele, but included a flag "+" in the output:
In [41]:
# I nice trick to find the column number, given a column name:
awk -v RS='\t' '/Rv1319c/{print NR; exit}' SRR6152708.txt
In [42]:
cut -f 1023 SRR6152708.txt
In [35]:
grep "Not present" SRR6152708.txt.coverage.txt
In this case, MentaLiST outputs a zero (0) in the call file if it did not find any $k$-mers, and (0?) if it did find some $k$-mers from this locus but below the threshold.
In [49]:
awk -v RS='\t' '/Rv1417/{print NR}' SRR6152708.txt
In [36]:
cut -f 1094 SRR6152708.txt
In [37]:
grep Partial SRR6152708.txt.coverage.txt
In this case, the output is in the format x-, where x is the most covered allele found, since MentaLiST
is not sure if this is the correct call of a partially covered allele, or it could be a novel allele that was not detected. Here we show three of the above loci as an example:
In [54]:
awk -v RS='\t' '/Rv0275c/||/Rv0581/||/Rv0860/ {print NR}' SRR6152708.txt
In [55]:
cut -f 217,467,674 SRR6152708.txt
MentaLiST
on multiple samplesYou can also run MentaLiST
for all samples of a dataset. There are two ways of doing this, either specifying all samples in the command line (the suggested way), or by creating a file describing your data (gives you more control).
Let's download some more samples to try it:
In [57]:
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR639/002/SRR6397472/SRR6397472_{1,2}.fastq.gz --no-clobber
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR639/006/SRR6398036/SRR6398036_{1,2}.fastq.gz --no-clobber
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR615/008/SRR6152708/SRR6152708_{1,2}.fastq.gz --no-clobber
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR639/003/SRR6398023/SRR6398023_{1,2}.fastq.gz --no-clobber
In [59]:
mentalist call -o my_dataset_calls1.txt --db mtb_cgmlst.db -1 SRR6*_1.fastq.gz -2 SRR6*_2.fastq.gz
The result files will be the same as with a single sample, but with all samples combined on each file. All output files have a 'Sample' column to identify the sample.
In [62]:
cut -f1-12 my_dataset_calls1.txt | column -ts $'\t'
The input file should be a tabular file with two columns; the first has the sample name, and the second has a FASTQ file for the sample. In the case of multiple files per sample (paired-end reads or other cases), simply one row per file, with the same sample identifier.
For instance, for this 4 sample example dataset, the input file is:
In [64]:
cat my_dataset_samples.txt
In [63]:
mentalist call -o my_dataset_calls2.txt --db mtb_cgmlst.db -i my_dataset_samples.txt
The results should be exactly the same, between both input methods.
In [65]:
cut -f1-12 my_dataset_calls2.txt | column -ts $'\t'
In [ ]: