In this section we are going to look at how to convert from one file format to another. There are many tools available for converting between file formats, and we will use some of the most common ones: samtools, bcftools and Picard.
To convert from SAM to BAM format we are going to use the samtools view
command. In this instance, we would like to include the SAM header, so we use the -h
option:
In [ ]:
samtools view -h data/NA20538.bam > data/NA20538.sam
Now, have a look at the first ten lines of the SAM file. They should look like they did in the previous section when you viewed the BAM file header.
In [ ]:
head data/NA20538.sam
Well that was easy! And converting SAM to BAM is just as straightforward. This time there is no need for the -h
option, however we have to tell samtools that we want the output in BAM format. We do so by adding the -b
option:
In [ ]:
samtools view -b data/NA20538.sam > data/NA20538_2.bam
Samtools is very well documented, so for more usage options and functions, have a look at the samtools manual.
From samtools version 1.3, support for CRAM format was introduced. This means that the samtools view command can also be used to convert a BAM file to CRAM format. In the data directory there is a BAM file called yeast.bam that was created from S. cerevisiae Illumina sequencing data. There is also a reference genome in the directory, called Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa. For the conversion, an index file (.fai) must first be created. This can be done using samtools faidx
. However, as we will see, samtools will generate this file on the fly when we specify a reference file using the -F
option.
To convert to CRAM, we use the -C
option to tell samtools we want the output as CRAM, and the -T
option to specify what reference file to use for the conversion. We also use the -o
option to specify the name of the output file. Give this a go:
In [ ]:
samtools view -C \
-T data/Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa \
-o data/yeast.cram data/yeast.bam
Have a look at what files were created:
In [ ]:
ls -l data
As you can see, this has created an index file for the reference genome called Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa.fai and the CRAM file yeast.cram.
Q1: Since CRAM files use reference-based compression, we expect the CRAM file to be smaller than the BAM file. What is the size of the CRAM file?
In [ ]:
Q2: Is your CRAM file smaller than the original BAM file?
In [ ]:
ls -l data
To convert CRAM back to BAM, simply change -C
to -b
and change places for the input and output CRAM/BAM:
samtools view -b -T data/Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa \
-o data/yeast.bam data/yeast.cram
As mentioned in the previous section of this tutorial, SAM format is mainly used to store alignment data. However, in some cases we may want to store unaligned data in SAM format and for this we can use the picard tools FastqToSam
application. Picard tools is a Java application that comes with a number of useful options for manipulating high-throughput sequencing data. Apart from FASTQ to SAM, we won't go into any detail about Picard tools in this tutorial, but feel free to explore it on the Picard tools website. To convert the FASTQ files of lane 13681_1#18 to unaligned SAM format, run:
In [ ]:
java -jar $PICARD FastqToSam F1=data/13681_1#18_1.fastq.gz \
F2=data/13681_1#18_2.fastq.gz \
O=data/13681_1#18.sam SM=13681_1#18
Where $PICARD should contain the path to the picard.jar file, as described on the index page.
From here you can go on and convert the SAM file to BAM and CRAM, as described previously. There are also multiple options for specifying what metadata to include in the SAM header. To see all available options, run:
In [ ]:
java -jar $PICARD FastqToSam -h
Although it is possible to convert CRAM to FASTQ directly using the samtools fastq
command, for many applications we need the fastq files to be ordered. For this reason, we will first use samtools collate
, which will produce a collated BAM file. The reference file and its index file that was created when we converted BAM to CRAM is required for this as well.
In [ ]:
samtools collate data/yeast.cram data/yeast.collated
The newly produced BAM file will be called yeast.collated.bam. Let's use this to create two FASTQ files, one for the forward reads and one for the reverse reads:
In [ ]:
samtools fastq -1 data/yeast.collated_1.fastq \
-2 data/yeast.collated_2.fastq data/yeast.collated.bam
For further information and usage options, have a look at the samtools manual page.
As we saw in the previous section, bcftools comprises a set of programs for interacting with VCF/BCF files. In a similar way that samtools view can be used to convert between SAM, BAM and CRAM, bcftools view
can be used to convert between VCF and BCF. To convert the file called 1kg.bcf to a compressed VCF file called 1kg.vcf.gz, run:
In [ ]:
bcftools view -O z -o data/1kg.vcf.gz data/1kg.bcf
The -O
option allows us to specify in what format we want the output, compressed BCF (b), uncompressed BCF (u), compressed VCF (z) or uncompressed VCF (v). With the -o
option we can select the name of the output file.
Have a look at what files were generated (the options -lrt
will list the files in reverse chronological order):
In [ ]:
ls -lrt data
As you can see, this also generated an index file, 1kg.bcf.csi.
To convert a VCF file to BCF, we can run a similar command. If we want to keep the original BCF, we need to give the new one a different name so that the old one is not overwritten:
In [ ]:
bcftools view -O b -o data/1kg_2.bcf data/1kg.vcf.gz
The answers to the questions on this page can be found here.
Now continue to the next section of the tutorial: QC assessment of NGS data.
You can also return to the index page.