File conversion

In this section we are going to look at how to convert from one file format to another. There are many tools available for converting between file formats, and we will use some of the most common ones: samtools, bcftools and Picard.

SAM to BAM

To convert from SAM to BAM format we are going to use the samtools view command. In this instance, we would like to include the SAM header, so we use the -h option:


In [ ]:
samtools view -h data/NA20538.bam > data/NA20538.sam

Now, have a look at the first ten lines of the SAM file. They should look like they did in the previous section when you viewed the BAM file header.


In [ ]:
head data/NA20538.sam

Well that was easy! And converting SAM to BAM is just as straightforward. This time there is no need for the -h option, however we have to tell samtools that we want the output in BAM format. We do so by adding the -b option:


In [ ]:
samtools view -b data/NA20538.sam > data/NA20538_2.bam

Samtools is very well documented, so for more usage options and functions, have a look at the samtools manual.

BAM to CRAM

From samtools version 1.3, support for CRAM format was introduced. This means that the samtools view command can also be used to convert a BAM file to CRAM format. In the data directory there is a BAM file called yeast.bam that was created from S. cerevisiae Illumina sequencing data. There is also a reference genome in the directory, called Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa. For the conversion, an index file (.fai) must first be created. This can be done using samtools faidx. However, as we will see, samtools will generate this file on the fly when we specify a reference file using the -F option.

To convert to CRAM, we use the -C option to tell samtools we want the output as CRAM, and the -T option to specify what reference file to use for the conversion. We also use the -o option to specify the name of the output file. Give this a go:


In [ ]:
samtools view -C \
    -T data/Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa \
    -o data/yeast.cram data/yeast.bam

Have a look at what files were created:


In [ ]:
ls -l data

As you can see, this has created an index file for the reference genome called Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa.fai and the CRAM file yeast.cram.

Q1: Since CRAM files use reference-based compression, we expect the CRAM file to be smaller than the BAM file. What is the size of the CRAM file?


In [ ]:

Q2: Is your CRAM file smaller than the original BAM file?


In [ ]:
ls -l data

To convert CRAM back to BAM, simply change -C to -b and change places for the input and output CRAM/BAM:

samtools view -b -T data/Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa \
    -o data/yeast.bam data/yeast.cram

FASTQ to SAM

As mentioned in the previous section of this tutorial, SAM format is mainly used to store alignment data. However, in some cases we may want to store unaligned data in SAM format and for this we can use the picard tools FastqToSam application. Picard tools is a Java application that comes with a number of useful options for manipulating high-throughput sequencing data. Apart from FASTQ to SAM, we won't go into any detail about Picard tools in this tutorial, but feel free to explore it on the Picard tools website. To convert the FASTQ files of lane 13681_1#18 to unaligned SAM format, run:


In [ ]:
java -jar $PICARD FastqToSam F1=data/13681_1#18_1.fastq.gz \
    F2=data/13681_1#18_2.fastq.gz \
    O=data/13681_1#18.sam SM=13681_1#18

Where $PICARD should contain the path to the picard.jar file, as described on the index page.

From here you can go on and convert the SAM file to BAM and CRAM, as described previously. There are also multiple options for specifying what metadata to include in the SAM header. To see all available options, run:


In [ ]:
java -jar $PICARD FastqToSam -h

CRAM to FASTQ

Although it is possible to convert CRAM to FASTQ directly using the samtools fastq command, for many applications we need the fastq files to be ordered. For this reason, we will first use samtools collate, which will produce a collated BAM file. The reference file and its index file that was created when we converted BAM to CRAM is required for this as well.


In [ ]:
samtools collate data/yeast.cram data/yeast.collated

The newly produced BAM file will be called yeast.collated.bam. Let's use this to create two FASTQ files, one for the forward reads and one for the reverse reads:


In [ ]:
samtools fastq -1 data/yeast.collated_1.fastq \
    -2 data/yeast.collated_2.fastq data/yeast.collated.bam

For further information and usage options, have a look at the samtools manual page.

VCF to BCF

As we saw in the previous section, bcftools comprises a set of programs for interacting with VCF/BCF files. In a similar way that samtools view can be used to convert between SAM, BAM and CRAM, bcftools view can be used to convert between VCF and BCF. To convert the file called 1kg.bcf to a compressed VCF file called 1kg.vcf.gz, run:


In [ ]:
bcftools view -O z -o data/1kg.vcf.gz data/1kg.bcf

The -O option allows us to specify in what format we want the output, compressed BCF (b), uncompressed BCF (u), compressed VCF (z) or uncompressed VCF (v). With the -o option we can select the name of the output file.

Have a look at what files were generated (the options -lrt will list the files in reverse chronological order):


In [ ]:
ls -lrt data

As you can see, this also generated an index file, 1kg.bcf.csi.

To convert a VCF file to BCF, we can run a similar command. If we want to keep the original BCF, we need to give the new one a different name so that the old one is not overwritten:


In [ ]:
bcftools view -O b -o data/1kg_2.bcf data/1kg.vcf.gz

The answers to the questions on this page can be found here.

Now continue to the next section of the tutorial: QC assessment of NGS data.
You can also return to the index page.