Usually NGS reads are mapped against a reference genome containing only the assembled chromosomes, and not the remaining contigs. And this methodology is perfectly valid. However in order to decrease the probability of having mapping errors, adding all unassembled contigs may help:
For variant discovery, RNA-seq and ChIP-seq, it is recommended to use the entire primary assembly, including assembled chromosomes AND unlocalized/unplaced contigs, for the purpose of read mapping. Not including unlocalized and unplaced contigs potentially leads to more mapping errors.
We are thus going to download full chromosomes and unassembled contigs. From these sequences we are then going to create two reference genomes:
We search for the most recent reference genome corresponding to Mouse (https://www.ncbi.nlm.nih.gov/genome?term=mus%20musculus).
From there we obtain these identifiers:
In [2]:
species = 'Mus_musculus'
taxid = '10090'
assembly = 'GRCm38.p6'
genbank = 'GCF_000001635.26'
The variables defined above can be modified for any other species, resulting in new results for the following commands.
In [4]:
sumurl = ('ftp://ftp.ncbi.nlm.nih.gov/genomes/all/{0}/{1}/{2}/{3}/{4}_{5}/'
'{4}_{5}_assembly_report.txt').format(genbank[:3], genbank[4:7], genbank[7:10],
genbank[10:13], genbank, assembly)
crmurl = ('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
'?db=nuccore&id=%s&rettype=fasta&retmode=text')
In [4]:
print sumurl
In [5]:
! wget -q $sumurl -O chromosome_list.txt
In [6]:
! head chromosome_list.txt
In [1]:
import os
In [4]:
dirname = 'genome'
! mkdir -p {dirname}
For each contig/chromosome download the corresponding FASTA file from NCBI
In [16]:
contig = []
for line in open('chromosome_list.txt'):
if line.startswith('#'):
continue
seq_name, seq_role, assigned_molecule, _, genbank, _, refseq, _ = line.split(None, 7)
if seq_role == 'assembled-molecule':
name = 'chr%s.fasta' % assigned_molecule
else:
name = 'chr%s_%s.fasta' % (assigned_molecule, seq_name.replace('/', '-'))
contig.append(name)
outfile = os.path.join(dirname, name)
if os.path.exists(outfile) and os.path.getsize(outfile) > 10:
continue
error_code = os.system('wget "%s" --no-check-certificate -O %s' % (crmurl % (genbank), outfile))
if error_code:
error_code = os.system('wget "%s" --no-check-certificate -O %s' % (crmurl % (refseq), outfile))
if error_code:
print genbank
In [17]:
def write_to_fasta(line):
contig_file.write(line)
def write_to_fastas(line):
contig_file.write(line)
simple_file.write(line)
In [18]:
os.system('mkdir -p {}/{}-{}'.format(dirname, species, assembly))
Out[18]:
In [29]:
contig_file = open('{0}/{1}-{2}/{1}-{2}_contigs.fa'.format(dirname, species, assembly),'w')
simple_file = open('{0}/{1}-{2}/{1}-{2}.fa'.format(dirname, species, assembly),'w')
for molecule in contig:
fh = open('{0}/{1}'.format(dirname, molecule))
oline = '>%s\n' % (molecule.replace('.fasta', ''))
_ = fh.next()
# if molecule is an assembled chromosome we write to both files, otherwise only to the *_contigs one
write = write_to_fasta if '_' in molecule else write_to_fastas
for line in fh:
write(oline)
oline = line
# last line usually empty...
if line.strip():
write(line)
contig_file.close()
simple_file.close()
Remove all the other files (with single chromosome/contig)
In [12]:
! rm -f {dirname}/*.fasta
In [8]:
! gem-indexer -T 8 -i {dirname}/{species}-{assembly}/{species}-{assembly}_contigs.fa -o {dirname}/{species}-{assembly}/{species}-{assembly}_contigs
The path to the index file will be: {dirname}/{species}-{assembly}/{species}_contigs.gem
In this case we can use the FASTA of the genome whithout contigs and follow these step:
In [16]:
! gem-indexer -i {dirname}/{species}-{assembly}/{species}-{assembly}.fa \
-o {dirname}/{species}-{assembly}/{species}-{assembly} -T 8
! gem-mappability -I {dirname}/{species}-{assembly}/{species}-{assembly}.gem -l 50 \
-o {dirname}/{species}-{assembly}/{species}-{assembly}.50mer -T 8
! gem-2-wig -I {dirname}/{species}-{assembly}/{species}-{assembly}.gem \
-i {dirname}/{species}-{assembly}/{species}-{assembly}.50mer.mappability \
-o {dirname}/{species}-{assembly}/{species}-{assembly}.50mer
! wigToBigWig {dirname}/{species}-{assembly}/{species}-{assembly}.50mer.wig \
{dirname}/{species}-{assembly}/{species}-{assembly}.50mer.sizes \
{dirname}/{species}-{assembly}/{species}-{assembly}.50mer.bw
! bigWigToBedGraph {dirname}/{species}-{assembly}/{species}-{assembly}.50mer.bw \
{dirname}/{species}-{assembly}/{species}-{assembly}.50mer.bedGraph
In [ ]:
! rm -f {dirname}/{species}-{assembly}/{species}-{assembly}.50mer.mappability
! rm -f {dirname}/{species}-{assembly}/{species}-{assembly}.50mer.wig
! rm -f {dirname}/{species}-{assembly}/{species}-{assembly}.50mer.bw
! rm -f {dirname}/{species}-{assembly}/{species}-{assembly}.50mer.sizes
! rm -f {dirname}/{species}-{assembly}/*.log