Many (probably most) CRISPR spacers found in bacterial genomes came from phages, viruses that infect bacteria. This notebook attempts to figure out which phages infect which bacteria in a metagenomic sample by matching CRISPR spacers to phages from a database. It uses Crass to detect CRISPR spacers in metagenomic data, and BLAST to compare spacer sequences and metagenomic reads to databases of phage and bacterial genomes.
like this
are written in Bash. They can be run in a terminal (Linux / Mac) or in a Bash shell for Windows (Git Bash, for example). Alternatively, you can run those commands directly in this notebook by adding an exclamation point at the beginning of each line and changing the cell to a code block.We'll use Crass to find CRISPR spacers in metagenomic data. The Crass manual has instructions for installation, but here's what worked on Ubuntu:
sudo apt-get install libxerces-c3-dev
sudo add-apt-repository ppa:dns/gnu -y
sudo apt-get update -q
sudo apt-get install --only-upgrade autoconf
(or sudo apt-get install autoconf
)
sudo apt install libtool-bin
wget http://www.zlib.net/zlib-1.2.11.tar.gz
tar -xvzf zlib-1.2.11.tar.gz
cd zlib-1.2.11/
./configure --prefix=/usr/local/zlib
make
sudo make install
tar -xf crass-0.3.12.tar.gz
cd crass-0.3.12/
./autogen.sh
./configure
make
sudo make install
Install BLAST locally
See the documentation here to install BLAST to run on a local machine. On Ubuntu, I followed these instructions:
Download the appropriate installer from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. For Ubuntu, download the file ending in linux.tar.gz
.
Move the download to the desired directory (i.e. home
). Extract the files with tar zxvpf ncbi-blast+2.8.1-x64-linux.tar.gz
(changing the filename to match your download).
It's now technically usable as-is, but to be able to run from any directory, add it to the PATH: export PATH=$PATH:$HOME/ncbi-blast-2.8.1+/bin
. If the directory isn't in the home folder, replace everything after $PATH:$
with the correct folder. If the version number is different, change the ncbi folder name as well.
In order to run makeblastdb
in step 3, I also had to run sudo apt install ncbi-blast+
.
In [ ]:
# Download and extract a dataset - the data will be saved in the folder SRS014107 in the working directory
!wget ftp://public-ftp.ihmpdcc.org/Illumina/subgingival_plaque/SRS014107.tar.bz2
!tar -xf SRS014107.tar.bz2
We're using an NCBI phage database and the NCBI bacteria and archaea databases, accessible by downloading the accessions from this FTP site using collect_accessions.py
and then downloading genome sequences using acc2gb.py
:
This only needs to be done once, unless you want to periodically check for new sequences in NCBI and then recreate the databases.
In [ ]:
# Download accession numbers for phages, bacteria, and archaea:
!python ../parserscripts/collect_accessions.py Phages.ids > phage_accessions.txt
!python ../parserscripts/collect_accessions.py Bacteria.ids > bacteria_accessions.txt
!python ../parserscripts/collect_accessions.py Archaea.ids > archaea_accessions.txt
In [ ]:
# Download genome sequences
# this takes a long time (there are ~6700 bacteria, ~280 archaea, and ~2300 phages)
!cat phage_accessions.txt | python ../parserscripts/acc2gb.py your@email.com nuccore fasta > phagegenomes.dat
!cat bacteria_accessions.txt | python ../parserscripts/acc2gb.py your@email.com nuccore fasta > bacteriagenomes.dat
!cat archaea_accessions.txt | python ../parserscripts/acc2gb.py your@email.com nuccore fasta > archaeagenomes.dat
In [ ]:
# Create BLAST databases: one for bacteria and archaea, one for phages
import datetime
today = datetime.date.today()
!makeblastdb -in "phagegenomes.dat" -dbtype nucl -title phagedatabase_$today -out phagedb_$today
!cat bacteriagenomes.dat archaeagenomes.dat | makeblastdb -in - -dbtype nucl -title bacdatabase_$today -out bacdb_$today
The output from creating a BLAST database should look something like this:
Building a new DB, current time: 05/11/2018 12:37:11
New DB name: /Documents/GitHub/phageParser/bacdb_May2018
New DB title: bacdatabase_May2018
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6990 sequences in 103.433 seconds.
In [ ]:
# change data_ID to the folder name of the dataset downloaded in step 2.
data_ID = "SRS014107"
crass_output_directory = "%s-crass-output" %data_ID
In [ ]:
# Run Crass on the downloaded file from earlier
!crass "$data_ID/$data_ID".denovo_duplicates_marked.trimmed.1.fastq -o $crass_output_directory
In [ ]:
# create a folder called 'spacers' and a folder called 'source_reads in the Crass output folder'
!mkdir "$crass_output_directory"/spacers
!mkdir "$crass_output_directory"/source_reads
In [ ]:
import xml.etree.ElementTree as ET
from tqdm import tqdm
In [ ]:
tree = ET.parse('%s/crass.crispr' %crass_output_directory)
root = tree.getroot()
In [ ]:
# create dictionary to store repeats and read headers to get them out of the original file later
read_dict = {}
# create a list of the accessions that come up
read_dict_accessions = {}
for child in root: # each top-level child is a CRISPR array
repeat = child.attrib['drseq'] # the consensus repeat sequence identified by Crass
spacers = child[0][2] # the spacers associated with that repeat
source_reads = child[0][0] # the source reads that the spacers and repeats come from
read_dict[repeat] = {}
# create a file with the spacers
with open(crass_output_directory + "/spacers/" + repeat + ".fasta", 'w') as spacer_file:
for spacer in spacers:
spacer_file.write('>' + spacer.attrib['spid'] + '\n')
spacer_file.write(spacer.attrib['seq'] + '\n')
for read in source_reads:
header = read.attrib['accession']
read_dict[repeat][header] = []
read_dict_accessions[header] = repeat
In [ ]:
# Extract the sequences associated with each repeat for BLASTing
from Bio import SeqIO
In [ ]:
for seq_record in SeqIO.parse(data_ID + "/SRS014107.denovo_duplicates_marked.trimmed.1.fastq", "fastq"):
header = seq_record.id
if header in read_dict_accessions.keys():
read_dict[read_dict_accessions[header]][header] = seq_record.seq
#print(repr(seq_record.seq))
#print(len(seq_record))
In [ ]:
# save sequences to a file, one for each repeat
for repeat in read_dict.keys():
with open(crass_output_directory + "/source_reads/" + repeat +'_reads' + ".fasta", 'w') as repeat_file:
for header, sequence in read_dict[repeat].items():
repeat_file.write('>' + header + '\n')
repeat_file.write(str(sequence) + '\n')
In [ ]:
import subprocess
import sys
import os
from Bio.Blast.Applications import NcbitblastnCommandline
In [ ]:
# make output directory
outdir = "%s/spacers_BLAST" %(crass_output_directory)
if not os.path.exists(outdir):
os.makedirs(outdir)
# loop through each file in the 'spacers' folder
for filename in os.listdir(crass_output_directory + "/spacers"):
# get repeat sequence from filename
repeat = filename.split('.')[0]
# builds command line string
tblastn_obj = NcbitblastnCommandline(
cmd='blastn',
query="%s/spacers/%s" %(crass_output_directory,filename),
db="phagedb_%s" %today,
evalue=10,
outfmt=5,
out="%s/%s.xml" %(outdir,repeat)
)
tblastn_cline = str(tblastn_obj)
# executes command line string in bash
subprocess.call(tblastn_cline, shell=True)
In [ ]:
# make output directory
outdir = "%s/source_reads_BLAST" %(crass_output_directory)
if not os.path.exists(outdir):
os.makedirs(outdir)
# loop through each file in the source_reads folder
for filename in os.listdir(crass_output_directory + "/source_reads"):
# get repeat sequence from filename
repeat = filename.split('_')[0].split('.')[0]
# builds command line string
tblastn_obj = NcbitblastnCommandline(
cmd='blastn',
query="%s/source_reads/%s" %(crass_output_directory,filename),
db="bacdb_%s" %today,
evalue=10,
outfmt=5,
out="%s/%s.xml" %(outdir,repeat)
)
tblastn_cline = str(tblastn_obj)
# executes command line string in bash
subprocess.call(tblastn_cline, shell=True)
In [ ]:
import csv
In [ ]:
def parse_blast(resultfile): #takes in the BLAST result, outputs list that can be made into csv
from Bio.Blast import NCBIXML
result_handle = open(resultfile)
blast_records = NCBIXML.parse(result_handle)
csv_list = []
header = [ 'Query',
'Name', 'Length', 'Score', 'Expect',
'QueryStart', 'QueryEnd',
'SubjectStart', 'SubjectEnd'
]
csv_list.append(header)
count = 0
for blast_record in blast_records:
'''help(blast_record.alignments[0].hsps[0])''' # these give help info for the parts
'''help(blast_record.alignments[0]) '''
count +=1
query = blast_record.query
for alignment in blast_record.alignments:
name = alignment.title
length = alignment.length
hsp = alignment.hsps[0] # I don't know if we will ever have more than one, so might as well take the first one.
score = hsp.score
expect = hsp.expect
if expect >= cutoff:
continue
querystart = hsp.query_start
queryend = hsp.query_end
subjectstart = hsp.sbjct_start
subjectend = hsp.sbjct_end
row = [query,name,length,score,expect,querystart,queryend,subjectstart,subjectend]
csv_list.append(row)
result_handle.close()
return csv_list
def write_csv(dest, csv_cont): #takes a list of lists object with each csv row as a list
with open(dest, 'w') as out_file:
writer= csv.writer(out_file, delimiter=',')
for row in csv_cont:
writer.writerow(row)
In [ ]:
#source reads
indir = crass_output_directory + "/source_reads_BLAST"
outdir = crass_output_directory + "/source_reads_BLAST_results"
if not os.path.exists(outdir):
os.makedirs(outdir)
cutoff = 1
for fn in os.listdir("%s/" %indir):
ID = fn[:fn.index('.')]
if fn == 'sorted':
continue
csv_list = parse_blast("%s/%s" %(indir,fn))
write_csv("%s/%s.csv" %(outdir,ID), csv_list)
In [ ]:
# spacers
indir = crass_output_directory + "/spacers_BLAST"
outdir = crass_output_directory + "/spacers_BLAST_results"
if not os.path.exists(outdir):
os.makedirs(outdir)
cutoff = 10
for fn in os.listdir("%s/" %indir):
ID = fn[:fn.index('.')]
if fn == 'sorted':
continue
csv_list = parse_blast("%s/%s" %(indir,fn))
write_csv("%s/%s.csv" %(outdir,ID), csv_list)