Pipeline to match bacteria and phages via CRISPR spacers

Many (probably most) CRISPR spacers found in bacterial genomes came from phages, viruses that infect bacteria. This notebook attempts to figure out which phages infect which bacteria in a metagenomic sample by matching CRISPR spacers to phages from a database. It uses Crass to detect CRISPR spacers in metagenomic data, and BLAST to compare spacer sequences and metagenomic reads to databases of phage and bacterial genomes.

Notes on running this notebook

Commands in markdown (non-code) blocks that look like this are written in Bash. They can be run in a terminal (Linux / Mac) or in a Bash shell for Windows (Git Bash, for example). Alternatively, you can run those commands directly in this notebook by adding an exclamation point at the beginning of each line and changing the cell to a code block.
Many of the steps only need to be run once, even if you're analyzing multiple metagenomic datasets.

Step 1: Install the necessary software

We'll use Crass to find CRISPR spacers in metagenomic data. The Crass manual has instructions for installation, but here's what worked on Ubuntu:

Install Crass dependencies

sudo apt-get install libxerces-c3-dev

sudo add-apt-repository ppa:dns/gnu -y
sudo apt-get update -q
sudo apt-get install --only-upgrade autoconf

(or sudo apt-get install autoconf)

sudo apt install libtool-bin

wget http://www.zlib.net/zlib-1.2.11.tar.gz
tar -xvzf zlib-1.2.11.tar.gz 
cd zlib-1.2.11/
./configure --prefix=/usr/local/zlib
make
sudo make install

Install Crass: download tar from here

tar -xf crass-0.3.12.tar.gz 
cd crass-0.3.12/
./autogen.sh 
./configure 
make 
sudo make install

Install BLAST locally

See the documentation here to install BLAST to run on a local machine. On Ubuntu, I followed these instructions:
- Download the appropriate installer from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. For Ubuntu, download the file ending in linux.tar.gz.
- Move the download to the desired directory (i.e. home). Extract the files with tar zxvpf ncbi-blast+2.8.1-x64-linux.tar.gz (changing the filename to match your download).
- It's now technically usable as-is, but to be able to run from any directory, add it to the PATH: export PATH=$PATH:$HOME/ncbi-blast-2.8.1+/bin. If the directory isn't in the home folder, replace everything after $PATH:$ with the correct folder. If the version number is different, change the ncbi folder name as well.
- In order to run makeblastdb in step 3, I also had to run sudo apt install ncbi-blast+.

Step 2: Download some microbiome data in fasta or fastq format

We'll use a dataset from the Human Microbiome Project as an example. The file below is a metagenomic sample from subginvival plaque.



In [ ]:

    
# Download and extract a dataset - the data will be saved in the folder SRS014107 in the working directory
!wget ftp://public-ftp.ihmpdcc.org/Illumina/subgingival_plaque/SRS014107.tar.bz2
!tar -xf SRS014107.tar.bz2

Step 3: Download reference genomes and make BLAST databases

We're using an NCBI phage database and the NCBI bacteria and archaea databases, accessible by downloading the accessions from this FTP site using collect_accessions.py and then downloading genome sequences using acc2gb.py:

This only needs to be done once, unless you want to periodically check for new sequences in NCBI and then recreate the databases.



In [ ]:

    
# Download accession numbers for phages, bacteria, and archaea:

!python ../parserscripts/collect_accessions.py Phages.ids > phage_accessions.txt
!python ../parserscripts/collect_accessions.py Bacteria.ids > bacteria_accessions.txt
!python ../parserscripts/collect_accessions.py Archaea.ids > archaea_accessions.txt



In [ ]:

    
# Download genome sequences 
# this takes a long time (there are ~6700 bacteria, ~280 archaea, and ~2300 phages)

!cat phage_accessions.txt | python ../parserscripts/acc2gb.py your@email.com nuccore fasta > phagegenomes.dat
!cat bacteria_accessions.txt | python ../parserscripts/acc2gb.py your@email.com nuccore fasta > bacteriagenomes.dat
!cat archaea_accessions.txt | python ../parserscripts/acc2gb.py your@email.com nuccore fasta > archaeagenomes.dat



In [ ]:

    
# Create BLAST databases: one for bacteria and archaea, one for phages
import datetime
today = datetime.date.today()

!makeblastdb -in "phagegenomes.dat" -dbtype nucl -title phagedatabase_$today -out phagedb_$today
!cat bacteriagenomes.dat archaeagenomes.dat | makeblastdb -in - -dbtype nucl -title bacdatabase_$today -out bacdb_$today

The output from creating a BLAST database should look something like this:

Building a new DB, current time: 05/11/2018 12:37:11
New DB name:   /Documents/GitHub/phageParser/bacdb_May2018
New DB title:  bacdatabase_May2018
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6990 sequences in 103.433 seconds.

Step 4: Run Crass

The simplest syntax to run Crass on a fasta file is crass input_fasta_file -o output_directory.



In [ ]:

    
# change data_ID to the folder name of the dataset downloaded in step 2.
data_ID = "SRS014107"
crass_output_directory = "%s-crass-output" %data_ID



In [ ]:

    
# Run Crass on the downloaded file from earlier
!crass "$data_ID/$data_ID".denovo_duplicates_marked.trimmed.1.fastq -o $crass_output_directory

Step 5: Parse Crass output to get spacer sequences

The Crass output we're interested in is the file crass.crispr that appears in the output directory. It's an XML file, so we can parse it.



In [ ]:

    
# create a folder called 'spacers' and a folder called 'source_reads in the Crass output folder'

!mkdir "$crass_output_directory"/spacers
!mkdir "$crass_output_directory"/source_reads



In [ ]:

    
import xml.etree.ElementTree as ET
from tqdm import tqdm



In [ ]:

    
tree = ET.parse('%s/crass.crispr' %crass_output_directory)
root = tree.getroot()



In [ ]:

    
# create dictionary to store repeats and read headers to get them out of the original file later
read_dict = {}

# create a list of the accessions that come up
read_dict_accessions = {}
    
for child in root: # each top-level child is a CRISPR array
    repeat = child.attrib['drseq'] # the consensus repeat sequence identified by Crass
    spacers = child[0][2] # the spacers associated with that repeat
    source_reads = child[0][0] # the source reads that the spacers and repeats come from
    
    read_dict[repeat] = {}
    
    # create a file with the spacers
    
    with open(crass_output_directory + "/spacers/" + repeat + ".fasta", 'w') as spacer_file:
        for spacer in spacers:
            spacer_file.write('>' + spacer.attrib['spid'] + '\n')
            spacer_file.write(spacer.attrib['seq'] + '\n')
    
    for read in source_reads:
        header = read.attrib['accession']
        read_dict[repeat][header] = []
        read_dict_accessions[header] = repeat



In [ ]:

    
# Extract the sequences associated with each repeat for BLASTing
from Bio import SeqIO



In [ ]:

    
for seq_record in SeqIO.parse(data_ID + "/SRS014107.denovo_duplicates_marked.trimmed.1.fastq", "fastq"):
    header = seq_record.id 
    if header in read_dict_accessions.keys():
        read_dict[read_dict_accessions[header]][header] = seq_record.seq
    #print(repr(seq_record.seq))
    #print(len(seq_record))



In [ ]:

    
# save sequences to a file, one for each repeat

for repeat in read_dict.keys():
    with open(crass_output_directory + "/source_reads/" + repeat +'_reads' + ".fasta", 'w') as repeat_file:
        for header, sequence in read_dict[repeat].items():
            repeat_file.write('>' + header + '\n')
            repeat_file.write(str(sequence) + '\n')

Step 6: BLAST spacers against phage database, BLAST source reads against bacterial database



In [ ]:

    
import subprocess
import sys
import os

from Bio.Blast.Applications import NcbitblastnCommandline



In [ ]:

    
# make output directory

outdir = "%s/spacers_BLAST" %(crass_output_directory)
if not os.path.exists(outdir):
    os.makedirs(outdir)

# loop through each file in the 'spacers' folder

for filename in os.listdir(crass_output_directory + "/spacers"):
    
    # get repeat sequence from filename
    repeat = filename.split('.')[0]
    
    # builds command line string
    tblastn_obj = NcbitblastnCommandline(
        cmd='blastn',
        query="%s/spacers/%s" %(crass_output_directory,filename),
        db="phagedb_%s" %today,
        evalue=10,
        outfmt=5,
        out="%s/%s.xml" %(outdir,repeat)
    )

    tblastn_cline = str(tblastn_obj)

    # executes command line string in bash
    subprocess.call(tblastn_cline, shell=True)



In [ ]:

    
# make output directory

outdir = "%s/source_reads_BLAST" %(crass_output_directory)
if not os.path.exists(outdir):
    os.makedirs(outdir)

# loop through each file in the source_reads folder

for filename in os.listdir(crass_output_directory + "/source_reads"):
    # get repeat sequence from filename
    repeat = filename.split('_')[0].split('.')[0]
    # builds command line string
    tblastn_obj = NcbitblastnCommandline(
        cmd='blastn',
        query="%s/source_reads/%s" %(crass_output_directory,filename),
        db="bacdb_%s" %today,
        evalue=10,
        outfmt=5,
        out="%s/%s.xml" %(outdir,repeat)
    )

    tblastn_cline = str(tblastn_obj)

    # executes command line string in bash
    subprocess.call(tblastn_cline, shell=True)

Step 7: Parse BLAST output



In [ ]:

    
import csv



In [ ]:

    
def parse_blast(resultfile): #takes in the BLAST result, outputs list that can be made into csv
    from Bio.Blast import NCBIXML
    result_handle = open(resultfile)
    blast_records = NCBIXML.parse(result_handle)
    csv_list = []
    
    header = [  'Query',
                'Name', 'Length', 'Score', 'Expect',
                'QueryStart', 'QueryEnd',
                'SubjectStart', 'SubjectEnd'
            ]
    
    csv_list.append(header)
    count = 0
    for blast_record in blast_records:
        '''help(blast_record.alignments[0].hsps[0])''' # these give help info for the parts 
        '''help(blast_record.alignments[0])        '''
        count +=1
        
        query = blast_record.query
        for alignment in blast_record.alignments:

            name = alignment.title
            length = alignment.length
    
            hsp = alignment.hsps[0] # I don't know if we will ever have more than one, so might as well take the first one.
            score = hsp.score
            expect = hsp.expect
            if expect >= cutoff:
                continue
            querystart = hsp.query_start
            queryend = hsp.query_end
            subjectstart = hsp.sbjct_start
            subjectend = hsp.sbjct_end
            row = [query,name,length,score,expect,querystart,queryend,subjectstart,subjectend]
            csv_list.append(row)
            
    result_handle.close()
    return csv_list

def write_csv(dest, csv_cont): #takes a list of lists object with each csv row as a list
    with open(dest, 'w') as out_file:
        writer= csv.writer(out_file, delimiter=',')
        for row in csv_cont:
            writer.writerow(row)



In [ ]:

    
#source reads

indir = crass_output_directory + "/source_reads_BLAST" 
outdir = crass_output_directory + "/source_reads_BLAST_results"

if not os.path.exists(outdir):
    os.makedirs(outdir)

cutoff = 1

for fn in os.listdir("%s/" %indir):
    ID = fn[:fn.index('.')]
    if fn == 'sorted':
        continue
    csv_list = parse_blast("%s/%s" %(indir,fn))
    write_csv("%s/%s.csv" %(outdir,ID), csv_list)



In [ ]:

    
# spacers

indir = crass_output_directory + "/spacers_BLAST" 
outdir = crass_output_directory + "/spacers_BLAST_results"

if not os.path.exists(outdir):
    os.makedirs(outdir)

cutoff = 10

for fn in os.listdir("%s/" %indir):
    ID = fn[:fn.index('.')]
    if fn == 'sorted':
        continue
    csv_list = parse_blast("%s/%s" %(indir,fn))
    write_csv("%s/%s.csv" %(outdir,ID), csv_list)