hmmerclust

A python package for detecting a gene cluster of interest in a set of bacterial genomes followed by interactive analysis of the results using ipython/jupyter notebook. For example, it could be used for identifying genetic loci that encode biosynthetic pathways or multicomponent protein assemblies, comparing how these loci differ across taxonomical groups, and automatically extracting sequences for subsequent phylogenetic analysis or protein expression.

Please see the demo folder, which contains an executable notebook demonstrating the functionality.

Requirements

  • python2.7/pip
  • virtualenv
  • hmmer installed on system
  • ipython/notebook
  • pandas
  • matplotlib
  • biopython

User knowledge

Input files

  • Genbank genome sequence files
  • A folder with a set of multiple sequence alignments representing proteins of interest
  • a settings file for defining colors, tables, and abbreviations
  • Please see the demo folder, which contains an executable notebook demonstrating the whole process.

How it works

  • Protein coding sequences are extracted from all the genome genbank files and combined into a single fasta
  • Hmmbuild generates a hidden markov model from the input alignments
  • Hmmsearch queries the combined proteome fasta using the hmms.
  • All the data describing the organism info and taxonomy, the identified proteins, and hit statitics are stored in an object-oriented database.
  • Hits are annotated based on the best-scoring hit. But other hits are preserved in the database if they need to be compared later
  • The hits are clustered together based on the location of their coding sequence on the chromosome by user-specified criteria (e.g. how many CDS represent a cluster, and how far apart should they be). The hits are stored together as 'loci objects'.
  • A pandas dataframe is generated with all the hit data. This allows for interactive analysis of the results, hit statistics, etc.

Scripts for visualizaing and analyzing results

  • A heatmap for visualizing presence or absence of cluster components across a large number of species. The species can be optionally sorted based on their position in a phylogenetic tree.
  • Locus maps w/ tables that describe the hits.
  • Functions for extracting entire loci or sets of annotated proteins for doing multiple sequence alignments or phylogenetic analysis in a separate program.

Installation

Make a virtual envelope

First, install virtualenv and virtualenvwrapper.

Then make a virtualenv for hmmerclust:

$ mkvirtualenv -p /usr/bin/python hmmerclust
$ workon hmmerclust

Install requirements and start the notebok

$ pip install ipython[all] matplotlib pandas biopython hmmerclust
$ ipython notebook

Setting up the files

3 key things

  • Alignment folder
  • Genome folder
  • settings.py file
  • 16S_aligned.csv (for sorting results by phylogentic tree. Just make an empty file to start)

In [8]:
from hmmerclust import hmmerclust
import settings

Set up the genome directory and define genome variables

You can use the fetch_gbwithparts() function to download genomes from NCBI. Here, we just download a few. In an actual search I might use all the genomes from RefSeq. See a list here


In [9]:
accessions_of_interest = ['NC_002516',
 'NC_002695',
 'NC_003143',
 'NC_003198',
 'NC_007645',
 'NC_007650',
 'NC_007651',
 'NC_007760',
 'NC_010287',
 'NC_013722',
 'NC_013971',
 'NC_015677',
 'NC_018518',
 'NC_022904']

In [10]:
#download these from NCBI... may take 5 minutes

hmmerclust.fetch_gbwithparts(accessions_of_interest, 'your@email.com', './genomes/')


downloading genomes... please wait
done downloading NC_002516
done downloading NC_002695
done downloading NC_003143
done downloading NC_003198
done downloading NC_007645
done downloading NC_007650
done downloading NC_007651
done downloading NC_007760
done downloading NC_010287
done downloading NC_013722
done downloading NC_013971
done downloading NC_015677
done downloading NC_018518
done downloading NC_022904

In [11]:
ls ./genomes/


NC_002516.gb  NC_003143.gb  NC_007645.gb  NC_007651.gb  NC_010287.gb  NC_013971.gb  NC_018518.gb
NC_002695.gb  NC_003198.gb  NC_007650.gb  NC_007760.gb  NC_013722.gb  NC_015677.gb  NC_022904.gb

In [12]:
genome_list = !ls ./genomes/
genome_dir = './genomes/'

Make a database from these genomes

Now it is time to populate our database with organism data. After you do this once, a fasta will be generated. If you already have the fasta from a previous run, set freshfasta=False because the file can be very large and don't want to always generate this and have multiple copies on your system.


In [13]:
db = hmmerclust.OrganismDB('t3ss_database',
                           genome_list,
                           genome_dir,
                           freshfasta=True)


making combined fasta for NC_002516.gb
5572 proteins were added
5572 unique proteins were added -- dropping redundant ones
making combined fasta for NC_002695.gb
10776 proteins were added
10776 unique proteins were added -- dropping redundant ones
making combined fasta for NC_003143.gb
14574 proteins were added
14574 unique proteins were added -- dropping redundant ones
making combined fasta for NC_003198.gb
18685 proteins were added
18685 unique proteins were added -- dropping redundant ones
making combined fasta for NC_007645.gb
24962 proteins were added
24945 unique proteins were added -- dropping redundant ones
making combined fasta for NC_007650.gb
27318 proteins were added
27301 unique proteins were added -- dropping redundant ones
making combined fasta for NC_007651.gb
30594 proteins were added
30577 unique proteins were added -- dropping redundant ones
making combined fasta for NC_007760.gb
35020 proteins were added
35002 unique proteins were added -- dropping redundant ones
making combined fasta for NC_010287.gb
35905 proteins were added
35887 unique proteins were added -- dropping redundant ones
making combined fasta for NC_013722.gb
39019 proteins were added
39001 unique proteins were added -- dropping redundant ones
making combined fasta for NC_013971.gb
42326 proteins were added
42306 unique proteins were added -- dropping redundant ones
making combined fasta for NC_015677.gb
46171 proteins were added
46150 unique proteins were added -- dropping redundant ones
making combined fasta for NC_018518.gb
49788 proteins were added
49563 unique proteins were added -- dropping redundant ones
making combined fasta for NC_022904.gb
54559 proteins were added
54319 unique proteins were added -- dropping redundant ones
Adding organism attributes for NC_002516.gb
Adding organism attributes for NC_002695.gb
Adding organism attributes for NC_003143.gb
Adding organism attributes for NC_003198.gb
Adding organism attributes for NC_007645.gb
Adding organism attributes for NC_007650.gb
Adding organism attributes for NC_007651.gb
Adding organism attributes for NC_007760.gb
Adding organism attributes for NC_010287.gb
Adding organism attributes for NC_013722.gb
Adding organism attributes for NC_013971.gb
Adding organism attributes for NC_015677.gb
Adding organism attributes for NC_018518.gb
Adding organism attributes for NC_022904.gb

In [14]:
db.organisms


Out[14]:
[<hmmerclust.hmmerclust.Organism at 0x10731a950>,
 <hmmerclust.hmmerclust.Organism at 0x10a3f06d0>,
 <hmmerclust.hmmerclust.Organism at 0x10a00c210>,
 <hmmerclust.hmmerclust.Organism at 0x10a4789d0>,
 <hmmerclust.hmmerclust.Organism at 0x10a4788d0>,
 <hmmerclust.hmmerclust.Organism at 0x10c284090>,
 <hmmerclust.hmmerclust.Organism at 0x10f74d5d0>,
 <hmmerclust.hmmerclust.Organism at 0x10f74d590>,
 <hmmerclust.hmmerclust.Organism at 0x10c9ed090>,
 <hmmerclust.hmmerclust.Organism at 0x10c340450>,
 <hmmerclust.hmmerclust.Organism at 0x10c340410>,
 <hmmerclust.hmmerclust.Organism at 0x10c340050>,
 <hmmerclust.hmmerclust.Organism at 0x10a4a4c90>,
 <hmmerclust.hmmerclust.Organism at 0x10c4de050>]

In [15]:
for org in db.organisms:
    print org.name


Pseudomonas aeruginosa PAO1
Escherichia coli O157:H7 str. Sakai
Yersinia pestis CO92
Salmonella enterica subsp. enterica serovar Typhi str. CT18
Hahella chejuensis KCTC 2396
Burkholderia thailandensis E264
Burkholderia thailandensis E264
Anaeromyxobacter dehalogenans 2CP-C
Chlamydia trachomatis 434/Bu
Xanthomonas albilineans GPE PC73
Erwinia amylovora ATCC 49946
Ramlibacter tataouinensis TTB310
Bordetella pertussis 18323
Pandoraea pnomenusa 3kgm

the organism object encapsulates this data:


In [16]:
db.organisms[1].__dict__


Out[16]:
{'accesion_version': 'NC_002695.1',
 'accession': 'NC_002695',
 'clazz': 'Gammaproteobacteria',
 'description': 'Escherichia coli O157:H7 str. Sakai chromosome, complete genome.',
 'family': 'Enterobacteriaceae',
 'genome_length': 5498450,
 'genome_path': './genomes/NC_002695.gb',
 'genus': 'Escherichia',
 'kingdom': 'Bacteria',
 'loci': [],
 'name': 'Escherichia coli O157:H7 str. Sakai',
 'order': 'Enterobacteriales',
 'parent_db': <hmmerclust.hmmerclust.OrganismDB instance at 0x10a486ab8>,
 'phylum': 'Proteobacteria',
 'proteins': [],
 'rRNA16S_sequence': None,
 'species': 'Escherichia coli',
 'taxonomy': ['Bacteria',
  'Proteobacteria',
  'Gammaproteobacteria',
  'Enterobacteriales',
  'Enterobacteriaceae',
  'Escherichia'],
 'tree_order': 0}

In [17]:
#notice a combined_fasta file was created when the db was made
!ls


16S_aligned.csv       combined_fasta        hhsearch_results      locus_fastas          settings.pyc
__init__.py           genomes               hmm                   out.svg
alignments            group_fastas          hmmerclust_demo.ipynb settings.py

Carry out the search against the combined_fasta

Set up the /alignments folder. Alignments FASTA and must have txt extension. In this case, I queried PFAM for the proteins located in my gene cluster of interest (here, the type III secretion system) and downloaded the hmm seed alignments. It is important to name the files starting with how you want to the search results to be annotated later, followed by an underscore. It doesn't matter what goes after the underscore.


In [18]:
!ls alignments


InvA_PF00771_seed.txt InvJ_PF02510_seed.txt PrgJ_PB000379.txt     SipD_PF06511_seed.txt SpaS_PF01312_seed.txt
InvC_PF00006_seed.txt OrgA_PF09482_seed.txt PrgK_PF01514_seed.txt SpaO_PF01052_seed.txt
InvE_PF07201_seed.txt OrgB_PB004806.txt     PscP_PF02120_seed.txt SpaP_PF00813_seed.txt
InvG_PF00263_seed.txt PrgH_PF09480_seed.txt SipB_PF04888_seed.txt SpaQ_PF01313_seed.txt
InvH_PF04741_seed.txt PrgI_PF09392_seed.txt SipC_PF09599_seed.txt SpaR_PF01311_seed.txt

In [19]:
combined_fasta = './combined_fasta'
s = hmmerclust.HmmSearch(db, combined_fasta, 
                         freshbuild=True,
                         freshsearch=True)


building Hmm for InvA_PF00771_seed.txt
building Hmm for InvC_PF00006_seed.txt
building Hmm for InvE_PF07201_seed.txt
building Hmm for InvG_PF00263_seed.txt
building Hmm for InvH_PF04741_seed.txt
building Hmm for InvJ_PF02510_seed.txt
building Hmm for OrgA_PF09482_seed.txt
building Hmm for OrgB_PB004806.txt
building Hmm for PrgH_PF09480_seed.txt
building Hmm for PrgI_PF09392_seed.txt
building Hmm for PrgJ_PB000379.txt
building Hmm for PrgK_PF01514_seed.txt
building Hmm for PscP_PF02120_seed.txt
building Hmm for SipB_PF04888_seed.txt
building Hmm for SipC_PF09599_seed.txt
building Hmm for SipD_PF06511_seed.txt
building Hmm for SpaO_PF01052_seed.txt
building Hmm for SpaP_PF00813_seed.txt
building Hmm for SpaQ_PF01313_seed.txt
building Hmm for SpaR_PF01311_seed.txt
building Hmm for SpaS_PF01312_seed.txt
hhbuild complete for ['InvA', 'InvC', 'InvE', 'InvG', 'InvH', 'InvJ', 'OrgA', 'OrgB', 'PrgH', 'PrgI', 'PrgJ', 'PrgK', 'PscP', 'SipB', 'SipC', 'SipD', 'SpaO', 'SpaP', 'SpaQ', 'SpaR', 'SpaS']
running HHsearch on InvA
running HHsearch on InvC
running HHsearch on InvE
running HHsearch on InvG
running HHsearch on InvH
running HHsearch on InvJ
running HHsearch on OrgA
running HHsearch on OrgB
running HHsearch on PrgH
running HHsearch on PrgI
running HHsearch on PrgJ
running HHsearch on PrgK
running HHsearch on PscP
running HHsearch on SipB
running HHsearch on SipC
running HHsearch on SipD
running HHsearch on SpaO
running HHsearch on SpaP
running HHsearch on SpaQ
running HHsearch on SpaR
running HHsearch on SpaS
extracted 39 hits for InvA.out
extracted 296 hits for InvC.out
extracted 19 hits for InvE.out
extracted 63 hits for InvG.out
extracted 1 hits for InvH.out
extracted 3 hits for InvJ.out
extracted 8 hits for OrgA.out
extracted 2 hits for OrgB.out
extracted 12 hits for PrgH.out
extracted 41 hits for PrgI.out
extracted 13 hits for PrgJ.out
extracted 41 hits for PrgK.out
extracted 47 hits for PscP.out
extracted 29 hits for SipB.out
extracted 4 hits for SipC.out
extracted 8 hits for SipD.out
extracted 59 hits for SpaO.out
extracted 39 hits for SpaP.out
extracted 38 hits for SpaQ.out
extracted 40 hits for SpaR.out
extracted 42 hits for SpaS.out
adding proteins to organism NC_002516
adding proteins to organism NC_002695
adding proteins to organism NC_003143
adding proteins to organism NC_003198
adding proteins to organism NC_007645
adding proteins to organism NC_007650
adding proteins to organism NC_007651
adding proteins to organism NC_007760
adding proteins to organism NC_010287
adding proteins to organism NC_013722
adding proteins to organism NC_013971
adding proteins to organism NC_015677
adding proteins to organism NC_018518
adding proteins to organism NC_022904
adding SearchIO hit objects for NC_002516
adding SearchIO hit objects for NC_002695
adding SearchIO hit objects for NC_003143
adding SearchIO hit objects for NC_003198
adding SearchIO hit objects for NC_007645
adding SearchIO hit objects for NC_007650
adding SearchIO hit objects for NC_007651
adding SearchIO hit objects for NC_007760
adding SearchIO hit objects for NC_010287
adding SearchIO hit objects for NC_013722
adding SearchIO hit objects for NC_013971
adding SearchIO hit objects for NC_015677
adding SearchIO hit objects for NC_018518
adding SearchIO hit objects for NC_022904
setting best hit values for Pseudomonas aeruginosa PAO1
setting best hit values for Escherichia coli O157:H7 str. Sakai
setting best hit values for Yersinia pestis CO92
setting best hit values for Salmonella enterica subsp. enterica serovar Typhi str. CT18
setting best hit values for Hahella chejuensis KCTC 2396
setting best hit values for Burkholderia thailandensis E264
setting best hit values for Burkholderia thailandensis E264
setting best hit values for Anaeromyxobacter dehalogenans 2CP-C
setting best hit values for Chlamydia trachomatis 434/Bu
setting best hit values for Xanthomonas albilineans GPE PC73
setting best hit values for Erwinia amylovora ATCC 49946
setting best hit values for Ramlibacter tataouinensis TTB310
setting best hit values for Bordetella pertussis 18323
setting best hit values for Pandoraea pnomenusa 3kgm

In [45]:
#hits now found in the db object
db.organisms[1].proteins


Out[45]:
[NP_311099.1 - ABC transporter ATP-binding protein,
 NP_310807.1 - hypothetical protein,
 NP_312703.1 - ATP synthase F0F1 subunit alpha,
 NP_312230.1 - ABC transporter ATP-binding protein,
 NP_311368.2 - glycine cleavage system transcriptional repressor,
 NP_313267.1 - hypothetical protein,
 NP_312611.1 - hypothetical protein,
 NP_309773.1 - oligopeptide ABC transporter ATP-binding protein,
 NP_309779.1 - transporter,
 NP_312226.1 - hypothetical protein,
 NP_311745.1 - EprI,
 NP_311753.1 - surface presentation of antigens protein SpaO,
 NP_312260.1 - outer membrane porin HofQ,
 NP_311743.1 - EprK,
 NP_311757.1 - ATP synthase SpaL,
 NP_311751.1 - EpaQ,
 NP_311759.1 - EivE,
 NP_312472.2 - hypothetical protein,
 NP_312579.1 - protein EscF,
 NP_310061.1 - potassium-tellurite ethidium and proflavin transporter,
 NP_308962.1 - putrescine transporter ATP-binding protein,
 NP_312609.1 - EscS,
 NP_310704.1 - flagellar MS-ring protein,
 NP_312084.1 - ATP-dependent metalloprotease,
 NP_310682.1 - amino-acid ABC transporter ATP-binding protein,
 NP_308654.1 - iron-enterobactin transporter ATP-binding protein,
 NP_309004.1 - recombination factor protein RarA,
 NP_312595.1 - EscN,
 NP_310714.1 - flagellar biosynthesis protein FliP,
 NP_310716.1 - flagellar biosynthesis protein FliR,
 NP_308868.1 - tail assembly protein,
 NP_310712.1 - flagellar motor switch protein FliN,
 NP_310616.1 - flagellar biosynthesis protein FlhA,
 NP_308283.1 - FhiA protein,
 NP_309894.1 - peptide ABC transporter ATP-binding protein,
 NP_312608.1 - EscT,
 NP_308440.2 - ferric transporter ATP-binding protein,
 NP_312599.1 - hypothetical protein,
 NP_310114.1 - ABC transporter ATP-binding protein,
 NP_312885.1 - ATP-dependent protease ATP-binding protein HslU,
 NP_312610.1 - type III secretion system protein,
 NP_310715.1 - flagellar biosynthesis protein FliQ,
 NP_309863.1 - anthranilate synthase component I,
 NP_311754.1 - EivJ,
 NP_311758.1 - EivA,
 NP_312602.1 - EscC,
 NP_309060.1 - ABC transporter ATPase,
 NP_312701.1 - ATP synthase F0F1 subunit beta,
 NP_308935.1 - glutathione transporter ATP-binding protein,
 NP_311746.1 - EprH,
 NP_312743.1 - transcription termination factor Rho,
 NP_311640.1 - hypothetical protein,
 NP_310711.1 - flagellar motor switch protein FliM,
 NP_311744.1 - EprJ,
 NP_311742.1 - hypothetical protein,
 NP_312592.1 - SepQ,
 NP_312607.1 - secretion system apparatus protein SsaU,
 NP_310617.1 - flagellar biosynthesis protein FlhB,
 NP_308520.1 - DNA-binding ATP-dependent protease La,
 NP_308681.1 - citrate lyase subunit alpha,
 NP_309724.1 - ferric enterobactin transport ATP-binding protein,
 NP_312600.1 - EscJ,
 NP_309774.1 - hypothetical protein,
 NP_311760.1 - EivG,
 NP_312584.1 - SepL,
 NP_310707.1 - flagellum-specific ATP synthase,
 NP_310391.1 - lipoprotein,
 NP_313107.1 - phosphonate C-P lyase system protein PhnK,
 NP_311440.1 - ABC transporter ATP-binding protein,
 NP_312596.1 - hypothetical protein,
 NP_312582.1 - protein EspD,
 NP_311752.1 - surface presentation of antigens protein SpaP,
 NP_312581.1 - protein EspB,
 NP_310709.1 - flagellar hook-length control protein,
 NP_311748.1 - surface presentation of antigens protein SpaS,
 NP_313376.1 - ABC transporter ATP-binding protein,
 NP_310705.1 - flagellar motor switch protein G,
 NP_310706.2 - flagellar assembly protein H,
 NP_310708.1 - flagellar biosynthesis chaperone,
 NP_310710.1 - flagellar basal body-associated protein FliL,
 NP_310713.2 - flagellar biosynthesis protein FliO,
 NP_311747.1 - transcriptional regulator,
 NP_311755.1 - hypothetical protein,
 NP_311756.1 - EivI,
 NP_312580.1 - hypothetical protein,
 NP_312583.1 - protein EspA,
 NP_312585.1 - EscD,
 NP_312586.1 - gamma intimin,
 NP_312587.1 - protein CesT,
 NP_312588.1 - hypothetical protein,
 NP_312589.1 - hypothetical protein,
 NP_312590.1 - hypothetical protein,
 NP_312591.1 - hypothetical protein,
 NP_944569.1 - hypothetical protein,
 NP_312593.1 - hypothetical protein,
 NP_312594.1 - hypothetical protein,
 NP_312597.1 - hypothetical protein,
 NP_312598.1 - SepZ,
 NP_312601.1 - SepD,
 NP_312603.1 - CesD,
 NP_312604.1 - hypothetical protein,
 NP_312605.1 - negative regulator GrlR,
 NP_312606.1 - hypothetical protein]

cluster the hits as loci

Now that the hits have been associated with each organism, they can be clustered into loci based on their proximity in the genome. Here, we search for loci with minimum of 5 of the search proteins within a 15k bp of each other.

Also set the COLOUR_DICT setting here which will determine the locus map color.

In the settings file there are definitions for how results will be displayed later in some of the scripts. The COLOUR_DICT variable indicates how ORFs will be coloured in the locus maps. The HEATMAP_COLUMNS variable are how proteins are ordered in the heatmap scripts. the HEATMAP_ABBREVIATIONS variable are 1 letter codes for proteins that should reflect the same order as the columns. If these are not defined, the will be generated automatically.


In [21]:
!cat settings.py


COLOUR_DICT = {'InvC': "#74bc78",
        'SpaO': "#ec9b6a",   
        'SpaP': "#ec9b6a",
        'SpaQ': "#ec9b6a",
        'SpaR': "#ec9b6a",
        'SpaS': "#ec9b6a",
        'InvA': "#00adef",  
        'PscP': "#f8ed31",
        'OrgB': "#ec9b6a",
        'OrgA': "#ec9b6a",
        'PrgJ': "#8cc63e",
        'PrgK': "#f05a28",
        'PrgH': "#2a388f",
        'InvG': "#9aafcc",
        'InvH': '#9e1e62',
        'InvE': "#00adef",
        'PrgI': "#d91b5b",
        'InvJ': "#00adef",
        'SipC': '#652c90',
        'SipB': '#652c90',
        'SipD': '#652c90',}

HEATMAP_COLUMNS = ['InvC',
        'SpaO',
        'SpaP',
        'SpaQ',
        'SpaR',
        'SpaS',
        'InvA',
        'PscP',
        'OrgB',
        'OrgA',
        'PrgJ',
        'PrgK',
        'PrgH',
        'InvG',
        'InvH',
        'InvE',
        'PrgI',
        'InvJ',
        'SipC',
        'SipB',
        'SipD']

HEATMAP_ABBREVIATIONS = ['C', 'O', 'P', 'Q', 'R', 'S', 'A', 'P', 'B', 'A',
'J', 'K', 'H', 'G', 'H', 'E', 'I', 'J', 'C', 'B', 'D']

In [22]:
db.find_loci(5, 15000, colordict=settings.COLOUR_DICT)


finding loci for Pseudomonas aeruginosa PAO1
total of 2 found for Pseudomonas aeruginosa PAO1
finding loci for Escherichia coli O157:H7 str. Sakai
total of 3 found for Escherichia coli O157:H7 str. Sakai
finding loci for Yersinia pestis CO92
total of 3 found for Yersinia pestis CO92
finding loci for Salmonella enterica subsp. enterica serovar Typhi str. CT18
total of 3 found for Salmonella enterica subsp. enterica serovar Typhi str. CT18
finding loci for Hahella chejuensis KCTC 2396
total of 4 found for Hahella chejuensis KCTC 2396
finding loci for Burkholderia thailandensis E264
total of 3 found for Burkholderia thailandensis E264
finding loci for Burkholderia thailandensis E264
total of 1 found for Burkholderia thailandensis E264
finding loci for Anaeromyxobacter dehalogenans 2CP-C
total of 2 found for Anaeromyxobacter dehalogenans 2CP-C
finding loci for Chlamydia trachomatis 434/Bu
total of 1 found for Chlamydia trachomatis 434/Bu
finding loci for Xanthomonas albilineans GPE PC73
total of 2 found for Xanthomonas albilineans GPE PC73
finding loci for Erwinia amylovora ATCC 49946
total of 5 found for Erwinia amylovora ATCC 49946
finding loci for Ramlibacter tataouinensis TTB310
total of 1 found for Ramlibacter tataouinensis TTB310
finding loci for Bordetella pertussis 18323
total of 2 found for Bordetella pertussis 18323
finding loci for Pandoraea pnomenusa 3kgm
total of 4 found for Pandoraea pnomenusa 3kgm

In [23]:
# loci attribute now populated
db.organisms[1].loci


Out[23]:
[<hmmerclust.hmmerclust.Locus instance at 0x10d9573b0>,
 <hmmerclust.hmmerclust.Locus instance at 0x10e60d128>,
 <hmmerclust.hmmerclust.Locus instance at 0x10c796050>]

Make a pandas dataframe using the db

Use the power of pandas to make sense of the results from a phylogenetic perspective, or filter the results based on the hmmer statistics.


In [24]:
df = hmmerclust.FinalDataFrame(db)


Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9, 
            ...
            826, 827, 828, 829, 830, 831, 832, 833, 834, 835],
           dtype='int64', length=836)

In [25]:
# each row represents a hit from our search
# For e.g., index the first row

df.df.head()


Out[25]:
org_name org_acc org_phylum org_class org_order org_family org_genus org_species org_tree_order org_genome_length ... prot_acc prot_gi prot_product prot_translation prot_numb_of_res hit_query hit_evalue hit_bitscore hit_bias locus_id
0 Pseudomonas aeruginosa PAO1 NC_002516 Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa 0 6264404 ... NP_250132.1 15596638 hypothetical protein MAVAPGVLLPPTPDVKPKAAAPKSQQKTPEPSNDKTSSFSDMYAKE... 427 PscP 1.100000e-21 78.1 1.1 <hmmerclust.hmmerclust.Locus instance at 0x10a...
1 Pseudomonas aeruginosa PAO1 NC_002516 Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa 0 6264404 ... NP_250414.1 15596920 type III export protein PscJ MRRTVKGLSRMALLALVLALGGCKVELYTGISQKEGNEMLALLRSE... 248 PrgK 2.300000e-72 244.3 0.2 <hmmerclust.hmmerclust.Locus instance at 0x10c...
2 Pseudomonas aeruginosa PAO1 NC_002516 Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa 0 6264404 ... NP_250134.1 15596640 flagellar motor switch protein FliM MAVQDLLSQDEIDALLHGVDDGLVETEVEATPGSVKSYDLTSQDRI... 323 SpaO 1.300000e-17 64.9 0.2 <hmmerclust.hmmerclust.Locus instance at 0x10a...
3 Pseudomonas aeruginosa PAO1 NC_002516 Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa 0 6264404 ... NP_249795.1 15596301 flagellum-specific ATP synthase MRLERTSFARRLEGYTEAVSLPAQPVVEGRLLRMVGLTLEAEGLQA... 451 InvC 1.700000e-72 245.2 0.0 None
4 Pseudomonas aeruginosa PAO1 NC_002516 Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa 0 6264404 ... NP_250138.1 15596644 flagellar biosynthesis protein FliQ MTPEVALDLFREALWLTAMIVGVLVVPSLLVGLVVAMFQAATQINE... 89 SpaQ 1.800000e-29 103.0 8.8 <hmmerclust.hmmerclust.Locus instance at 0x10a...

5 rows × 22 columns


In [26]:
df.df.ix[0]


Out[26]:
org_name                                   Pseudomonas aeruginosa PAO1
org_acc                                                      NC_002516
org_phylum                                              Proteobacteria
org_class                                          Gammaproteobacteria
org_order                                              Pseudomonadales
org_family                                            Pseudomonadaceae
org_genus                                                  Pseudomonas
org_species                                     Pseudomonas aeruginosa
org_tree_order                                                       0
org_genome_length                                              6264404
org_prot_count                                                      82
org_numb_loci                                                        2
prot_acc                                                   NP_250132.1
prot_gi                                                       15596638
prot_product                                      hypothetical protein
prot_translation     MAVAPGVLLPPTPDVKPKAAAPKSQQKTPEPSNDKTSSFSDMYAKE...
prot_numb_of_res                                                   427
hit_query                                                         PscP
hit_evalue                                                     1.1e-21
hit_bitscore                                                      78.1
hit_bias                                                           1.1
locus_id             <hmmerclust.hmmerclust.Locus instance at 0x10a...
Name: 0, dtype: object

In [27]:
%pylab
%matplotlib inline
figsize(20,10)


Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib

Can use pandas to analyze the results by taxonomy, etc


In [28]:
# number of loci identified by family

figsize(5,5)
df.df.groupby(['org_family'])['locus_id'].nunique().plot(kind='bar')


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x10db0f290>

...or check stats like evalue for an individual protein


In [29]:
figsize(15,5)
df.df[df.df.hit_query=='InvG'].hit_evalue.order().plot(logy=True, kind='bar')


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b77af50>

hmmerclust has some functions for visualizing the results with a heatmap

  • quickly compare presence or absence of detected hits across loci/species
  • Set the cols & singleletters variables to order the columns by a specific order, with single letters on the heat map. The single letters are useful when looking at a very large number of organisms/loci
  • The heatmap is colored by # of identical components in the locus, for example below, PrgI was identified twice in some of the loci (todo... add the color bar legend..)
  • use the subset arg to specify that the locus must contain this component to be included in the heatmap

In [30]:
InvG_only = hmmerclust.HeatMap(df.df,
                  by_locus=True,
                  cols=settings.HEATMAP_COLUMNS,
                  singleletters=settings.HEATMAP_ABBREVIATIONS,
                  subset=['InvG'])


Notice if we set by_locus=False, the rows are no longer by locus, rather a hit can be anywhere in the genome. In this case we see a lot of InvC hits not associated with loci


In [31]:
hmmerclust.HeatMap(df.df,
                  by_locus=False,
                  cols=settings.HEATMAP_COLUMNS,
                  singleletters=settings.HEATMAP_ABBREVIATIONS)


Out[31]:
<hmmerclust.hmmerclust.HeatMap instance at 0x10b80bf80>

In [32]:
#the heatmap is built from this unstacked pandas dataframe

InvG_only.unstacked_df.head()


Out[32]:
hit_query InvC SpaO SpaP SpaQ SpaR SpaS InvA PscP OrgB OrgA ... PrgK PrgH InvG InvH InvE PrgI InvJ SipC SipB SipD
org_species locus_id org_tree_order
Yersinia pestis <hmmerclust.hmmerclust.Locus instance at 0x10c682128> 0 1 1 1 1 1 1 1 0 0 0 ... 1 1 1 0 0 2 0 0 0 0
Xanthomonas albilineans <hmmerclust.hmmerclust.Locus instance at 0x10cf043f8> 0 1 1 1 1 1 1 1 0 0 1 ... 1 1 1 0 1 2 0 1 1 1
Salmonella enterica <hmmerclust.hmmerclust.Locus instance at 0x10d86a320> 0 1 1 1 1 1 1 1 0 0 1 ... 1 1 1 1 1 1 1 1 1 1
<hmmerclust.hmmerclust.Locus instance at 0x10ca8b050> 0 1 1 1 1 1 1 1 0 0 0 ... 1 0 1 0 1 2 0 0 2 0
Ramlibacter tataouinensis <hmmerclust.hmmerclust.Locus instance at 0x10f7291b8> 0 2 1 1 1 1 1 1 1 0 0 ... 1 0 1 0 0 0 0 0 0 0

5 rows × 21 columns

we could grab all the loci from this df:


In [33]:
InvG_only.unstacked_df.reset_index().locus_id


Out[33]:
0     <hmmerclust.hmmerclust.Locus instance at 0x10c...
1     <hmmerclust.hmmerclust.Locus instance at 0x10c...
2     <hmmerclust.hmmerclust.Locus instance at 0x10d...
3     <hmmerclust.hmmerclust.Locus instance at 0x10c...
4     <hmmerclust.hmmerclust.Locus instance at 0x10f...
5     <hmmerclust.hmmerclust.Locus instance at 0x10c...
6     <hmmerclust.hmmerclust.Locus instance at 0x10d...
7     <hmmerclust.hmmerclust.Locus instance at 0x10c...
8     <hmmerclust.hmmerclust.Locus instance at 0x10c...
9     <hmmerclust.hmmerclust.Locus instance at 0x10b...
10    <hmmerclust.hmmerclust.Locus instance at 0x10b...
11    <hmmerclust.hmmerclust.Locus instance at 0x10e...
12    <hmmerclust.hmmerclust.Locus instance at 0x10c...
13    <hmmerclust.hmmerclust.Locus instance at 0x10f...
14    <hmmerclust.hmmerclust.Locus instance at 0x10d...
15    <hmmerclust.hmmerclust.Locus instance at 0x10d...
16    <hmmerclust.hmmerclust.Locus instance at 0x10c...
17    <hmmerclust.hmmerclust.Locus instance at 0x10c...
18    <hmmerclust.hmmerclust.Locus instance at 0x10c...
19    <hmmerclust.hmmerclust.Locus instance at 0x10e...
20    <hmmerclust.hmmerclust.Locus instance at 0x10c...
Name: locus_id, dtype: object

In [34]:
#store these in a new list
invg_loci = list(InvG_only.unstacked_df.reset_index().locus_id)

then visualize them:


In [44]:
figsize(15,5)
for locus in invg_loci[:3]:
    hmmerclust.LocusView(locus)



---------------------------------------------------------------------- 
----------------------------------------------------------------------
Organism:  Yersinia pestis CO92
Locus id: 4503118120
         accession query    evalue bitscore  bias                    name
0   YP_002345337.1  InvG     3e-32    113.1   4.7  type III secretion pro
1   YP_002345338.1  PrgH      0.61     10.7   0.1  type-III secretion pro
2   YP_002345339.1     -         -        -     -  type III secretion app
3   YP_002345341.1  PrgI   6.9e-06     28.3   0.5  type III secretion app
4   YP_002345342.1     -         -        -     -  type III secretion app
5   YP_002345343.1  PrgI     1e-10     43.8   2.7  type III secretion app
6   YP_002345344.1  PrgK   9.8e-58    196.5   1.2  type III secretion sys
7   YP_002345345.1     -         -        -     -    hypothetical protein
8   YP_002345346.1     -         -        -     -  type III secretion sys
9   YP_002345347.1  InvA    1e-226    756.2   4.3  secretion system appar
10  YP_002345348.1  InvC   6.4e-71    240.1     0  type III secretion sys
11  YP_002345349.1     -         -        -     -  type III secretion sys
12  YP_002345351.1  SpaO     1e-19     71.6     0  type III secretion sys
13  YP_002345352.1  SpaP   8.1e-71    239.5  10.6  type III secretion sys
14  YP_002345353.1  SpaQ   2.2e-27     96.3  14.1  type III secretion app
15  YP_002345354.1  SpaR   6.4e-62    210.8  23.1  type III secretion app
16  YP_002345355.1  SpaS  2.7e-110    370.3   0.7  secretion system appar


---------------------------------------------------------------------- 
----------------------------------------------------------------------
Organism:  Xanthomonas albilineans GPE PC73
Locus id: 4512039928
         accession query    evalue bitscore  bias                    name
0   YP_003375972.1  SipD   4.9e-21     76.9   0.1    xsa-invasion protein
1   YP_003375973.1     -         -        -     -  xsa-associated protein
2   YP_003375974.1     -         -        -     -  xsa-associated protein
3   YP_003375975.1     -         -        -     -  xsa-associated protein
4   YP_003375976.1     -         -        -     -  xsa-associated protein
5   YP_003375977.1     -         -        -     -    hypothetical protein
6   YP_003375978.1     -         -        -     -    hypothetical protein
7   YP_003375979.1     -         -        -     -  oxygen-regulated invas
8   YP_003375980.1  OrgA   9.5e-20     72.8     0  oxygen-regulated invas
9   YP_003375981.1  PrgK   3.9e-39    135.8   0.1  xanthomonas secretion 
10  YP_003375982.1  PrgI   2.7e-07     32.8   0.1  xanthomonas secretion 
11  YP_003375983.1  PrgI   6.3e-08     34.8   0.5  xanthomonas secretion 
12  YP_003375984.1  PrgH   2.6e-60    206.1     0  xanthomonas secretion 
13  YP_003375985.1     -         -        -     -  xanthomonas secretion 
14  YP_003375986.1  InvG   1.8e-25       91   0.9  xanthomonas secretion 
15  YP_003375987.1  InvE   1.4e-20     75.8     6  xanthomonas secretion 
16  YP_003375988.1  InvA  7.3e-203    677.3   5.4  xanthomonas secretion 
17  YP_003375989.1     -         -        -     -  xanthomonas secretion 
18  YP_003375990.1  InvC   2.3e-63    215.4   0.1  xanthomonas secretion 
19  YP_003375991.1     -         -        -     -  xsa-associated protein
20  YP_003375992.1     -         -        -     -  xsa-associated protein
21  YP_003375993.1  SpaO   3.6e-19     69.9   0.1  xanthomonas secretion 
22  YP_003375994.1  SpaP   4.3e-62    211.1  11.4  xanthomonas secretion 
23  YP_003375995.1  SpaQ   9.8e-26       91   8.2  xanthomonas secretion 
24  YP_003375996.1  SpaR   1.4e-40    140.9  20.1  xanthomonas secretion 
25  YP_003375997.1  SpaS   1.6e-83    282.3   0.4  xanthomonas secretion 
26  YP_003375998.1     -         -        -     -  xsa-invasion chaperone
27  YP_003375999.1  SipB   5.8e-26     93.4   4.2    xsa-invasion protein
28  YP_003376000.1  SipC   5.8e-17     63.5    26    xsa-invasion protein


---------------------------------------------------------------------- 
----------------------------------------------------------------------
Organism:  Salmonella enterica subsp. enterica serovar Typhi str. CT18
Locus id: 4521894688
      accession query    evalue bitscore  bias                    name
0   NP_457264.1  OrgA   5.9e-70    236.4   3.9   cell invasion protein
1   NP_457265.1  PrgK   8.7e-59    199.9     0  pathogenicity 1 island
2   NP_457266.1  PrgJ   1.1e-07     33.6     0  pathogenicity 1 island
3   NP_457267.1  PrgI   7.1e-10     41.1   0.1  pathogenicity 1 island
4   NP_457268.1  PrgH  1.1e-152    510.1     0  pathogenicity 1 island
5   NP_457269.1     -         -        -     -  AraC family transcript
6   NP_457270.1     -         -        -     -  invasion protein regul
7   NP_457271.1     -         -        -     -   cell invasion protein
8   NP_457272.1     -         -        -     -    tyrosine phosphatase
9   NP_457273.1     -         -        -     -       chaperone protein
10  NP_457274.1     -         -        -     -    hypothetical protein
11  NP_457275.1     -         -        -     -    acyl carrier protein
12  NP_457276.1     -         -        -     -  pathogenicity island 1
13  NP_457277.1  SipD  2.7e-175      584    11  pathogenicity island 1
14  NP_457278.1  SipC  2.3e-135    452.8  34.3  pathogenicity island 1
15  NP_457279.1  SipB   1.4e-70    239.9  24.3  pathogenicity island 1
16  NP_457280.1     -         -        -     -  chaperone protein SicA
17  NP_457281.1  SpaS  3.4e-117      393   4.4       secretory protein
18  NP_457282.1  SpaR   1.7e-78    265.2  16.6       secretory protein
19  NP_457283.1  SpaQ   2.2e-25     89.9   0.8       secretory protein
20  NP_457284.1  SpaP   3.3e-66    224.5   3.7       secretory protein
21  NP_457285.1  SpaO   1.8e-16     61.2     0  surface presentation o
22  NP_457286.1  InvJ  4.8e-213    708.1   7.2  surface presentation o
23  NP_457287.1     -         -        -     -       secretory protein
24  NP_457288.1  InvC   9.1e-70    236.3     0  secretory apparatus AT
25  NP_457289.1     -         -        -     -       secretory protein
26  NP_457290.1  InvA  2.1e-224    748.5   9.3       secretory protein
27  NP_457291.1  InvE   5.3e-42    145.5   0.4   cell invasion protein
28  NP_457292.1  InvG   1.6e-30    107.4   0.6       secretory protein
29  NP_457293.1     -         -        -     -  AraC family transcript
30  NP_457294.1  InvH   6.8e-96    319.8     6  cell adherance/invasio

Can get more detail about locus hits by setting `hit_detail_table=True. In any case, the hsps from hmmer are mapped onto the ORFs. Sometimes a single ORF will match to multiple queries ... these will be revealed in the hit detail table.


In [36]:
hmmerclust.LocusView(invg_loci[0], hit_detail_table=True)



---------------------------------------------------------------------- 
----------------------------------------------------------------------
Organism:  Yersinia pestis CO92
Locus id: 4503118120
         accession query    evalue bitscore  bias                    name
0   YP_002345337.1  InvG     3e-32    113.1   4.7  type III secretion pro
1   YP_002345338.1  PrgH      0.61     10.7   0.1  type-III secretion pro
2   YP_002345339.1     -         -        -     -  type III secretion app
3   YP_002345341.1  PrgI   6.9e-06     28.3   0.5  type III secretion app
4   YP_002345342.1     -         -        -     -  type III secretion app
5   YP_002345343.1  PrgI     1e-10     43.8   2.7  type III secretion app
6   YP_002345344.1  PrgK   9.8e-58    196.5   1.2  type III secretion sys
7   YP_002345345.1     -         -        -     -    hypothetical protein
8   YP_002345346.1     -         -        -     -  type III secretion sys
9   YP_002345347.1  InvA    1e-226    756.2   4.3  secretion system appar
10  YP_002345348.1  InvC   6.4e-71    240.1     0  type III secretion sys
11  YP_002345349.1     -         -        -     -  type III secretion sys
12  YP_002345351.1  SpaO     1e-19     71.6     0  type III secretion sys
13  YP_002345352.1  SpaP   8.1e-71    239.5  10.6  type III secretion sys
14  YP_002345353.1  SpaQ   2.2e-27     96.3  14.1  type III secretion app
15  YP_002345354.1  SpaR   6.4e-62    210.8  23.1  type III secretion app
16  YP_002345355.1  SpaS  2.7e-110    370.3   0.7  secretion system appar
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
InvG     113.1  3.000000e-32   4.7        359      520
      bitscore  evalue  bias  hsp_start  hsp_end
name                                            
PrgH      10.7    0.61   0.1          4       34
PrgH      10.7    0.61   0.0        165      342
      bitscore    evalue  bias  hsp_start  hsp_end
name                                              
PrgI      28.3  0.000007   0.5          1       70
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
PrgI      43.8  1.000000e-10   2.7          1       82
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
PrgK     196.5  9.800000e-58   1.2          0      202
      bitscore         evalue  bias  hsp_start  hsp_end
name                                                   
InvA     756.2  1.000000e-226   4.3         26      665
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
InvC     240.1  6.400000e-71     0        148      359
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
SpaO      71.6  1.000000e-19   0.0        146      179
SpaO      71.6  1.000000e-19   0.1        224      300
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
SpaP     239.5  8.100000e-71  10.6         12      212
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
SpaQ      96.3  2.200000e-27  14.1         11       87
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
SpaR     210.8  6.400000e-62  23.1          2      251
      bitscore         evalue  bias  hsp_start  hsp_end
name                                                   
SpaS     370.3  2.700000e-110   0.7          0      342
Out[36]:
<hmmerclust.hmmerclust.LocusView instance at 0x10c423320>

Filtering and exporting sequences

Finally, sequences can be exported for multiple sequence alignments, etc. Use pandas define the criteria for the exported sequences ... e.g. here we export sequences that belong to a locus with e-value less than 0.01


In [37]:
#836 protein in the total search
len(df.df)


Out[37]:
836

In [38]:
#remember the list of invg loci we made in the previous step
invg_loci


Out[38]:
[<hmmerclust.hmmerclust.Locus instance at 0x10c682128>,
 <hmmerclust.hmmerclust.Locus instance at 0x10cf043f8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10d86a320>,
 <hmmerclust.hmmerclust.Locus instance at 0x10ca8b050>,
 <hmmerclust.hmmerclust.Locus instance at 0x10f7291b8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10cab72d8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10da0c3f8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10ce7f7e8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10ccdf200>,
 <hmmerclust.hmmerclust.Locus instance at 0x10bee9098>,
 <hmmerclust.hmmerclust.Locus instance at 0x10bb55878>,
 <hmmerclust.hmmerclust.Locus instance at 0x10e60d128>,
 <hmmerclust.hmmerclust.Locus instance at 0x10c796050>,
 <hmmerclust.hmmerclust.Locus instance at 0x10f742098>,
 <hmmerclust.hmmerclust.Locus instance at 0x10d2a3368>,
 <hmmerclust.hmmerclust.Locus instance at 0x10d1792d8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10cf474d0>,
 <hmmerclust.hmmerclust.Locus instance at 0x10cb390e0>,
 <hmmerclust.hmmerclust.Locus instance at 0x10c2981b8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10e3491b8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10c7143f8>]

In [39]:
#make a new df that only has proteins in the invg_loci
invg_only_df = df.df[df.df['locus_id'].isin(invg_loci)]
len(invg_only_df)


Out[39]:
288

In [40]:
#now filter this by evalue < 0.01
evalue_cut = invg_only_df[invg_only_df.hit_evalue < 0.01]

In [41]:
#down to 278 proteins
len(evalue_cut)


Out[41]:
278

In [42]:
#export all these proteins as fasta files
hmmerclust.RelatedProteinGroup(evalue_cut)


Out[42]:
<hmmerclust.hmmerclust.RelatedProteinGroup instance at 0x10c27ecf8>

In [43]:
#the hits are grouped by name, for MSA,  in this new folder
!ls group_fastas/


InvA.fasta InvE.fasta InvH.fasta OrgA.fasta PrgH.fasta PrgJ.fasta PscP.fasta SipC.fasta SpaO.fasta SpaQ.fasta SpaS.fasta
InvC.fasta InvG.fasta InvJ.fasta OrgB.fasta PrgI.fasta PrgK.fasta SipB.fasta SipD.fasta SpaP.fasta SpaR.fasta

In [ ]: