hmmerclust

A python package for detecting a gene cluster of interest in a set of bacterial genomes followed by interactive analysis of the results using ipython/jupyter notebook. For example, it could be used for identifying genetic loci that encode biosynthetic pathways or multicomponent protein assemblies, comparing how these loci differ across taxonomical groups, and automatically extracting sequences for subsequent phylogenetic analysis or protein expression.

Please see the demo folder, which contains an executable notebook demonstrating the functionality.

Requirements

python2.7/pip
virtualenv
hmmer installed on system
ipython/notebook
pandas
matplotlib
biopython

User knowledge

Python
How to use the pandas data analysis library

Input files

Genbank genome sequence files
A folder with a set of multiple sequence alignments representing proteins of interest
a settings file for defining colors, tables, and abbreviations
Please see the demo folder, which contains an executable notebook demonstrating the whole process.

How it works

Protein coding sequences are extracted from all the genome genbank files and combined into a single fasta
Hmmbuild generates a hidden markov model from the input alignments
Hmmsearch queries the combined proteome fasta using the hmms.
All the data describing the organism info and taxonomy, the identified proteins, and hit statitics are stored in an object-oriented database.
Hits are annotated based on the best-scoring hit. But other hits are preserved in the database if they need to be compared later
The hits are clustered together based on the location of their coding sequence on the chromosome by user-specified criteria (e.g. how many CDS represent a cluster, and how far apart should they be). The hits are stored together as 'loci objects'.
A pandas dataframe is generated with all the hit data. This allows for interactive analysis of the results, hit statistics, etc.

Scripts for visualizaing and analyzing results

A heatmap for visualizing presence or absence of cluster components across a large number of species. The species can be optionally sorted based on their position in a phylogenetic tree.
Locus maps w/ tables that describe the hits.
Functions for extracting entire loci or sets of annotated proteins for doing multiple sequence alignments or phylogenetic analysis in a separate program.

Installation

Make a virtual envelope

First, install virtualenv and virtualenvwrapper.

Then make a virtualenv for hmmerclust:

$ mkvirtualenv -p /usr/bin/python hmmerclust
$ workon hmmerclust

Install requirements and start the notebok

$ pip install ipython[all] matplotlib pandas biopython hmmerclust
$ ipython notebook

Setting up the files

3 key things

Alignment folder
Genome folder
settings.py file
16S_aligned.csv (for sorting results by phylogentic tree. Just make an empty file to start)



In [8]:

    
from hmmerclust import hmmerclust
import settings

Set up the genome directory and define genome variables

You can use the fetch_gbwithparts() function to download genomes from NCBI. Here, we just download a few. In an actual search I might use all the genomes from RefSeq. See a list here



In [9]:

    
accessions_of_interest = ['NC_002516',
 'NC_002695',
 'NC_003143',
 'NC_003198',
 'NC_007645',
 'NC_007650',
 'NC_007651',
 'NC_007760',
 'NC_010287',
 'NC_013722',
 'NC_013971',
 'NC_015677',
 'NC_018518',
 'NC_022904']



In [10]:

    
#download these from NCBI... may take 5 minutes

hmmerclust.fetch_gbwithparts(accessions_of_interest, 'your@email.com', './genomes/')









    



downloading genomes... please wait
done downloading NC_002516
done downloading NC_002695
done downloading NC_003143
done downloading NC_003198
done downloading NC_007645
done downloading NC_007650
done downloading NC_007651
done downloading NC_007760
done downloading NC_010287
done downloading NC_013722
done downloading NC_013971
done downloading NC_015677
done downloading NC_018518
done downloading NC_022904



In [11]:

    
ls ./genomes/









    



NC_002516.gb  NC_003143.gb  NC_007645.gb  NC_007651.gb  NC_010287.gb  NC_013971.gb  NC_018518.gb
NC_002695.gb  NC_003198.gb  NC_007650.gb  NC_007760.gb  NC_013722.gb  NC_015677.gb  NC_022904.gb



In [12]:

    
genome_list = !ls ./genomes/
genome_dir = './genomes/'

Make a database from these genomes

Now it is time to populate our database with organism data. After you do this once, a fasta will be generated. If you already have the fasta from a previous run, set freshfasta=False because the file can be very large and don't want to always generate this and have multiple copies on your system.



In [13]:

    
db = hmmerclust.OrganismDB('t3ss_database',
                           genome_list,
                           genome_dir,
                           freshfasta=True)









    



making combined fasta for NC_002516.gb
5572 proteins were added
5572 unique proteins were added -- dropping redundant ones
making combined fasta for NC_002695.gb
10776 proteins were added
10776 unique proteins were added -- dropping redundant ones
making combined fasta for NC_003143.gb
14574 proteins were added
14574 unique proteins were added -- dropping redundant ones
making combined fasta for NC_003198.gb
18685 proteins were added
18685 unique proteins were added -- dropping redundant ones
making combined fasta for NC_007645.gb
24962 proteins were added
24945 unique proteins were added -- dropping redundant ones
making combined fasta for NC_007650.gb
27318 proteins were added
27301 unique proteins were added -- dropping redundant ones
making combined fasta for NC_007651.gb
30594 proteins were added
30577 unique proteins were added -- dropping redundant ones
making combined fasta for NC_007760.gb
35020 proteins were added
35002 unique proteins were added -- dropping redundant ones
making combined fasta for NC_010287.gb
35905 proteins were added
35887 unique proteins were added -- dropping redundant ones
making combined fasta for NC_013722.gb
39019 proteins were added
39001 unique proteins were added -- dropping redundant ones
making combined fasta for NC_013971.gb
42326 proteins were added
42306 unique proteins were added -- dropping redundant ones
making combined fasta for NC_015677.gb
46171 proteins were added
46150 unique proteins were added -- dropping redundant ones
making combined fasta for NC_018518.gb
49788 proteins were added
49563 unique proteins were added -- dropping redundant ones
making combined fasta for NC_022904.gb
54559 proteins were added
54319 unique proteins were added -- dropping redundant ones
Adding organism attributes for NC_002516.gb
Adding organism attributes for NC_002695.gb
Adding organism attributes for NC_003143.gb
Adding organism attributes for NC_003198.gb
Adding organism attributes for NC_007645.gb
Adding organism attributes for NC_007650.gb
Adding organism attributes for NC_007651.gb
Adding organism attributes for NC_007760.gb
Adding organism attributes for NC_010287.gb
Adding organism attributes for NC_013722.gb
Adding organism attributes for NC_013971.gb
Adding organism attributes for NC_015677.gb
Adding organism attributes for NC_018518.gb
Adding organism attributes for NC_022904.gb



In [14]:

    
db.organisms









    Out[14]:





[<hmmerclust.hmmerclust.Organism at 0x10731a950>,
 <hmmerclust.hmmerclust.Organism at 0x10a3f06d0>,
 <hmmerclust.hmmerclust.Organism at 0x10a00c210>,
 <hmmerclust.hmmerclust.Organism at 0x10a4789d0>,
 <hmmerclust.hmmerclust.Organism at 0x10a4788d0>,
 <hmmerclust.hmmerclust.Organism at 0x10c284090>,
 <hmmerclust.hmmerclust.Organism at 0x10f74d5d0>,
 <hmmerclust.hmmerclust.Organism at 0x10f74d590>,
 <hmmerclust.hmmerclust.Organism at 0x10c9ed090>,
 <hmmerclust.hmmerclust.Organism at 0x10c340450>,
 <hmmerclust.hmmerclust.Organism at 0x10c340410>,
 <hmmerclust.hmmerclust.Organism at 0x10c340050>,
 <hmmerclust.hmmerclust.Organism at 0x10a4a4c90>,
 <hmmerclust.hmmerclust.Organism at 0x10c4de050>]



In [15]:

    
for org in db.organisms:
    print org.name









    



Pseudomonas aeruginosa PAO1
Escherichia coli O157:H7 str. Sakai
Yersinia pestis CO92
Salmonella enterica subsp. enterica serovar Typhi str. CT18
Hahella chejuensis KCTC 2396
Burkholderia thailandensis E264
Burkholderia thailandensis E264
Anaeromyxobacter dehalogenans 2CP-C
Chlamydia trachomatis 434/Bu
Xanthomonas albilineans GPE PC73
Erwinia amylovora ATCC 49946
Ramlibacter tataouinensis TTB310
Bordetella pertussis 18323
Pandoraea pnomenusa 3kgm

the organism object encapsulates this data:



In [16]:

    
db.organisms[1].__dict__









    Out[16]:





{'accesion_version': 'NC_002695.1',
 'accession': 'NC_002695',
 'clazz': 'Gammaproteobacteria',
 'description': 'Escherichia coli O157:H7 str. Sakai chromosome, complete genome.',
 'family': 'Enterobacteriaceae',
 'genome_length': 5498450,
 'genome_path': './genomes/NC_002695.gb',
 'genus': 'Escherichia',
 'kingdom': 'Bacteria',
 'loci': [],
 'name': 'Escherichia coli O157:H7 str. Sakai',
 'order': 'Enterobacteriales',
 'parent_db': <hmmerclust.hmmerclust.OrganismDB instance at 0x10a486ab8>,
 'phylum': 'Proteobacteria',
 'proteins': [],
 'rRNA16S_sequence': None,
 'species': 'Escherichia coli',
 'taxonomy': ['Bacteria',
  'Proteobacteria',
  'Gammaproteobacteria',
  'Enterobacteriales',
  'Enterobacteriaceae',
  'Escherichia'],
 'tree_order': 0}



In [17]:

    
#notice a combined_fasta file was created when the db was made
!ls









    



16S_aligned.csv       combined_fasta        hhsearch_results      locus_fastas          settings.pyc
__init__.py           genomes               hmm                   out.svg
alignments            group_fastas          hmmerclust_demo.ipynb settings.py

Carry out the search against the combined_fasta

Set up the /alignments folder. Alignments FASTA and must have txt extension. In this case, I queried PFAM for the proteins located in my gene cluster of interest (here, the type III secretion system) and downloaded the hmm seed alignments. It is important to name the files starting with how you want to the search results to be annotated later, followed by an underscore. It doesn't matter what goes after the underscore.



In [18]:

    
!ls alignments









    



InvA_PF00771_seed.txt InvJ_PF02510_seed.txt PrgJ_PB000379.txt     SipD_PF06511_seed.txt SpaS_PF01312_seed.txt
InvC_PF00006_seed.txt OrgA_PF09482_seed.txt PrgK_PF01514_seed.txt SpaO_PF01052_seed.txt
InvE_PF07201_seed.txt OrgB_PB004806.txt     PscP_PF02120_seed.txt SpaP_PF00813_seed.txt
InvG_PF00263_seed.txt PrgH_PF09480_seed.txt SipB_PF04888_seed.txt SpaQ_PF01313_seed.txt
InvH_PF04741_seed.txt PrgI_PF09392_seed.txt SipC_PF09599_seed.txt SpaR_PF01311_seed.txt



In [19]:

    
combined_fasta = './combined_fasta'
s = hmmerclust.HmmSearch(db, combined_fasta, 
                         freshbuild=True,
                         freshsearch=True)









    



building Hmm for InvA_PF00771_seed.txt
building Hmm for InvC_PF00006_seed.txt
building Hmm for InvE_PF07201_seed.txt
building Hmm for InvG_PF00263_seed.txt
building Hmm for InvH_PF04741_seed.txt
building Hmm for InvJ_PF02510_seed.txt
building Hmm for OrgA_PF09482_seed.txt
building Hmm for OrgB_PB004806.txt
building Hmm for PrgH_PF09480_seed.txt
building Hmm for PrgI_PF09392_seed.txt
building Hmm for PrgJ_PB000379.txt
building Hmm for PrgK_PF01514_seed.txt
building Hmm for PscP_PF02120_seed.txt
building Hmm for SipB_PF04888_seed.txt
building Hmm for SipC_PF09599_seed.txt
building Hmm for SipD_PF06511_seed.txt
building Hmm for SpaO_PF01052_seed.txt
building Hmm for SpaP_PF00813_seed.txt
building Hmm for SpaQ_PF01313_seed.txt
building Hmm for SpaR_PF01311_seed.txt
building Hmm for SpaS_PF01312_seed.txt
hhbuild complete for ['InvA', 'InvC', 'InvE', 'InvG', 'InvH', 'InvJ', 'OrgA', 'OrgB', 'PrgH', 'PrgI', 'PrgJ', 'PrgK', 'PscP', 'SipB', 'SipC', 'SipD', 'SpaO', 'SpaP', 'SpaQ', 'SpaR', 'SpaS']
running HHsearch on InvA
running HHsearch on InvC
running HHsearch on InvE
running HHsearch on InvG
running HHsearch on InvH
running HHsearch on InvJ
running HHsearch on OrgA
running HHsearch on OrgB
running HHsearch on PrgH
running HHsearch on PrgI
running HHsearch on PrgJ
running HHsearch on PrgK
running HHsearch on PscP
running HHsearch on SipB
running HHsearch on SipC
running HHsearch on SipD
running HHsearch on SpaO
running HHsearch on SpaP
running HHsearch on SpaQ
running HHsearch on SpaR
running HHsearch on SpaS
extracted 39 hits for InvA.out
extracted 296 hits for InvC.out
extracted 19 hits for InvE.out
extracted 63 hits for InvG.out
extracted 1 hits for InvH.out
extracted 3 hits for InvJ.out
extracted 8 hits for OrgA.out
extracted 2 hits for OrgB.out
extracted 12 hits for PrgH.out
extracted 41 hits for PrgI.out
extracted 13 hits for PrgJ.out
extracted 41 hits for PrgK.out
extracted 47 hits for PscP.out
extracted 29 hits for SipB.out
extracted 4 hits for SipC.out
extracted 8 hits for SipD.out
extracted 59 hits for SpaO.out
extracted 39 hits for SpaP.out
extracted 38 hits for SpaQ.out
extracted 40 hits for SpaR.out
extracted 42 hits for SpaS.out
adding proteins to organism NC_002516
adding proteins to organism NC_002695
adding proteins to organism NC_003143
adding proteins to organism NC_003198
adding proteins to organism NC_007645
adding proteins to organism NC_007650
adding proteins to organism NC_007651
adding proteins to organism NC_007760
adding proteins to organism NC_010287
adding proteins to organism NC_013722
adding proteins to organism NC_013971
adding proteins to organism NC_015677
adding proteins to organism NC_018518
adding proteins to organism NC_022904
adding SearchIO hit objects for NC_002516
adding SearchIO hit objects for NC_002695
adding SearchIO hit objects for NC_003143
adding SearchIO hit objects for NC_003198
adding SearchIO hit objects for NC_007645
adding SearchIO hit objects for NC_007650
adding SearchIO hit objects for NC_007651
adding SearchIO hit objects for NC_007760
adding SearchIO hit objects for NC_010287
adding SearchIO hit objects for NC_013722
adding SearchIO hit objects for NC_013971
adding SearchIO hit objects for NC_015677
adding SearchIO hit objects for NC_018518
adding SearchIO hit objects for NC_022904
setting best hit values for Pseudomonas aeruginosa PAO1
setting best hit values for Escherichia coli O157:H7 str. Sakai
setting best hit values for Yersinia pestis CO92
setting best hit values for Salmonella enterica subsp. enterica serovar Typhi str. CT18
setting best hit values for Hahella chejuensis KCTC 2396
setting best hit values for Burkholderia thailandensis E264
setting best hit values for Burkholderia thailandensis E264
setting best hit values for Anaeromyxobacter dehalogenans 2CP-C
setting best hit values for Chlamydia trachomatis 434/Bu
setting best hit values for Xanthomonas albilineans GPE PC73
setting best hit values for Erwinia amylovora ATCC 49946
setting best hit values for Ramlibacter tataouinensis TTB310
setting best hit values for Bordetella pertussis 18323
setting best hit values for Pandoraea pnomenusa 3kgm



In [45]:

    
#hits now found in the db object
db.organisms[1].proteins









    Out[45]:





[NP_311099.1 - ABC transporter ATP-binding protein,
 NP_310807.1 - hypothetical protein,
 NP_312703.1 - ATP synthase F0F1 subunit alpha,
 NP_312230.1 - ABC transporter ATP-binding protein,
 NP_311368.2 - glycine cleavage system transcriptional repressor,
 NP_313267.1 - hypothetical protein,
 NP_312611.1 - hypothetical protein,
 NP_309773.1 - oligopeptide ABC transporter ATP-binding protein,
 NP_309779.1 - transporter,
 NP_312226.1 - hypothetical protein,
 NP_311745.1 - EprI,
 NP_311753.1 - surface presentation of antigens protein SpaO,
 NP_312260.1 - outer membrane porin HofQ,
 NP_311743.1 - EprK,
 NP_311757.1 - ATP synthase SpaL,
 NP_311751.1 - EpaQ,
 NP_311759.1 - EivE,
 NP_312472.2 - hypothetical protein,
 NP_312579.1 - protein EscF,
 NP_310061.1 - potassium-tellurite ethidium and proflavin transporter,
 NP_308962.1 - putrescine transporter ATP-binding protein,
 NP_312609.1 - EscS,
 NP_310704.1 - flagellar MS-ring protein,
 NP_312084.1 - ATP-dependent metalloprotease,
 NP_310682.1 - amino-acid ABC transporter ATP-binding protein,
 NP_308654.1 - iron-enterobactin transporter ATP-binding protein,
 NP_309004.1 - recombination factor protein RarA,
 NP_312595.1 - EscN,
 NP_310714.1 - flagellar biosynthesis protein FliP,
 NP_310716.1 - flagellar biosynthesis protein FliR,
 NP_308868.1 - tail assembly protein,
 NP_310712.1 - flagellar motor switch protein FliN,
 NP_310616.1 - flagellar biosynthesis protein FlhA,
 NP_308283.1 - FhiA protein,
 NP_309894.1 - peptide ABC transporter ATP-binding protein,
 NP_312608.1 - EscT,
 NP_308440.2 - ferric transporter ATP-binding protein,
 NP_312599.1 - hypothetical protein,
 NP_310114.1 - ABC transporter ATP-binding protein,
 NP_312885.1 - ATP-dependent protease ATP-binding protein HslU,
 NP_312610.1 - type III secretion system protein,
 NP_310715.1 - flagellar biosynthesis protein FliQ,
 NP_309863.1 - anthranilate synthase component I,
 NP_311754.1 - EivJ,
 NP_311758.1 - EivA,
 NP_312602.1 - EscC,
 NP_309060.1 - ABC transporter ATPase,
 NP_312701.1 - ATP synthase F0F1 subunit beta,
 NP_308935.1 - glutathione transporter ATP-binding protein,
 NP_311746.1 - EprH,
 NP_312743.1 - transcription termination factor Rho,
 NP_311640.1 - hypothetical protein,
 NP_310711.1 - flagellar motor switch protein FliM,
 NP_311744.1 - EprJ,
 NP_311742.1 - hypothetical protein,
 NP_312592.1 - SepQ,
 NP_312607.1 - secretion system apparatus protein SsaU,
 NP_310617.1 - flagellar biosynthesis protein FlhB,
 NP_308520.1 - DNA-binding ATP-dependent protease La,
 NP_308681.1 - citrate lyase subunit alpha,
 NP_309724.1 - ferric enterobactin transport ATP-binding protein,
 NP_312600.1 - EscJ,
 NP_309774.1 - hypothetical protein,
 NP_311760.1 - EivG,
 NP_312584.1 - SepL,
 NP_310707.1 - flagellum-specific ATP synthase,
 NP_310391.1 - lipoprotein,
 NP_313107.1 - phosphonate C-P lyase system protein PhnK,
 NP_311440.1 - ABC transporter ATP-binding protein,
 NP_312596.1 - hypothetical protein,
 NP_312582.1 - protein EspD,
 NP_311752.1 - surface presentation of antigens protein SpaP,
 NP_312581.1 - protein EspB,
 NP_310709.1 - flagellar hook-length control protein,
 NP_311748.1 - surface presentation of antigens protein SpaS,
 NP_313376.1 - ABC transporter ATP-binding protein,
 NP_310705.1 - flagellar motor switch protein G,
 NP_310706.2 - flagellar assembly protein H,
 NP_310708.1 - flagellar biosynthesis chaperone,
 NP_310710.1 - flagellar basal body-associated protein FliL,
 NP_310713.2 - flagellar biosynthesis protein FliO,
 NP_311747.1 - transcriptional regulator,
 NP_311755.1 - hypothetical protein,
 NP_311756.1 - EivI,
 NP_312580.1 - hypothetical protein,
 NP_312583.1 - protein EspA,
 NP_312585.1 - EscD,
 NP_312586.1 - gamma intimin,
 NP_312587.1 - protein CesT,
 NP_312588.1 - hypothetical protein,
 NP_312589.1 - hypothetical protein,
 NP_312590.1 - hypothetical protein,
 NP_312591.1 - hypothetical protein,
 NP_944569.1 - hypothetical protein,
 NP_312593.1 - hypothetical protein,
 NP_312594.1 - hypothetical protein,
 NP_312597.1 - hypothetical protein,
 NP_312598.1 - SepZ,
 NP_312601.1 - SepD,
 NP_312603.1 - CesD,
 NP_312604.1 - hypothetical protein,
 NP_312605.1 - negative regulator GrlR,
 NP_312606.1 - hypothetical protein]

cluster the hits as loci

Now that the hits have been associated with each organism, they can be clustered into loci based on their proximity in the genome. Here, we search for loci with minimum of 5 of the search proteins within a 15k bp of each other.

Also set the COLOUR_DICT setting here which will determine the locus map color.

In the settings file there are definitions for how results will be displayed later in some of the scripts. The COLOUR_DICT variable indicates how ORFs will be coloured in the locus maps. The HEATMAP_COLUMNS variable are how proteins are ordered in the heatmap scripts. the HEATMAP_ABBREVIATIONS variable are 1 letter codes for proteins that should reflect the same order as the columns. If these are not defined, the will be generated automatically.



In [21]:

    
!cat settings.py









    



COLOUR_DICT = {'InvC': "#74bc78",
        'SpaO': "#ec9b6a",   
        'SpaP': "#ec9b6a",
        'SpaQ': "#ec9b6a",
        'SpaR': "#ec9b6a",
        'SpaS': "#ec9b6a",
        'InvA': "#00adef",  
        'PscP': "#f8ed31",
        'OrgB': "#ec9b6a",
        'OrgA': "#ec9b6a",
        'PrgJ': "#8cc63e",
        'PrgK': "#f05a28",
        'PrgH': "#2a388f",
        'InvG': "#9aafcc",
        'InvH': '#9e1e62',
        'InvE': "#00adef",
        'PrgI': "#d91b5b",
        'InvJ': "#00adef",
        'SipC': '#652c90',
        'SipB': '#652c90',
        'SipD': '#652c90',}

HEATMAP_COLUMNS = ['InvC',
        'SpaO',
        'SpaP',
        'SpaQ',
        'SpaR',
        'SpaS',
        'InvA',
        'PscP',
        'OrgB',
        'OrgA',
        'PrgJ',
        'PrgK',
        'PrgH',
        'InvG',
        'InvH',
        'InvE',
        'PrgI',
        'InvJ',
        'SipC',
        'SipB',
        'SipD']

HEATMAP_ABBREVIATIONS = ['C', 'O', 'P', 'Q', 'R', 'S', 'A', 'P', 'B', 'A',
'J', 'K', 'H', 'G', 'H', 'E', 'I', 'J', 'C', 'B', 'D']



In [22]:

    
db.find_loci(5, 15000, colordict=settings.COLOUR_DICT)









    



finding loci for Pseudomonas aeruginosa PAO1
total of 2 found for Pseudomonas aeruginosa PAO1
finding loci for Escherichia coli O157:H7 str. Sakai
total of 3 found for Escherichia coli O157:H7 str. Sakai
finding loci for Yersinia pestis CO92
total of 3 found for Yersinia pestis CO92
finding loci for Salmonella enterica subsp. enterica serovar Typhi str. CT18
total of 3 found for Salmonella enterica subsp. enterica serovar Typhi str. CT18
finding loci for Hahella chejuensis KCTC 2396
total of 4 found for Hahella chejuensis KCTC 2396
finding loci for Burkholderia thailandensis E264
total of 3 found for Burkholderia thailandensis E264
finding loci for Burkholderia thailandensis E264
total of 1 found for Burkholderia thailandensis E264
finding loci for Anaeromyxobacter dehalogenans 2CP-C
total of 2 found for Anaeromyxobacter dehalogenans 2CP-C
finding loci for Chlamydia trachomatis 434/Bu
total of 1 found for Chlamydia trachomatis 434/Bu
finding loci for Xanthomonas albilineans GPE PC73
total of 2 found for Xanthomonas albilineans GPE PC73
finding loci for Erwinia amylovora ATCC 49946
total of 5 found for Erwinia amylovora ATCC 49946
finding loci for Ramlibacter tataouinensis TTB310
total of 1 found for Ramlibacter tataouinensis TTB310
finding loci for Bordetella pertussis 18323
total of 2 found for Bordetella pertussis 18323
finding loci for Pandoraea pnomenusa 3kgm
total of 4 found for Pandoraea pnomenusa 3kgm



In [23]:

    
# loci attribute now populated
db.organisms[1].loci









    Out[23]:





[<hmmerclust.hmmerclust.Locus instance at 0x10d9573b0>,
 <hmmerclust.hmmerclust.Locus instance at 0x10e60d128>,
 <hmmerclust.hmmerclust.Locus instance at 0x10c796050>]

Make a pandas dataframe using the db

Use the power of pandas to make sense of the results from a phylogenetic perspective, or filter the results based on the hmmer statistics.



In [24]:

    
df = hmmerclust.FinalDataFrame(db)









    



Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9, 
            ...
            826, 827, 828, 829, 830, 831, 832, 833, 834, 835],
           dtype='int64', length=836)



In [25]:

    
# each row represents a hit from our search
# For e.g., index the first row

df.df.head()









    Out[25]:






  
    
      
      org_name
      org_acc
      org_phylum
      org_class
      org_order
      org_family
      org_genus
      org_species
      org_tree_order
      org_genome_length
      ...
      prot_acc
      prot_gi
      prot_product
      prot_translation
      prot_numb_of_res
      hit_query
      hit_evalue
      hit_bitscore
      hit_bias
      locus_id
    
  
  
    
      0
      Pseudomonas aeruginosa PAO1
      NC_002516
      Proteobacteria
      Gammaproteobacteria
      Pseudomonadales
      Pseudomonadaceae
      Pseudomonas
      Pseudomonas aeruginosa
      0
      6264404
      ...
      NP_250132.1
      15596638
      hypothetical protein
      MAVAPGVLLPPTPDVKPKAAAPKSQQKTPEPSNDKTSSFSDMYAKE...
      427
      PscP
      1.100000e-21
      78.1
      1.1
      <hmmerclust.hmmerclust.Locus instance at 0x10a...
    
    
      1
      Pseudomonas aeruginosa PAO1
      NC_002516
      Proteobacteria
      Gammaproteobacteria
      Pseudomonadales
      Pseudomonadaceae
      Pseudomonas
      Pseudomonas aeruginosa
      0
      6264404
      ...
      NP_250414.1
      15596920
      type III export protein PscJ
      MRRTVKGLSRMALLALVLALGGCKVELYTGISQKEGNEMLALLRSE...
      248
      PrgK
      2.300000e-72
      244.3
      0.2
      <hmmerclust.hmmerclust.Locus instance at 0x10c...
    
    
      2
      Pseudomonas aeruginosa PAO1
      NC_002516
      Proteobacteria
      Gammaproteobacteria
      Pseudomonadales
      Pseudomonadaceae
      Pseudomonas
      Pseudomonas aeruginosa
      0
      6264404
      ...
      NP_250134.1
      15596640
      flagellar motor switch protein FliM
      MAVQDLLSQDEIDALLHGVDDGLVETEVEATPGSVKSYDLTSQDRI...
      323
      SpaO
      1.300000e-17
      64.9
      0.2
      <hmmerclust.hmmerclust.Locus instance at 0x10a...
    
    
      3
      Pseudomonas aeruginosa PAO1
      NC_002516
      Proteobacteria
      Gammaproteobacteria
      Pseudomonadales
      Pseudomonadaceae
      Pseudomonas
      Pseudomonas aeruginosa
      0
      6264404
      ...
      NP_249795.1
      15596301
      flagellum-specific ATP synthase
      MRLERTSFARRLEGYTEAVSLPAQPVVEGRLLRMVGLTLEAEGLQA...
      451
      InvC
      1.700000e-72
      245.2
      0.0
      None
    
    
      4
      Pseudomonas aeruginosa PAO1
      NC_002516
      Proteobacteria
      Gammaproteobacteria
      Pseudomonadales
      Pseudomonadaceae
      Pseudomonas
      Pseudomonas aeruginosa
      0
      6264404
      ...
      NP_250138.1
      15596644
      flagellar biosynthesis protein FliQ
      MTPEVALDLFREALWLTAMIVGVLVVPSLLVGLVVAMFQAATQINE...
      89
      SpaQ
      1.800000e-29
      103.0
      8.8
      <hmmerclust.hmmerclust.Locus instance at 0x10a...
    
  

5 rows × 22 columns



In [26]:

    
df.df.ix[0]









    Out[26]:





org_name                                   Pseudomonas aeruginosa PAO1
org_acc                                                      NC_002516
org_phylum                                              Proteobacteria
org_class                                          Gammaproteobacteria
org_order                                              Pseudomonadales
org_family                                            Pseudomonadaceae
org_genus                                                  Pseudomonas
org_species                                     Pseudomonas aeruginosa
org_tree_order                                                       0
org_genome_length                                              6264404
org_prot_count                                                      82
org_numb_loci                                                        2
prot_acc                                                   NP_250132.1
prot_gi                                                       15596638
prot_product                                      hypothetical protein
prot_translation     MAVAPGVLLPPTPDVKPKAAAPKSQQKTPEPSNDKTSSFSDMYAKE...
prot_numb_of_res                                                   427
hit_query                                                         PscP
hit_evalue                                                     1.1e-21
hit_bitscore                                                      78.1
hit_bias                                                           1.1
locus_id             <hmmerclust.hmmerclust.Locus instance at 0x10a...
Name: 0, dtype: object



In [27]:

    
%pylab
%matplotlib inline
figsize(20,10)









    



Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib

Can use pandas to analyze the results by taxonomy, etc



In [28]:

    
# number of loci identified by family

figsize(5,5)
df.df.groupby(['org_family'])['locus_id'].nunique().plot(kind='bar')









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x10db0f290>

...or check stats like evalue for an individual protein



In [29]:

    
figsize(15,5)
df.df[df.df.hit_query=='InvG'].hit_evalue.order().plot(logy=True, kind='bar')









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b77af50>

hmmerclust has some functions for visualizing the results with a heatmap

quickly compare presence or absence of detected hits across loci/species
Set the cols & singleletters variables to order the columns by a specific order, with single letters on the heat map. The single letters are useful when looking at a very large number of organisms/loci
The heatmap is colored by # of identical components in the locus, for example below, PrgI was identified twice in some of the loci (todo... add the color bar legend..)
use the subset arg to specify that the locus must contain this component to be included in the heatmap



In [30]:

    
InvG_only = hmmerclust.HeatMap(df.df,
                  by_locus=True,
                  cols=settings.HEATMAP_COLUMNS,
                  singleletters=settings.HEATMAP_ABBREVIATIONS,
                  subset=['InvG'])

Notice if we set by_locus=False, the rows are no longer by locus, rather a hit can be anywhere in the genome. In this case we see a lot of InvC hits not associated with loci



In [31]:

    
hmmerclust.HeatMap(df.df,
                  by_locus=False,
                  cols=settings.HEATMAP_COLUMNS,
                  singleletters=settings.HEATMAP_ABBREVIATIONS)









    












    Out[31]:





<hmmerclust.hmmerclust.HeatMap instance at 0x10b80bf80>



In [32]:

    
#the heatmap is built from this unstacked pandas dataframe

InvG_only.unstacked_df.head()









    Out[32]:






  
    
      
      
      hit_query
      InvC
      SpaO
      SpaP
      SpaQ
      SpaR
      SpaS
      InvA
      PscP
      OrgB
      OrgA
      ...
      PrgK
      PrgH
      InvG
      InvH
      InvE
      PrgI
      InvJ
      SipC
      SipB
      SipD
    
    
      org_species
      locus_id
      org_tree_order
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Yersinia pestis
      <hmmerclust.hmmerclust.Locus instance at 0x10c682128>
      0
      1
      1
      1
      1
      1
      1
      1
      0
      0
      0
      ...
      1
      1
      1
      0
      0
      2
      0
      0
      0
      0
    
    
      Xanthomonas albilineans
      <hmmerclust.hmmerclust.Locus instance at 0x10cf043f8>
      0
      1
      1
      1
      1
      1
      1
      1
      0
      0
      1
      ...
      1
      1
      1
      0
      1
      2
      0
      1
      1
      1
    
    
      Salmonella enterica
      <hmmerclust.hmmerclust.Locus instance at 0x10d86a320>
      0
      1
      1
      1
      1
      1
      1
      1
      0
      0
      1
      ...
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
    
    
      <hmmerclust.hmmerclust.Locus instance at 0x10ca8b050>
      0
      1
      1
      1
      1
      1
      1
      1
      0
      0
      0
      ...
      1
      0
      1
      0
      1
      2
      0
      0
      2
      0
    
    
      Ramlibacter tataouinensis
      <hmmerclust.hmmerclust.Locus instance at 0x10f7291b8>
      0
      2
      1
      1
      1
      1
      1
      1
      1
      0
      0
      ...
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 21 columns

we could grab all the loci from this df:



In [33]:

    
InvG_only.unstacked_df.reset_index().locus_id









    Out[33]:





0     <hmmerclust.hmmerclust.Locus instance at 0x10c...
1     <hmmerclust.hmmerclust.Locus instance at 0x10c...
2     <hmmerclust.hmmerclust.Locus instance at 0x10d...
3     <hmmerclust.hmmerclust.Locus instance at 0x10c...
4     <hmmerclust.hmmerclust.Locus instance at 0x10f...
5     <hmmerclust.hmmerclust.Locus instance at 0x10c...
6     <hmmerclust.hmmerclust.Locus instance at 0x10d...
7     <hmmerclust.hmmerclust.Locus instance at 0x10c...
8     <hmmerclust.hmmerclust.Locus instance at 0x10c...
9     <hmmerclust.hmmerclust.Locus instance at 0x10b...
10    <hmmerclust.hmmerclust.Locus instance at 0x10b...
11    <hmmerclust.hmmerclust.Locus instance at 0x10e...
12    <hmmerclust.hmmerclust.Locus instance at 0x10c...
13    <hmmerclust.hmmerclust.Locus instance at 0x10f...
14    <hmmerclust.hmmerclust.Locus instance at 0x10d...
15    <hmmerclust.hmmerclust.Locus instance at 0x10d...
16    <hmmerclust.hmmerclust.Locus instance at 0x10c...
17    <hmmerclust.hmmerclust.Locus instance at 0x10c...
18    <hmmerclust.hmmerclust.Locus instance at 0x10c...
19    <hmmerclust.hmmerclust.Locus instance at 0x10e...
20    <hmmerclust.hmmerclust.Locus instance at 0x10c...
Name: locus_id, dtype: object



In [34]:

    
#store these in a new list
invg_loci = list(InvG_only.unstacked_df.reset_index().locus_id)

then visualize them:



In [44]:

    
figsize(15,5)
for locus in invg_loci[:3]:
    hmmerclust.LocusView(locus)









    




---------------------------------------------------------------------- 
----------------------------------------------------------------------
Organism:  Yersinia pestis CO92
Locus id: 4503118120






    












    



         accession query    evalue bitscore  bias                    name
0   YP_002345337.1  InvG     3e-32    113.1   4.7  type III secretion pro
1   YP_002345338.1  PrgH      0.61     10.7   0.1  type-III secretion pro
2   YP_002345339.1     -         -        -     -  type III secretion app
3   YP_002345341.1  PrgI   6.9e-06     28.3   0.5  type III secretion app
4   YP_002345342.1     -         -        -     -  type III secretion app
5   YP_002345343.1  PrgI     1e-10     43.8   2.7  type III secretion app
6   YP_002345344.1  PrgK   9.8e-58    196.5   1.2  type III secretion sys
7   YP_002345345.1     -         -        -     -    hypothetical protein
8   YP_002345346.1     -         -        -     -  type III secretion sys
9   YP_002345347.1  InvA    1e-226    756.2   4.3  secretion system appar
10  YP_002345348.1  InvC   6.4e-71    240.1     0  type III secretion sys
11  YP_002345349.1     -         -        -     -  type III secretion sys
12  YP_002345351.1  SpaO     1e-19     71.6     0  type III secretion sys
13  YP_002345352.1  SpaP   8.1e-71    239.5  10.6  type III secretion sys
14  YP_002345353.1  SpaQ   2.2e-27     96.3  14.1  type III secretion app
15  YP_002345354.1  SpaR   6.4e-62    210.8  23.1  type III secretion app
16  YP_002345355.1  SpaS  2.7e-110    370.3   0.7  secretion system appar


---------------------------------------------------------------------- 
----------------------------------------------------------------------
Organism:  Xanthomonas albilineans GPE PC73
Locus id: 4512039928






    












    



         accession query    evalue bitscore  bias                    name
0   YP_003375972.1  SipD   4.9e-21     76.9   0.1    xsa-invasion protein
1   YP_003375973.1     -         -        -     -  xsa-associated protein
2   YP_003375974.1     -         -        -     -  xsa-associated protein
3   YP_003375975.1     -         -        -     -  xsa-associated protein
4   YP_003375976.1     -         -        -     -  xsa-associated protein
5   YP_003375977.1     -         -        -     -    hypothetical protein
6   YP_003375978.1     -         -        -     -    hypothetical protein
7   YP_003375979.1     -         -        -     -  oxygen-regulated invas
8   YP_003375980.1  OrgA   9.5e-20     72.8     0  oxygen-regulated invas
9   YP_003375981.1  PrgK   3.9e-39    135.8   0.1  xanthomonas secretion 
10  YP_003375982.1  PrgI   2.7e-07     32.8   0.1  xanthomonas secretion 
11  YP_003375983.1  PrgI   6.3e-08     34.8   0.5  xanthomonas secretion 
12  YP_003375984.1  PrgH   2.6e-60    206.1     0  xanthomonas secretion 
13  YP_003375985.1     -         -        -     -  xanthomonas secretion 
14  YP_003375986.1  InvG   1.8e-25       91   0.9  xanthomonas secretion 
15  YP_003375987.1  InvE   1.4e-20     75.8     6  xanthomonas secretion 
16  YP_003375988.1  InvA  7.3e-203    677.3   5.4  xanthomonas secretion 
17  YP_003375989.1     -         -        -     -  xanthomonas secretion 
18  YP_003375990.1  InvC   2.3e-63    215.4   0.1  xanthomonas secretion 
19  YP_003375991.1     -         -        -     -  xsa-associated protein
20  YP_003375992.1     -         -        -     -  xsa-associated protein
21  YP_003375993.1  SpaO   3.6e-19     69.9   0.1  xanthomonas secretion 
22  YP_003375994.1  SpaP   4.3e-62    211.1  11.4  xanthomonas secretion 
23  YP_003375995.1  SpaQ   9.8e-26       91   8.2  xanthomonas secretion 
24  YP_003375996.1  SpaR   1.4e-40    140.9  20.1  xanthomonas secretion 
25  YP_003375997.1  SpaS   1.6e-83    282.3   0.4  xanthomonas secretion 
26  YP_003375998.1     -         -        -     -  xsa-invasion chaperone
27  YP_003375999.1  SipB   5.8e-26     93.4   4.2    xsa-invasion protein
28  YP_003376000.1  SipC   5.8e-17     63.5    26    xsa-invasion protein


---------------------------------------------------------------------- 
----------------------------------------------------------------------
Organism:  Salmonella enterica subsp. enterica serovar Typhi str. CT18
Locus id: 4521894688






    












    



      accession query    evalue bitscore  bias                    name
0   NP_457264.1  OrgA   5.9e-70    236.4   3.9   cell invasion protein
1   NP_457265.1  PrgK   8.7e-59    199.9     0  pathogenicity 1 island
2   NP_457266.1  PrgJ   1.1e-07     33.6     0  pathogenicity 1 island
3   NP_457267.1  PrgI   7.1e-10     41.1   0.1  pathogenicity 1 island
4   NP_457268.1  PrgH  1.1e-152    510.1     0  pathogenicity 1 island
5   NP_457269.1     -         -        -     -  AraC family transcript
6   NP_457270.1     -         -        -     -  invasion protein regul
7   NP_457271.1     -         -        -     -   cell invasion protein
8   NP_457272.1     -         -        -     -    tyrosine phosphatase
9   NP_457273.1     -         -        -     -       chaperone protein
10  NP_457274.1     -         -        -     -    hypothetical protein
11  NP_457275.1     -         -        -     -    acyl carrier protein
12  NP_457276.1     -         -        -     -  pathogenicity island 1
13  NP_457277.1  SipD  2.7e-175      584    11  pathogenicity island 1
14  NP_457278.1  SipC  2.3e-135    452.8  34.3  pathogenicity island 1
15  NP_457279.1  SipB   1.4e-70    239.9  24.3  pathogenicity island 1
16  NP_457280.1     -         -        -     -  chaperone protein SicA
17  NP_457281.1  SpaS  3.4e-117      393   4.4       secretory protein
18  NP_457282.1  SpaR   1.7e-78    265.2  16.6       secretory protein
19  NP_457283.1  SpaQ   2.2e-25     89.9   0.8       secretory protein
20  NP_457284.1  SpaP   3.3e-66    224.5   3.7       secretory protein
21  NP_457285.1  SpaO   1.8e-16     61.2     0  surface presentation o
22  NP_457286.1  InvJ  4.8e-213    708.1   7.2  surface presentation o
23  NP_457287.1     -         -        -     -       secretory protein
24  NP_457288.1  InvC   9.1e-70    236.3     0  secretory apparatus AT
25  NP_457289.1     -         -        -     -       secretory protein
26  NP_457290.1  InvA  2.1e-224    748.5   9.3       secretory protein
27  NP_457291.1  InvE   5.3e-42    145.5   0.4   cell invasion protein
28  NP_457292.1  InvG   1.6e-30    107.4   0.6       secretory protein
29  NP_457293.1     -         -        -     -  AraC family transcript
30  NP_457294.1  InvH   6.8e-96    319.8     6  cell adherance/invasio

Can get more detail about locus hits by setting `hit_detail_table=True. In any case, the hsps from hmmer are mapped onto the ORFs. Sometimes a single ORF will match to multiple queries ... these will be revealed in the hit detail table.



In [36]:

    
hmmerclust.LocusView(invg_loci[0], hit_detail_table=True)









    




---------------------------------------------------------------------- 
----------------------------------------------------------------------
Organism:  Yersinia pestis CO92
Locus id: 4503118120






    












    



         accession query    evalue bitscore  bias                    name
0   YP_002345337.1  InvG     3e-32    113.1   4.7  type III secretion pro
1   YP_002345338.1  PrgH      0.61     10.7   0.1  type-III secretion pro
2   YP_002345339.1     -         -        -     -  type III secretion app
3   YP_002345341.1  PrgI   6.9e-06     28.3   0.5  type III secretion app
4   YP_002345342.1     -         -        -     -  type III secretion app
5   YP_002345343.1  PrgI     1e-10     43.8   2.7  type III secretion app
6   YP_002345344.1  PrgK   9.8e-58    196.5   1.2  type III secretion sys
7   YP_002345345.1     -         -        -     -    hypothetical protein
8   YP_002345346.1     -         -        -     -  type III secretion sys
9   YP_002345347.1  InvA    1e-226    756.2   4.3  secretion system appar
10  YP_002345348.1  InvC   6.4e-71    240.1     0  type III secretion sys
11  YP_002345349.1     -         -        -     -  type III secretion sys
12  YP_002345351.1  SpaO     1e-19     71.6     0  type III secretion sys
13  YP_002345352.1  SpaP   8.1e-71    239.5  10.6  type III secretion sys
14  YP_002345353.1  SpaQ   2.2e-27     96.3  14.1  type III secretion app
15  YP_002345354.1  SpaR   6.4e-62    210.8  23.1  type III secretion app
16  YP_002345355.1  SpaS  2.7e-110    370.3   0.7  secretion system appar
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
InvG     113.1  3.000000e-32   4.7        359      520
      bitscore  evalue  bias  hsp_start  hsp_end
name                                            
PrgH      10.7    0.61   0.1          4       34
PrgH      10.7    0.61   0.0        165      342
      bitscore    evalue  bias  hsp_start  hsp_end
name                                              
PrgI      28.3  0.000007   0.5          1       70
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
PrgI      43.8  1.000000e-10   2.7          1       82
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
PrgK     196.5  9.800000e-58   1.2          0      202
      bitscore         evalue  bias  hsp_start  hsp_end
name                                                   
InvA     756.2  1.000000e-226   4.3         26      665
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
InvC     240.1  6.400000e-71     0        148      359
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
SpaO      71.6  1.000000e-19   0.0        146      179
SpaO      71.6  1.000000e-19   0.1        224      300
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
SpaP     239.5  8.100000e-71  10.6         12      212
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
SpaQ      96.3  2.200000e-27  14.1         11       87
      bitscore        evalue  bias  hsp_start  hsp_end
name                                                  
SpaR     210.8  6.400000e-62  23.1          2      251
      bitscore         evalue  bias  hsp_start  hsp_end
name                                                   
SpaS     370.3  2.700000e-110   0.7          0      342






    Out[36]:





<hmmerclust.hmmerclust.LocusView instance at 0x10c423320>

Filtering and exporting sequences

Finally, sequences can be exported for multiple sequence alignments, etc. Use pandas define the criteria for the exported sequences ... e.g. here we export sequences that belong to a locus with e-value less than 0.01



In [37]:

    
#836 protein in the total search
len(df.df)









    Out[37]:





836



In [38]:

    
#remember the list of invg loci we made in the previous step
invg_loci









    Out[38]:





[<hmmerclust.hmmerclust.Locus instance at 0x10c682128>,
 <hmmerclust.hmmerclust.Locus instance at 0x10cf043f8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10d86a320>,
 <hmmerclust.hmmerclust.Locus instance at 0x10ca8b050>,
 <hmmerclust.hmmerclust.Locus instance at 0x10f7291b8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10cab72d8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10da0c3f8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10ce7f7e8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10ccdf200>,
 <hmmerclust.hmmerclust.Locus instance at 0x10bee9098>,
 <hmmerclust.hmmerclust.Locus instance at 0x10bb55878>,
 <hmmerclust.hmmerclust.Locus instance at 0x10e60d128>,
 <hmmerclust.hmmerclust.Locus instance at 0x10c796050>,
 <hmmerclust.hmmerclust.Locus instance at 0x10f742098>,
 <hmmerclust.hmmerclust.Locus instance at 0x10d2a3368>,
 <hmmerclust.hmmerclust.Locus instance at 0x10d1792d8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10cf474d0>,
 <hmmerclust.hmmerclust.Locus instance at 0x10cb390e0>,
 <hmmerclust.hmmerclust.Locus instance at 0x10c2981b8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10e3491b8>,
 <hmmerclust.hmmerclust.Locus instance at 0x10c7143f8>]



In [39]:

    
#make a new df that only has proteins in the invg_loci
invg_only_df = df.df[df.df['locus_id'].isin(invg_loci)]
len(invg_only_df)









    Out[39]:





288



In [40]:

    
#now filter this by evalue < 0.01
evalue_cut = invg_only_df[invg_only_df.hit_evalue < 0.01]



In [41]:

    
#down to 278 proteins
len(evalue_cut)









    Out[41]:





278



In [42]:

    
#export all these proteins as fasta files
hmmerclust.RelatedProteinGroup(evalue_cut)









    Out[42]:





<hmmerclust.hmmerclust.RelatedProteinGroup instance at 0x10c27ecf8>



In [43]:

    
#the hits are grouped by name, for MSA,  in this new folder
!ls group_fastas/









    



InvA.fasta InvE.fasta InvH.fasta OrgA.fasta PrgH.fasta PrgJ.fasta PscP.fasta SipC.fasta SpaO.fasta SpaQ.fasta SpaS.fasta
InvC.fasta InvG.fasta InvJ.fasta OrgB.fasta PrgI.fasta PrgK.fasta SipB.fasta SipD.fasta SpaP.fasta SpaR.fasta



In [ ]:

	org_name	org_acc	org_phylum	org_class	org_order	org_family	org_genus	org_species	org_genome_length	...	prot_acc	prot_gi	prot_product	prot_translation	prot_numb_of_res	hit_query	hit_evalue	hit_bitscore	hit_bias	locus_id
0	Pseudomonas aeruginosa PAO1	NC_002516	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	Pseudomonas aeruginosa	6264404	...	NP_250132.1	15596638	hypothetical protein	MAVAPGVLLPPTPDVKPKAAAPKSQQKTPEPSNDKTSSFSDMYAKE...	427	PscP	1.100000e-21	78.1	1.1	<hmmerclust.hmmerclust.Locus instance at 0x10a...
1	Pseudomonas aeruginosa PAO1	NC_002516	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	Pseudomonas aeruginosa	6264404	...	NP_250414.1	15596920	type III export protein PscJ	MRRTVKGLSRMALLALVLALGGCKVELYTGISQKEGNEMLALLRSE...	248	PrgK	2.300000e-72	244.3	0.2	<hmmerclust.hmmerclust.Locus instance at 0x10c...
2	Pseudomonas aeruginosa PAO1	NC_002516	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	Pseudomonas aeruginosa	6264404	...	NP_250134.1	15596640	flagellar motor switch protein FliM	MAVQDLLSQDEIDALLHGVDDGLVETEVEATPGSVKSYDLTSQDRI...	323	SpaO	1.300000e-17	64.9	0.2	<hmmerclust.hmmerclust.Locus instance at 0x10a...
3	Pseudomonas aeruginosa PAO1	NC_002516	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	Pseudomonas aeruginosa	6264404	...	NP_249795.1	15596301	flagellum-specific ATP synthase	MRLERTSFARRLEGYTEAVSLPAQPVVEGRLLRMVGLTLEAEGLQA...	451	InvC	1.700000e-72	245.2	0.0	None
4	Pseudomonas aeruginosa PAO1	NC_002516	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Pseudomonadaceae	Pseudomonas	Pseudomonas aeruginosa	6264404	...	NP_250138.1	15596644	flagellar biosynthesis protein FliQ	MTPEVALDLFREALWLTAMIVGVLVVPSLLVGLVVAMFQAATQINE...	89	SpaQ	1.800000e-29	103.0	8.8	<hmmerclust.hmmerclust.Locus instance at 0x10a...

		hit_query	InvC	SpaO	SpaP	SpaQ	SpaR	SpaS	InvA	PscP	OrgB	OrgA	...	PrgK	PrgH	InvG	InvH	InvE	PrgI	InvJ	SipC	SipB	SipD
org_species	locus_id	org_tree_order
Yersinia pestis	<hmmerclust.hmmerclust.Locus instance at 0x10c682128>	0	1	1	1	1	1	1	1	0	0	0	...	1	1	1	0	0	2	0	0	0	0
Xanthomonas albilineans	<hmmerclust.hmmerclust.Locus instance at 0x10cf043f8>	0	1	1	1	1	1	1	1	0	0	1	...	1	1	1	0	1	2	0	1	1	1
Salmonella enterica	<hmmerclust.hmmerclust.Locus instance at 0x10d86a320>	0	1	1	1	1	1	1	1	0	0	1	...	1	1	1	1	1	1	1	1	1	1
Salmonella enterica	<hmmerclust.hmmerclust.Locus instance at 0x10ca8b050>	0	1	1	1	1	1	1	1	0	0	0	...	1	0	1	0	1	2	0	0	2	0
Ramlibacter tataouinensis	<hmmerclust.hmmerclust.Locus instance at 0x10f7291b8>	0	2	1	1	1	1	1	1	1	0	0	...	1	0	1	0	0	0	0	0	0	0