A python package for detecting a gene cluster of interest in a set of bacterial genomes followed by interactive analysis of the results using ipython/jupyter notebook. For example, it could be used for identifying genetic loci that encode biosynthetic pathways or multicomponent protein assemblies, comparing how these loci differ across taxonomical groups, and automatically extracting sequences for subsequent phylogenetic analysis or protein expression.
Please see the demo folder, which contains an executable notebook demonstrating the functionality.
First, install virtualenv and virtualenvwrapper.
Then make a virtualenv for hmmerclust:
$ mkvirtualenv -p /usr/bin/python hmmerclust
$ workon hmmerclust
$ pip install ipython[all] matplotlib pandas biopython hmmerclust
$ ipython notebook
In [8]:
from hmmerclust import hmmerclust
import settings
You can use the fetch_gbwithparts() function to download genomes from NCBI. Here, we just download a few. In an actual search I might use all the genomes from RefSeq. See a list here
In [9]:
accessions_of_interest = ['NC_002516',
'NC_002695',
'NC_003143',
'NC_003198',
'NC_007645',
'NC_007650',
'NC_007651',
'NC_007760',
'NC_010287',
'NC_013722',
'NC_013971',
'NC_015677',
'NC_018518',
'NC_022904']
In [10]:
#download these from NCBI... may take 5 minutes
hmmerclust.fetch_gbwithparts(accessions_of_interest, 'your@email.com', './genomes/')
In [11]:
ls ./genomes/
In [12]:
genome_list = !ls ./genomes/
genome_dir = './genomes/'
Now it is time to populate our database with organism data. After you do this once, a fasta will be generated. If you already have the fasta from a previous run, set freshfasta=False because the file can be very large and don't want to always generate this and have multiple copies on your system.
In [13]:
db = hmmerclust.OrganismDB('t3ss_database',
genome_list,
genome_dir,
freshfasta=True)
In [14]:
db.organisms
Out[14]:
In [15]:
for org in db.organisms:
print org.name
the organism object encapsulates this data:
In [16]:
db.organisms[1].__dict__
Out[16]:
In [17]:
#notice a combined_fasta file was created when the db was made
!ls
Set up the /alignments folder. Alignments FASTA and must have txt extension. In this case, I queried PFAM for the proteins located in my gene cluster of interest (here, the type III secretion system) and downloaded the hmm seed alignments. It is important to name the files starting with how you want to the search results to be annotated later, followed by an underscore. It doesn't matter what goes after the underscore.
In [18]:
!ls alignments
In [19]:
combined_fasta = './combined_fasta'
s = hmmerclust.HmmSearch(db, combined_fasta,
freshbuild=True,
freshsearch=True)
In [45]:
#hits now found in the db object
db.organisms[1].proteins
Out[45]:
Now that the hits have been associated with each organism, they can be clustered into loci based on their proximity in the genome. Here, we search for loci with minimum of 5 of the search proteins within a 15k bp of each other.
Also set the COLOUR_DICT setting here which will determine the locus map color.
In the settings file there are definitions for how results will be displayed later in some of the scripts.
The COLOUR_DICT variable indicates how ORFs will be coloured in the locus maps.
The HEATMAP_COLUMNS variable are how proteins are ordered in the heatmap scripts.
the HEATMAP_ABBREVIATIONS variable are 1 letter codes for proteins that should reflect the same order as the columns.
If these are not defined, the will be generated automatically.
In [21]:
!cat settings.py
In [22]:
db.find_loci(5, 15000, colordict=settings.COLOUR_DICT)
In [23]:
# loci attribute now populated
db.organisms[1].loci
Out[23]:
In [24]:
df = hmmerclust.FinalDataFrame(db)
In [25]:
# each row represents a hit from our search
# For e.g., index the first row
df.df.head()
Out[25]:
In [26]:
df.df.ix[0]
Out[26]:
In [27]:
%pylab
%matplotlib inline
figsize(20,10)
In [28]:
# number of loci identified by family
figsize(5,5)
df.df.groupby(['org_family'])['locus_id'].nunique().plot(kind='bar')
Out[28]:
In [29]:
figsize(15,5)
df.df[df.df.hit_query=='InvG'].hit_evalue.order().plot(logy=True, kind='bar')
Out[29]:
In [30]:
InvG_only = hmmerclust.HeatMap(df.df,
by_locus=True,
cols=settings.HEATMAP_COLUMNS,
singleletters=settings.HEATMAP_ABBREVIATIONS,
subset=['InvG'])
Notice if we set by_locus=False, the rows are no longer by locus, rather a hit can be anywhere in the genome. In this case we see a lot of InvC hits not associated with loci
In [31]:
hmmerclust.HeatMap(df.df,
by_locus=False,
cols=settings.HEATMAP_COLUMNS,
singleletters=settings.HEATMAP_ABBREVIATIONS)
Out[31]:
In [32]:
#the heatmap is built from this unstacked pandas dataframe
InvG_only.unstacked_df.head()
Out[32]:
we could grab all the loci from this df:
In [33]:
InvG_only.unstacked_df.reset_index().locus_id
Out[33]:
In [34]:
#store these in a new list
invg_loci = list(InvG_only.unstacked_df.reset_index().locus_id)
then visualize them:
In [44]:
figsize(15,5)
for locus in invg_loci[:3]:
hmmerclust.LocusView(locus)
Can get more detail about locus hits by setting `hit_detail_table=True. In any case, the hsps from hmmer are mapped onto the ORFs. Sometimes a single ORF will match to multiple queries ... these will be revealed in the hit detail table.
In [36]:
hmmerclust.LocusView(invg_loci[0], hit_detail_table=True)
Out[36]:
In [37]:
#836 protein in the total search
len(df.df)
Out[37]:
In [38]:
#remember the list of invg loci we made in the previous step
invg_loci
Out[38]:
In [39]:
#make a new df that only has proteins in the invg_loci
invg_only_df = df.df[df.df['locus_id'].isin(invg_loci)]
len(invg_only_df)
Out[39]:
In [40]:
#now filter this by evalue < 0.01
evalue_cut = invg_only_df[invg_only_df.hit_evalue < 0.01]
In [41]:
#down to 278 proteins
len(evalue_cut)
Out[41]:
In [42]:
#export all these proteins as fasta files
hmmerclust.RelatedProteinGroup(evalue_cut)
Out[42]:
In [43]:
#the hits are grouped by name, for MSA, in this new folder
!ls group_fastas/
In [ ]: