GEM-PRO - List of Gene IDs

This notebook gives an example of how to run the GEM-PRO pipeline with a list of gene IDs.

**Input:** List of gene IDs

**Output:** GEM-PRO model

Imports



In [1]:

    
import sys
import logging



In [2]:

    
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO



In [3]:

    
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don't affect running of the pipeline
INFO (default)
- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff

**Warning:** `DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!



In [4]:

    
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)  # SET YOUR LOGGING LEVEL HERE #



In [5]:

    
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

ROOT_DIR
- The directory where a folder named after your PROJECT will be created
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

    ROOT_DIR
    └── PROJECT
        ├── data  # General storage for pipeline outputs
        ├── model  # SBML and GEM-PRO models are stored here
        ├── genes  # Per gene information
        │   ├── <gene_id1>  # Specific gene directory
        │   │   └── protein
        │   │       ├── sequences  # Protein sequence files, alignments, etc.
        │   │       └── structures  # Protein structure files, calculations, etc.
        │   └── <gene_id2>
        │       └── protein
        │           ├── sequences
        │           └── structures
        ├── reactions  # Per reaction information
        │   └── <reaction_id1>  # Specific reaction directory
        │       └── complex
        │           └── structures  # Protein complex files
        └── metabolites  # Per metabolite information
            └── <metabolite_id1>  # Specific metabolite directory
                └── chemical
                    └── structures  # Metabolite 2D and 3D structure files

**Note:** Methods for protein complexes and metabolites are still in development.



In [6]:

    
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_GP'
LIST_OF_GENES = ['b0761', 'b0889', 'b0995', 'b1013', 'b1014', 'b1040', 'b1130', 'b1187', 'b1221', 'b1299']
PDB_FILE_TYPE = 'mmtf'



In [7]:

    
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type=PDB_FILE_TYPE)









    



[2018-12-11 23:12] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-12-11 23:12] [ssbio.pipeline.gempro] INFO: /tmp/genes_GP: GEM-PRO project location
[2018-12-11 23:12] [ssbio.pipeline.gempro] INFO: Added 10 genes to GEM-PRO project
[2018-12-11 23:12] [ssbio.pipeline.gempro] INFO: 10: number of genes

Mapping gene ID --> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

**Note:** You only need to map gene IDs using one service. However you can run both if some genes don't map in one service and do map in another!

Methods



In [8]:

    
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()









    





 
 










    



[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1299: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1299/b1299_protein/sequences/eco-b1299.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1299: reading sequence from sequence file /tmp/genes_GP/genes/b1299/b1299_protein/sequences/eco-b1299.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1299/b1299_protein/sequences/eco-b1299.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b1299: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b0761: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b0761/b0761_protein/sequences/eco-b0761.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b0761: reading sequence from sequence file /tmp/genes_GP/genes/b0761/b0761_protein/sequences/eco-b0761.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b0761/b0761_protein/sequences/eco-b0761.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b0761: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b0889: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b0889/b0889_protein/sequences/eco-b0889.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b0889: reading sequence from sequence file /tmp/genes_GP/genes/b0889/b0889_protein/sequences/eco-b0889.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b0889/b0889_protein/sequences/eco-b0889.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b0889: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1187: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1187/b1187_protein/sequences/eco-b1187.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1187: reading sequence from sequence file /tmp/genes_GP/genes/b1187/b1187_protein/sequences/eco-b1187.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1187/b1187_protein/sequences/eco-b1187.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b1187: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b0995: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b0995/b0995_protein/sequences/eco-b0995.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b0995: reading sequence from sequence file /tmp/genes_GP/genes/b0995/b0995_protein/sequences/eco-b0995.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b0995/b0995_protein/sequences/eco-b0995.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b0995: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1130: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1130/b1130_protein/sequences/eco-b1130.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1130: reading sequence from sequence file /tmp/genes_GP/genes/b1130/b1130_protein/sequences/eco-b1130.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1130/b1130_protein/sequences/eco-b1130.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b1130: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1013: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1013/b1013_protein/sequences/eco-b1013.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1013: reading sequence from sequence file /tmp/genes_GP/genes/b1013/b1013_protein/sequences/eco-b1013.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1013/b1013_protein/sequences/eco-b1013.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b1013: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1040: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1040/b1040_protein/sequences/eco-b1040.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1040: reading sequence from sequence file /tmp/genes_GP/genes/b1040/b1040_protein/sequences/eco-b1040.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1040/b1040_protein/sequences/eco-b1040.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b1040: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1221: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1221/b1221_protein/sequences/eco-b1221.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1221: reading sequence from sequence file /tmp/genes_GP/genes/b1221/b1221_protein/sequences/eco-b1221.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1221/b1221_protein/sequences/eco-b1221.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b1221: loaded KEGG information for gene
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1014: no sequence stored in memory
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1014/b1014_protein/sequences/eco-b1014.faa: KEGG FASTA file already exists
[2018-12-11 23:12] [ssbio.protein.sequence.seqprop] DEBUG: eco:b1014: reading sequence from sequence file /tmp/genes_GP/genes/b1014/b1014_protein/sequences/eco-b1014.faa
[2018-12-11 23:12] [ssbio.databases.kegg] DEBUG: /tmp/genes_GP/genes/b1014/b1014_protein/sequences/eco-b1014.kegg: KEGG metadata file already exists
[2018-12-11 23:12] [ssbio.pipeline.gempro] DEBUG: b1014: loaded KEGG information for gene






    









    



[2018-12-11 23:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes mapped to KEGG
[2018-12-11 23:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.






    



Missing KEGG mapping:  []






    Out[8]:







  
    
      
      kegg
      refseq
      uniprot
      pdbs
      sequence_file
      metadata_file
    
    
      gene
      
      
      
      
      
      
    
  
  
    
      b0761
      eco:b0761
      NP_415282
      P0A9G8
      1B9M;1H9S;1B9N;1O7L;1H9R
      eco-b0761.faa
      eco-b0761.kegg
    
    
      b0889
      eco:b0889
      NP_415409
      P0ACJ0
      2GQQ;2L4A
      eco-b0889.faa
      eco-b0889.kegg
    
    
      b0995
      eco:b0995
      NP_415515
      P38684
      1ZGZ
      eco-b0995.faa
      eco-b0995.kegg
    
    
      b1013
      eco:b1013
      NP_415533
      P0ACU2
      4JYK;4XK4;4X1E;3LOC
      eco-b1013.faa
      eco-b1013.kegg
    
    
      b1014
      eco:b1014
      NP_415534
      P09546
      3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1...
      eco-b1014.faa
      eco-b1014.kegg



In [9]:

    
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()









    



['Entry', 'Status', 'names', 'names', 'Length', 'P0A9U6', 'reviewed', 'transcriptional', 'PuuR', 'ycjC', 'JW1292', 'coli', 'K12)', 'b0761', 'MODE_ECOLI', 'DNA-binding', 'dual', 'ModE', 'modR', 'JW0744', 'coli', 'K12)', 'b0889', 'LRP_ECOLI', 'Leucine-responsive', 'protein', 'alsB', 'livR', 'b0889', 'Escherichia', '(strain', '164', 'P0A8V6', 'reviewed', 'acid', 'regulator', 'fadR', 'thdB', 'JW1176', 'coli', 'K12)', 'b0995', 'TORR_ECOLI', 'TorCAD', 'transcriptional', 'protein', 'torR', 'JW0980', 'coli', 'K12)', 'b1130', 'PHOP_ECOLI', 'Transcriptional', 'protein', 'phoP', 'JW1116', 'coli', 'K12)', 'b1013', 'RUTR_ECOLI', 'HTH-type', 'regulator', '(Rut', 'repressor)', 'ycdC', 'JW0998', 'coli', 'K12)', 'b1040', 'CSGD_ECOLI', 'CsgBAC', 'transcriptional', 'protein', 'b1040', 'Escherichia', '(strain', '216', 'P0AF28', 'reviewed', 'response', 'protein', 'narL', 'b1221', 'Escherichia', '(strain', '216', 'P09546', 'reviewed', 'protein', '[Includes:', 'dehydrogenase', '1.5.5.2)', 'oxidase);', 'dehydrogenase', 'dehydrogenase)', '1.2.1.88)', 'gamma-semialdehyde', 'putA', 'b1014', 'Escherichia', '(strain', '1320'] ['name', 'Protein', 'Gene', 'Organism', 'b1299', 'PUUR_ECOLI', 'HTH-type', 'regulator', 'puuR', 'b1299', 'Escherichia', '(strain', '185', 'P0A9G8', 'reviewed', 'transcriptional', 'regulator', 'modE', 'b0761', 'Escherichia', '(strain', '262', 'P0ACJ0', 'reviewed', 'regulatory', 'lrp', 'ihb', 'oppI', 'JW0872', 'coli', 'K12)', 'b1187', 'FADR_ECOLI', 'Fatty', 'metabolism', 'protein', 'oleR', 'b1187', 'Escherichia', '(strain', '239', 'P38684', 'reviewed', 'operon', 'regulatory', 'TorR', 'b0995', 'Escherichia', '(strain', '230', 'P23836', 'reviewed', 'regulatory', 'PhoP', 'b1130', 'Escherichia', '(strain', '223', 'P0ACU2', 'reviewed', 'transcriptional', 'RutR', 'operon', 'rutR', 'b1013', 'Escherichia', '(strain', '212', 'P52106', 'reviewed', 'operon', 'regulatory', 'csgD', 'JW1023', 'coli', 'K12)', 'b1221', 'NARL_ECOLI', 'Nitrate/nitrite', 'regulator', 'NarL', 'frdR', 'JW1212', 'coli', 'K12)', 'b1014', 'PUTA_ECOLI', 'Bifunctional', 'PutA', 'Proline', '(EC', '(Proline', 'Delta-1-pyrroline-5-carboxylate', '(P5C', '(EC', '(L-glutamate', 'dehydrogenase)]', 'poaA', 'JW0999', 'coli', 'K12)']






    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-20bf31fcd8d4> in <module>
      1 # UniProt mapping
----> 2 my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
      3 print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
      4 my_gempro.df_uniprot_metadata.head()

/vagrant/ssbio/ssbio/pipeline/gempro.py in uniprot_mapping_and_metadata(self, model_gene_source, custom_gene_mapping, outdir, set_as_representative, force_rerun)
    534 
    535         # Map all IDs first to available UniProts
--> 536         genes_to_uniprots = bs_unip.mapping(fr=model_gene_source, to='ACC', query=genes_to_map)
    537 
    538         successfully_mapped_counter = 0

/vagrant/software_modified/bioservices/src/bioservices/uniprot.py in mapping(self, fr, to, query)
    355             print(keys, values)
    356             for i, key in enumerate(keys):
--> 357                 result_dict[key].append(values[i])
    358         return result_dict
    359 

IndexError: list index out of range



In [10]:

    
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()









    





 
 










    









    



[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes with a representative sequence
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.






    



Missing a representative sequence:  []






    Out[10]:







  
    
      
      uniprot
      kegg
      pdbs
      sequence_file
      metadata_file
    
    
      gene
      
      
      
      
      
    
  
  
    
      b0761
      P0A9G8
      eco:b0761
      1B9M;1H9S;1B9N;1O7L;1H9R
      eco-b0761.faa
      eco-b0761.kegg
    
    
      b0889
      P0ACJ0
      eco:b0889
      2GQQ;2L4A
      eco-b0889.faa
      eco-b0889.kegg
    
    
      b0995
      P38684
      eco:b0995
      1ZGZ
      eco-b0995.faa
      eco-b0995.kegg
    
    
      b1013
      P0ACU2
      eco:b1013
      4JYK;4XK4;4X1E;3LOC
      eco-b1013.faa
      eco-b1013.kegg
    
    
      b1014
      P09546
      eco:b1014
      3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1...
      eco-b1014.faa
      eco-b1014.kegg

Mapping representative sequence --> structure

These are the ways to map sequence to structure:

Use the UniProt ID and their automatic mappings to the PDB
BLAST the sequence to the PDB
Make homology models or
Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you'll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods



In [11]:

    
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()









    



[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.






    





 
 










    









    



[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes with at least one experimental structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe






    Out[11]:



In [12]:

    
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)









    





 
 










    









    



[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5: number of genes with additional structures added from BLAST






    Out[12]:







  
    
      
      pdb_id
      pdb_chain_id
      hit_score
      hit_evalue
      hit_percent_similar
      hit_percent_ident
      hit_num_ident
      hit_num_similar
    
    
      gene
      
      
      
      
      
      
      
      
    
  
  
    
      b0761
      1b9n
      B
      1091.0
      5.530720e-119
      0.931298
      0.931298
      244
      244
    
    
      b0761
      1o7l
      D
      1089.0
      1.096280e-118
      0.931298
      0.931298
      244
      244

Downloading and ranking structures

Methods

**Warning:** Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.



In [13]:

    
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)









    





 
 










    









    



[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Saved 15 structures total






    Out[13]:







  
    
      
      chemicals
      description
      experimental_method
      mapped_chains
      pdb_id
      pdb_title
      resolution
      structure_file
      taxonomy_name
    
    
      gene
      
      
      
      
      
      
      
      
      
    
  
  
    
      b0761
      NI
      ModE (MOLYBDATE-DEPENDENT TRANSCRIPTIONAL REGU...
      X-RAY DIFFRACTION
      A;B
      1b9m
      REGULATOR FROM ESCHERICHIA COLI
      1.75
      1b9m.mmtf
      Escherichia coli
    
    
      b0761
      NI
      MODE (MOLYBDATE DEPENDENT TRANSCRIPTIONAL REGU...
      X-RAY DIFFRACTION
      A;B
      1b9n
      REGULATOR FROM ESCHERICHIA COLI
      2.09
      1b9n.mmtf
      Escherichia coli



In [14]:

    
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()









    





 
 










    









    



[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5/10: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.






    Out[14]:







  
    
      
      id
      is_experimental
      file_type
      structure_file
    
    
      gene
      
      
      
      
    
  
  
    
      b0761
      REP-1b9n
      True
      pdb
      1b9n-A_clean.pdb
    
    
      b0889
      REP-2gqq
      True
      pdb
      2gqq-A_clean.pdb
    
    
      b1013
      REP-4xk4
      True
      pdb
      4xk4-A_clean.pdb
    
    
      b1187
      REP-1h9t
      True
      pdb
      1h9t-A_clean.pdb
    
    
      b1221
      REP-1rnl
      True
      pdb
      1rnl-A_clean.pdb



In [15]:

    
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b1187').protein.representative_structure
my_gempro.genes.get_by_id('b1187').protein.representative_structure.get_dict()









    Out[15]:





<StructProp REP-1h9t at 0x7f3880a275c0>






    Out[15]:





{'_structure_dir': '/tmp/genes_GP/genes/b1187/b1187_protein/structures',
 'chains': [<ChainProp A at 0x7f38830e1c18>],
 'date': None,
 'description': 'FATTY ACID METABOLISM REGULATOR PROTEIN',
 'file_type': 'pdb',
 'id': 'REP-1h9t',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_structure_id': '1h9t',
 'resolution': 3.25,
 'structure_file': '1h9t-A_clean.pdb',
 'taxonomy_name': 'ESCHERICHIA COLI'}

Saving your GEM-PRO

**Warning:** Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.



In [16]:

    
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)









    



[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_GP) to /tmp/genes_GP/model/genes_GP.json

	kegg	refseq	uniprot	pdbs	sequence_file	metadata_file
gene
b0761	eco:b0761	NP_415282	P0A9G8	1B9M;1H9S;1B9N;1O7L;1H9R	eco-b0761.faa	eco-b0761.kegg
b0889	eco:b0889	NP_415409	P0ACJ0	2GQQ;2L4A	eco-b0889.faa	eco-b0889.kegg
b0995	eco:b0995	NP_415515	P38684	1ZGZ	eco-b0995.faa	eco-b0995.kegg
b1013	eco:b1013	NP_415533	P0ACU2	4JYK;4XK4;4X1E;3LOC	eco-b1013.faa	eco-b1013.kegg
b1014	eco:b1014	NP_415534	P09546	3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1...	eco-b1014.faa	eco-b1014.kegg

	pdb_id	pdb_chain_id	hit_score	hit_evalue	hit_percent_similar	hit_percent_ident	hit_num_ident	hit_num_similar
gene
b0761	1b9n	B	1091.0	5.530720e-119	0.931298	0.931298	244	244
b0761	1o7l	D	1089.0	1.096280e-118	0.931298	0.931298	244	244

	chemicals	description	experimental_method	mapped_chains	pdb_id	pdb_title	resolution	structure_file	taxonomy_name
gene
b0761	NI	ModE (MOLYBDATE-DEPENDENT TRANSCRIPTIONAL REGU...	X-RAY DIFFRACTION	A;B	1b9m	REGULATOR FROM ESCHERICHIA COLI	1.75	1b9m.mmtf	Escherichia coli
b0761	NI	MODE (MOLYBDATE DEPENDENT TRANSCRIPTIONAL REGU...	X-RAY DIFFRACTION	A;B	1b9n	REGULATOR FROM ESCHERICHIA COLI	2.09	1b9n.mmtf	Escherichia coli

	id	is_experimental	file_type	structure_file
gene
b0761	REP-1b9n	True	pdb	1b9n-A_clean.pdb
b0889	REP-2gqq	True	pdb	2gqq-A_clean.pdb
b1013	REP-4xk4	True	pdb	4xk4-A_clean.pdb
b1187	REP-1h9t	True	pdb	1h9t-A_clean.pdb
b1221	REP-1rnl	True	pdb	1rnl-A_clean.pdb