Writing PB in file

The API allows to write all the files the command line tools can. This includes the outputs of PBassign. The functions to handle several file formats are available in the pbxplore.io module.


In [1]:
from pprint import pprint
import urllib.request
import os

# print date & versions
import datetime
print("Date & time:",datetime.datetime.now())
import sys
print("Python version:", sys.version)


Date & time: 2017-03-13 10:24:57.828963
Python version: 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609]

In [2]:
import pbxplore as pbx
print("PBxplore version:", pbx.__version__)


PBxplore version: 1.3.5

Fasta files

The most common way to save PB sequences is to write them in a fasta file.

PBxplore allows two ways to write fasta files. The sequences can be written either all at once or one at a time. To write a batch of sequences at once, we need a list of sequences and a list of the corresponding sequence names. The writing function here is pbxplore.io.write_fasta().


In [3]:
names = []
pb_sequences = []
pdb_name, _ = urllib.request.urlretrieve('https://files.rcsb.org/view/2LFU.pdb', '2LFU.pdb')
for chain_name, chain in pbx.chains_from_files([pdb_name]):
    dihedrals = chain.get_phi_psi_angles()
    pb_seq = pbx.assign(dihedrals)
    names.append(chain_name)
    pb_sequences.append(pb_seq)

pprint(names)

pprint(pb_sequences)

with open('output.fasta', 'w') as outfile:
    pbx.io.write_fasta(outfile, pb_sequences, names)


['2LFU.pdb | model 1 | chain A',
 '2LFU.pdb | model 2 | chain A',
 '2LFU.pdb | model 3 | chain A',
 '2LFU.pdb | model 4 | chain A',
 '2LFU.pdb | model 5 | chain A',
 '2LFU.pdb | model 6 | chain A',
 '2LFU.pdb | model 7 | chain A',
 '2LFU.pdb | model 8 | chain A',
 '2LFU.pdb | model 9 | chain A',
 '2LFU.pdb | model 10 | chain A']
['ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcddddddddddfklopadddddfhpamlnopcddddddehjadddddehjacbddddddddfklmaccddddddfbgniaghiapaddddddfklnoambZZ',
 'ZZpcfblcffbccdddddeehjacdddddddddfklggcddddddddddddddfblghiadddddddddfklopadddddehpmmmnopcddddddeehiacdddfblopadcddddddfklpaccdddddfklmlmgcdehiaddddddfklmmgopZZ',
 'ZZmgghiafbbccdddddehjbdcdddddddddfklggcddddddddddddddfbfghpacddddddddfklopadddddehiaklmmmgcdddddeehiaddddfkbgciacdddddefklpaccddddddfkgojbdfehpaddddddfkbccfbgZZ',
 'ZZcghiacfkbacdddddfbhpacdddddddddfklmcfdddddddddddddehiacddddddddddddfknopadddddfkpamlnopaddddddehjaccdddfklnopacddddddfklmpccdddddddehiabghehiaddddddfklpccfkZZ',
 'ZZpaehiehkaccdddddehjbccdddddddddfklggcddddddddddddddfbhpadddddddddddfklopadddddehiamlmmpccdddddeehiadddddfbacddcddddddfklmaccddddddfbgghiafehiadddddddfklpacfZZ',
 'ZZmghbacfkbccdddddeehpacdddddddddfklggcdddddddddddddehiacadddddddddddfklopadddddehiaklnopcddddddeehiadddehjlnopacddddddfklmaccddddehiaehbgcdehiadddddddfehjlpcZZ',
 'ZZcchbacfkbccdddddfehpacdddddddddfklggcdddddddddddddddehjapadddddddddfknopadddddfklmmmnopcddddddehjiddddddfknopacddddddfklpaccdddddfklmaacdfehpadddddehjblckknZZ',
 'ZZcehjdeehiacdjdddedjbdcdddddddddfklggcdddddddddddddddbfblbacddddddddfklopacddddehiamlnopaddddddehjacddddfehpaaccdddddefklpaccdddddfklmbfbehehiaddddddffkgoiehZZ',
 'ZZpccdjdfkbccdddddehhpacdddddddddfklggcdddddddddddddehiacbdcdddddddddfklopadddddehiammnopcddddddeejiadddehjlgobacddddddfklmpccddddehiacbcbdfehpadddddehjklmklmZZ',
 'ZZccfklcfkbccdddddehjbdcdddddddddfklggcdddddddddddddehiapaccdddddddddfklopadddddehjamlnopaddddddehjddcdddfbfghpacddddddfklpaccddddddfbcfbacfehpadddddddekpghiaZZ']
Read 10 chain(s) in 2LFU.pdb

In [4]:
!cat output.fasta
!rm output.fasta


>2LFU.pdb | model 1 | chain A
ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcd
dddddddddfklopadddddfhpamlnopcddddddehjadddddehjacbddddddddf
klmaccddddddfbgniaghiapaddddddfklnoambZZ
>2LFU.pdb | model 2 | chain A
ZZpcfblcffbccdddddeehjacdddddddddfklggcddddddddddddddfblghia
dddddddddfklopadddddehpmmmnopcddddddeehiacdddfblopadcddddddf
klpaccdddddfklmlmgcdehiaddddddfklmmgopZZ
>2LFU.pdb | model 3 | chain A
ZZmgghiafbbccdddddehjbdcdddddddddfklggcddddddddddddddfbfghpa
cddddddddfklopadddddehiaklmmmgcdddddeehiaddddfkbgciacdddddef
klpaccddddddfkgojbdfehpaddddddfkbccfbgZZ
>2LFU.pdb | model 4 | chain A
ZZcghiacfkbacdddddfbhpacdddddddddfklmcfdddddddddddddehiacddd
dddddddddfknopadddddfkpamlnopaddddddehjaccdddfklnopacddddddf
klmpccdddddddehiabghehiaddddddfklpccfkZZ
>2LFU.pdb | model 5 | chain A
ZZpaehiehkaccdddddehjbccdddddddddfklggcddddddddddddddfbhpadd
dddddddddfklopadddddehiamlmmpccdddddeehiadddddfbacddcddddddf
klmaccddddddfbgghiafehiadddddddfklpacfZZ
>2LFU.pdb | model 6 | chain A
ZZmghbacfkbccdddddeehpacdddddddddfklggcdddddddddddddehiacadd
dddddddddfklopadddddehiaklnopcddddddeehiadddehjlnopacddddddf
klmaccddddehiaehbgcdehiadddddddfehjlpcZZ
>2LFU.pdb | model 7 | chain A
ZZcchbacfkbccdddddfehpacdddddddddfklggcdddddddddddddddehjapa
dddddddddfknopadddddfklmmmnopcddddddehjiddddddfknopacddddddf
klpaccdddddfklmaacdfehpadddddehjblckknZZ
>2LFU.pdb | model 8 | chain A
ZZcehjdeehiacdjdddedjbdcdddddddddfklggcdddddddddddddddbfblba
cddddddddfklopacddddehiamlnopaddddddehjacddddfehpaaccdddddef
klpaccdddddfklmbfbehehiaddddddffkgoiehZZ
>2LFU.pdb | model 9 | chain A
ZZpccdjdfkbccdddddehhpacdddddddddfklggcdddddddddddddehiacbdc
dddddddddfklopadddddehiammnopcddddddeejiadddehjlgobacddddddf
klmpccddddehiacbcbdfehpadddddehjklmklmZZ
>2LFU.pdb | model 10 | chain A
ZZccfklcfkbccdddddehjbdcdddddddddfklggcdddddddddddddehiapacc
dddddddddfklopadddddehjamlnopaddddddehjddcdddfbfghpacddddddf
klpaccddddddfbcfbacfehpadddddddekpghiaZZ

Sequences can be written once at a time using the pbxplore.io.write_fasta_entry() function.


In [5]:
pdb_name, _ = urllib.request.urlretrieve('https://files.rcsb.org/view/2LFU.pdb', '2LFU.pdb')

with open('output.fasta', 'w') as outfile:
    for chain_name, chain in pbx.chains_from_files([pdb_name]):
        dihedrals = chain.get_phi_psi_angles()
        pb_seq = pbx.assign(dihedrals)
        pbx.io.write_fasta_entry(outfile, pb_seq, chain_name)


Read 10 chain(s) in 2LFU.pdb

In [6]:
!cat output.fasta
!rm output.fasta


>2LFU.pdb | model 1 | chain A
ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcd
dddddddddfklopadddddfhpamlnopcddddddehjadddddehjacbddddddddf
klmaccddddddfbgniaghiapaddddddfklnoambZZ
>2LFU.pdb | model 2 | chain A
ZZpcfblcffbccdddddeehjacdddddddddfklggcddddddddddddddfblghia
dddddddddfklopadddddehpmmmnopcddddddeehiacdddfblopadcddddddf
klpaccdddddfklmlmgcdehiaddddddfklmmgopZZ
>2LFU.pdb | model 3 | chain A
ZZmgghiafbbccdddddehjbdcdddddddddfklggcddddddddddddddfbfghpa
cddddddddfklopadddddehiaklmmmgcdddddeehiaddddfkbgciacdddddef
klpaccddddddfkgojbdfehpaddddddfkbccfbgZZ
>2LFU.pdb | model 4 | chain A
ZZcghiacfkbacdddddfbhpacdddddddddfklmcfdddddddddddddehiacddd
dddddddddfknopadddddfkpamlnopaddddddehjaccdddfklnopacddddddf
klmpccdddddddehiabghehiaddddddfklpccfkZZ
>2LFU.pdb | model 5 | chain A
ZZpaehiehkaccdddddehjbccdddddddddfklggcddddddddddddddfbhpadd
dddddddddfklopadddddehiamlmmpccdddddeehiadddddfbacddcddddddf
klmaccddddddfbgghiafehiadddddddfklpacfZZ
>2LFU.pdb | model 6 | chain A
ZZmghbacfkbccdddddeehpacdddddddddfklggcdddddddddddddehiacadd
dddddddddfklopadddddehiaklnopcddddddeehiadddehjlnopacddddddf
klmaccddddehiaehbgcdehiadddddddfehjlpcZZ
>2LFU.pdb | model 7 | chain A
ZZcchbacfkbccdddddfehpacdddddddddfklggcdddddddddddddddehjapa
dddddddddfknopadddddfklmmmnopcddddddehjiddddddfknopacddddddf
klpaccdddddfklmaacdfehpadddddehjblckknZZ
>2LFU.pdb | model 8 | chain A
ZZcehjdeehiacdjdddedjbdcdddddddddfklggcdddddddddddddddbfblba
cddddddddfklopacddddehiamlnopaddddddehjacddddfehpaaccdddddef
klpaccdddddfklmbfbehehiaddddddffkgoiehZZ
>2LFU.pdb | model 9 | chain A
ZZpccdjdfkbccdddddehhpacdddddddddfklggcdddddddddddddehiacbdc
dddddddddfklopadddddehiammnopcddddddeejiadddehjlgobacddddddf
klmpccddddehiacbcbdfehpadddddehjklmklmZZ
>2LFU.pdb | model 10 | chain A
ZZccfklcfkbccdddddehjbdcdddddddddfklggcdddddddddddddehiapacc
dddddddddfklopadddddehjamlnopaddddddehjddcdddfbfghpacddddddf
klpaccddddddfbcfbacfehpadddddddekpghiaZZ

By default, the lines in fasta files are wrapped at 60 caracters as defined in pbxplore.io.fasta.FASTA_WIDTH. Both pbxplore.io.write_fasta() and pbxplore.io.write_fasta_entry() have a width optionnal argument that allows to control the wrapping.


In [7]:
print(pb_sequences[0])


ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcddddddddddfklopadddddfhpamlnopcddddddehjadddddehjacbddddddddfklmaccddddddfbgniaghiapaddddddfklnoambZZ

In [8]:
with open('output.fasta', 'w') as outfile:
    for width in (60, 70, 80):
        pbx.io.write_fasta_entry(outfile, pb_sequences[0],
                                        'width={} blocks'.format(width),
                                        width=width)

In [9]:
!cat output.fasta
!rm output.fasta


>width=60 blocks
ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcd
dddddddddfklopadddddfhpamlnopcddddddehjadddddehjacbddddddddf
klmaccddddddfbgniaghiapaddddddfklnoambZZ
>width=70 blocks
ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcddddddddddf
klopadddddfhpamlnopcddddddehjadddddehjacbddddddddfklmaccddddddfbgniagh
iapaddddddfklnoambZZ
>width=80 blocks
ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcddddddddddfklopaddddd
fhpamlnopcddddddehjadddddehjacbddddddddfklmaccddddddfbgniaghiapaddddddfklnoambZZ

Dihedral angles

One needs the phi and psi dihedral angles to assign protein block sequences. Having these angles, it is sometime convenient to store them in a file. This can be done easily.


In [10]:
pdb_name, _ = urllib.request.urlretrieve('https://files.rcsb.org/view/2LFU.pdb', '2LFU.pdb')

with open('output.phipsi', 'w') as outfile:
    for chain_name, chain in pbx.chains_from_files([pdb_name]):
        dihedral = chain.get_phi_psi_angles()
        for res in sorted(dihedral):
            phi = "{:8.2f}".format(dihedral[res]["phi"]) if dihedral[res]["phi"] else "    None"
            psi = "{:8.2f}".format(dihedral[res]["psi"]) if dihedral[res]["psi"] else "    None"
            print("{} {:6d} {} {} ".format(chain_name, res, phi, psi), file=outfile)


Read 10 chain(s) in 2LFU.pdb

Note it's better to write the dihedral for each PDB/frame due to the high memory cost to store all of them in a list.

The output is formated with one line per residue. The first columns repeat the name given for the chain, then is the residue id followed by the phi and the psi angle. If an angle is not defined, 'None' is written instead.


In [11]:
!head output.phipsi
!tail output.phipsi
!rm output.phipsi


2LFU.pdb | model 1 | chain A    276     None    44.77 
2LFU.pdb | model 1 | chain A    277   -81.67    15.27 
2LFU.pdb | model 1 | chain A    278   -79.03   -21.52 
2LFU.pdb | model 1 | chain A    279  -144.43    64.08 
2LFU.pdb | model 1 | chain A    280   -84.23   144.90 
2LFU.pdb | model 1 | chain A    281    70.83   -64.01 
2LFU.pdb | model 1 | chain A    282  -107.77    93.35 
2LFU.pdb | model 1 | chain A    283   -71.42   108.03 
2LFU.pdb | model 1 | chain A    284   -72.99    99.31 
2LFU.pdb | model 1 | chain A    285   -92.93     2.32 
2LFU.pdb | model 10 | chain A    426   -88.16   127.99 
2LFU.pdb | model 10 | chain A    427  -107.40  -161.76 
2LFU.pdb | model 10 | chain A    428   -72.12   -14.94 
2LFU.pdb | model 10 | chain A    429    66.62   -81.17 
2LFU.pdb | model 10 | chain A    430  -141.67   111.76 
2LFU.pdb | model 10 | chain A    431   -69.37   141.19 
2LFU.pdb | model 10 | chain A    432    76.08   -75.91 
2LFU.pdb | model 10 | chain A    433  -149.23   167.34 
2LFU.pdb | model 10 | chain A    434   -63.43   -27.06 
2LFU.pdb | model 10 | chain A    435  -166.70     None 

Read fasta files

We want to read sequences that we wrote in files. PBxplore provides a function to read fasta files: the pbxplore.io.read_fasta() function.


In [12]:
def pdb_to_fasta_pb(pdb_path, fasta_path):
    """
    Write a fasta file with all the PB sequences from a PDB
    """
    with open(fasta_path, 'w') as outfile:
        for chain_name, chain in pbx.chains_from_files([pdb_path]):
            dihedrals = chain.get_phi_psi_angles()
            pb_seq = pbx.assign(dihedrals)
            pbx.io.write_fasta_entry(outfile, pb_seq, chain_name)

# Write a fasta file
pdb_name, _ = urllib.request.urlretrieve('https://files.rcsb.org/view/2LFU.pdb', '2LFU.pdb')
pdb_to_fasta_pb(pdb_name, 'output.fasta')

# Read a list of headers and a list of sequences from a fasta file
names, sequences = pbx.io.read_fasta('output.fasta')

print('names:')
pprint(names)
print('sequences:')
pprint(sequences)

!rm output.fasta


read 10 sequences in output.fasta
names:
['2LFU.pdb | model 1 | chain A',
 '2LFU.pdb | model 2 | chain A',
 '2LFU.pdb | model 3 | chain A',
 '2LFU.pdb | model 4 | chain A',
 '2LFU.pdb | model 5 | chain A',
 '2LFU.pdb | model 6 | chain A',
 '2LFU.pdb | model 7 | chain A',
 '2LFU.pdb | model 8 | chain A',
 '2LFU.pdb | model 9 | chain A',
 '2LFU.pdb | model 10 | chain A']
sequences:
['ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcddddddddddfklopadddddfhpamlnopcddddddehjadddddehjacbddddddddfklmaccddddddfbgniaghiapaddddddfklnoambZZ',
 'ZZpcfblcffbccdddddeehjacdddddddddfklggcddddddddddddddfblghiadddddddddfklopadddddehpmmmnopcddddddeehiacdddfblopadcddddddfklpaccdddddfklmlmgcdehiaddddddfklmmgopZZ',
 'ZZmgghiafbbccdddddehjbdcdddddddddfklggcddddddddddddddfbfghpacddddddddfklopadddddehiaklmmmgcdddddeehiaddddfkbgciacdddddefklpaccddddddfkgojbdfehpaddddddfkbccfbgZZ',
 'ZZcghiacfkbacdddddfbhpacdddddddddfklmcfdddddddddddddehiacddddddddddddfknopadddddfkpamlnopaddddddehjaccdddfklnopacddddddfklmpccdddddddehiabghehiaddddddfklpccfkZZ',
 'ZZpaehiehkaccdddddehjbccdddddddddfklggcddddddddddddddfbhpadddddddddddfklopadddddehiamlmmpccdddddeehiadddddfbacddcddddddfklmaccddddddfbgghiafehiadddddddfklpacfZZ',
 'ZZmghbacfkbccdddddeehpacdddddddddfklggcdddddddddddddehiacadddddddddddfklopadddddehiaklnopcddddddeehiadddehjlnopacddddddfklmaccddddehiaehbgcdehiadddddddfehjlpcZZ',
 'ZZcchbacfkbccdddddfehpacdddddddddfklggcdddddddddddddddehjapadddddddddfknopadddddfklmmmnopcddddddehjiddddddfknopacddddddfklpaccdddddfklmaacdfehpadddddehjblckknZZ',
 'ZZcehjdeehiacdjdddedjbdcdddddddddfklggcdddddddddddddddbfblbacddddddddfklopacddddehiamlnopaddddddehjacddddfehpaaccdddddefklpaccdddddfklmbfbehehiaddddddffkgoiehZZ',
 'ZZpccdjdfkbccdddddehhpacdddddddddfklggcdddddddddddddehiacbdcdddddddddfklopadddddehiammnopcddddddeejiadddehjlgobacddddddfklmpccddddehiacbcbdfehpadddddehjklmklmZZ',
 'ZZccfklcfkbccdddddehjbdcdddddddddfklggcdddddddddddddehiapaccdddddddddfklopadddddehjamlnopaddddddehjddcdddfbfghpacddddddfklpaccddddddfbcfbacfehpadddddddekpghiaZZ']
Read 10 chain(s) in 2LFU.pdb

If the sequences we want to read are spread amongst several fasta files, then we can use the pbxplore.io.read_several_fasta() function that takes a list of fasta file path as argument instead of a single path.


In [13]:
# Write several fasta files
pdbname, _ = urllib.request.urlretrieve('https://files.rcsb.org/view/1BTA.pdb', '1BTA.pdb')
pdb_to_fasta_pb(pdbname, '1BTA.fasta')
pdbname, _ = urllib.request.urlretrieve('https://files.rcsb.org/view/2LFU.pdb', '2LFU.pdb')
pdb_to_fasta_pb(pdb_name, '2FLU.fasta')

# Read the fasta files
names, sequences = pbx.io.read_several_fasta(['1BTA.fasta', '2FLU.fasta'])

# Print the first entries
print('names:')
pprint(names[:5])
print('sequences:')
pprint(sequences[:5])

!rm 1BTA.fasta 2FLU.fasta


Read 1 chain(s) in 1BTA.pdb
read 1 sequences in 1BTA.fasta
read 10 sequences in 2FLU.fasta
names:
['1BTA.pdb | chain A',
 '2LFU.pdb | model 1 | chain A',
 '2LFU.pdb | model 2 | chain A',
 '2LFU.pdb | model 3 | chain A',
 '2LFU.pdb | model 4 | chain A']
sequences:
['ZZdddfklonbfklmmmmmmmmnopafklnoiaklmmmmmnoopacddddddehkllmmmmngoilmmmmmmmmmmmmnopacdcddZZ',
 'ZZbghiacfkbccdddddehiadddddddddddfklggcdddddddddddddehifbdcddddddddddfklopadddddfhpamlnopcddddddehjadddddehjacbddddddddfklmaccddddddfbgniaghiapaddddddfklnoambZZ',
 'ZZpcfblcffbccdddddeehjacdddddddddfklggcddddddddddddddfblghiadddddddddfklopadddddehpmmmnopcddddddeehiacdddfblopadcddddddfklpaccdddddfklmlmgcdehiaddddddfklmmgopZZ',
 'ZZmgghiafbbccdddddehjbdcdddddddddfklggcddddddddddddddfbfghpacddddddddfklopadddddehiaklmmmgcdddddeehiaddddfkbgciacdddddefklpaccddddddfkgojbdfehpaddddddfkbccfbgZZ',
 'ZZcghiacfkbacdddddfbhpacdddddddddfklmcfdddddddddddddehiacddddddddddddfknopadddddfkpamlnopaddddddehjaccdddfklnopacddddddfklmpccdddddddehiabghehiaddddddfklpccfkZZ']
Read 10 chain(s) in 2LFU.pdb