Mapping TFBS to alignments

Dictionary postions with an alignment

Me and Bronski's conversation on how to do this

Me: Hey! I want to map nucleotide sequence position after an alignment. I know you have done this before. So I would rather not reinvent the wheel. You did a dictionary in python, but how? Can I see your script? If this feature is embedded in a larger program it might be easier to just explain your strategy.

Bronski: So the strategy is to loop through an aligned sequence and create a dictionary where the keys are the original indices and the values are the indices in the alignment.

Here’s a simple example:


In [10]:
aligned_seq = 'AGC---TTCATCA'
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0
for xInd, x in enumerate(aligned_seq):    
    if x in nuc_list:
        remap_dict[counter] = xInd
        counter += 1
        
print(remap_dict)


{0: 0, 1: 1, 2: 2, 3: 6, 4: 7, 5: 8, 6: 9, 7: 10, 8: 11, 9: 12}

My Attempt with ES2

Breakdown of what I have to do:

  1. Read in alignment file.
  2. seperate each sequence into it's own sequence
    • make dictionary for each sequence
    • print out sequence?
      • run TFBS finder for each sequence
  3. Make vector of each sequence that says presence or absence at each position.
  4. Figure out a way to visualize this.

Read in Alignment File

Use Bio.AlignIO.read()

  • The first argument is a handle to read the data from, typically an open file (see Section 24.1), or a filename.
  • The second argument is a lower case string specifying the alignment format. As in Bio.SeqIO we don’t try and guess the file format for you! See http://biopython.org/wiki/AlignIO for a full listing of supported formats.

In [20]:
from Bio import AlignIO
alignment = AlignIO.read("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta")
print(alignment)


SingleLetterAlphabet() alignment with 9 rows and 1136 columns
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...C-- ludwig_eve-striped-2||MEMB002F|+
ATATAACCCAATAATTTGAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB002A|+
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB003C|-
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB002C|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...C-- ludwig_eve-striped-2||MEMB003B|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...C-- ludwig_eve-striped-2||MEMB003F|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCCAGAAG...C-- ludwig_eve-striped-2||MEMB003D|-
ATATAACCCAATAATTTGAACTAACTCGCAGGAGCGGCAAGAAG...CTA ludwig_eve-striped-2||MEMB002D|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...CTA ludwig_eve-striped-2||MEMB002E|-

In [12]:
for record in alignment:
    print(record.id)


ludwig_eve-striped-2||MEMB002F|+
ludwig_eve-striped-2||MEMB002A|+
ludwig_eve-striped-2||MEMB003C|-
ludwig_eve-striped-2||MEMB002C|+
ludwig_eve-striped-2||MEMB003B|+
ludwig_eve-striped-2||MEMB003F|+
ludwig_eve-striped-2||MEMB003D|-
ludwig_eve-striped-2||MEMB002D|+
ludwig_eve-striped-2||MEMB002E|-

Buuuuuuuut, we don't really need the alignment as an alignment per se. But it is important for viewing and testing later. We need to have each seperate sequence, So I am going to use SeqIO.parse.


In [13]:
from Bio import SeqIO

# read in alignment as a list of sequences
records = list(SeqIO.parse("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta"))

In [14]:
# Testing with the first sequence
seqTest = records[0]
#print(seqTest.seq)
print(type(seqTest))


<class 'Bio.SeqRecord.SeqRecord'>

In [15]:
# Turn just the sequence into a string instead of fasta sequence
aligned_seq = str(seqTest.seq)
print(type(aligned_seq)) # check


<type 'str'>

Notes on loop

  • enumerate(): prints out numbers counting up
  • xInd is the keys that were enumerated.
  • then the remap_dict[counter] = xInd makes the dictionary x is the nucleotide

In [33]:
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0

for xInd, x in enumerate(aligned_seq):    
    if x in nuc_list:
        remap_dict[counter] = xInd
        counter += 1
        
#checking dictionary created
print(len(remap_dict)) # should be length of alignment
print(remap_dict[40]) #should print the value of the number key
print(type(remap_dict[40])) #Check data type


905
43
<type 'int'>

Putting the Remap together with TFBS

The last part is to create the vector that should span the entire alignment printing 1 if the position has a bicoid site or 0 if not.


In [52]:
## Attempt at vector

bcdSites = [0] * len(aligned_seq)

#from loctaingTFB.ipy
TFBS = [10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]

#Need to make positive.
TFBS_pos = [abs(k) for k in TFBS]

# Or do I need to subtract from the end position...
# something like this
for i in TFBS:
    if i < 0:
        len(aligned_seq - i)
    else: 
        i


print(TFBS)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-52-138c293461b0> in <module>()
     13 for i in TFBS:
     14     if i < 0:
---> 15         len(aligned_seq - i)
     16     else:
     17         i

TypeError: unsupported operand type(s) for -: 'str' and 'int'

In [ ]:


In [ ]: