Mapping TFBS to alignments

Dictionary postions with an alignment

Me and Bronski's conversation on how to do this

Me: Hey! I want to map nucleotide sequence position after an alignment. I know you have done this before. So I would rather not reinvent the wheel. You did a dictionary in python, but how? Can I see your script? If this feature is embedded in a larger program it might be easier to just explain your strategy.

Bronski: So the strategy is to loop through an aligned sequence and create a dictionary where the keys are the original indices and the values are the indices in the alignment.

Here’s a simple example:


In [71]:
aligned_seq = 'AGC---TTCATCA'
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0
for xInd, x in enumerate(aligned_seq):    
    if x in nuc_list:
        remap_dict[counter] = xInd
        counter += 1
        
print(remap_dict)


{0: 0, 1: 1, 2: 2, 3: 6, 4: 7, 5: 8, 6: 9, 7: 10, 8: 11, 9: 12}

My Attempt with ES2

Breakdown of what I have to do:

  1. Read in alignment file.
  2. seperate each sequence into it's own sequence
    • make dictionary for each sequence
    • print out sequence?
      • run TFBS finder for each sequence
  3. Make vector of each sequence that says presence or absence at each position.
  4. Figure out a way to visualize this.

Read in Alignment File

Use Bio.AlignIO.read()

  • The first argument is a handle to read the data from, typically an open file (see Section 24.1), or a filename.
  • The second argument is a lower case string specifying the alignment format. As in Bio.SeqIO we don’t try and guess the file format for you! See http://biopython.org/wiki/AlignIO for a full listing of supported formats.

In [72]:
from Bio import AlignIO
alignment = AlignIO.read("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta")
print(alignment)


SingleLetterAlphabet() alignment with 9 rows and 1136 columns
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...C-- ludwig_eve-striped-2||MEMB002F|+
ATATAACCCAATAATTTGAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB002A|+
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB003C|-
ATATAACCCAATAATTTTAACTAACTCGCAGGA---GCAAGAAG...CTG ludwig_eve-striped-2||MEMB002C|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...C-- ludwig_eve-striped-2||MEMB003B|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...C-- ludwig_eve-striped-2||MEMB003F|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCCAGAAG...C-- ludwig_eve-striped-2||MEMB003D|-
ATATAACCCAATAATTTGAACTAACTCGCAGGAGCGGCAAGAAG...CTA ludwig_eve-striped-2||MEMB002D|+
ATATAACCCAATAATTTTAACTAACTCGCAGGAGCGGCAAGAAG...CTA ludwig_eve-striped-2||MEMB002E|-

In [73]:
for record in alignment:
    print(record.id)


ludwig_eve-striped-2||MEMB002F|+
ludwig_eve-striped-2||MEMB002A|+
ludwig_eve-striped-2||MEMB003C|-
ludwig_eve-striped-2||MEMB002C|+
ludwig_eve-striped-2||MEMB003B|+
ludwig_eve-striped-2||MEMB003F|+
ludwig_eve-striped-2||MEMB003D|-
ludwig_eve-striped-2||MEMB002D|+
ludwig_eve-striped-2||MEMB002E|-

Buuuuuuuut, we don't really need the alignment as an alignment per se. But it is important for viewing and testing later. We need to have each seperate sequence, So I am going to use SeqIO.parse.


In [74]:
from Bio import SeqIO

# read in alignment as a list of sequences
records = list(SeqIO.parse("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta"))

In [75]:
# Testing with the first sequence
seqTest = records[0]
#print(seqTest.seq)
print(type(seqTest))


<class 'Bio.SeqRecord.SeqRecord'>

In [76]:
# Turn just the sequence into a string instead of fasta sequence
aligned_seq = str(seqTest.seq)
print(type(aligned_seq)) # check


<type 'str'>

Notes on loop

  • enumerate(): prints out numbers counting up
  • xInd is the keys that were enumerated.
  • then the remap_dict[counter] = xInd makes the dictionary x is the nucleotide

In [83]:
remap_dict = {}
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']
counter = 0

for xInd, x in enumerate(aligned_seq):    
    if x in nuc_list:
        remap_dict[counter] = xInd
        counter += 1

#checking dictionary created
print(len(remap_dict)) # should be length of alignment
print(remap_dict[40]) #should print the value of the number key
print(type(remap_dict[40])) #Check data type


{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 36, 34: 37, 35: 38, 36: 39, 37: 40, 38: 41, 39: 42, 40: 43, 41: 44, 42: 45, 43: 46, 44: 47, 45: 48, 46: 49, 47: 59, 48: 60, 49: 61, 50: 62, 51: 63, 52: 64, 53: 65, 54: 66, 55: 67, 56: 68, 57: 69, 58: 70, 59: 71, 60: 72, 61: 73, 62: 74, 63: 75, 64: 76, 65: 77, 66: 78, 67: 79, 68: 80, 69: 81, 70: 82, 71: 83, 72: 84, 73: 85, 74: 86, 75: 87, 76: 88, 77: 89, 78: 90, 79: 91, 80: 92, 81: 93, 82: 94, 83: 97, 84: 98, 85: 99, 86: 100, 87: 101, 88: 102, 89: 103, 90: 104, 91: 105, 92: 106, 93: 139, 94: 140, 95: 141, 96: 142, 97: 143, 98: 144, 99: 145, 100: 146, 101: 147, 102: 148, 103: 149, 104: 150, 105: 151, 106: 152, 107: 153, 108: 154, 109: 160, 110: 161, 111: 162, 112: 163, 113: 164, 114: 165, 115: 166, 116: 167, 117: 168, 118: 169, 119: 170, 120: 171, 121: 180, 122: 181, 123: 182, 124: 183, 125: 184, 126: 185, 127: 186, 128: 187, 129: 188, 130: 189, 131: 190, 132: 191, 133: 192, 134: 200, 135: 201, 136: 202, 137: 203, 138: 204, 139: 205, 140: 206, 141: 207, 142: 208, 143: 209, 144: 210, 145: 211, 146: 212, 147: 213, 148: 214, 149: 215, 150: 216, 151: 217, 152: 218, 153: 219, 154: 220, 155: 221, 156: 222, 157: 223, 158: 224, 159: 225, 160: 226, 161: 227, 162: 228, 163: 243, 164: 244, 165: 245, 166: 246, 167: 247, 168: 248, 169: 249, 170: 250, 171: 251, 172: 252, 173: 253, 174: 254, 175: 255, 176: 256, 177: 257, 178: 258, 179: 259, 180: 260, 181: 261, 182: 262, 183: 263, 184: 264, 185: 265, 186: 266, 187: 267, 188: 268, 189: 269, 190: 270, 191: 271, 192: 272, 193: 273, 194: 274, 195: 275, 196: 276, 197: 277, 198: 278, 199: 279, 200: 280, 201: 281, 202: 282, 203: 283, 204: 284, 205: 285, 206: 286, 207: 287, 208: 288, 209: 289, 210: 290, 211: 291, 212: 292, 213: 293, 214: 294, 215: 295, 216: 296, 217: 316, 218: 317, 219: 318, 220: 319, 221: 320, 222: 321, 223: 322, 224: 323, 225: 324, 226: 325, 227: 326, 228: 327, 229: 328, 230: 329, 231: 330, 232: 331, 233: 332, 234: 333, 235: 334, 236: 335, 237: 336, 238: 337, 239: 338, 240: 339, 241: 340, 242: 341, 243: 342, 244: 343, 245: 344, 246: 345, 247: 346, 248: 347, 249: 348, 250: 349, 251: 350, 252: 351, 253: 352, 254: 353, 255: 354, 256: 355, 257: 356, 258: 357, 259: 358, 260: 359, 261: 360, 262: 361, 263: 362, 264: 363, 265: 364, 266: 365, 267: 366, 268: 367, 269: 368, 270: 369, 271: 370, 272: 371, 273: 372, 274: 373, 275: 374, 276: 375, 277: 376, 278: 377, 279: 378, 280: 379, 281: 380, 282: 381, 283: 382, 284: 383, 285: 384, 286: 385, 287: 386, 288: 387, 289: 388, 290: 389, 291: 397, 292: 398, 293: 399, 294: 400, 295: 401, 296: 402, 297: 403, 298: 404, 299: 405, 300: 406, 301: 407, 302: 408, 303: 409, 304: 410, 305: 411, 306: 413, 307: 414, 308: 415, 309: 416, 310: 417, 311: 418, 312: 419, 313: 420, 314: 421, 315: 422, 316: 423, 317: 424, 318: 425, 319: 426, 320: 427, 321: 429, 322: 430, 323: 431, 324: 432, 325: 433, 326: 434, 327: 435, 328: 436, 329: 437, 330: 438, 331: 439, 332: 440, 333: 441, 334: 442, 335: 443, 336: 444, 337: 445, 338: 446, 339: 447, 340: 448, 341: 449, 342: 450, 343: 451, 344: 452, 345: 453, 346: 454, 347: 455, 348: 456, 349: 457, 350: 458, 351: 459, 352: 474, 353: 475, 354: 476, 355: 477, 356: 478, 357: 479, 358: 480, 359: 481, 360: 482, 361: 483, 362: 484, 363: 485, 364: 486, 365: 487, 366: 488, 367: 489, 368: 490, 369: 491, 370: 492, 371: 493, 372: 494, 373: 495, 374: 496, 375: 497, 376: 498, 377: 499, 378: 500, 379: 501, 380: 502, 381: 503, 382: 504, 383: 505, 384: 506, 385: 507, 386: 508, 387: 509, 388: 510, 389: 511, 390: 512, 391: 516, 392: 517, 393: 518, 394: 519, 395: 520, 396: 521, 397: 522, 398: 523, 399: 524, 400: 525, 401: 526, 402: 528, 403: 529, 404: 530, 405: 531, 406: 532, 407: 533, 408: 534, 409: 535, 410: 536, 411: 537, 412: 538, 413: 539, 414: 540, 415: 541, 416: 542, 417: 543, 418: 544, 419: 545, 420: 546, 421: 547, 422: 550, 423: 551, 424: 552, 425: 553, 426: 554, 427: 555, 428: 556, 429: 560, 430: 561, 431: 562, 432: 563, 433: 564, 434: 565, 435: 566, 436: 567, 437: 568, 438: 569, 439: 570, 440: 571, 441: 574, 442: 575, 443: 576, 444: 577, 445: 578, 446: 579, 447: 580, 448: 581, 449: 582, 450: 583, 451: 584, 452: 588, 453: 589, 454: 590, 455: 591, 456: 592, 457: 593, 458: 594, 459: 595, 460: 596, 461: 597, 462: 598, 463: 599, 464: 600, 465: 601, 466: 602, 467: 603, 468: 604, 469: 605, 470: 606, 471: 607, 472: 608, 473: 609, 474: 610, 475: 611, 476: 612, 477: 613, 478: 614, 479: 615, 480: 616, 481: 617, 482: 618, 483: 619, 484: 620, 485: 621, 486: 622, 487: 623, 488: 624, 489: 625, 490: 626, 491: 627, 492: 628, 493: 631, 494: 632, 495: 633, 496: 635, 497: 636, 498: 637, 499: 663, 500: 664, 501: 665, 502: 666, 503: 667, 504: 668, 505: 669, 506: 670, 507: 671, 508: 672, 509: 673, 510: 674, 511: 675, 512: 676, 513: 677, 514: 678, 515: 679, 516: 680, 517: 681, 518: 682, 519: 683, 520: 684, 521: 685, 522: 686, 523: 687, 524: 688, 525: 689, 526: 690, 527: 691, 528: 692, 529: 693, 530: 694, 531: 695, 532: 696, 533: 697, 534: 698, 535: 699, 536: 700, 537: 701, 538: 702, 539: 703, 540: 704, 541: 705, 542: 706, 543: 707, 544: 708, 545: 709, 546: 710, 547: 711, 548: 712, 549: 713, 550: 714, 551: 715, 552: 716, 553: 717, 554: 718, 555: 719, 556: 720, 557: 721, 558: 722, 559: 723, 560: 724, 561: 725, 562: 726, 563: 739, 564: 740, 565: 741, 566: 742, 567: 743, 568: 744, 569: 745, 570: 746, 571: 747, 572: 748, 573: 749, 574: 750, 575: 751, 576: 752, 577: 753, 578: 754, 579: 755, 580: 756, 581: 759, 582: 760, 583: 761, 584: 762, 585: 763, 586: 764, 587: 765, 588: 766, 589: 767, 590: 768, 591: 769, 592: 770, 593: 771, 594: 772, 595: 773, 596: 774, 597: 775, 598: 776, 599: 777, 600: 778, 601: 779, 602: 780, 603: 781, 604: 782, 605: 783, 606: 784, 607: 785, 608: 786, 609: 787, 610: 788, 611: 789, 612: 790, 613: 791, 614: 792, 615: 793, 616: 794, 617: 795, 618: 796, 619: 797, 620: 798, 621: 806, 622: 807, 623: 808, 624: 809, 625: 810, 626: 812, 627: 813, 628: 814, 629: 815, 630: 816, 631: 817, 632: 818, 633: 819, 634: 820, 635: 821, 636: 822, 637: 823, 638: 824, 639: 825, 640: 826, 641: 827, 642: 829, 643: 830, 644: 831, 645: 832, 646: 833, 647: 834, 648: 835, 649: 836, 650: 837, 651: 838, 652: 839, 653: 840, 654: 841, 655: 842, 656: 843, 657: 844, 658: 845, 659: 846, 660: 847, 661: 848, 662: 849, 663: 850, 664: 851, 665: 852, 666: 853, 667: 854, 668: 855, 669: 861, 670: 862, 671: 863, 672: 864, 673: 865, 674: 866, 675: 867, 676: 868, 677: 869, 678: 872, 679: 873, 680: 874, 681: 875, 682: 876, 683: 877, 684: 878, 685: 879, 686: 880, 687: 881, 688: 882, 689: 883, 690: 892, 691: 893, 692: 894, 693: 895, 694: 896, 695: 897, 696: 898, 697: 899, 698: 900, 699: 901, 700: 902, 701: 903, 702: 904, 703: 905, 704: 906, 705: 907, 706: 908, 707: 909, 708: 910, 709: 911, 710: 912, 711: 913, 712: 914, 713: 929, 714: 930, 715: 931, 716: 932, 717: 933, 718: 934, 719: 935, 720: 936, 721: 937, 722: 938, 723: 939, 724: 947, 725: 948, 726: 949, 727: 950, 728: 951, 729: 952, 730: 953, 731: 954, 732: 955, 733: 956, 734: 957, 735: 960, 736: 961, 737: 962, 738: 963, 739: 964, 740: 965, 741: 966, 742: 967, 743: 968, 744: 969, 745: 970, 746: 971, 747: 972, 748: 973, 749: 974, 750: 975, 751: 976, 752: 977, 753: 978, 754: 979, 755: 980, 756: 981, 757: 982, 758: 983, 759: 984, 760: 985, 761: 986, 762: 987, 763: 988, 764: 989, 765: 990, 766: 991, 767: 992, 768: 993, 769: 994, 770: 995, 771: 996, 772: 997, 773: 998, 774: 999, 775: 1000, 776: 1001, 777: 1002, 778: 1003, 779: 1004, 780: 1005, 781: 1006, 782: 1007, 783: 1008, 784: 1009, 785: 1010, 786: 1011, 787: 1012, 788: 1013, 789: 1014, 790: 1015, 791: 1016, 792: 1017, 793: 1018, 794: 1019, 795: 1020, 796: 1023, 797: 1024, 798: 1026, 799: 1027, 800: 1028, 801: 1029, 802: 1030, 803: 1031, 804: 1032, 805: 1033, 806: 1034, 807: 1035, 808: 1036, 809: 1037, 810: 1038, 811: 1039, 812: 1040, 813: 1041, 814: 1042, 815: 1043, 816: 1044, 817: 1045, 818: 1046, 819: 1047, 820: 1048, 821: 1049, 822: 1050, 823: 1051, 824: 1052, 825: 1053, 826: 1054, 827: 1055, 828: 1056, 829: 1057, 830: 1058, 831: 1059, 832: 1060, 833: 1061, 834: 1063, 835: 1064, 836: 1065, 837: 1066, 838: 1067, 839: 1068, 840: 1069, 841: 1070, 842: 1071, 843: 1072, 844: 1073, 845: 1074, 846: 1075, 847: 1076, 848: 1077, 849: 1078, 850: 1079, 851: 1080, 852: 1081, 853: 1082, 854: 1083, 855: 1084, 856: 1085, 857: 1086, 858: 1087, 859: 1088, 860: 1089, 861: 1090, 862: 1091, 863: 1092, 864: 1093, 865: 1094, 866: 1095, 867: 1096, 868: 1097, 869: 1098, 870: 1099, 871: 1100, 872: 1101, 873: 1102, 874: 1103, 875: 1104, 876: 1105, 877: 1106, 878: 1107, 879: 1108, 880: 1109, 881: 1110, 882: 1111, 883: 1112, 884: 1113, 885: 1114, 886: 1115, 887: 1116, 888: 1117, 889: 1118, 890: 1119, 891: 1120, 892: 1121, 893: 1122, 894: 1123, 895: 1124, 896: 1125, 897: 1126, 898: 1127, 899: 1128, 900: 1129, 901: 1130, 902: 1131, 903: 1132, 904: 1133}
905
43
<type 'int'>

We need two sequences. One that is not the alignment.

Putting the Remap together with TFBS

The last part is to create the vector that should span the entire alignment printing 1 if the position has a bicoid site or 0 if not.


In [78]:
## Attempt at vector

bcdSites = [0] * len(aligned_seq)

#from loctaingTFB.ipy
TFBS = [10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]

#Need to make positive. This is not what I need.
#TFBS_pos = [abs(k) for k in TFBS]

print((TFBS))
m = 7

# This is the range of the motif
for pos in TFBS:
    print(aligned_seq[pos:pos+m])


[10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]
ATAATTT
CCTCG--
--AACTG
--CTGTG
ACAG---
TCCATCC
-------
-------
-------
GAACGGT
GAGACAG
G------
-------
AACAGGC
AGTTGGG
TA-----
-TAATCC
GGGATTA
GCCGAGG
CACCTCA
CTTGGTA
CGATCC-
TGCGCCA
AAAGTCA
AATAAAT
GTC-CCA
CCC-TAA
CC-TAAT
------T

Now I need to make a new vector that says if the bicoid site is present or absent on the position. Can the negative positions be used in a query of the dictionary? Likely not.


In [118]:
print(type(TFBS))
print(type(remap_dict))
print(TFBS)

# Okay, the problem is the negative numbers
another_key = [82, 85, 98]
print(len(remap_dict))

# So I need to convert the negative number first.
print([remap_dict[x] for x in another_key])


<type 'list'>
<type 'dict'>
[10, 102, 137, -741, -680, -595, 309, -497, -485, 429, 453, 459, 465, -376, -347, -339, -308, 593, 600, -289, 613, 623, -240, 679, -128, -77, 825, 826, 886]
905
[94, 99, 144]

In [128]:
# Working on converting TFBS negative numbers. 
TFBS_2 = []
for x in TFBS:
    if x < 0:
       TFBS_2.append(905 + x)
    else:
       TFBS_2.append(x)

print(TFBS_2)


[10, 102, 137, 164, 225, 310, 309, 408, 420, 429, 453, 459, 465, 529, 558, 566, 597, 593, 600, 616, 613, 623, 665, 679, 777, 828, 825, 826, 886]