Bring in Alignment for mapping

This program will map TFBS using the Biopython's motif package.

Inputs:

  1. before alignment (fasta)
  2. after alignment (fasta)
  3. TFBS Position Frequency Matrix.

Outputs:

  1. .csv file that outputs found TFBSs at each position, if any, in alignment (position, score, species, raw_position, strand motif_found)
  2. .csv file that outputs only TFBS found.

To Do:

  • [x] Fix bug where species are labelled wrong only in reverse strand (how is this possible)?
    • I arbitrarily set the length. An artifact when I was looking at one sequence at a time.
  • [ ] Why are there NaNs when I pivot the table? Weird.
  • [x] Make Vector that attaches
  • [ ] Attach real species name
  • [ ] Loop through all files in directory
  • [ ] Write file name in last column
  • [ ] Append to a file
  • [ ] Have all of the important checks read out into a file

In [1]:
from Bio import motifs
from Bio import SeqIO 
from Bio.Seq import Seq
from Bio.Seq import MutableSeq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC, generic_dna, generic_protein
from collections import defaultdict
import re
import pandas as pd
import numpy as np
import os, sys

In [2]:
#####################
## sys Inputs - to do
#####################

## So I can run in shell and loop through sequences AND motifs to get a giant dataset
## Need to get the alignment to raw sequence

## read in alignment as a list of sequences
# alignment = list(SeqIO.parse(sys.argv[1], "fasta"))
# motif = motifs.read(open(sys.argv[2],"pfm")

alignment = list(SeqIO.parse("../data/fasta/output_ludwig_eve-striped-2.fa", "fasta"))
motif = motifs.read(open("../data/PWM/transpose_fm/bcd_FlyReg.fm"),"pfm")

raw_sequences = []
for record in alignment:
    raw_sequences.append(SeqRecord(record.seq.ungap("-"), id = record.id))

In [12]:
##################
## Input 1 - Alignment Input 
##################

## Sort alphabetically
alignment = [f for f in sorted(alignment, key=lambda x : x.id)]

## Make all ids lowercase

for record in alignment:
    record.id = record.id.lower()

## Check 
print("Found %i records in alignment file" % len(alignment))

## Sequence Length should be the same for all alignment sequences
for seq in alignment:
    print len(seq)


Found 9 records in alignment file
1136
1136
1136
1136
1136
1136
1136
1136
1136

In [3]:
#####################
## Input 2 - Raw Sequences Input
#####################

print("Found %i records in raw sequence file" % len(raw_sequences))

## make all IUPAC.IUPACUnambiguousDNA()
raw_sequences_2 = []
for seq in raw_sequences:
    raw_sequences_2.append(Seq(str(seq.seq), IUPAC.IUPACUnambiguousDNA()))


Found 9 records in raw sequence file

In [4]:
#####################
## Input 3 - Motif Input
#####################
## motif.weblogo("mymotif.png")
print(motif.counts)
pwm = motif.counts.normalize(pseudocounts=0.0) # Doesn't change from pwm
pssm = pwm.log_odds()
print(pssm) # Why do I need log odds exactly?
motif_length = len(motif) #for later retrival of nucleotide sequence


        0      1      2      3      4      5      6      7
A:   0.19   0.17   0.88   0.92   0.04   0.04   0.06   0.12
C:   0.37   0.08   0.04   0.02   0.02   0.87   0.52   0.25
G:   0.08   0.04   0.04   0.04   0.33   0.02   0.08   0.37
T:   0.37   0.71   0.04   0.02   0.62   0.08   0.35   0.27

        0      1      2      3      4      5      6      7
A:  -0.38  -0.53   1.82   1.88  -2.70  -2.70  -2.12  -1.12
C:   0.55  -1.70  -2.70  -3.70  -3.70   1.79   1.05  -0.00
G:  -1.70  -2.70  -2.70  -2.70   0.39  -3.70  -1.70   0.55
T:   0.55   1.51  -2.70  -3.70   1.30  -1.70   0.47   0.11


In [15]:
######################
## Searching for Motifs in Sequences
######################

## Returns a list of arrays with a score for each position
## This give the score for each position
## If you print the length you get the length of the sequence minus TFBS length. 

## Forward stand
pssm_list = [ ]
for seq in raw_sequences_2:
    pssm_list.append(pssm.calculate(seq))

In [16]:
########################## 
## Automatic Calculation of threshold
##########################

## Ideal to find something that automatically calculates, as
## opposed to having human choosing.

## Approximate calculation of appropriate thresholds for motif finding 
## Patser Threshold
## It selects such a threshold that the log(fpr)=-ic(M) 
## note: the actual patser software uses natural logarithms instead of log_2, so the numbers 
## are not directly comparable. 

distribution = pssm.distribution(background=motif.background, precision=10**4)
patser_threshold = distribution.threshold_patser() #for use later

print("Patser Threshold is %5.3f" % patser_threshold) # Calculates Paster threshold.


Patser Threshold is 3.262

In [17]:
###################################
## Searching for motif in all raw_sequences
#################################

raw_id = []
for seq in raw_sequences:
    raw_id.append(seq.id)

record_length = []
for record in raw_sequences_2:
    record_length.append(len(record))

position_list = []
for i in range(0,8):
    for position, score in pssm.search(raw_sequences_2[i], threshold = patser_threshold):
        positions = {'species': raw_id[i], 'score':score, 'position':position, 'seq_len': record_length[i] }
        position_list.append(positions)
        
position_DF = pd.DataFrame(position_list)

In [18]:
#############################
## Add strand and pos position information as columns to position_DF
#############################
position_list_pos = []
for i, x in enumerate(position_DF['position']):
    if x < 0:
       position_list_pos.append(position_DF.loc[position_DF.index[i], 'seq_len'] + x)
    else:
       position_list_pos.append(x)

## append to position_DF
position_DF['raw_position'] = position_list_pos

## print position_DF['raw_position']
    
## strand Column
strand = []
for x in position_DF['position']:
    if x < 0:
       strand.append("negative")
    else:
       strand.append("positive")
    
## append to position_DF
position_DF['strand'] = strand

## motif_found column
## First turn into a list of strings
raw_sequences_2_list = []
for seq in raw_sequences_2:
    raw_sequences_2_list.append(str(seq))

## Now get all motifs found in sequences
# motif_found = []
# for x in position_DF['position']:
    # motif_found.append(raw_sequences_2_list[i][x:x + motif_length])

## append to position_DF
# position_DF['motif_found'] = motif_found

print position_DF

## Check
## len(motif_found)    
## print(motif_found) 
## print(position_DF)


     position      score  seq_len                           species  \
0          10   5.013668      905  ludwig_eve-striped-2||MEMB002F|+   
1        -804   3.414712      905  ludwig_eve-striped-2||MEMB002F|+   
2         145  10.457056      905  ludwig_eve-striped-2||MEMB002F|+   
3        -740   8.946094      905  ludwig_eve-striped-2||MEMB002F|+   
4        -664   3.572565      905  ludwig_eve-striped-2||MEMB002F|+   
5        -598   3.594098      905  ludwig_eve-striped-2||MEMB002F|+   
6        -580   5.417715      905  ludwig_eve-striped-2||MEMB002F|+   
7         389   3.824103      905  ludwig_eve-striped-2||MEMB002F|+   
8         420   4.672100      905  ludwig_eve-striped-2||MEMB002F|+   
9        -479   4.374594      905  ludwig_eve-striped-2||MEMB002F|+   
10        430   3.794091      905  ludwig_eve-striped-2||MEMB002F|+   
11       -448   9.909568      905  ludwig_eve-striped-2||MEMB002F|+   
12        460   4.094957      905  ludwig_eve-striped-2||MEMB002F|+   
13       -436   5.787577      905  ludwig_eve-striped-2||MEMB002F|+   
14        502   3.793224      905  ludwig_eve-striped-2||MEMB002F|+   
15        523   3.774115      905  ludwig_eve-striped-2||MEMB002F|+   
16       -357   9.909568      905  ludwig_eve-striped-2||MEMB002F|+   
17        551   4.094957      905  ludwig_eve-striped-2||MEMB002F|+   
18       -313   3.769742      905  ludwig_eve-striped-2||MEMB002F|+   
19       -305   5.431520      905  ludwig_eve-striped-2||MEMB002F|+   
20       -275   3.832565      905  ludwig_eve-striped-2||MEMB002F|+   
21       -274   5.505177      905  ludwig_eve-striped-2||MEMB002F|+   
22        635   5.016483      905  ludwig_eve-striped-2||MEMB002F|+   
23        641  10.016483      905  ludwig_eve-striped-2||MEMB002F|+   
24        649   3.357991      905  ludwig_eve-striped-2||MEMB002F|+   
25       -220   3.824103      905  ludwig_eve-striped-2||MEMB002F|+   
26       -138   5.920559      905  ludwig_eve-striped-2||MEMB002F|+   
27        -96   3.995421      905  ludwig_eve-striped-2||MEMB002F|+   
28        813   3.391992      905  ludwig_eve-striped-2||MEMB002F|+   
29        -82   6.380240      905  ludwig_eve-striped-2||MEMB002F|+   
..        ...        ...      ...                               ...   
205       847   3.285077      898  ludwig_eve-striped-2||MEMB003D|-   
206       888   3.875044      898  ludwig_eve-striped-2||MEMB003D|-   
207        10   5.454241      913  ludwig_eve-striped-2||MEMB002D|+   
208       174  10.457056      913  ludwig_eve-striped-2||MEMB002D|+   
209      -719   8.946094      913  ludwig_eve-striped-2||MEMB002D|+   
210      -652   5.243600      913  ludwig_eve-striped-2||MEMB002D|+   
211      -585   3.594098      913  ludwig_eve-striped-2||MEMB002D|+   
212      -567   6.417715      913  ludwig_eve-striped-2||MEMB002D|+   
213       350   3.702168      913  ludwig_eve-striped-2||MEMB002D|+   
214       444   3.794091      913  ludwig_eve-striped-2||MEMB002D|+   
215      -439   9.909568      913  ludwig_eve-striped-2||MEMB002D|+   
216       477   4.094957      913  ludwig_eve-striped-2||MEMB002D|+   
217      -427   5.787577      913  ludwig_eve-striped-2||MEMB002D|+   
218       518   5.020957      913  ludwig_eve-striped-2||MEMB002D|+   
219      -376   4.736640      913  ludwig_eve-striped-2||MEMB002D|+   
220      -329   9.909568      913  ludwig_eve-striped-2||MEMB002D|+   
221       587   4.094957      913  ludwig_eve-striped-2||MEMB002D|+   
222       597   3.346478      913  ludwig_eve-striped-2||MEMB002D|+   
223      -294   4.374594      913  ludwig_eve-striped-2||MEMB002D|+   
224      -262   4.389700      913  ludwig_eve-striped-2||MEMB002D|+   
225       655   5.016483      913  ludwig_eve-striped-2||MEMB002D|+   
226       661  10.016483      913  ludwig_eve-striped-2||MEMB002D|+   
227      -241   8.959556      913  ludwig_eve-striped-2||MEMB002D|+   
228       676   8.997031      913  ludwig_eve-striped-2||MEMB002D|+   
229       760   3.559102      913  ludwig_eve-striped-2||MEMB002D|+   
230      -115   5.920559      913  ludwig_eve-striped-2||MEMB002D|+   
231       844   3.391992      913  ludwig_eve-striped-2||MEMB002D|+   
232       -59   6.380240      913  ludwig_eve-striped-2||MEMB002D|+   
233       858   5.612093      913  ludwig_eve-striped-2||MEMB002D|+   
234       901   4.422532      913  ludwig_eve-striped-2||MEMB002D|+   

     raw_position    strand  
0              10  positive  
1             101  negative  
2             145  positive  
3             165  negative  
4             241  negative  
5             307  negative  
6             325  negative  
7             389  positive  
8             420  positive  
9             426  negative  
10            430  positive  
11            457  negative  
12            460  positive  
13            469  negative  
14            502  positive  
15            523  positive  
16            548  negative  
17            551  positive  
18            592  negative  
19            600  negative  
20            630  negative  
21            631  negative  
22            635  positive  
23            641  positive  
24            649  positive  
25            685  negative  
26            767  negative  
27            809  negative  
28            813  positive  
29            823  negative  
..            ...       ...  
205           847  positive  
206           888  positive  
207            10  positive  
208           174  positive  
209           194  negative  
210           261  negative  
211           328  negative  
212           346  negative  
213           350  positive  
214           444  positive  
215           474  negative  
216           477  positive  
217           486  negative  
218           518  positive  
219           537  negative  
220           584  negative  
221           587  positive  
222           597  positive  
223           619  negative  
224           651  negative  
225           655  positive  
226           661  positive  
227           672  negative  
228           676  positive  
229           760  positive  
230           798  negative  
231           844  positive  
232           854  negative  
233           858  positive  
234           901  positive  

[235 rows x 6 columns]

In [19]:
##################
## get alignment position 
#################

## Need to map to the sequence alignment position

remap_list = []
nuc_list = ['A', 'a', 'G', 'g', 'C', 'c', 'T', 't', 'N', 'n']


positions = {'score':score, 'position':position, 'species': i}
position_list.append(positions)

for i in range(0,9):
    counter = 0
    for xInd, x in enumerate(alignment[i].seq):    
        if x in nuc_list:
            remaps = {'raw_position': counter, 'align_position':xInd, 'species':alignment[i].id}
            counter += 1
            remap_list.append(remaps)
            
remap_DF = pd.DataFrame(remap_list)

## Check
## print_full(remap_DF)

In [24]:
## Merge both datasets

## Check first
## print(position_DF.shape)
## print(remap_DF.shape)

## Merge - all sites
TFBS_map_DF_all = pd.merge(position_DF, remap_DF, on=['species', 'raw_position'], how='outer')

## Sort
TFBS_map_DF_all = TFBS_map_DF_all.sort_values(by=['species','align_position'], ascending=[True, True])

## Check
## print_full(TFBS_map_DF_all)
## print(TFBS_map_DF_all.shape)

# Merge - only signal 
TFBS_map_DF_only_signal = pd.merge(position_DF, remap_DF, on=['species', 'raw_position'], how='inner')
TFBS_map_DF_only_signal = TFBS_map_DF_only_signal.sort_values(by=['species','align_position'], ascending=[True, True])

## To quickly check if species share similar TFBS positions
## print_full(TFBS_map_DF_only_signal.sort_values(by=['raw_position'], ascending=[True]))


## Check
## print_full(TFBS_map_DF_only_signal)
## print(TFBS_map_DF_only_signal.shape)

print(TFBS_map_DF_all)

## Write out Files
## TFBS_map_DF_all.to_csv('../data/outputs/TFBS_map_DF_all_bicoid_test.csv', sep='\t', na_rep="NA")


      position      score  seq_len                           species  \
34          10   5.454241      868  ludwig_eve-striped-2||MEMB002A|+   
35         144  10.457056      868  ludwig_eve-striped-2||MEMB002A|+   
36        -697   8.946094      868  ludwig_eve-striped-2||MEMB002A|+   
37        -625   5.243600      868  ludwig_eve-striped-2||MEMB002A|+   
38        -536   6.417715      868  ludwig_eve-striped-2||MEMB002A|+   
39         336   3.702168      868  ludwig_eve-striped-2||MEMB002A|+   
40         377   3.491528      868  ludwig_eve-striped-2||MEMB002A|+   
41        -469   3.414712      868  ludwig_eve-striped-2||MEMB002A|+   
42        -461   3.296591      868  ludwig_eve-striped-2||MEMB002A|+   
43         423   3.794091      868  ludwig_eve-striped-2||MEMB002A|+   
44        -416   9.909568      868  ludwig_eve-striped-2||MEMB002A|+   
45         455   4.094957      868  ludwig_eve-striped-2||MEMB002A|+   
46        -404   5.787577      868  ludwig_eve-striped-2||MEMB002A|+   
47         521   3.793224      868  ludwig_eve-striped-2||MEMB002A|+   
48        -301   9.909568      868  ludwig_eve-striped-2||MEMB002A|+   
49         570   3.509995      868  ludwig_eve-striped-2||MEMB002A|+   
50         599   3.794091      868  ludwig_eve-striped-2||MEMB002A|+   
51         639   7.881554      868  ludwig_eve-striped-2||MEMB002A|+   
52         669   3.412068      868  ludwig_eve-striped-2||MEMB002A|+   
53         687   3.285077      868  ludwig_eve-striped-2||MEMB002A|+   
54        -166   4.473469      868  ludwig_eve-striped-2||MEMB002A|+   
55        -116   5.920559      868  ludwig_eve-striped-2||MEMB002A|+   
56         -58   6.380240      868  ludwig_eve-striped-2||MEMB002A|+   
57         814   5.505177      868  ludwig_eve-striped-2||MEMB002A|+   
58         815   3.285077      868  ludwig_eve-striped-2||MEMB002A|+   
59         856   3.875044      868  ludwig_eve-striped-2||MEMB002A|+   
87          10   5.013668      862  ludwig_eve-striped-2||MEMB002C|+   
88         150  10.457056      862  ludwig_eve-striped-2||MEMB002C|+   
89        -678   8.946094      862  ludwig_eve-striped-2||MEMB002C|+   
90        -606   5.243600      862  ludwig_eve-striped-2||MEMB002C|+   
...        ...        ...      ...                               ...   
8231       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8232       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8233       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8234       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8235       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8236       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8237       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8238       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8239       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8240       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8241       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8242       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8243       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8244       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8245       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8246       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8247       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8248       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8249       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8250       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8251       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8252       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8253       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8254       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8255       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8256       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8257       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8258       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8259       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   
8260       NaN        NaN      NaN  ludwig_eve-striped-2||memb003f|+   

      raw_position    strand  align_position  file  
34              10  positive             NaN  file  
35             144  positive             NaN  file  
36             171  negative             NaN  file  
37             243  negative             NaN  file  
38             332  negative             NaN  file  
39             336  positive             NaN  file  
40             377  positive             NaN  file  
41             399  negative             NaN  file  
42             407  negative             NaN  file  
43             423  positive             NaN  file  
44             452  negative             NaN  file  
45             455  positive             NaN  file  
46             464  negative             NaN  file  
47             521  positive             NaN  file  
48             567  negative             NaN  file  
49             570  positive             NaN  file  
50             599  positive             NaN  file  
51             639  positive             NaN  file  
52             669  positive             NaN  file  
53             687  positive             NaN  file  
54             702  negative             NaN  file  
55             752  negative             NaN  file  
56             810  negative             NaN  file  
57             814  positive             NaN  file  
58             815  positive             NaN  file  
59             856  positive             NaN  file  
87              10  positive             NaN  file  
88             150  positive             NaN  file  
89             184  negative             NaN  file  
90             256  negative             NaN  file  
...            ...       ...             ...   ...  
8231           838       NaN            1078  file  
8232           839       NaN            1079  file  
8233           840       NaN            1080  file  
8234           841       NaN            1081  file  
8235           842       NaN            1082  file  
8236           843       NaN            1083  file  
8237           844       NaN            1084  file  
8238           845       NaN            1085  file  
8239           846       NaN            1112  file  
8240           847       NaN            1113  file  
8241           848       NaN            1114  file  
8242           849       NaN            1115  file  
8243           850       NaN            1116  file  
8244           851       NaN            1117  file  
8245           852       NaN            1118  file  
8246           853       NaN            1119  file  
8247           854       NaN            1120  file  
8248           855       NaN            1121  file  
8249           856       NaN            1122  file  
8250           857       NaN            1123  file  
8251           858       NaN            1124  file  
8252           859       NaN            1125  file  
8253           860       NaN            1126  file  
8254           861       NaN            1127  file  
8255           862       NaN            1128  file  
8256           863       NaN            1129  file  
8257           864       NaN            1130  file  
8258           865       NaN            1131  file  
8259           866       NaN            1132  file  
8260           867       NaN            1133  file  

[8261 rows x 8 columns]

In [21]:
###############################
## Print binary info for each species
################################

# Create new column, 1 if TFBS is present, 0 if absent
TFBS_map_DF_all['presence'] = 0 # For some reason you have to initaite the column first.
TFBS_map_DF_all['presence'] = TFBS_map_DF_all.apply(lambda x: x.notnull(), axis=1)
TFBS_map_DF_all['presence'] = TFBS_map_DF_all['presence'].astype(int)

## Check   
## print_full(TFBS_map_DF_all)

## Create new dataframe

## Check First
## list(TFBS_map_DF_all.columns)

TFBS_map_DF_binary = TFBS_map_DF_all[['species', 'presence', 'align_position', 'strand']].copy()

## Subset on strand

TFBS_map_DF_binary_positive = TFBS_map_DF_binary['strand'] == "positive"
TFBS_map_DF_binary_negative = TFBS_map_DF_binary['strand'] == "negative"

## Check
print(TFBS_map_DF_binary)

## Now long to wide
## [ ] NaNs are introduced here and not sure why and it is super annoying.
## - Maybe it has something to do with the negative and positive strands. These should be subsetted first.
## There should be two presense and absence...or maybe a 2 presented? This get back to the range issue. 
## We should have the 1s maybe represent TFBS range, not just starting position.

TFBS_map_DF_binary = TFBS_map_DF_binary.pivot_table(index='species', columns='align_position', values='presence')

print(TFBS_map_DF_binary.iloc[:,6:40])


                               species  presence  align_position    strand
34    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
35    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
36    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
37    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
38    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
39    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
40    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
41    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
42    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
43    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
44    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
45    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
46    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
47    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
48    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
49    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
50    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
51    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
52    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
53    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
54    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
55    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
56    ludwig_eve-striped-2||MEMB002A|+         1             NaN  negative
57    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
58    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
59    ludwig_eve-striped-2||MEMB002A|+         1             NaN  positive
87    ludwig_eve-striped-2||MEMB002C|+         1             NaN  positive
88    ludwig_eve-striped-2||MEMB002C|+         1             NaN  positive
89    ludwig_eve-striped-2||MEMB002C|+         1             NaN  negative
90    ludwig_eve-striped-2||MEMB002C|+         1             NaN  negative
...                                ...       ...             ...       ...
8231  ludwig_eve-striped-2||memb003f|+         0            1078       NaN
8232  ludwig_eve-striped-2||memb003f|+         0            1079       NaN
8233  ludwig_eve-striped-2||memb003f|+         0            1080       NaN
8234  ludwig_eve-striped-2||memb003f|+         0            1081       NaN
8235  ludwig_eve-striped-2||memb003f|+         0            1082       NaN
8236  ludwig_eve-striped-2||memb003f|+         0            1083       NaN
8237  ludwig_eve-striped-2||memb003f|+         0            1084       NaN
8238  ludwig_eve-striped-2||memb003f|+         0            1085       NaN
8239  ludwig_eve-striped-2||memb003f|+         0            1112       NaN
8240  ludwig_eve-striped-2||memb003f|+         0            1113       NaN
8241  ludwig_eve-striped-2||memb003f|+         0            1114       NaN
8242  ludwig_eve-striped-2||memb003f|+         0            1115       NaN
8243  ludwig_eve-striped-2||memb003f|+         0            1116       NaN
8244  ludwig_eve-striped-2||memb003f|+         0            1117       NaN
8245  ludwig_eve-striped-2||memb003f|+         0            1118       NaN
8246  ludwig_eve-striped-2||memb003f|+         0            1119       NaN
8247  ludwig_eve-striped-2||memb003f|+         0            1120       NaN
8248  ludwig_eve-striped-2||memb003f|+         0            1121       NaN
8249  ludwig_eve-striped-2||memb003f|+         0            1122       NaN
8250  ludwig_eve-striped-2||memb003f|+         0            1123       NaN
8251  ludwig_eve-striped-2||memb003f|+         0            1124       NaN
8252  ludwig_eve-striped-2||memb003f|+         0            1125       NaN
8253  ludwig_eve-striped-2||memb003f|+         0            1126       NaN
8254  ludwig_eve-striped-2||memb003f|+         0            1127       NaN
8255  ludwig_eve-striped-2||memb003f|+         0            1128       NaN
8256  ludwig_eve-striped-2||memb003f|+         0            1129       NaN
8257  ludwig_eve-striped-2||memb003f|+         0            1130       NaN
8258  ludwig_eve-striped-2||memb003f|+         0            1131       NaN
8259  ludwig_eve-striped-2||memb003f|+         0            1132       NaN
8260  ludwig_eve-striped-2||memb003f|+         0            1133       NaN

[8261 rows x 4 columns]
align_position                    6   7   8   9   10  11  12  13  14  15 ...  \
species                                                                  ...   
ludwig_eve-striped-2||memb002a|+   0   0   0   0   0   0   0   0   0   0 ...   
ludwig_eve-striped-2||memb002c|+   0   0   0   0   0   0   0   0   0   0 ...   
ludwig_eve-striped-2||memb002d|+   0   0   0   0   0   0   0   0   0   0 ...   
ludwig_eve-striped-2||memb002e|-   0   0   0   0   0   0   0   0   0   0 ...   
ludwig_eve-striped-2||memb002f|+   0   0   0   0   0   0   0   0   0   0 ...   
ludwig_eve-striped-2||memb003b|+   0   0   0   0   0   0   0   0   0   0 ...   
ludwig_eve-striped-2||memb003c|-   0   0   0   0   0   0   0   0   0   0 ...   
ludwig_eve-striped-2||memb003d|-   0   0   0   0   0   0   0   0   0   0 ...   
ludwig_eve-striped-2||memb003f|+   0   0   0   0   0   0   0   0   0   0 ...   

align_position                    30  31  32  33  34  35  36  37  38  39  
species                                                                   
ludwig_eve-striped-2||memb002a|+   0   0   0 NaN NaN NaN   0   0   0   0  
ludwig_eve-striped-2||memb002c|+   0   0   0 NaN NaN NaN   0   0   0   0  
ludwig_eve-striped-2||memb002d|+   0   0   0   0   0   0   0   0   0   0  
ludwig_eve-striped-2||memb002e|-   0   0   0   0   0   0   0   0   0   0  
ludwig_eve-striped-2||memb002f|+   0   0   0 NaN NaN NaN   0   0   0   0  
ludwig_eve-striped-2||memb003b|+   0   0   0   0   0   0   0   0   0   0  
ludwig_eve-striped-2||memb003c|-   0   0   0 NaN NaN NaN   0   0   0   0  
ludwig_eve-striped-2||memb003d|-   0   0   0   0   0   0   0   0   0   0  
ludwig_eve-striped-2||memb003f|+   0   0   0   0   0   0   0   0   0   0  

[9 rows x 34 columns]

In [22]:
####################
## Attach input files name as a column
#####################

## Ideally I would attach the file name of the 1. raw sequence and 2. the motif being tested