Bringing it all together: cross referencing data files


To wrap things up, we'll go over one big example that integrates a little bit of everything we've learned so far. This example will demonstrate a very common task in genomics: cross referencing data between different text files.

The situation:

I'm interested in looking at the distribution of translation start sites (TSS) along the length of mRNAs. I already have a file with the locations of all the TSS in mouse (all_start_sites.txt). However, first I need to normalize each TSS position by gene length so that I can graph the relative position of TIS across all genes in my dataset. I can easily do this by dividing the start site position by the full length of the gene. Unfortunately, though, I have gene lengths stored in a separate file (transc_lengths.txt), so I need to somehow integrate these two pieces of information (TSS position and gene length) across the two files.

Here's a picture version of what I want to do:

And here a few lines from each of the two files I have:

all_start_sites.txt

knownGene   GeneName    InitCodon[nt]   DisttoCDS[codons]   FramevsCDS  InitContext[-3to+4] CDSLength[codons]   HarrPeakStart   HarrPeakWidth   #HarrReads  PeakScore   Codon   Product
uc007afd.1  Mrpl15  248 79  1   AATATGG 15  247 2   368 2.61    aug internal-out-of-frame
uc007afh.1  Lypla1  36  5   0   AACATGT 225 34  4   783 3.27    aug n-term-trunc
uc007afi.1  Tcea1   28  -24 0   GGCTTGT 325 27  3   446 1.43    nearcog n-term-ext
uc007afi.1  Tcea1   100 0   0   GCCATGG 301 99  3   3852    3.79    aug canonical
uc007afn.1  Atp6v1h 100 -13 -1  GCTATCC 10  99  3   728 0.77    nearcog uorf
uc007afn.1  Atp6v1h 149 3   0   AAGATGG 480 147 3   1407    1.36    aug n-term-trunc
uc007agb.1  Pcmtd1  120 -97 -1  GCGCTGG 45  119 3   65  0.75    nearcog uorf
...

transc_lengths.txt

SeqID   Len
uc009gmc.1  4900
uc008mue.1  459
uc007hzr.1  4578
uc007gtm.1  1257
uc007axo.1  2311
uc007wps.1  2694
uc007gqc.1  30
uc009smc.1  1530
...

The common piece of information between these files is the gene ID (knownGene and SeqID), so this is what we will use to match up start sites to gene lengths. Note that we can not count on the gene IDs to be in the same order in the two files, so we will need some other way of matching up the data. (Something like a lookup table, perhaps..)

The plan

When working with more than one file, it sometimes helps to write down a step-by-step plan before you start coding. One way to do this is to write out your plan in the form of code comments -- then, once you're ready to code, you can just fill in what's needed under each comment.

Here's my step-by-step plan for tackling this problem:

1. Open length file
2. For each line in length file: 
    a. Extract the gene id (1st column) and length (2nd column)
    b. Store lengths in dictionary based on gene id (where gene IDs are the keys and lengths are the values)
3. Open output file
4. Open TSS file
5. For each line in tss file:
    a. Extract the id (1st column) and start site (3rd column)
    b. Using the id, lookup the length of the gene from the hash
    c. Divide the start position by the length of the gene
    d. Print the result to the output file

The code

And here's the final code. (Side note: I heavily commented this code for educational purposes, but usually you would not need so many comments!)


In [ ]:
# input files
tssFile = "all_start_sites.txt"
lenFile = "transc_lengths.txt"

# output files
normOut = "normalized_tss.txt"

# lookup table (dictionary) for transc lengths
lengthDict = {}


# read in lengths, store in dictionary
ins = open(lenFile, 'r')
ins.readline()                              # don't forget to skip the header!

for line in ins:
    line = line.rstrip('\r\n')
    
    linePartsList = line.split()            # split on whitespace to break line into separate pieces of data
    geneID = linePartsList[0]               # gene id is the first column, so it's stored in index 0 of linePartsList
    geneLength = int(linePartsList[1])      # gene length = second column, so it's stored in index 1. Note conversion to int!!
    
    lengthDict[geneID] = geneLength         # store len in hash labeled by the id

ins.close()


# read in TSS, use stored lengths to normalize, then print. No need to store TSS.
skipCount = 0 

output = open(normOut, 'w')
output.write("RelativePos\n")               # write header line for output file. remember to add a \n character to the end.

ins = open(tssFile, 'r')
ins.readline()                              # skip header in input file

for line in ins:
    line = line.rstrip('\r\n')
    
    linePartsList = line.split()            # split on whitespace to break line into separate pieces of data
    geneID = linePartsList[0]               # gene id is the first column, so it's the first element in the list
    tss = int(linePartsList[2])             # TSS = third column, so it's third in the list. Note conversion to int!!
    
    if geneID in lengthDict:                # make sure there is an entry in lengthDict for this id. See note below (##)
        fullLen = lengthDict[geneID]        # using the gene id as the key, lookup the length of this gene in the dictionary
        norm = tss / float(fullLen)         # get relative position of TSS. We convert one number to float to avoid int division
        output.write(str(norm) + "\n")      # write the result to output file. Note conversion to string and \n on end.
    else:
        print ">Warning: geneID", geneID, "was not in length lookup table. This gene will be skipped."
        skipCount = skipCount + 1

ins.close()
output.close()

print "Skipped", skipCount, "entries in TSS file."

(##) Note: Why do we have the line "if geneID in lengthDict:"?

If there are gene IDs in the TSS file that were not included in the length file, then we will get an error when we try to look up that ID in the lengthDict (because it's just not there). As you can see from the output above, this is exactly what would have happened to us had we not included this if statement (try removing it yourself and see what happens).

Whenever you think there is a chance this sort of thing might happen (and there almost always is in genomics, unfortunately) just add a quick if statement like the one here, which will let you skip over any IDs that would cause an error.

It's also usually a good idea to print out a message whenever you skip data, like we do in the else: statement above. However, this can sometimes cause your code output to get out of hand (as it does here), so sometimes it's better to just print the count of how many data points were skipped.

Code output

The main output of this code is a new file called normalized_tss.txt that contains all of the normalized TSS positions in a single column. Here's what it looks like:

RelativePos
0.0147118921128
0.0506072874494
0.0754048582996
0.0229226361032
0.0506208213945
0.0787010506208
0.140783190067
0.170009551098
0.0329929300864
...

These data can then be plotted in the form of a histogram. We won't go over plotting here, but here's an example of a plot generated using ggplot2 in R:

From this we can conclude that the majority of TSS occur in the first 10% of the gene length, which is pretty much what we would expect.