Extract-Transform-Load Script

Extract-Transform-Load Scripts (ETLS) are common tools in data management. The purpose of ETLS is to gather relevant data (both direct and inferred) from public databases and capture important features in a possibly different data structure schema for specific analysis.

PubMed Central ETLS Example

This script will Extract data from the CSV files provided to us by Stanford, Transform the data into a format usable by GeneDive, and then Load the data into the GeneDive sqlite database.

Whenever new data is obtained for GeneDive, this process should be run against that dataset.


In [1]:
import re
import sqlite3

In [2]:
INTERACTIONS_FILE = "pmc.csv"
DELIMITER = "\t"
DATABASE = "data.sqlite"

If write is false, the script will run but not write anything to the database. This keeps it safe while you're nosing around, and can also be useful if you need to re-generate the complete typeahead/adjacency files.


In [3]:
WRITE = True

In [4]:
conn = sqlite3.connect(DATABASE)
cursor = conn.cursor()

Map the columns as they appear in the file to the correct values.


In [5]:
interactions = []

with open(INTERACTIONS_FILE) as file:
    for line in file:
        line = line[:-1]
        line = line.split(DELIMITER)
        
        interaction = {
          "journal": line[0],
          "article_id": line[1],
          "pubmed_id": line[2],
          "sentence_id": line[3],
          "mention1_offset": line[5],
          "mention2_offset": line[4],
          "mention1": line[6],
          "mention2": line[7],
          "geneids1": line[8],
          "geneids2": line[9],
          "probability": line[10],
          "excerpt": line[11]    
        }
        
        interactions.append(interaction)

Remove any interactions for which the a gene traces to multiple IDs.


In [6]:
interactions = [i for i in interactions if ( ';' not in i['geneids1'] and ';' not in i['geneids2'] ) ]

GeneDive expects the target genes in the excerpt to be wrapped in pound signs. This is important because a sentence may mention the target gene multiple times, so we need to use the offset data her to make sure we tag the right mention.


In [7]:
for i in interactions:
    excerpt = i['excerpt']
    
    excerpt = re.sub('"', '', excerpt)
    tokens = excerpt.split(" ")
    offset1 = int(i['mention1_offset'])
    offset2 = int(i['mention2_offset'])

    tokens[offset1] = "".join(["#",tokens[offset1],"#"])
    tokens[offset2] = "".join(["#",tokens[offset2],"#"])
    
    i['excerpt'] = " ".join(tokens)

Specific for PMC Data

We didn't get Journal Data - we need to extract it from the article titles. Comment out the next section if journal titles were included.


In [8]:
for i in interactions:
    journal = i['article_id'].split("_")
    x = 0

    while x < len(journal):
        if journal[x][:2] == "19" or journal[x][:2] == "20":
            journal = " ".join(journal[:x])
            break
            
        x+= 1

    i['journal'] = journal

Our insert statement - probably don't need to touch this


In [9]:
INTERACTIONS_WRITE = '''insert into interactions ( journal, article_id, pubmed_id, sentence_id, mention1_offset, mention2_offset, mention1, mention2, geneids1, geneids2, probability, context, section, reactome ) values ( "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}" );'''

In [10]:
for interaction in interactions:
    statement = INTERACTIONS_WRITE.format(
        interaction['journal'],
        interaction['article_id'],
        interaction['pubmed_id'],
        interaction['sentence_id'],
        interaction['mention1_offset'],
        interaction['mention2_offset'],
        interaction['mention1'],
        interaction['mention2'],
        interaction['geneids1'],
        interaction['geneids2'],
        interaction['probability'],
        interaction['excerpt'],
        "Unknown",
        0
    )
    
    cursor.execute(statement)

if WRITE:
    conn.commit()

conn.close()

In [ ]: