Download data from APD database

Note that this recipe is a work in progress.

Tools

In this recipe we use Selenium to download HTML source from a web server that requires filling in a query in an input box.

We use Python pandas to open an Excel file which contains all query sequences and use time module to wait for the page to finish loading.


In [18]:
from selenium import webdriver
import pandas as pd
import time

Launch the Firefox web browser.


In [43]:
driver = webdriver.Firefox()

Using get function, we can tell Firefox to visit the web page (shown below).

The web page only has an input box, a submit and a clear button.


In [ ]:
url = 'http://aps.unmc.edu/AP/prediction/prediction_main.php'  # the url of the main page
driver.get(url)

In [48]:
from IPython.display import HTML  ## for displaying the webpage in IFrame

In [50]:
HTML('<iframe src=http://aps.unmc.edu/AP/prediction/prediction_main.php width=700 height=500></iframe>')


Out[50]:

Read query sequences from an Excel file.


In [26]:
table = pd.read_excel('T_cell_epitope_positive.xlsx')

Then we iterate over the table (DataFrame) to get an epitope ID and a query sequence.

The query sequence is entered in the input box using send_keys function and submit function is called to submit the query.

HTML source is then saved to the output file named after the epitope ID. The script waits for 5 seconds then proceeds to the next iteration.


In [27]:
for row in table.iterrows():
    epitope_id, seq = row[1][0], row[1][2]
    op = open('data/t_cell_positive/%s.txt' % str(epitope_id), 'w')
    driver.get(url)  # load the main page
    input_element = driver.find_element_by_name('input')  # find input element
    input_element.send_keys(seq)  # enter the query sequence in the input box
    input_element.submit()  # submit the query sequence to the server
    time.sleep(5)  # wait for 5 seconds
    op.write(str(driver.page_source))  # HTML source is saved to the output file
    op.close()

We do the same thing for another two Excel files.


In [ ]:
table = pd.read_excel('data/t_cell_epitope_negative.xlsx')

In [ ]:
url = 'http://aps.unmc.edu/AP/prediction/prediction_main.php'

for row in table.iterrows():
    epitope_id, seq = row[1][0], row[1][1]
    op = open('data/t_cell_negative/%s.txt' % str(epitope_id), 'w')
    driver.get(url)
    input_element = driver.find_element_by_name('input')
    input_element.send_keys(seq)
    input_element.submit()
    time.sleep(5)
    op.write(str(driver.page_source))
    op.close()

In [44]:
table = pd.read_excel('data/B_cell_epitope_negative.xlsx')

In [45]:
url = 'http://aps.unmc.edu/AP/prediction/prediction_main.php'

for row in table.iterrows():
    epitope_id, seq = row[1][0], row[1][1]
    op = open('data/b_cell_negative/%s.txt' % str(epitope_id), 'w')
    driver.get(url)
    input_element = driver.find_element_by_name('input')
    input_element.send_keys(seq)
    input_element.submit()
    time.sleep(3)
    op.write(str(driver.page_source))
    op.close()

Quit Firefox.


In [ ]:
driver.quit()