Reverse engineering a dynamic web page


Initialization

Import modules needed below. It is assumed a downloader.py module is in the working directory.


In [ ]:
import os, json
import lxml.html
import cssselect
import pprint

from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *

# go to working dir, where a module downloader.py should exist
os.chdir(r'C:\Users\ps\Desktop\python\work\web scraping')

from downloader import Downloader                  # script to download html page

In [ ]:
base = r'http://example.webscraping.com/'
surl = r'http://example.webscraping.com/search'     # url for example site's search page
durl = r'http://example.webscraping.com/dynamic'    # url for JavaScript example

First attempt:

Just simple web scraping, downloading html page and searching for div results a


In [ ]:
D = Downloader()
html = D(surl)
tree = lxml.html.fromstring(html.decode('utf-8'))     # must decode else errors happen
tree.cssselect('div#results a')

Clearly nothing happened. That is because the data is not resident in the downloaded web page. The actual data is obtained in response to a GET request generated by JavaScript code in the web page and sent to the server. This is called AJAX for Asynchronous JavaScript and XML, a technology for client-server data transfer without reloading the whole HTML page.

We can emulate JS by sending to the server an url just like the JS script would have sent, like this:


In [ ]:
url = base + 'ajax/search.json?page=0&page_size=10&search_term=a'
html = D(url)
res = json.loads(html.decode('utf-8'))
pprint.pprint(res)

As expected we got the desired response. Notice that AJAX returns data in JSON format.

The question is, how do we find that right JS-like url.

Using WebKit

This is a web-rendering engine that executes JavaScript. We use it in order to simulate a legitimate web browser's JS request to the server and receive the correct response.

First lets try to receive data from this new url in the traditional manner:


In [ ]:
html = D(durl)
tree = lxml.html.fromstring(html.decode('utf-8'))
tree.cssselect('#result')[0].text_content()

Hm, nothing. As expected, results are generated server-side by a JS request. Lets try again, this time using WebKit.

To do that we must assemble a bunch of objects that act web browser-like:

The Qt framework requires creation of a QApplication first, to initialize stuff.


In [ ]:
app = QApplication([])

Next create a container for web documents:


In [ ]:
webview = QWebView()

A local event loop is created next:


In [ ]:
loop = QEventLoop()

The loadFinished callback of the QWebView object is connected to the quit method of QEventLoop so that when the web page finishes loading the event loop will be stopped. The url to load must be wrapped in a QUrl object, which is passed to QWebView.


In [ ]:
webview.loadFinished.connect(loop.quit)
webview.load(QUrl(durl))

The QWebView load method is asynchronous so that execution will immediately pass to the next line while the web page is still being loaded. However we want to wait until the web page is fully loaded, so loop.exec is called to start the event loop.

We are almost done, a new web browser almost ready to rumble!


In [ ]:
loop.exec_()

After the web page has been completely loaded the event loop will exit and execution moves to the next line, where the resulting HTML can be extracted as usual:


In [ ]:
html = webview.page().mainFrame().toHtml()
tree = lxml.html.fromstring(html)
tree.cssselect('#result')[0].text_content()

This code was packed in the script webkit_render.py

Cool, heh?

Using Selenium

The advantage of using WebKit is full control to customize the browser renderer to behave as we need it to. If such flexibility is not needed then Selenium is an alternative.

Lets redo the previous example using Selenium and its API. The first step is connect it to the current web browser (Firefox is current in this example).


In [ ]:
from selenium import webdriver
driver = webdriver.Firefox()

When this command is run an empty browser window will pop up.

This is handy because with each command this window can be checked to see if Selenium worked as expected.

So next we load a web page in the browser using the get() method:


In [ ]:
driver.get(surl)

Next we input country names in the Name box. To select it we use its ID, 'search_term'. Once it is found, use the send_keys() command to input data simulating actual key input. To select all countries use the '.' metacharacter.


In [ ]:
driver.find_element_by_id('search_term').send_keys('.')

We want to get all countries in a single search. To achieve that we set the browser page to 1000, which is acomplished using JavaScript:


In [ ]:
js = "document.getElementById('page_size').options[1].text = '1000'"
driver.execute_script(js)

Next we select the Search button. To select it we use its ID, 'search'. Once it is found, we simulate clicking it with the method click().


In [ ]:
driver.find_element_by_id('search').click()

We need to wait for AJAX to complete the request before using the results. That is done with a timeout:


In [ ]:
driver.implicitly_wait(30)                             # 30 seconds

Next we search for the desired element like before:


In [ ]:
links = driver.find_elements_by_css_selector('#results a')

Then we create a list of countries by extracting the text from each link that was found:


In [ ]:
countries = [link.text for link in links]
pprint.pprint(countries)

Now we got that response. Before we leave, close the browser window we created.


In [ ]:
driver.close()

How cool is that, heh?