Lessons Learned Scraping The Web

Hacking public, inflexbile APIs and discovering concurrency

Andres De Castro

https://github.com/andres-de-castro/scraping

Live presentation made with RISE

* https://github.com/damianavila/RISE

The Problem

  • Obtuse APIs
  • 10k+ requests -> hours of completion time
  • Rendered javascript
  • Authentication/Rate Limiting

A simple example

  • Our data is tabular, it exists in a table element (td) in a webpage
  • Interfacing with a server usually requires the following:
    • An HTTP request handler (requests / urllib libraries)
    • An HTML parser (bs4 / lxml)
    • A container to store, modify and view the data (pandas)
  • We will interact with Morningstar's 'API'

In [1]:
import pandas as pd

url = 'http://performance.morningstar.com/Performance/stock/split-history.action?&t=AAPL'

pd.read_html(url)[0]


Out[1]:
Date Ratio
0 06/09/2014 7:1
1 02/28/2005 2:1
2 06/21/2000 2:1
3 06/16/1987 2:1

A not so simple example

Target data lives in a table element

http://www2.tse.or.jp/tseHpFront/JJK020010Action.do?Show=Show #1301


In [2]:
# The naive approach

url = 'http://quote.jpx.co.jp/jpx/template/quote.cgi?F=tmp/e_stock_detail&MKTN=T&QCODE=1301'

try:
    pd.read_html(url)
except Exception as e:
    print (str(e))


HTTP Error 502: Bad Gateway

Frustration !

Lesson Learned # 1 - APIs aren't flexible

Thought Process

  • Perhaps the web-server knows it is python making the request
  • Can we trick the web server into thinking it is a web browser making the requests?
  • Clock is ticking...

Enter Selenium

  • A fully featured web driver
  • Uses firefox by default
  • Used by QAs everywhere
  • Headless option with xvfb
  • Seems like a good solution, right?

Results

  • A full day of development time
  • Three days of debugging
  • Requesting access for a web element that hasn't loaded -> Error
  • High overhead due to FF process
  • Multiple try-excepts -> Multiple edge cases
  • Completion time (3800 stocks) ~1.2 hours
  • Our users want today's data ASAP

Lesson Learned #2

Use the right tool for the job

Rethinking the approach

Let's revisit our target

http://www2.tse.or.jp/tseHpFront/JJK020010Action.do?Show=Show

Examine our request

%%bash

curl 'http://quote.jpx.co.jp/jpx/template/quote.cgi?F=tmp/e_stock_detail&MKTN=T&QCODE=1301' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8,es;q=0.6' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8' -H 'Referer: http://www2.tse.or.jp/tseHpFront/JJK020010Action.do' -H 'Cookie: TS4be622=5de6667395943132172f01acdabc66df16cd3f45e0bd3db2578e4e0e' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' --compressed

Lesson Learned #3

Chrome dev tools / FF's firebug are your best friends

What next?

  • Parse the headers manually into requests
  • For loop through all the stock indices
  • Seems like a good idea?

Parse request headers

  • Use network tab in chrome / firefox
  • Extract headers (akin to copy as cURL request)
  • Feed headers as dictionary into requests module

My suggestion

http://curl.trillworks.com/

  • Feed it cURL request
  • Will return nicely formatted code for use with requests

In [7]:
import pandas as pd 

import requests
from bs4 import BeautifulSoup
from io import StringIO

codes = ['9986', '9987', '9989', '9990'] #'9991', '9992', '9993', '9994', '9995', '9996']
for code in codes: 
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
        'Cookie': '__utma=139475176.428689694.1438095265.1439320455.1440102255.14; __utmz=139475176.1440102255.14.6.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); TS4be622=c6390468d7aed6d150c549c11b5dbc654181b62eb149119556167434',
        'Referer': 'http://www2.tse.or.jp/tseHpFront/JJK020010Action.do'
            }

    payload = {'F': 'tmp/e_stock_detail',
            'MKTN': 'T',
            'QCODE': str(code)
            }

    r = requests.post('http://quote.jpx.co.jp/jpx/template/quote.cgi?F=tmp/e_stock_detail&MKTN=T&QCODE=' + str(code), data=payload, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    values = soup.find(id="histData")['value']

    df = pd.DataFrame.from_csv(StringIO(values), sep=",", parse_dates=False, header=None)
    df = df.drop(df.columns[-1],1)
    df.columns = ['Open', 'High', 'Low', 'Close', 'Volume']
    df.index.names = ['Date']

df.tail(10)


Out[7]:
Open High Low Close Volume
Date
2016/07/06 945.0 962.0 935.0 961.0 178500
2016/07/07 961.0 972.0 938.0 945.0 99200
2016/07/08 940.0 947.0 901.0 904.0 238600
2016/07/11 917.0 986.0 917.0 964.0 220400
2016/07/12 982.0 1018.0 980.0 1015.0 144400
2016/07/13 1033.0 1050.0 1016.0 1040.0 188900
2016/07/14 1033.0 1048.0 1007.0 1030.0 106600
2016/07/15 1026.0 1052.0 1013.0 1019.0 109200
2016/07/19 1019.0 1038.0 1001.0 1018.0 75900
2016/07/20 1007.0 1009.0 981.0 1006.0 94300

Still too slow for production means

  • Completion time ~ 40 minutes (Only 33% faster) than the selenium approach
  • Timeout or redirect -> entire job fails

Concurrency

  • Process spends a great deal of time waiting on a request's completion
  • We'll exploit the ability to make multiple requests
  • Pass a collection of URL's + a function to transform the data received

Problems

  • Production using Ubuntu LTS -> Restricted to py2.6 & 3.4
  • async module not implemented until 3.5
  • Luckily for us we have the twisted/tornado frameworks (also backwards compatible in py3.x)

Lesson Learned #4

Most of the hard work has already been done for you

i.e don't reinvent the wheel


In [9]:
import sys

from tornado import gen, ioloop
from tornado.httpclient import AsyncHTTPClient, HTTPRequest
from tornado.queues import Queue

class Scraper():
    
    @gen.coroutine
    def read(self, destinations):
        for url in destinations:
            yield self.queue.put(url)

    @gen.coroutine
    def get(self, transform, headers, connect_timeout, request_timeout, http_client):
        while True:
            url = yield self.queue.get()
            try:
                request = HTTPRequest(url,
                                    connect_timeout=connect_timeout,
                                    request_timeout=request_timeout,
                                    method="GET",
                                    headers = headers
                )
            except Exception as e:
                sys.stderr.write('Destination {0} returned error {1}'.format(url, str(e) + '\n'))

            future = self.http_client.fetch(request)

            def done_callback(future):
                body = future.result().body
                url = future.result().effective_url
                transform(body, url=url)
                self.queue.task_done()


            future.add_done_callback(done_callback)

In [5]:
%%bash 

time python tse.py


4930  has returned 0 values. check if deprecated
6200  has returned 0 values. check if deprecated
6531  has returned 0 values. check if deprecated
3470  has returned 0 values. check if deprecated
3469  has returned 0 values. check if deprecated
3471  has returned 0 values. check if deprecated
3544  has returned 0 values. check if deprecated

real	3m49.665s
user	2m29.972s
sys	0m2.451s

In [6]:
pd.read_csv('tse.csv').head(10)


Out[6]:
Stock Date Open High Low Close Volume
0 3936 2016/07/20 10500.0 10590.0 10280.0 10280.0 4200
1 3934 2016/07/20 2380.0 2425.0 2380.0 2395.0 3300
2 3932 2016/07/20 2874.0 3220.0 2870.0 3220.0 657400
3 3935 2016/07/20 2407.0 2688.0 2380.0 2594.0 118300
4 3929 2016/07/20 1210.0 1250.0 1208.0 1219.0 6100
5 9966 2016/07/20 1681.0 1704.0 1677.0 1684.0 3300
6 9967 2016/07/20 NaN NaN NaN NaN 0
7 9969 2016/07/20 495.0 495.0 493.0 494.0 1900
8 9974 2016/07/20 4090.0 4200.0 4070.0 4200.0 15200
9 9976 2016/07/20 659.0 659.0 659.0 659.0 32000

In closing

On Concurrency

* David Beazley's PyCon talk Concurrency From The Ground Up
* https://www.youtube.com/watch?v=MCs5OvhV9S4
* Tornado library

Dealing with js rendered webpages

* phantomJS (ghost.py)
* Selenium