Lessons Learned Scraping The Web

Hacking public, inflexbile APIs and discovering concurrency

Andres De Castro

https://github.com/andres-de-castro/scraping

Live presentation made with RISE

* https://github.com/damianavila/RISE

The Problem

Obtuse APIs
10k+ requests -> hours of completion time
Rendered javascript
Authentication/Rate Limiting

A simple example

Our data is tabular, it exists in a table element (td) in a webpage
Interfacing with a server usually requires the following:
- An HTTP request handler (requests / urllib libraries)
- An HTML parser (bs4 / lxml)
- A container to store, modify and view the data (pandas)
We will interact with Morningstar's 'API'



In [1]:

    
import pandas as pd

url = 'http://performance.morningstar.com/Performance/stock/split-history.action?&t=AAPL'

pd.read_html(url)[0]

A not so simple example

Target data lives in a table element

http://www2.tse.or.jp/tseHpFront/JJK020010Action.do?Show=Show #1301



In [2]:

    
# The naive approach

url = 'http://quote.jpx.co.jp/jpx/template/quote.cgi?F=tmp/e_stock_detail&MKTN=T&QCODE=1301'

try:
    pd.read_html(url)
except Exception as e:
    print (str(e))









    



HTTP Error 502: Bad Gateway

Frustration !

Lesson Learned # 1 - APIs aren't flexible

Thought Process

Perhaps the web-server knows it is python making the request
Can we trick the web server into thinking it is a web browser making the requests?
Clock is ticking...

Enter Selenium

A fully featured web driver
Uses firefox by default
Used by QAs everywhere
Headless option with xvfb
Seems like a good solution, right?

Results

A full day of development time
Three days of debugging
Requesting access for a web element that hasn't loaded -> Error
High overhead due to FF process
Multiple try-excepts -> Multiple edge cases
Completion time (3800 stocks) ~1.2 hours
Our users want today's data ASAP

Lesson Learned #2

Use the right tool for the job

Rethinking the approach

Let's revisit our target

http://www2.tse.or.jp/tseHpFront/JJK020010Action.do?Show=Show

Examine our request

%%bash

curl 'http://quote.jpx.co.jp/jpx/template/quote.cgi?F=tmp/e_stock_detail&MKTN=T&QCODE=1301' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8,es;q=0.6' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8' -H 'Referer: http://www2.tse.or.jp/tseHpFront/JJK020010Action.do' -H 'Cookie: TS4be622=5de6667395943132172f01acdabc66df16cd3f45e0bd3db2578e4e0e' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' --compressed

Lesson Learned #3

Chrome dev tools / FF's firebug are your best friends

What next?

Parse the headers manually into requests
For loop through all the stock indices
Seems like a good idea?

Parse request headers

Use network tab in chrome / firefox
Extract headers (akin to copy as cURL request)
Feed headers as dictionary into requests module

My suggestion

http://curl.trillworks.com/

Feed it cURL request
Will return nicely formatted code for use with requests



In [7]:

    
import pandas as pd 

import requests
from bs4 import BeautifulSoup
from io import StringIO

codes = ['9986', '9987', '9989', '9990'] #'9991', '9992', '9993', '9994', '9995', '9996']
for code in codes: 
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
        'Cookie': '__utma=139475176.428689694.1438095265.1439320455.1440102255.14; __utmz=139475176.1440102255.14.6.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); TS4be622=c6390468d7aed6d150c549c11b5dbc654181b62eb149119556167434',
        'Referer': 'http://www2.tse.or.jp/tseHpFront/JJK020010Action.do'
            }

    payload = {'F': 'tmp/e_stock_detail',
            'MKTN': 'T',
            'QCODE': str(code)
            }

    r = requests.post('http://quote.jpx.co.jp/jpx/template/quote.cgi?F=tmp/e_stock_detail&MKTN=T&QCODE=' + str(code), data=payload, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    values = soup.find(id="histData")['value']

    df = pd.DataFrame.from_csv(StringIO(values), sep=",", parse_dates=False, header=None)
    df = df.drop(df.columns[-1],1)
    df.columns = ['Open', 'High', 'Low', 'Close', 'Volume']
    df.index.names = ['Date']

df.tail(10)

Still too slow for production means

Completion time ~ 40 minutes (Only 33% faster) than the selenium approach
Timeout or redirect -> entire job fails

Concurrency

Process spends a great deal of time waiting on a request's completion
We'll exploit the ability to make multiple requests
Pass a collection of URL's + a function to transform the data received

Problems

Production using Ubuntu LTS -> Restricted to py2.6 & 3.4
async module not implemented until 3.5
Luckily for us we have the twisted/tornado frameworks (also backwards compatible in py3.x)

Lesson Learned #4

Most of the hard work has already been done for you

i.e don't reinvent the wheel



In [9]:

    
import sys

from tornado import gen, ioloop
from tornado.httpclient import AsyncHTTPClient, HTTPRequest
from tornado.queues import Queue

class Scraper():
    
    @gen.coroutine
    def read(self, destinations):
        for url in destinations:
            yield self.queue.put(url)

    @gen.coroutine
    def get(self, transform, headers, connect_timeout, request_timeout, http_client):
        while True:
            url = yield self.queue.get()
            try:
                request = HTTPRequest(url,
                                    connect_timeout=connect_timeout,
                                    request_timeout=request_timeout,
                                    method="GET",
                                    headers = headers
                )
            except Exception as e:
                sys.stderr.write('Destination {0} returned error {1}'.format(url, str(e) + '\n'))

            future = self.http_client.fetch(request)

            def done_callback(future):
                body = future.result().body
                url = future.result().effective_url
                transform(body, url=url)
                self.queue.task_done()


            future.add_done_callback(done_callback)



In [5]:

    
%%bash 

time python tse.py









    



4930  has returned 0 values. check if deprecated
6200  has returned 0 values. check if deprecated
6531  has returned 0 values. check if deprecated
3470  has returned 0 values. check if deprecated
3469  has returned 0 values. check if deprecated
3471  has returned 0 values. check if deprecated
3544  has returned 0 values. check if deprecated

real	3m49.665s
user	2m29.972s
sys	0m2.451s



In [6]:

    
pd.read_csv('tse.csv').head(10)

In closing

On Concurrency

* David Beazley's PyCon talk Concurrency From The Ground Up
* https://www.youtube.com/watch?v=MCs5OvhV9S4
* Tornado library

Dealing with js rendered webpages

* phantomJS (ghost.py)
* Selenium

	Date	Ratio
0	06/09/2014	7:1
1	02/28/2005	2:1
2	06/21/2000	2:1
3	06/16/1987	2:1

	Open	High	Low	Close	Volume
Date
2016/07/06	945.0	962.0	935.0	961.0	178500
2016/07/07	961.0	972.0	938.0	945.0	99200
2016/07/08	940.0	947.0	901.0	904.0	238600
2016/07/11	917.0	986.0	917.0	964.0	220400
2016/07/12	982.0	1018.0	980.0	1015.0	144400
2016/07/13	1033.0	1050.0	1016.0	1040.0	188900
2016/07/14	1033.0	1048.0	1007.0	1030.0	106600
2016/07/15	1026.0	1052.0	1013.0	1019.0	109200
2016/07/19	1019.0	1038.0	1001.0	1018.0	75900
2016/07/20	1007.0	1009.0	981.0	1006.0	94300

	Stock	Date	Open	High	Low	Close	Volume
0	3936	2016/07/20	10500.0	10590.0	10280.0	10280.0	4200
1	3934	2016/07/20	2380.0	2425.0	2380.0	2395.0	3300
2	3932	2016/07/20	2874.0	3220.0	2870.0	3220.0	657400
3	3935	2016/07/20	2407.0	2688.0	2380.0	2594.0	118300
4	3929	2016/07/20	1210.0	1250.0	1208.0	1219.0	6100
5	9966	2016/07/20	1681.0	1704.0	1677.0	1684.0	3300
6	9967	2016/07/20	NaN	NaN	NaN	NaN	0
7	9969	2016/07/20	495.0	495.0	493.0	494.0	1900
8	9974	2016/07/20	4090.0	4200.0	4070.0	4200.0	15200
9	9976	2016/07/20	659.0	659.0	659.0	659.0	32000