Overview -- Web Scraping

We will be using the Wikipedia API to download articles from the web. Eventually we will be performing NLP and machine learning on the articles to do awesome things like document clustering, document filtering/classification, network analysis, and some indexing. For this sprint we are just converened with getting some data.

Goals

  • Get experience using an API to access Wikipedia articles
  • Store the retrieved articles and metadata in MongoDB
  • Use regular expressions in Python to search for all articles that contain the word 'Zipf' or 'Zipfian'
  • Augment the article content with contextual information from its external links
  • Have FUN! (that's an order)

Data Sources (ranked by ease of use... usually)

DaaS -- Data as a service

Bulk Downloads -- just like the good ol' days

APIs -- public and hidden

DIY

  • Webscraping
  • manual downloads

If you have any other favorite datasources, please post them to Piazza!

Exercise: Wikipedia++


In [113]:
# import Python's standard library modules for regular expresions and json
import re
import json

# This is useful to fix malformed urls 
import urlparse

# import the Image display module
from IPython.display import Image

# inline allows us to embed matplotlib figures directly into the IPython notebook
%pylab inline


Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['text', 'title']
`%pylab --no-import-all` prevents importing * from pylab and numpy

Step 1: Access the Wikipedia API

Lucky for us the Wikipedia API is well documented. And you do not need an API key to access it (isn't that nice of them). Go ahead, give it a spin!


In [114]:
# import the Requests HTTP library
import requests

# A User agent header required for the Wikipedia API.
headers = {'user_agent': 'DataWrangling/1.1 (http://zipfianacademy.com; class@zipfianacademy.com)'}

In [115]:
# Experiment with fetching one or two pages and examining the result (fill in URL and payload)
url = 'http://en.wikipedia.org/w/api.php'

# parameters for the API request
payload = {'action':'parse', 'page':'Zipf\'s_law', 'prop':'links', 'format':'json'}

# make the request
r = requests.post(url, data=payload, headers=headers)

HINT: Check out the parse action_

Step 2: Persistence in MongoDB

Now that you have some experience with the API and can sucessfully access articles with associated metadata, it is time to start storing them in MongoDB!

You should have a MongoDB daemon running on your vagrant machine. It is here that you will be storing all of your data, but be aware of how many articles you are crawling.

One article = ~120 kilobytes. 500MB / 120KB ≈ 4,250 articles.


In [116]:
# import MongoDB modules
from pymongo import MongoClient
from bson.objectid import ObjectId

# connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')

# connect to the wikipedia database: if it does not exist it will automatically create it -- one reason why mongoDB can be nice.
db = client.wikipedia

Each database has a number of collections analogous to SQL tables. And each collection is comprised of documents analogous to a rows in a SQL table. And each document has fields analogous to SQL columns. Also, the docs have made a more comprehensive comparision.

Now that you can store and retrieve articles in the Mongo Database, it is time to iterate!

Step 3: Retrieve and store every article (with associated metadata) within 2 hops from the 'Zipf's law' article.

Do not follow external links, only linked Wikipedia articles

HINT: The Zipf's Law article should be located at: 'http://en.wikipedia.org/w/api.php?action=parse&format=json&page=Zipf's%20law'


In [7]:
# grab the list of linked Wikipedia articles from the API result 

links = [item['*'] for item in r.json()['parse']['links']]

In [13]:
# iterate over each link and store the returned document in MongoDB
db.wikipedia.remove()

i=1
for link in links:
    if i % 10 == 0:
        print i
    i += 1
    url = 'http://en.wikipedia.org/w/api.php'

    # parameters for the API request
    payload = {'action':'parse', 'page':link, 'format':'json'}

    # make the request
    req = requests.post(url, data=payload, headers=headers, allow_redirects = True)
    db.wikipedia.insert(req.json())


10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230

In [14]:
db.wikipedia.count()


Out[14]:
239

Step 3: Find all articles that mention 'Zipf' or 'Zipfian' (case insensitive)

We will get some practice now with regular expressions in order to search the content of the articles for the terms Zipf or Zipfian. We only want articles that mention these terms in the displayed text however, so we must first remove all the unnecessary HTML tags and only keep what is in between the relevant tags. Beautiful Soup makes this almost trivial. Explore the documentation to find how to do this effortlessly: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Test out your Regular Expressions before you run them over every document you have in your database: http://pythex.org/. Here is some useful documentation on regular expressions in Python: http://docs.python.org/2/howto/regex.html

Once you have identified the relevant articles, save them to a file for now, we do not need to persist them in the database (but you can if you want).


In [117]:
import re

In [118]:
# import the Beautiful Soup module 
from bs4 import BeautifulSoup

p = re.compile('zipf',re.IGNORECASE)
zipf_titles = []

for cursor in db.wikipedia.find():
    
    html = cursor['parse']['text']['*']

    soup = BeautifulSoup(html)
    text = soup.getText()
    if p.search(text):
        zipf_titles.append(cursor['parse']['title'])

In [17]:
len(zipf_titles)


Out[17]:
180

Step 4: Augmentation! Time to remix the web... or rather just Wikipedia. But hey, isn't Wikipedia the web.

We want to augment our Zipfian Wikipedia articles with content from the WWW at large. Stepping out of the walled garden of collaboratively edited document safety... let us scrape! For each of the artcles we found to contain 'Zipf' or 'Zipfian', we want to know what the web has to say. For each of the external links of said articles, fetch the linked webpage and extract the <title> and <meta name="keywords"> from the HTML. Beautiful Soup would probably help you a lot here.

You still have to watch out for pages without keywords or a title

Once you have extracted this information, update the stored document in your database with this information. Add a field called 'extraexternal' that contains the additional contextual information. 'extraexternal' should be an array of JSON objects, each of which have keys:

  • 'url' : the url of the page
  • 'title' : the title of the page
  • 'keywords' : the keywords from the meta tag

Example:

    {

     ...

     'displaytitle': "Zipf's law",
     'externallinks': [...],
     'text': {
                '*': '<table class="infobox bordered" style="width:325px; max-width:325px; font-size:95%; text-align: left;">\n<caption>Zipf\'s law</caption>\n<tr...'
              }
    'extraexternal' : [{ 
                         'url' : 'http://zipfianacademy.com',
                         'title' : 'Teaching the Long Tail | Zipfian Academy'
                         'keywords' : 'data, datascience, science, bootcamp, training, hadoop, big, bigdata, boot, camp, machine...'
                       }, ... ]
     ...
    } 

In [119]:
# re-open our output file of matched articles 

i=1

for title in zipf_titles[:5]:
    print i, ':', title
    
    links = db.wikipedia.find_one({'parse.title':title})['parse']['externallinks']
    print len(links), ' external links'
    
    extraexternal = []
    
    for link in links[:]:
        if 'http' not in link:
            link = 'http:'+link
        
        print link
        
        try:
            r_link = requests.get(link,timeout=1)
        
        
            if 'html' in r_link.headers['content-type']:
                html = r_link.content
                
            link_external = {}
            soup = BeautifulSoup(html)
            link_external['url'] = r_link.url
            
            if soup.find('title'):
                link_external['title'] = soup.find('title').get_text()
                
            if soup.select('meta[name="keywords"]'):
                link_external['keywords'] = soup.select('meta[name="keywords"]')[0].attrs['content']
            else:
                link_external['keywords'] = None
                
            extraexternal.append(link_external)
        except:
            pass
    
    objectid = db.wikipedia.find_one({'parse.title':title})['_id']
    db.wikipedia.update({'_id':objectid},{'$set':{'parse.cupcake':extraexternal}})
        
    i+=1  
        

# iterate over each article that contains 'Zipf' or 'Zipfian'

# TODO


1 : Wishart distribution
8  external links
http://dx.doi.org/10.1093%2Fbiomet%2F20A.1-2.32
http://www.zentralblatt-math.org/zmath/en/search/?format=complete&q=an:54.0565.02
http://www.jstor.org/stable/2331939
http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1176325375
http://www.jstor.org/stable/2346290
http://dx.doi.org/10.1214%2Faop%2F1176990455
http://dx.doi.org/10.1007%2FBF01078179
http://www.jstor.org/pss/2283988
2 : Wrapped exponential distribution
2  external links
http://www.pstat.ucsb.edu/faculty/jammalam/html/Some%20Publications/2004_WrappedSkewFamilies_Comm..pdf
http://dx.doi.org/10.1081%2FSTA-200026570
3 : Von Mises–Fisher distribution
2  external links
http://dx.doi.org/10.1007%2Fs00180-011-0232-x
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.186.1887&rep=rep1&type=pdf
4 : Natural exponential family
0  external links
5 : List of probability distributions
0  external links

In [124]:
cursor = db.wikipedia.find({'parse.title':'Wishart distribution'})
cursor[0]['parse']['cupcake']


Out[124]:
[{u'keywords': None,
  u'title': u'Sign In ',
  u'url': u'http://biomet.oxfordjournals.org/content/20A/1-2/32'},
 {u'keywords': None,
  u'title': u'zbMATH - the first resource for mathematics',
  u'url': u'http://zbmath.org/?format=complete&q=an:54.0565.02'},
 {u'keywords': None,
  u'title': u'Uhlig\n\t\t\t\t\t\t\t:\n\t\t\t\t\t\tOn Singular Wishart and Singular Multivariate Beta Distributions',
  u'url': u'http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1176325375'},
 {u'keywords': None,
  u'title': u'Peddada\n\t\t\t\t\t\t\t,\n\t\t\t\t\t\tRichards\n\t\t\t\t\t\t\t:\n\t\t\t\t\t\tProof of a Conjecture of M. L. Eaton on the Characteristic Function of the Wishart Distribution',
  u'url': u'http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aop/1176990455'},
 {u'keywords': None,
  u'title': u'Invariant generalized functions in homogeneous domains - Springer',
  u'url': u'http://link.springer.com/article/10.1007%2FBF01078179'}]

In [ ]:

Congratulations!

You have made it to the end (hopefully succcessfully). Now that you have your data and have contextualized it with information from the web, you can start performing some interesting analyses on it. Some ideas to get you started: