We will be using the Wikipedia API to download articles from the web. Eventually we will be performing NLP and machine learning on the articles to do awesome things like document clustering, document filtering/classification, network analysis, and some indexing. For this sprint we are just converened with getting some data.
In [113]:
# import Python's standard library modules for regular expresions and json
import re
import json
# This is useful to fix malformed urls
import urlparse
# import the Image display module
from IPython.display import Image
# inline allows us to embed matplotlib figures directly into the IPython notebook
%pylab inline
Lucky for us the Wikipedia API is well documented. And you do not need an API key to access it (isn't that nice of them). Go ahead, give it a spin!
In [114]:
# import the Requests HTTP library
import requests
# A User agent header required for the Wikipedia API.
headers = {'user_agent': 'DataWrangling/1.1 (http://zipfianacademy.com; class@zipfianacademy.com)'}
In [115]:
# Experiment with fetching one or two pages and examining the result (fill in URL and payload)
url = 'http://en.wikipedia.org/w/api.php'
# parameters for the API request
payload = {'action':'parse', 'page':'Zipf\'s_law', 'prop':'links', 'format':'json'}
# make the request
r = requests.post(url, data=payload, headers=headers)
HINT: Check out the parse action_
Now that you have some experience with the API and can sucessfully access articles with associated metadata, it is time to start storing them in MongoDB!
You should have a MongoDB daemon running on your vagrant machine. It is here that you will be storing all of your data, but be aware of how many articles you are crawling.
In [116]:
# import MongoDB modules
from pymongo import MongoClient
from bson.objectid import ObjectId
# connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')
# connect to the wikipedia database: if it does not exist it will automatically create it -- one reason why mongoDB can be nice.
db = client.wikipedia
Each database has a number of collections analogous to SQL tables. And each collection is comprised of documents analogous to a rows in a SQL table. And each document has fields analogous to SQL columns. Also, the docs have made a more comprehensive comparision.
Now that you can store and retrieve articles in the Mongo Database, it is time to iterate!
Do not follow external links, only linked Wikipedia articles
HINT: The Zipf's Law article should be located at: 'http://en.wikipedia.org/w/api.php?action=parse&format=json&page=Zipf's%20law'
In [7]:
# grab the list of linked Wikipedia articles from the API result
links = [item['*'] for item in r.json()['parse']['links']]
In [13]:
# iterate over each link and store the returned document in MongoDB
db.wikipedia.remove()
i=1
for link in links:
if i % 10 == 0:
print i
i += 1
url = 'http://en.wikipedia.org/w/api.php'
# parameters for the API request
payload = {'action':'parse', 'page':link, 'format':'json'}
# make the request
req = requests.post(url, data=payload, headers=headers, allow_redirects = True)
db.wikipedia.insert(req.json())
In [14]:
db.wikipedia.count()
Out[14]:
We will get some practice now with regular expressions in order to search the content of the articles for the terms Zipf
or Zipfian
. We only want articles that mention these terms in the displayed text however, so we must first remove all the unnecessary HTML tags and only keep what is in between the relevant tags. Beautiful Soup makes this almost trivial. Explore the documentation to find how to do this effortlessly: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Test out your Regular Expressions before you run them over every document you have in your database: http://pythex.org/. Here is some useful documentation on regular expressions in Python: http://docs.python.org/2/howto/regex.html
Once you have identified the relevant articles, save them to a file for now, we do not need to persist them in the database (but you can if you want).
In [117]:
import re
In [118]:
# import the Beautiful Soup module
from bs4 import BeautifulSoup
p = re.compile('zipf',re.IGNORECASE)
zipf_titles = []
for cursor in db.wikipedia.find():
html = cursor['parse']['text']['*']
soup = BeautifulSoup(html)
text = soup.getText()
if p.search(text):
zipf_titles.append(cursor['parse']['title'])
In [17]:
len(zipf_titles)
Out[17]:
We want to augment our Zipfian Wikipedia articles with content from the WWW at large. Stepping out of the walled garden of collaboratively edited document safety... let us scrape! For each of the artcles we found to contain 'Zipf' or 'Zipfian', we want to know what the web has to say. For each of the external links of said articles, fetch the linked webpage and extract the <title> and <meta name="keywords"> from the HTML. Beautiful Soup would probably help you a lot here.
You still have to watch out for pages without keywords or a title
Once you have extracted this information, update the stored document in your database with this information. Add a field called 'extraexternal' that contains the additional contextual information. 'extraexternal' should be an array of JSON objects, each of which have keys:
Example:
{
...
'displaytitle': "Zipf's law",
'externallinks': [...],
'text': {
'*': '<table class="infobox bordered" style="width:325px; max-width:325px; font-size:95%; text-align: left;">\n<caption>Zipf\'s law</caption>\n<tr...'
}
'extraexternal' : [{
'url' : 'http://zipfianacademy.com',
'title' : 'Teaching the Long Tail | Zipfian Academy'
'keywords' : 'data, datascience, science, bootcamp, training, hadoop, big, bigdata, boot, camp, machine...'
}, ... ]
...
}
In [119]:
# re-open our output file of matched articles
i=1
for title in zipf_titles[:5]:
print i, ':', title
links = db.wikipedia.find_one({'parse.title':title})['parse']['externallinks']
print len(links), ' external links'
extraexternal = []
for link in links[:]:
if 'http' not in link:
link = 'http:'+link
print link
try:
r_link = requests.get(link,timeout=1)
if 'html' in r_link.headers['content-type']:
html = r_link.content
link_external = {}
soup = BeautifulSoup(html)
link_external['url'] = r_link.url
if soup.find('title'):
link_external['title'] = soup.find('title').get_text()
if soup.select('meta[name="keywords"]'):
link_external['keywords'] = soup.select('meta[name="keywords"]')[0].attrs['content']
else:
link_external['keywords'] = None
extraexternal.append(link_external)
except:
pass
objectid = db.wikipedia.find_one({'parse.title':title})['_id']
db.wikipedia.update({'_id':objectid},{'$set':{'parse.cupcake':extraexternal}})
i+=1
# iterate over each article that contains 'Zipf' or 'Zipfian'
# TODO
In [124]:
cursor = db.wikipedia.find({'parse.title':'Wishart distribution'})
cursor[0]['parse']['cupcake']
Out[124]:
In [ ]: