In the beautiful_soup.ipynb
notebook, I showed how BeautifulSoup can be used
to parse messy HTML, tp extract information, and to act as a rudimentary web crawler.
I used The Guardian as an illustrative example about how this can be achieved.
The reason for choosing The Guardian was because they provide a REST API to their servers.
With theise it is possible to perform specific queries on their servers, and to receive
current information from their servers according to their API guide (ie in JSON)
http://open-platform.theguardian.com/
In order to use their API, you will need to register for an API key. At the time of writing (Feb 1, 2017) this was an automated process that can be completed at
http://open-platform.theguardian.com/access/
The API is documented here:
http://open-platform.theguardian.com/documentation/
and Python bindings to their API are provided by The Guardian here
https://github.com/prabhath6/theguardian-api-python
and these can easily be integrated into a web-crawler based on API calls, rather than being based on HTML parsing, etc.
We use four parameters in our queries here:
section
: the section of the newspaper that we are interested in querying. In this case I'm lookin in
the technology section
order-by
: I have specifie that the newest items should be closer to the front of the query list
api-key
: I have left this as test (which works here), but for real deployment of such a spider
a real API key should be specified
page-size
: The number of results to return.
In [2]:
from __future__ import print_function
import requests
import json
In [3]:
url = 'https://content.guardianapis.com/sections?api-key=test'
req = requests.get(url)
src = req.text
In [12]:
json.loads(src)['response']['status']
Out[12]:
In [11]:
sections = json.loads(src)['response']
print(sections.keys())
In [13]:
print(json.dumps(sections['results'][0], indent=2, sort_keys=True))
In [14]:
for result in sections['results']:
if 'tech' in result['id'].lower():
print(result['webTitle'], result['apiUrl'])
In [23]:
# Specify the arguments
args = {
'section': 'technology',
'order-by': 'newest',
'api-key': 'test',
'page-size': '100',
'q' : 'privacy%20AND%20data'
}
# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
base_url,
'&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])
)
# Make the request and extract the source
req = requests.get(url)
src = req.text
In [24]:
print('Number of byes received:', len(src))
The API returns JSON, so we parse this using the in-built JSON library.
The API specifies that all data are returned within the response
key, even under failure.
Thereofre, I have immediately descended to the response field
In [25]:
response = json.loads(src)['response']
print('The following are available:\n ', sorted(response.keys()))
In [26]:
assert response['status'] == 'ok'
The API standard states that the results will be found in the results
field under the response
field.
Furthermore, the URLs will be found in the webUrl
field, and the title will be found in the webTitle
field.
First let's look to see what a single result looks like in full, and then I will print a restricted set of parameters on the full set of results .
In [27]:
print(json.dumps(response['results'][0], indent=2, sort_keys=True))
In [46]:
for result in response['results']:
print(result['webUrl'][:70], result['webTitle'][:20])
Let's now request a specific piece of content from the API.
We select the ith result from the above response and get its apiUrl
and id
:
In [50]:
i = 0
api_url = response['results'][i]['apiUrl']
api_id = response['results'][i]['id']
print(api_url)
print(api_id)
We then use the id
to contstruct a search url string to request this piece of content from the API.
(Note that you need to include the api-key
in the search, this is what I forgot in the lecture. You also need to specify if you want to include data fields other than the article metadata e.g. body
and headline
are included in the example below.)
In [53]:
base_url = "https://content.guardianapis.com/search?"
search_string = "ids=%s&api-key=test&show-fields=headline,body" %api_id
url = base_url + search_string
print(url)
In [54]:
req = requests.get(url)
src = req.text
In [57]:
response = json.loads(src)['response']
assert response['status'] == 'ok'
In [60]:
print(response['results'][0]['fields']['headline'])
In [62]:
body = response['results'][0]['fields']['body']
print(body)
We can now do some simple text processing on the article text. e.g. count the word frequnecies:
In [94]:
words = body.replace('<p>','').replace('</p>','').split()
print(len(words))
unique_words = list(set(words))
print(len(unique_words))
#count_dictionary = {word: count for word, count in zip(words, [words.count(w) for w in words])}
count_dictionary = {'word': unique_words, 'count': [words.count(w) for w in unique_words]}
In [96]:
import pandas as pd
In [97]:
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)
Out[97]:
So we have a dataframe with word occurence frequency in the article.
But there is punctuation messing this up. For example, we see that again.
appears once, as does providers,
.
One option to fix this would be to strip out the punctuation using Python string manipulation. But you could also use regular expressions to remove the punctuation. Below is a hacky example, but you can probably find a better solution.
In [112]:
import re ## imports the regular expression library
words_wo_punctuation = re.sub(r'[^\w\s]','',body.replace('<p>','').replace('</p>','')).split()
Note that the regex r'[^\w\s]'
substitutes anything in body
that is not a word \w
or and blank space \s
with the empty string ''
.
In [113]:
unique_words = list(set(words_wo_punctuation))
print(len(unique_words))
count_dictionary = {'word': unique_words, 'count': [words_wo_punctuation.count(w) for w in unique_words]}
In [117]:
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)
Out[117]:
In [ ]: