The Guardian API

In the beautiful_soup.ipynb notebook, I showed how BeautifulSoup can be used to parse messy HTML, tp extract information, and to act as a rudimentary web crawler. I used The Guardian as an illustrative example about how this can be achieved. The reason for choosing The Guardian was because they provide a REST API to their servers. With theise it is possible to perform specific queries on their servers, and to receive current information from their servers according to their API guide (ie in JSON)

In order to use their API, you will need to register for an API key. At the time of writing (Feb 1, 2017) this was an automated process that can be completed at

The API is documented here:

and Python bindings to their API are provided by The Guardian here

and these can easily be integrated into a web-crawler based on API calls, rather than being based on HTML parsing, etc.

We use four parameters in our queries here:

  1. section: the section of the newspaper that we are interested in querying. In this case I'm lookin in the technology section

  2. order-by: I have specifie that the newest items should be closer to the front of the query list

  3. api-key: I have left this as test (which works here), but for real deployment of such a spider a real API key should be specified

  4. page-size: The number of results to return.

In [2]:
from __future__ import print_function

import requests 
import json

Inspect all sections and search for technology-based sections

In [3]:
url = ''
req = requests.get(url)
src = req.text

In [12]:


In [11]:
sections = json.loads(src)['response']


dict_keys(['status', 'userTier', 'total', 'results'])

In [13]:
print(json.dumps(sections['results'][0], indent=2, sort_keys=True))

  "apiUrl": "",
  "editions": [
      "apiUrl": "",
      "code": "default",
      "id": "about",
      "webTitle": "About",
      "webUrl": ""
  "id": "about",
  "webTitle": "About",
  "webUrl": ""

In [14]:
for result in sections['results']: 
    if 'tech' in result['id'].lower(): 
        print(result['webTitle'], result['apiUrl'])


Manual query on whole API

In [23]:
# Specify the arguments
args = {
    'section': 'technology', 
    'order-by': 'newest', 
    'api-key': 'test', 
    'page-size': '100',
    'q' : 'privacy%20AND%20data'

# Construct the URL
base_url = ''
url = '{}?{}'.format(
    '&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])

# Make the request and extract the source
req = requests.get(url) 
src = req.text

In [24]:
print('Number of byes received:', len(src))

Number of byes received: 59513

The API returns JSON, so we parse this using the in-built JSON library. The API specifies that all data are returned within the response key, even under failure. Thereofre, I have immediately descended to the response field

Parsing the JSON

In [25]:
response = json.loads(src)['response']
print('The following are available:\n ', sorted(response.keys()))

The following are available:
  ['currentPage', 'orderBy', 'pageSize', 'pages', 'results', 'startIndex', 'status', 'total', 'userTier']

Verifying the status code

It is important to verify that the status message is ok before continuing - if it is not ok no 'real' data will have been received.

In [26]:
assert response['status'] == 'ok'

Listing the results

The API standard states that the results will be found in the results field under the response field. Furthermore, the URLs will be found in the webUrl field, and the title will be found in the webTitle field.

First let's look to see what a single result looks like in full, and then I will print a restricted set of parameters on the full set of results .

In [27]:
print(json.dumps(response['results'][0], indent=2, sort_keys=True))

  "apiUrl": "",
  "id": "technology/2020/feb/04/google-software-glitch-sent-some-users-videos-to-strangers",
  "isHosted": false,
  "pillarId": "pillar/news",
  "pillarName": "News",
  "sectionId": "technology",
  "sectionName": "Technology",
  "type": "article",
  "webPublicationDate": "2020-02-04T12:56:37Z",
  "webTitle": "Google software glitch sent some users' videos to strangers",
  "webUrl": ""

In [46]:
for result in response['results']: 
    print(result['webUrl'][:70], result['webTitle'][:20]) Google software glit Mike Pompeo restates Facebook pays $550m Boris Johnson gets f Quick, cheap to make Peter Diamandis: ‘In Met police to begin Tell us about the we Bezos hack: UN to ad Watchdog cracks down $1tn is just the sta Google owner Alphabe Was anyone ever so y Google Nest Mini rev Twitter drops Grindr Anger over use of fa Skype audio graded b Facial recognition a Travelex hack: staff Rise of #MeTooBots: Technology in 2050: Get yourself cyberse Lack of guidance lea How can I get better 'Mind your own busin Twenty tech trends f Ring hackers are rep Can DuckDuckGo repla Ring Alarm review: A What sort of securit Tim Berners-Lee unve Tech giants watch ou Warren and group of Technology laws are Firefox’s fight for Porn, public transpo Will Google get away Google's secret cach The rise of microchi Google Nest Hub Max LA suspends Uber’s s WhatsApp 'hack' is s Apple lets users opt Facebook agrees to p Labour calls for hal Google Pixel 4 XL re Why you should worry Mind-reading tech? H OnePlus 7T Pro revie Google to add eye de How the wheels came UK drops plans for o ‘Digital welfare sta Alexa, do you recall Google launches chea Elizabeth Warren tro Tim Cook defends App 'Alexa, are you inva What does Peter Dutt US whistleblower bla 'We are hurtling tow US, UK and Australia Google reportedly ta Zuckerberg: I'll 'go iPhone 11 review: an Plan for massive fac Amazon launches Alex Pulp diction: Samuel Firefox: 'no UK plan Facebook to launch n To decarbonize we mu The viral selfie app YouTube’s fine and c Google, Facebook, Am Can I still use my C US states to launch Apple made Siri defl Facebook confirms 41 Police use of facial A ‘deep fake’ app wi Android 10 released: Apple apologises for Does Amazon have ans Apple warns new cred Manchester City warn Privacy campaigners ICO opens investigat Facebook admits cont People at King’s Cro South Wales police t Alexa users can now Facial recognition… Innocence lost: What Apple halts practice What is facial recog Apple contractors 'r Facebook revenues so Facebook to pay $5bn Amazon Echo Show 5 r US justice departmen

Let's now request a specific piece of content from the API.

We select the ith result from the above response and get its apiUrl and id:

In [50]:
i = 0
api_url = response['results'][i]['apiUrl']
api_id = response['results'][i]['id']


We then use the id to contstruct a search url string to request this piece of content from the API.

(Note that you need to include the api-key in the search, this is what I forgot in the lecture. You also need to specify if you want to include data fields other than the article metadata e.g. body and headline are included in the example below.)

In [53]:
base_url = ""
search_string = "ids=%s&api-key=test&show-fields=headline,body" %api_id

url = base_url + search_string

In [54]:
req = requests.get(url) 
src = req.text

In [57]:
response = json.loads(src)['response']
assert response['status'] == 'ok'

In [60]:

Google software glitch sent some users' videos to strangers

In [62]:
body = response['results'][0]['fields']['body']

<p>Google has said a software bug resulted in some users’ personal videos being emailed to strangers.</p> <p>The flaw affected users of Google Photos who requested to export their data in late November. For four days the export tool wrongly added videos to unrelated users’ archives.</p> <p>As a result, private videos may have been sent to strangers, while downloaded archives may not have been complete.</p> <p>“We are notifying people about a bug that may have affected users who used Google Takeout to export their Google Photos content between November 21 and November 25,” a Google spokesperson said.</p> <p>“These users may have received either an incomplete archive, or videos – not photos – that were not theirs. We fixed the underlying issue and have conducted an in-depth analysis to help prevent this from ever happening again. We are very sorry this happened.”</p> <p>The company emphasised that the bug only affected users of Google Takeout, a download-your-data tool, and not users of Google Photos more broadly. It did not give specific numbers on how many users were affected but said it was less than 0.01% of Google Photos users.</p> <p>Google said it had self-reported to the Irish data protection commissioner, who oversees the company in the EU and has the power to levy fines of up to 4% of annual turnover for breaches of the General Data Protection Regulation.</p> <p>Javvad Malik, a security awareness advocate at KnowBe4, praised Google for the relative speed of its response but added: “While the issue was limited to videos being incorrectly shared when downloading an archive, it is a data breach and impacted the privacy of users.</p> <p>“Many users trust cloud providers, especially for photos and videos which are automatically backed up from mobile devices. It is imperative that cloud providers maintain that trust through robust security measures that allow users to restore their data, while at the same time ensuring data is kept safe from accidental or malicious leaks.”</p> <p>Google Photos offers unlimited cloud storage of images, with the trade-off that Google has access to the pictures and can use them to train its machine learning models.</p> <p>This week the company started trialling a service in North America in which users can have 10 of their photos each month chosen algorithmically and printed and posted to them for $8 (£6.15) a month.</p> <p>On Monday Google revealed its <a href="">quarterly earnings</a> and disclosed for the first time how much revenue it makes from YouTube advertising. The video sharing site raised more than $1bn a month last year, Google said.</p>

We can now do some simple text processing on the article text. e.g. count the word frequnecies:

In [94]:
words = body.replace('<p>','').replace('</p>','').split()
unique_words = list(set(words))
#count_dictionary = {word: count for word, count in zip(words, [words.count(w) for w in words])}
count_dictionary = {'word': unique_words, 'count': [words.count(w) for w in unique_words]}


In [96]:
import pandas as pd

In [97]:
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)

word count
137 the 16
65 to 14
23 Google 14
5 of 11
33 a 10
... ... ...
1 again. 1
40 On 1
122 providers, 1
123 Malik, 1
120 numbers 1

240 rows × 2 columns

So we have a dataframe with word occurence frequency in the article.

But there is punctuation messing this up. For example, we see that again. appears once, as does providers,.

One option to fix this would be to strip out the punctuation using Python string manipulation. But you could also use regular expressions to remove the punctuation. Below is a hacky example, but you can probably find a better solution.

In [112]:
import re  ## imports the regular expression library
words_wo_punctuation = re.sub(r'[^\w\s]','',body.replace('<p>','').replace('</p>','')).split()

Note that the regex r'[^\w\s]' substitutes anything in body that is not a word \w or and blank space \s with the empty string ''.

In [113]:
unique_words = list(set(words_wo_punctuation))
count_dictionary = {'word': unique_words, 'count': [words_wo_punctuation.count(w) for w in unique_words]}


In [117]:
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)

word count
135 the 16
62 to 14
20 Google 14
196 users 13
3 of 11
... ... ...
89 KnowBe4 1
90 result 1
91 4 1
92 resulted 1
224 001 1

225 rows × 2 columns

In [ ]: