The Guardian API

In the beautiful_soup.ipynb notebook, I showed how BeautifulSoup can be used to parse messy HTML, tp extract information, and to act as a rudimentary web crawler. I used The Guardian as an illustrative example about how this can be achieved. The reason for choosing The Guardian was because they provide a REST API to their servers. With theise it is possible to perform specific queries on their servers, and to receive current information from their servers according to their API guide (ie in JSON)

http://open-platform.theguardian.com/

In order to use their API, you will need to register for an API key. At the time of writing (Feb 1, 2017) this was an automated process that can be completed at

http://open-platform.theguardian.com/access/

The API is documented here:

http://open-platform.theguardian.com/documentation/

and Python bindings to their API are provided by The Guardian here

https://github.com/prabhath6/theguardian-api-python

and these can easily be integrated into a web-crawler based on API calls, rather than being based on HTML parsing, etc.

We use four parameters in our queries here:

section: the section of the newspaper that we are interested in querying. In this case I'm lookin in the technology section
order-by: I have specifie that the newest items should be closer to the front of the query list
api-key: I have left this as test (which works here), but for real deployment of such a spider a real API key should be specified
page-size: The number of results to return.



In [2]:

    
from __future__ import print_function

import requests 
import json

Inspect all sections and search for technology-based sections



In [3]:

    
url = 'https://content.guardianapis.com/sections?api-key=test'
req = requests.get(url)
src = req.text



In [12]:

    
json.loads(src)['response']['status']









    Out[12]:





'ok'



In [11]:

    
sections = json.loads(src)['response']

print(sections.keys())









    



dict_keys(['status', 'userTier', 'total', 'results'])



In [13]:

    
print(json.dumps(sections['results'][0], indent=2, sort_keys=True))









    



{
  "apiUrl": "https://content.guardianapis.com/about",
  "editions": [
    {
      "apiUrl": "https://content.guardianapis.com/about",
      "code": "default",
      "id": "about",
      "webTitle": "About",
      "webUrl": "https://www.theguardian.com/about"
    }
  ],
  "id": "about",
  "webTitle": "About",
  "webUrl": "https://www.theguardian.com/about"
}



In [14]:

    
for result in sections['results']: 
    if 'tech' in result['id'].lower(): 
        print(result['webTitle'], result['apiUrl'])









    



Technology https://content.guardianapis.com/technology

Manual query on whole API



In [23]:

    
# Specify the arguments
args = {
    'section': 'technology', 
    'order-by': 'newest', 
    'api-key': 'test', 
    'page-size': '100',
    'q' : 'privacy%20AND%20data'
}

# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
    base_url, 
    '&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])
)

# Make the request and extract the source
req = requests.get(url) 
src = req.text



In [24]:

    
print('Number of byes received:', len(src))









    



Number of byes received: 59513

The API returns JSON, so we parse this using the in-built JSON library. The API specifies that all data are returned within the response key, even under failure. Thereofre, I have immediately descended to the response field

Parsing the JSON



In [25]:

    
response = json.loads(src)['response']
print('The following are available:\n ', sorted(response.keys()))









    



The following are available:
  ['currentPage', 'orderBy', 'pageSize', 'pages', 'results', 'startIndex', 'status', 'total', 'userTier']

Verifying the status code

It is important to verify that the status message is ok before continuing - if it is not ok no 'real' data will have been received.



In [26]:

    
assert response['status'] == 'ok'

Listing the results

The API standard states that the results will be found in the results field under the response field. Furthermore, the URLs will be found in the webUrl field, and the title will be found in the webTitle field.

First let's look to see what a single result looks like in full, and then I will print a restricted set of parameters on the full set of results .



In [27]:

    
print(json.dumps(response['results'][0], indent=2, sort_keys=True))









    



{
  "apiUrl": "https://content.guardianapis.com/technology/2020/feb/04/google-software-glitch-sent-some-users-videos-to-strangers",
  "id": "technology/2020/feb/04/google-software-glitch-sent-some-users-videos-to-strangers",
  "isHosted": false,
  "pillarId": "pillar/news",
  "pillarName": "News",
  "sectionId": "technology",
  "sectionName": "Technology",
  "type": "article",
  "webPublicationDate": "2020-02-04T12:56:37Z",
  "webTitle": "Google software glitch sent some users' videos to strangers",
  "webUrl": "https://www.theguardian.com/technology/2020/feb/04/google-software-glitch-sent-some-users-videos-to-strangers"
}



In [46]:

    
for result in response['results']: 
    print(result['webUrl'][:70], result['webTitle'][:20])









    



https://www.theguardian.com/technology/2020/feb/04/google-software-gli Google software glit
https://www.theguardian.com/technology/2020/jan/30/mike-pompeo-restate Mike Pompeo restates
https://www.theguardian.com/technology/2020/jan/30/facebook-pays-550m- Facebook pays $550m 
https://www.theguardian.com/technology/2020/jan/28/boris-johnson-gets- Boris Johnson gets f
https://www.theguardian.com/technology/commentisfree/2020/jan/25/facia Quick, cheap to make
https://www.theguardian.com/technology/2020/jan/25/peter-diamandis-fut Peter Diamandis: ‘In
https://www.theguardian.com/technology/2020/jan/24/met-police-begin-us Met police to begin 
https://www.theguardian.com/technology/2020/jan/22/tell-us-about-the-w Tell us about the we
https://www.theguardian.com/technology/2020/jan/22/un-investigators-to Bezos hack: UN to ad
https://www.theguardian.com/technology/2020/jan/22/tech-firms-fail-pro Watchdog cracks down
https://www.theguardian.com/technology/2020/jan/18/1-trillion-dollars- $1tn is just the sta
https://www.theguardian.com/technology/2020/jan/17/google-owner-alphab Google owner Alphabe
https://www.theguardian.com/technology/2020/jan/16/instagram-my-data-c Was anyone ever so y
https://www.theguardian.com/technology/2020/jan/16/google-nest-mini-re Google Nest Mini rev
https://www.theguardian.com/technology/2020/jan/15/twitter-drops-grind Twitter drops Grindr
https://www.theguardian.com/technology/2020/jan/12/anger-over-use-faci Anger over use of fa
https://www.theguardian.com/technology/2020/jan/10/skype-audio-graded- Skype audio graded b
https://www.theguardian.com/technology/2020/jan/08/facial-recognition- Facial recognition a
https://www.theguardian.com/technology/2020/jan/08/travelex-hack-staff Travelex hack: staff
https://www.theguardian.com/technology/2020/jan/03/metoobots-scientist Rise of #MeTooBots: 
https://www.theguardian.com/technology/2020/jan/03/technology-2050-sav Technology in 2050: 
https://www.theguardian.com/technology/2019/dec/31/get-cybersecure-for Get yourself cyberse
https://www.theguardian.com/technology/2019/dec/29/lack-of-guidance-le Lack of guidance lea
https://www.theguardian.com/technology/askjack/2019/dec/19/how-can-i-g How can I get better
https://www.theguardian.com/technology/shortcuts/2019/dec/16/alexa-can 'Mind your own busin
https://www.theguardian.com/technology/2019/dec/14/twenty-tech-trends- Twenty tech trends f
https://www.theguardian.com/technology/2019/dec/13/ring-hackers-report Ring hackers are rep
https://www.theguardian.com/technology/askjack/2019/dec/12/duckduckgo- Can DuckDuckGo repla
https://www.theguardian.com/technology/2019/dec/12/ring-alarm-review-a Ring Alarm review: A
https://www.theguardian.com/technology/askjack/2019/nov/28/security-so What sort of securit
https://www.theguardian.com/technology/2019/nov/24/tim-berners-lee-unv Tim Berners-Lee unve
https://www.theguardian.com/technology/2019/nov/23/facebook-google-hum Tech giants watch ou
https://www.theguardian.com/technology/2019/nov/21/google-project-nigh Warren and group of 
https://www.theguardian.com/technology/2019/nov/19/technology-laws-are Technology laws are 
https://www.theguardian.com/technology/2019/nov/17/firefox-mozilla-fig Firefox’s fight for 
https://www.theguardian.com/technology/2019/nov/17/porn-public-transpo Porn, public transpo
https://www.theguardian.com/technology/2019/nov/14/google-healthcare-d Will Google get away
https://www.theguardian.com/technology/2019/nov/12/google-medical-data Google's secret cach
https://www.theguardian.com/technology/2019/nov/08/the-rise-of-microch The rise of microchi
https://www.theguardian.com/technology/2019/nov/06/google-nest-hub-max Google Nest Hub Max 
https://www.theguardian.com/technology/2019/nov/04/uber-los-angeles-pe LA suspends Uber’s s
https://www.theguardian.com/technology/2019/nov/01/whatsapp-hack-is-se WhatsApp 'hack' is s
https://www.theguardian.com/technology/2019/oct/30/apple-lets-users-op Apple lets users opt
https://www.theguardian.com/technology/2019/oct/30/facebook-agrees-to- Facebook agrees to p
https://www.theguardian.com/technology/2019/oct/29/labour-calls-for-ha Labour calls for hal
https://www.theguardian.com/technology/2019/oct/29/google-pixel-4-xl-r Google Pixel 4 XL re
https://www.theguardian.com/technology/2019/oct/26/china-technology-so Why you should worry
https://www.theguardian.com/technology/2019/oct/24/mind-reading-tech-p Mind-reading tech? H
https://www.theguardian.com/technology/2019/oct/22/oneplus-7t-pro-revi OnePlus 7T Pro revie
https://www.theguardian.com/technology/2019/oct/21/google-eye-detectio Google to add eye de
https://www.theguardian.com/technology/2019/oct/18/how-the-wheels-came How the wheels came 
https://www.theguardian.com/culture/2019/oct/16/uk-drops-plans-for-onl UK drops plans for o
https://www.theguardian.com/technology/2019/oct/16/digital-welfare-sta ‘Digital welfare sta
https://www.theguardian.com/society/2019/oct/15/alexa-do-you-recall-th Alexa, do you recall
https://www.theguardian.com/technology/2019/oct/15/google-launches-che Google launches chea
https://www.theguardian.com/technology/2019/oct/11/elizabeth-warren-fa Elizabeth Warren tro
https://www.theguardian.com/technology/2019/oct/10/tim-cook-apple-hong Tim Cook defends App
https://www.theguardian.com/technology/2019/oct/09/alexa-are-you-invad 'Alexa, are you inva
https://www.theguardian.com/technology/2019/oct/08/what-does-peter-dut What does Peter Dutt
https://www.theguardian.com/technology/2019/oct/08/us-whistleblower-th US whistleblower bla
https://www.theguardian.com/technology/2019/oct/05/facial-recognition- 'We are hurtling tow
https://www.theguardian.com/technology/2019/oct/03/facebook-surveillan US, UK and Australia
https://www.theguardian.com/technology/2019/oct/03/google-data-harvest Google reportedly ta
https://www.theguardian.com/technology/2019/oct/01/mark-zuckerberg-fac Zuckerberg: I'll 'go
https://www.theguardian.com/technology/2019/oct/01/iphone-11-review-ip iPhone 11 review: an
https://www.theguardian.com/technology/2019/sep/29/plan-for-massive-fa Plan for massive fac
https://www.theguardian.com/technology/2019/sep/26/amazon-launches-ale Amazon launches Alex
https://www.theguardian.com/technology/2019/sep/26/pulp-diction-samuel Pulp diction: Samuel
https://www.theguardian.com/technology/2019/sep/24/firefox-no-uk-plans Firefox: 'no UK plan
https://www.theguardian.com/technology/2019/sep/18/facebook-portal-sma Facebook to launch n
https://www.theguardian.com/technology/2019/sep/17/tech-climate-change To decarbonize we mu
https://www.theguardian.com/technology/2019/sep/17/imagenet-roulette-a The viral selfie app
https://www.theguardian.com/technology/2019/sep/17/youtube-fine-and-ch YouTube’s fine and c
https://www.theguardian.com/technology/2019/sep/13/google-facebook-ama Google, Facebook, Am
https://www.theguardian.com/technology/askjack/2019/sep/12/can-i-still Can I still use my C
https://www.theguardian.com/technology/2019/sep/06/facebook-google-ant US states to launch 
https://www.theguardian.com/technology/2019/sep/06/apple-rewrote-siri- Apple made Siri defl
https://www.theguardian.com/technology/2019/sep/04/facebook-users-phon Facebook confirms 41
https://www.theguardian.com/technology/2019/sep/04/police-use-of-facia Police use of facial
https://www.theguardian.com/technology/2019/sep/04/a-deep-fake-app-wil A ‘deep fake’ app wi
https://www.theguardian.com/technology/2019/sep/04/android-10-released Android 10 released:
https://www.theguardian.com/technology/2019/aug/29/apple-apologises-li Apple apologises for
https://www.theguardian.com/technology/2019/aug/24/alexa-nhs-future-am Does Amazon have ans
https://www.theguardian.com/technology/2019/aug/22/apple-card-wallet-p Apple warns new cred
https://www.theguardian.com/technology/2019/aug/18/manchester-city-fac Manchester City warn
https://www.theguardian.com/technology/2019/aug/16/privacy-campaigners Privacy campaigners 
https://www.theguardian.com/technology/2019/aug/15/ico-opens-investiga ICO opens investigat
https://www.theguardian.com/technology/2019/aug/13/facebook-messenger- Facebook admits cont
https://www.theguardian.com/technology/2019/aug/13/people-at-kings-cro People at King’s Cro
https://www.theguardian.com/technology/2019/aug/07/south-wales-police- South Wales police t
https://www.theguardian.com/technology/2019/aug/05/alexa-allows-users- Alexa users can now 
https://www.theguardian.com/technology/2019/aug/04/facial-recognition- Facial recognition… 
https://www.theguardian.com/technology/2019/aug/04/innocence-lost-what Innocence lost: What
https://www.theguardian.com/technology/2019/aug/02/apple-halts-practic Apple halts practice
https://www.theguardian.com/technology/2019/jul/29/what-is-facial-reco What is facial recog
https://www.theguardian.com/technology/2019/jul/26/apple-contractors-r Apple contractors 'r
https://www.theguardian.com/technology/2019/jul/24/facebook-revenue-fi Facebook revenues so
https://www.theguardian.com/technology/2019/jul/24/facebook-to-pay-5bn Facebook to pay $5bn
https://www.theguardian.com/technology/2019/jul/24/amazon-echo-show-5- Amazon Echo Show 5 r
https://www.theguardian.com/technology/2019/jul/23/tech-companies-anti US justice departmen

Let's now request a specific piece of content from the API.

We select the ith result from the above response and get its apiUrl and id:



In [50]:

    
i = 0
api_url = response['results'][i]['apiUrl']
api_id = response['results'][i]['id']

print(api_url)
print(api_id)









    



https://content.guardianapis.com/technology/2020/feb/04/google-software-glitch-sent-some-users-videos-to-strangers
technology/2020/feb/04/google-software-glitch-sent-some-users-videos-to-strangers

We then use the id to contstruct a search url string to request this piece of content from the API.

(Note that you need to include the api-key in the search, this is what I forgot in the lecture. You also need to specify if you want to include data fields other than the article metadata e.g. body and headline are included in the example below.)



In [53]:

    
base_url = "https://content.guardianapis.com/search?"
search_string = "ids=%s&api-key=test&show-fields=headline,body" %api_id

url = base_url + search_string
print(url)









    



https://content.guardianapis.com/search?ids=technology/2020/feb/04/google-software-glitch-sent-some-users-videos-to-strangers&api-key=test&show-fields=headline,body



In [54]:

    
req = requests.get(url) 
src = req.text



In [57]:

    
response = json.loads(src)['response']
assert response['status'] == 'ok'



In [60]:

    
print(response['results'][0]['fields']['headline'])









    



Google software glitch sent some users' videos to strangers



In [62]:

    
body = response['results'][0]['fields']['body']
print(body)









    



<p>Google has said a software bug resulted in some users’ personal videos being emailed to strangers.</p> <p>The flaw affected users of Google Photos who requested to export their data in late November. For four days the export tool wrongly added videos to unrelated users’ archives.</p> <p>As a result, private videos may have been sent to strangers, while downloaded archives may not have been complete.</p> <p>“We are notifying people about a bug that may have affected users who used Google Takeout to export their Google Photos content between November 21 and November 25,” a Google spokesperson said.</p> <p>“These users may have received either an incomplete archive, or videos – not photos – that were not theirs. We fixed the underlying issue and have conducted an in-depth analysis to help prevent this from ever happening again. We are very sorry this happened.”</p> <p>The company emphasised that the bug only affected users of Google Takeout, a download-your-data tool, and not users of Google Photos more broadly. It did not give specific numbers on how many users were affected but said it was less than 0.01% of Google Photos users.</p> <p>Google said it had self-reported to the Irish data protection commissioner, who oversees the company in the EU and has the power to levy fines of up to 4% of annual turnover for breaches of the General Data Protection Regulation.</p> <p>Javvad Malik, a security awareness advocate at KnowBe4, praised Google for the relative speed of its response but added: “While the issue was limited to videos being incorrectly shared when downloading an archive, it is a data breach and impacted the privacy of users.</p> <p>“Many users trust cloud providers, especially for photos and videos which are automatically backed up from mobile devices. It is imperative that cloud providers maintain that trust through robust security measures that allow users to restore their data, while at the same time ensuring data is kept safe from accidental or malicious leaks.”</p> <p>Google Photos offers unlimited cloud storage of images, with the trade-off that Google has access to the pictures and can use them to train its machine learning models.</p> <p>This week the company started trialling a service in North America in which users can have 10 of their photos each month chosen algorithmically and printed and posted to them for $8 (£6.15) a month.</p> <p>On Monday Google revealed its <a href="https://www.theguardian.com/technology/2020/feb/03/youtube-ad-revenue-google-alphabet-shares">quarterly earnings</a> and disclosed for the first time how much revenue it makes from YouTube advertising. The video sharing site raised more than $1bn a month last year, Google said.</p>

We can now do some simple text processing on the article text. e.g. count the word frequnecies:



In [94]:

    
words = body.replace('<p>','').replace('</p>','').split()
print(len(words))
unique_words = list(set(words))
print(len(unique_words))
#count_dictionary = {word: count for word, count in zip(words, [words.count(w) for w in words])}
count_dictionary = {'word': unique_words, 'count': [words.count(w) for w in unique_words]}



In [96]:

    
import pandas as pd



In [97]:

    
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)









    Out[97]:







  
    
      
      word
      count
    
  
  
    
      137
      the
      16
    
    
      65
      to
      14
    
    
      23
      Google
      14
    
    
      5
      of
      11
    
    
      33
      a
      10
    
    
      ...
      ...
      ...
    
    
      1
      again.
      1
    
    
      40
      On
      1
    
    
      122
      providers,
      1
    
    
      123
      Malik,
      1
    
    
      120
      numbers
      1
    
  

240 rows × 2 columns

So we have a dataframe with word occurence frequency in the article.

But there is punctuation messing this up. For example, we see that again. appears once, as does providers,.

One option to fix this would be to strip out the punctuation using Python string manipulation. But you could also use regular expressions to remove the punctuation. Below is a hacky example, but you can probably find a better solution.



In [112]:

    
import re  ## imports the regular expression library
words_wo_punctuation = re.sub(r'[^\w\s]','',body.replace('<p>','').replace('</p>','')).split()

Note that the regex r'[^\w\s]' substitutes anything in body that is not a word \w or and blank space \s with the empty string ''.



In [113]:

    
unique_words = list(set(words_wo_punctuation))
print(len(unique_words))
count_dictionary = {'word': unique_words, 'count': [words_wo_punctuation.count(w) for w in unique_words]}



In [117]:

    
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)









    Out[117]:







  
    
      
      word
      count
    
  
  
    
      135
      the
      16
    
    
      62
      to
      14
    
    
      20
      Google
      14
    
    
      196
      users
      13
    
    
      3
      of
      11
    
    
      ...
      ...
      ...
    
    
      89
      KnowBe4
      1
    
    
      90
      result
      1
    
    
      91
      4
      1
    
    
      92
      resulted
      1
    
    
      224
      001
      1
    
  

225 rows × 2 columns



In [ ]:

	word	count
137	the	16
65	to	14
23	Google	14
5	of	11
33	a	10
...	...	...
1	again.	1
40	On	1
122	providers,	1
123	Malik,	1
120	numbers	1

	word	count
135	the	16
62	to	14
20	Google	14
196	users	13
3	of	11
...	...	...
89	KnowBe4	1
90	result	1
91	4	1
92	resulted	1
224	001	1