Practical use of Jupyter notebook

Second motivation : learning Python by web scraping

Scraping data from the WHO


In [2]:
Image("img/init.png")


Out[2]:

Expected results


In [3]:
Image("img/target_result.png")


Out[3]:

Techniques used

  • Regular expressions
  • Pythonic / Functional programming :
    • use lists (iterable) :
      • avoid looping on indices whenever possible
      • list / for comprehensions
    • lambda expressions
    • pipe and map essentially : (based on cytoolz)
  • Python web scraping :
    • lxml (Python library)
    • XPath (web page content)

In [4]:
# FOR WEB SCRAPING
from lxml import html
import requests

# FOR FUNCTIONAL PROGRAMMING
import cytoolz  # pipe

# FOR DATA WRANGLING
import pandas as pd # use of R like dataframes
import re #re for regular expressions

# TO INSERT IMAGES
from IPython.display import Image

Data wrangling in action


In [44]:
### Target URL
outbreakNewsURL = "http://www.who.int/csr/don/archive/disease/zika-virus-infection/en/"

page = requests.get(outbreakNewsURL)
tree = html.fromstring(page.content)
newsXPath = '//li'
zikaNews = tree.xpath(newsXPath)

In [21]:
### Store the relevant news in a list
zikaNews_dirty = [p.text_content() for p in zikaNews]

In [22]:
# Printing the first 20 elements
zikaNews_dirty[1:20] # omitting first element


Out[22]:
['\n        Navigation Alt+1\n      ',
 '\n        Content Alt+2\n      ',
 '\n      \n          Home\n        \n      ',
 '\n        \n          Health topics\n        \n      ',
 '\n        \n          Data\n        \n      ',
 '\n        \n          Media centre\n        \n      ',
 '\n        \n          Publications\n        \n      ',
 '\n        \n          Countries\n        \n      ',
 '\n        \n          Programmes\n        \n      ',
 '\n        \n          Governance\n        \n      ',
 '\n        \n          About WHO\n        \n      ',
 'Home\n',
 'Ebola outbreak\n',
 'Alert and response operations\n',
 'Diseases\n',
 'Biorisk reduction\n',
 '\n22 April 2016\n\t\t\tZika virus infection – Papua New Guinea\n',
 '\n21 April 2016\n\t\t\tZika virus infection – Peru\n',
 '\n20 April 2016\n\t\t\tZika virus infection – Saint Lucia\n']
The extracted tree still contains much noise

In [9]:
Image("img/flatten_tree_data.png")


Out[9]:

In [23]:
# Extract only the items containing the pattern "Zika virus infection "
#sample= '\n22 April 2016\n\t\t\tZika virus infection – Papua New Guinea - USA\n'
keywdEN ="Zika virus infection "
zikaNews_content = [s for s in zikaNews_dirty if re.search(keywdEN, s)]

In [24]:
zikaNews_content[0:10] # first 11 elements


Out[24]:
['\n22 April 2016\n\t\t\tZika virus infection – Papua New Guinea\n',
 '\n21 April 2016\n\t\t\tZika virus infection – Peru\n',
 '\n20 April 2016\n\t\t\tZika virus infection – Saint Lucia\n',
 '\n15 April 2016\n\t\t\tZika virus infection – Chile\n',
 '\n12 April 2016\n\t\t\tZika virus infection – Viet Nam\n',
 '\n29 March 2016\n\t\t\tZika virus infection – Dominica and Cuba\n',
 '\n7 March 2016\n\t\t\tZika virus infection – Argentina and France\n',
 '\n4 March 2016\n\t\t\tZika virus infection – Netherlands - Sint Maarten\n',
 '\n1 March 2016\n\t\t\tZika virus infection – Saint Vincent and the Grenadines\n',
 '\n29 February 2016\n\t\t\tZika virus infection – Trinidad and Tobago\n']
Use of lambda functions and piping

In [25]:
#### Use of lambdas (avoid creating verbose Python functions with def f():{})
substitudeUnicodeDash = lambda s : re.sub(u'–',"@", s)
substituteNonUnicode = lambda s : re.sub(r"\s"," ",s)
removeSpace = lambda s: s.strip()

In [27]:
# Use of pipe to chain lambda functions within a list comprehension
### Should be familiar to those using R dplyr %>%
zikaNews_dirty = [cytoolz.pipe(s, 
                               removeSpace, 
                               substituteNonUnicode) 
                  for s in zikaNews_content]

In [28]:
# List comprehension
zikaNews_dirty = [s.split("Zika virus infection") for s in zikaNews_dirty ]

In [31]:
zikaNews_dirty[0:10]


Out[31]:
[['22 April 2016    ', ' – Papua New Guinea'],
 ['21 April 2016    ', ' – Peru'],
 ['20 April 2016    ', ' – Saint Lucia'],
 ['15 April 2016    ', ' – Chile'],
 ['12 April 2016    ', ' – Viet Nam'],
 ['29 March 2016    ', ' – Dominica and Cuba'],
 ['7 March 2016    ', ' – Argentina and France'],
 ['4 March 2016    ', ' – Netherlands - Sint Maarten'],
 ['1 March 2016    ', ' – Saint Vincent and the Grenadines'],
 ['29 February 2016    ', ' – Trinidad and Tobago']]
Further clean-up : use of the Pandas library

Use extensiveley the Pandas library


In [71]:
# Structure data into a Pandas dataframe
zika = pd.DataFrame(zikaNews_dirty, columns = ["Date","Locations"])

In [72]:
zika.head(n=20)


Out[72]:
Date Locations
0 22 April 2016 – Papua New Guinea
1 21 April 2016 – Peru
2 20 April 2016 – Saint Lucia
3 15 April 2016 – Chile
4 12 April 2016 – Viet Nam
5 29 March 2016 – Dominica and Cuba
6 7 March 2016 – Argentina and France
7 4 March 2016 – Netherlands - Sint Maarten
8 1 March 2016 – Saint Vincent and the Grenadines
9 29 February 2016 – Trinidad and Tobago
10 22 February 2016 – Netherlands - Bonaire and Aruba
11 12 February 2016 – United States of America
12 8 February 2016 – Maldives
13 8 February 2016 – Region of the Americas
14 29 January 2016 – United States of America - United States Vi...
15 27 January 2016 – Dominican Republic
16 21 January 2016 – France - Saint Martin and Guadeloupe
17 21 January 2016 – Haiti
18 20 January 2016 – Bolivia
19 20 January 2016 – Guyana, Barbados and Ecuador

In [73]:
### Removing the first dash sign / for zika["Locations"]
# Step 1 : transform in a list of strings, via str.split()
# Step 2 : copy the list, except the first element list[1:]
# Step 3 : reconstitute the entire string using ' '.join(list[1:])


# Step 1 : transform in a list of strings, via str.split()
zika["Split_Locations"] = pd.Series(zika["Locations"].iloc[i].split()  for i in range(len(zika)))
# Step 2 : copy the list, except the first element list[1:]
zika["Split_Locations"] = pd.Series([s[1:] for s in zika["Split_Locations"]])
# Step 3 : reconstitute the entire string using ' '.join(list[1:])
zika["Split_Locations"] = pd.Series([" ".join(s) for s in zika["Split_Locations"]])
zika["Split_Locations"] = pd.Series([s.split("-") for s in zika["Split_Locations"]])
zika["Split_Date"] = pd.Series([s.split() for s in zika["Date"]])

In [74]:
# Show the first 10 rows using HEAD
zika.head(n=10)


Out[74]:
Date Locations Split_Locations Split_Date
0 22 April 2016 – Papua New Guinea [Papua New Guinea] [22, April, 2016]
1 21 April 2016 – Peru [Peru] [21, April, 2016]
2 20 April 2016 – Saint Lucia [Saint Lucia] [20, April, 2016]
3 15 April 2016 – Chile [Chile] [15, April, 2016]
4 12 April 2016 – Viet Nam [Viet Nam] [12, April, 2016]
5 29 March 2016 – Dominica and Cuba [Dominica and Cuba] [29, March, 2016]
6 7 March 2016 – Argentina and France [Argentina and France] [7, March, 2016]
7 4 March 2016 – Netherlands - Sint Maarten [Netherlands , Sint Maarten] [4, March, 2016]
8 1 March 2016 – Saint Vincent and the Grenadines [Saint Vincent and the Grenadines] [1, March, 2016]
9 29 February 2016 – Trinidad and Tobago [Trinidad and Tobago] [29, February, 2016]

In [75]:
### Extract Day / Month / Year in the Split_Date column, 1 row is of the form [21, January, 2016]
zika["Day"]= pd.Series(zika["Split_Date"].iloc[i][0] for i in range(len(zika)))
zika["Month"]= pd.Series(zika["Split_Date"].iloc[i][1] for i in range(len(zika)))
zika["Year"]= pd.Series(zika["Split_Date"].iloc[i][2] for i in range(len(zika)))

In [76]:
# Show the first 10 rows using HEAD
zika.head(n=10)


Out[76]:
Date Locations Split_Locations Split_Date Day Month Year
0 22 April 2016 – Papua New Guinea [Papua New Guinea] [22, April, 2016] 22 April 2016
1 21 April 2016 – Peru [Peru] [21, April, 2016] 21 April 2016
2 20 April 2016 – Saint Lucia [Saint Lucia] [20, April, 2016] 20 April 2016
3 15 April 2016 – Chile [Chile] [15, April, 2016] 15 April 2016
4 12 April 2016 – Viet Nam [Viet Nam] [12, April, 2016] 12 April 2016
5 29 March 2016 – Dominica and Cuba [Dominica and Cuba] [29, March, 2016] 29 March 2016
6 7 March 2016 – Argentina and France [Argentina and France] [7, March, 2016] 7 March 2016
7 4 March 2016 – Netherlands - Sint Maarten [Netherlands , Sint Maarten] [4, March, 2016] 4 March 2016
8 1 March 2016 – Saint Vincent and the Grenadines [Saint Vincent and the Grenadines] [1, March, 2016] 1 March 2016
9 29 February 2016 – Trinidad and Tobago [Trinidad and Tobago] [29, February, 2016] 29 February 2016

In [77]:
# Extract Country and Territory
zika["Country"] = pd.Series(zika["Split_Locations"].iloc[i][0] for i in range(len(zika)))
zika["Territory"] = pd.Series(zika["Split_Locations"].iloc[i][len(zika["Split_Locations"].iloc[i])-1] for i in range(len(zika)))

In [78]:
# Show the first 20 rows using HEAD
zika[['Split_Locations','Country','Territory']].head(20)


Out[78]:
Split_Locations Country Territory
0 [Papua New Guinea] Papua New Guinea Papua New Guinea
1 [Peru] Peru Peru
2 [Saint Lucia] Saint Lucia Saint Lucia
3 [Chile] Chile Chile
4 [Viet Nam] Viet Nam Viet Nam
5 [Dominica and Cuba] Dominica and Cuba Dominica and Cuba
6 [Argentina and France] Argentina and France Argentina and France
7 [Netherlands , Sint Maarten] Netherlands Sint Maarten
8 [Saint Vincent and the Grenadines] Saint Vincent and the Grenadines Saint Vincent and the Grenadines
9 [Trinidad and Tobago] Trinidad and Tobago Trinidad and Tobago
10 [Netherlands , Bonaire and Aruba] Netherlands Bonaire and Aruba
11 [United States of America] United States of America United States of America
12 [Maldives] Maldives Maldives
13 [Region of the Americas] Region of the Americas Region of the Americas
14 [United States of America , United States Vir... United States of America United States Virgin Islands
15 [Dominican Republic] Dominican Republic Dominican Republic
16 [France , Saint Martin and Guadeloupe] France Saint Martin and Guadeloupe
17 [Haiti] Haiti Haiti
18 [Bolivia] Bolivia Bolivia
19 [Guyana, Barbados and Ecuador] Guyana, Barbados and Ecuador Guyana, Barbados and Ecuador
###### Last clean up : remove the "Territory" if it is the same as the "Country" e.g : Haiti / Haiti How to write ugly terse code with Python

In [83]:
zika["Territory"] =pd.Series(zika["Territory"][i] 
                             if zika["Territory"][i] != zika["Country"][i]  
                             else " " for i in range(len(zika))
                            )

In [84]:
# Show the first 20 rows using HEAD
zika[['Split_Locations','Country','Territory']].head(20)


Out[84]:
Split_Locations Country Territory
0 [Papua New Guinea] Papua New Guinea
1 [Peru] Peru
2 [Saint Lucia] Saint Lucia
3 [Chile] Chile
4 [Viet Nam] Viet Nam
5 [Dominica and Cuba] Dominica and Cuba
6 [Argentina and France] Argentina and France
7 [Netherlands , Sint Maarten] Netherlands Sint Maarten
8 [Saint Vincent and the Grenadines] Saint Vincent and the Grenadines
9 [Trinidad and Tobago] Trinidad and Tobago
10 [Netherlands , Bonaire and Aruba] Netherlands Bonaire and Aruba
11 [United States of America] United States of America
12 [Maldives] Maldives
13 [Region of the Americas] Region of the Americas
14 [United States of America , United States Vir... United States of America United States Virgin Islands
15 [Dominican Republic] Dominican Republic
16 [France , Saint Martin and Guadeloupe] France Saint Martin and Guadeloupe
17 [Haiti] Haiti
18 [Bolivia] Bolivia
19 [Guyana, Barbados and Ecuador] Guyana, Barbados and Ecuador