Practical use of Jupyter notebook

Second motivation : learning Python by web scraping

Scraping data from the WHO



In [2]:

    
Image("img/init.png")









    Out[2]:

Expected results



In [3]:

    
Image("img/target_result.png")









    Out[3]:

Techniques used

Regular expressions
Pythonic / Functional programming :
- use lists (iterable) :
  - avoid looping on indices whenever possible
  - list / for comprehensions
- lambda expressions
- pipe and map essentially : (based on cytoolz)
Python web scraping :
- lxml (Python library)
- XPath (web page content)



In [4]:

    
# FOR WEB SCRAPING
from lxml import html
import requests

# FOR FUNCTIONAL PROGRAMMING
import cytoolz  # pipe

# FOR DATA WRANGLING
import pandas as pd # use of R like dataframes
import re #re for regular expressions

# TO INSERT IMAGES
from IPython.display import Image

Data wrangling in action



In [44]:

    
### Target URL
outbreakNewsURL = "http://www.who.int/csr/don/archive/disease/zika-virus-infection/en/"

page = requests.get(outbreakNewsURL)
tree = html.fromstring(page.content)
newsXPath = '//li'
zikaNews = tree.xpath(newsXPath)



In [21]:

    
### Store the relevant news in a list
zikaNews_dirty = [p.text_content() for p in zikaNews]



In [22]:

    
# Printing the first 20 elements
zikaNews_dirty[1:20] # omitting first element









    Out[22]:





['\n        Navigation Alt+1\n      ',
 '\n        Content Alt+2\n      ',
 '\n      \n          Home\n        \n      ',
 '\n        \n          Health topics\n        \n      ',
 '\n        \n          Data\n        \n      ',
 '\n        \n          Media centre\n        \n      ',
 '\n        \n          Publications\n        \n      ',
 '\n        \n          Countries\n        \n      ',
 '\n        \n          Programmes\n        \n      ',
 '\n        \n          Governance\n        \n      ',
 '\n        \n          About WHO\n        \n      ',
 'Home\n',
 'Ebola outbreak\n',
 'Alert and response operations\n',
 'Diseases\n',
 'Biorisk reduction\n',
 '\n22 April 2016\n\t\t\tZika virus infection – Papua New Guinea\n',
 '\n21 April 2016\n\t\t\tZika virus infection – Peru\n',
 '\n20 April 2016\n\t\t\tZika virus infection – Saint Lucia\n']

The extracted tree still contains much noise



In [9]:

    
Image("img/flatten_tree_data.png")









    Out[9]:



In [23]:

    
# Extract only the items containing the pattern "Zika virus infection "
#sample= '\n22 April 2016\n\t\t\tZika virus infection – Papua New Guinea - USA\n'
keywdEN ="Zika virus infection "
zikaNews_content = [s for s in zikaNews_dirty if re.search(keywdEN, s)]



In [24]:

    
zikaNews_content[0:10] # first 11 elements









    Out[24]:





['\n22 April 2016\n\t\t\tZika virus infection – Papua New Guinea\n',
 '\n21 April 2016\n\t\t\tZika virus infection – Peru\n',
 '\n20 April 2016\n\t\t\tZika virus infection – Saint Lucia\n',
 '\n15 April 2016\n\t\t\tZika virus infection – Chile\n',
 '\n12 April 2016\n\t\t\tZika virus infection – Viet Nam\n',
 '\n29 March 2016\n\t\t\tZika virus infection – Dominica and Cuba\n',
 '\n7 March 2016\n\t\t\tZika virus infection – Argentina and France\n',
 '\n4 March 2016\n\t\t\tZika virus infection – Netherlands - Sint Maarten\n',
 '\n1 March 2016\n\t\t\tZika virus infection – Saint Vincent and the Grenadines\n',
 '\n29 February 2016\n\t\t\tZika virus infection – Trinidad and Tobago\n']

Use of lambda functions and piping



In [25]:

    
#### Use of lambdas (avoid creating verbose Python functions with def f():{})
substitudeUnicodeDash = lambda s : re.sub(u'–',"@", s)
substituteNonUnicode = lambda s : re.sub(r"\s"," ",s)
removeSpace = lambda s: s.strip()



In [27]:

    
# Use of pipe to chain lambda functions within a list comprehension
### Should be familiar to those using R dplyr %>%
zikaNews_dirty = [cytoolz.pipe(s, 
                               removeSpace, 
                               substituteNonUnicode) 
                  for s in zikaNews_content]



In [28]:

    
# List comprehension
zikaNews_dirty = [s.split("Zika virus infection") for s in zikaNews_dirty ]



In [31]:

    
zikaNews_dirty[0:10]









    Out[31]:





[['22 April 2016    ', ' – Papua New Guinea'],
 ['21 April 2016    ', ' – Peru'],
 ['20 April 2016    ', ' – Saint Lucia'],
 ['15 April 2016    ', ' – Chile'],
 ['12 April 2016    ', ' – Viet Nam'],
 ['29 March 2016    ', ' – Dominica and Cuba'],
 ['7 March 2016    ', ' – Argentina and France'],
 ['4 March 2016    ', ' – Netherlands - Sint Maarten'],
 ['1 March 2016    ', ' – Saint Vincent and the Grenadines'],
 ['29 February 2016    ', ' – Trinidad and Tobago']]

Further clean-up : use of the Pandas library

Use extensiveley the Pandas library



In [71]:

    
# Structure data into a Pandas dataframe
zika = pd.DataFrame(zikaNews_dirty, columns = ["Date","Locations"])



In [72]:

    
zika.head(n=20)









    Out[72]:






  
    
      
      Date
      Locations
    
  
  
    
      0
      22 April 2016
      – Papua New Guinea
    
    
      1
      21 April 2016
      – Peru
    
    
      2
      20 April 2016
      – Saint Lucia
    
    
      3
      15 April 2016
      – Chile
    
    
      4
      12 April 2016
      – Viet Nam
    
    
      5
      29 March 2016
      – Dominica and Cuba
    
    
      6
      7 March 2016
      – Argentina and France
    
    
      7
      4 March 2016
      – Netherlands - Sint Maarten
    
    
      8
      1 March 2016
      – Saint Vincent and the Grenadines
    
    
      9
      29 February 2016
      – Trinidad and Tobago
    
    
      10
      22 February 2016
      – Netherlands - Bonaire and Aruba
    
    
      11
      12 February 2016
      – United States of America
    
    
      12
      8 February 2016
      – Maldives
    
    
      13
      8 February 2016
      – Region of the Americas
    
    
      14
      29 January 2016
      – United States of America - United States Vi...
    
    
      15
      27 January 2016
      – Dominican Republic
    
    
      16
      21 January 2016
      – France - Saint Martin and Guadeloupe
    
    
      17
      21 January 2016
      – Haiti
    
    
      18
      20 January 2016
      – Bolivia
    
    
      19
      20 January 2016
      – Guyana, Barbados and Ecuador



In [73]:

    
### Removing the first dash sign / for zika["Locations"]
# Step 1 : transform in a list of strings, via str.split()
# Step 2 : copy the list, except the first element list[1:]
# Step 3 : reconstitute the entire string using ' '.join(list[1:])


# Step 1 : transform in a list of strings, via str.split()
zika["Split_Locations"] = pd.Series(zika["Locations"].iloc[i].split()  for i in range(len(zika)))
# Step 2 : copy the list, except the first element list[1:]
zika["Split_Locations"] = pd.Series([s[1:] for s in zika["Split_Locations"]])
# Step 3 : reconstitute the entire string using ' '.join(list[1:])
zika["Split_Locations"] = pd.Series([" ".join(s) for s in zika["Split_Locations"]])
zika["Split_Locations"] = pd.Series([s.split("-") for s in zika["Split_Locations"]])
zika["Split_Date"] = pd.Series([s.split() for s in zika["Date"]])



In [74]:

    
# Show the first 10 rows using HEAD
zika.head(n=10)









    Out[74]:






  
    
      
      Date
      Locations
      Split_Locations
      Split_Date
    
  
  
    
      0
      22 April 2016
      – Papua New Guinea
      [Papua New Guinea]
      [22, April, 2016]
    
    
      1
      21 April 2016
      – Peru
      [Peru]
      [21, April, 2016]
    
    
      2
      20 April 2016
      – Saint Lucia
      [Saint Lucia]
      [20, April, 2016]
    
    
      3
      15 April 2016
      – Chile
      [Chile]
      [15, April, 2016]
    
    
      4
      12 April 2016
      – Viet Nam
      [Viet Nam]
      [12, April, 2016]
    
    
      5
      29 March 2016
      – Dominica and Cuba
      [Dominica and Cuba]
      [29, March, 2016]
    
    
      6
      7 March 2016
      – Argentina and France
      [Argentina and France]
      [7, March, 2016]
    
    
      7
      4 March 2016
      – Netherlands - Sint Maarten
      [Netherlands ,  Sint Maarten]
      [4, March, 2016]
    
    
      8
      1 March 2016
      – Saint Vincent and the Grenadines
      [Saint Vincent and the Grenadines]
      [1, March, 2016]
    
    
      9
      29 February 2016
      – Trinidad and Tobago
      [Trinidad and Tobago]
      [29, February, 2016]



In [75]:

    
### Extract Day / Month / Year in the Split_Date column, 1 row is of the form [21, January, 2016]
zika["Day"]= pd.Series(zika["Split_Date"].iloc[i][0] for i in range(len(zika)))
zika["Month"]= pd.Series(zika["Split_Date"].iloc[i][1] for i in range(len(zika)))
zika["Year"]= pd.Series(zika["Split_Date"].iloc[i][2] for i in range(len(zika)))



In [76]:

    
# Show the first 10 rows using HEAD
zika.head(n=10)









    Out[76]:






  
    
      
      Date
      Locations
      Split_Locations
      Split_Date
      Day
      Month
      Year
    
  
  
    
      0
      22 April 2016
      – Papua New Guinea
      [Papua New Guinea]
      [22, April, 2016]
      22
      April
      2016
    
    
      1
      21 April 2016
      – Peru
      [Peru]
      [21, April, 2016]
      21
      April
      2016
    
    
      2
      20 April 2016
      – Saint Lucia
      [Saint Lucia]
      [20, April, 2016]
      20
      April
      2016
    
    
      3
      15 April 2016
      – Chile
      [Chile]
      [15, April, 2016]
      15
      April
      2016
    
    
      4
      12 April 2016
      – Viet Nam
      [Viet Nam]
      [12, April, 2016]
      12
      April
      2016
    
    
      5
      29 March 2016
      – Dominica and Cuba
      [Dominica and Cuba]
      [29, March, 2016]
      29
      March
      2016
    
    
      6
      7 March 2016
      – Argentina and France
      [Argentina and France]
      [7, March, 2016]
      7
      March
      2016
    
    
      7
      4 March 2016
      – Netherlands - Sint Maarten
      [Netherlands ,  Sint Maarten]
      [4, March, 2016]
      4
      March
      2016
    
    
      8
      1 March 2016
      – Saint Vincent and the Grenadines
      [Saint Vincent and the Grenadines]
      [1, March, 2016]
      1
      March
      2016
    
    
      9
      29 February 2016
      – Trinidad and Tobago
      [Trinidad and Tobago]
      [29, February, 2016]
      29
      February
      2016



In [77]:

    
# Extract Country and Territory
zika["Country"] = pd.Series(zika["Split_Locations"].iloc[i][0] for i in range(len(zika)))
zika["Territory"] = pd.Series(zika["Split_Locations"].iloc[i][len(zika["Split_Locations"].iloc[i])-1] for i in range(len(zika)))



In [78]:

    
# Show the first 20 rows using HEAD
zika[['Split_Locations','Country','Territory']].head(20)









    Out[78]:






  
    
      
      Split_Locations
      Country
      Territory
    
  
  
    
      0
      [Papua New Guinea]
      Papua New Guinea
      Papua New Guinea
    
    
      1
      [Peru]
      Peru
      Peru
    
    
      2
      [Saint Lucia]
      Saint Lucia
      Saint Lucia
    
    
      3
      [Chile]
      Chile
      Chile
    
    
      4
      [Viet Nam]
      Viet Nam
      Viet Nam
    
    
      5
      [Dominica and Cuba]
      Dominica and Cuba
      Dominica and Cuba
    
    
      6
      [Argentina and France]
      Argentina and France
      Argentina and France
    
    
      7
      [Netherlands ,  Sint Maarten]
      Netherlands
      Sint Maarten
    
    
      8
      [Saint Vincent and the Grenadines]
      Saint Vincent and the Grenadines
      Saint Vincent and the Grenadines
    
    
      9
      [Trinidad and Tobago]
      Trinidad and Tobago
      Trinidad and Tobago
    
    
      10
      [Netherlands ,  Bonaire and Aruba]
      Netherlands
      Bonaire and Aruba
    
    
      11
      [United States of America]
      United States of America
      United States of America
    
    
      12
      [Maldives]
      Maldives
      Maldives
    
    
      13
      [Region of the Americas]
      Region of the Americas
      Region of the Americas
    
    
      14
      [United States of America ,  United States Vir...
      United States of America
      United States Virgin Islands
    
    
      15
      [Dominican Republic]
      Dominican Republic
      Dominican Republic
    
    
      16
      [France ,  Saint Martin and Guadeloupe]
      France
      Saint Martin and Guadeloupe
    
    
      17
      [Haiti]
      Haiti
      Haiti
    
    
      18
      [Bolivia]
      Bolivia
      Bolivia
    
    
      19
      [Guyana, Barbados and Ecuador]
      Guyana, Barbados and Ecuador
      Guyana, Barbados and Ecuador

###### Last clean up : remove the "Territory" if it is the same as the "Country" e.g : Haiti / Haiti How to write ugly terse code with Python



In [83]:

    
zika["Territory"] =pd.Series(zika["Territory"][i] 
                             if zika["Territory"][i] != zika["Country"][i]  
                             else " " for i in range(len(zika))
                            )



In [84]:

    
# Show the first 20 rows using HEAD
zika[['Split_Locations','Country','Territory']].head(20)









    Out[84]:






  
    
      
      Split_Locations
      Country
      Territory
    
  
  
    
      0
      [Papua New Guinea]
      Papua New Guinea
      
    
    
      1
      [Peru]
      Peru
      
    
    
      2
      [Saint Lucia]
      Saint Lucia
      
    
    
      3
      [Chile]
      Chile
      
    
    
      4
      [Viet Nam]
      Viet Nam
      
    
    
      5
      [Dominica and Cuba]
      Dominica and Cuba
      
    
    
      6
      [Argentina and France]
      Argentina and France
      
    
    
      7
      [Netherlands ,  Sint Maarten]
      Netherlands
      Sint Maarten
    
    
      8
      [Saint Vincent and the Grenadines]
      Saint Vincent and the Grenadines
      
    
    
      9
      [Trinidad and Tobago]
      Trinidad and Tobago
      
    
    
      10
      [Netherlands ,  Bonaire and Aruba]
      Netherlands
      Bonaire and Aruba
    
    
      11
      [United States of America]
      United States of America
      
    
    
      12
      [Maldives]
      Maldives
      
    
    
      13
      [Region of the Americas]
      Region of the Americas
      
    
    
      14
      [United States of America ,  United States Vir...
      United States of America
      United States Virgin Islands
    
    
      15
      [Dominican Republic]
      Dominican Republic
      
    
    
      16
      [France ,  Saint Martin and Guadeloupe]
      France
      Saint Martin and Guadeloupe
    
    
      17
      [Haiti]
      Haiti
      
    
    
      18
      [Bolivia]
      Bolivia
      
    
    
      19
      [Guyana, Barbados and Ecuador]
      Guyana, Barbados and Ecuador

	Date	Locations
0	22 April 2016	– Papua New Guinea
1	21 April 2016	– Peru
2	20 April 2016	– Saint Lucia
3	15 April 2016	– Chile
4	12 April 2016	– Viet Nam
5	29 March 2016	– Dominica and Cuba
6	7 March 2016	– Argentina and France
7	4 March 2016	– Netherlands - Sint Maarten
8	1 March 2016	– Saint Vincent and the Grenadines
9	29 February 2016	– Trinidad and Tobago
10	22 February 2016	– Netherlands - Bonaire and Aruba
11	12 February 2016	– United States of America
12	8 February 2016	– Maldives
13	8 February 2016	– Region of the Americas
14	29 January 2016	– United States of America - United States Vi...
15	27 January 2016	– Dominican Republic
16	21 January 2016	– France - Saint Martin and Guadeloupe
17	21 January 2016	– Haiti
18	20 January 2016	– Bolivia
19	20 January 2016	– Guyana, Barbados and Ecuador

	Date	Locations	Split_Locations	Split_Date
0	22 April 2016	– Papua New Guinea	[Papua New Guinea]	[22, April, 2016]
1	21 April 2016	– Peru	[Peru]	[21, April, 2016]
2	20 April 2016	– Saint Lucia	[Saint Lucia]	[20, April, 2016]
3	15 April 2016	– Chile	[Chile]	[15, April, 2016]
4	12 April 2016	– Viet Nam	[Viet Nam]	[12, April, 2016]
5	29 March 2016	– Dominica and Cuba	[Dominica and Cuba]	[29, March, 2016]
6	7 March 2016	– Argentina and France	[Argentina and France]	[7, March, 2016]
7	4 March 2016	– Netherlands - Sint Maarten	[Netherlands , Sint Maarten]	[4, March, 2016]
8	1 March 2016	– Saint Vincent and the Grenadines	[Saint Vincent and the Grenadines]	[1, March, 2016]
9	29 February 2016	– Trinidad and Tobago	[Trinidad and Tobago]	[29, February, 2016]

	Split_Locations	Country	Territory
0	[Papua New Guinea]	Papua New Guinea	Papua New Guinea
1	[Peru]	Peru	Peru
2	[Saint Lucia]	Saint Lucia	Saint Lucia
3	[Chile]	Chile	Chile
4	[Viet Nam]	Viet Nam	Viet Nam
5	[Dominica and Cuba]	Dominica and Cuba	Dominica and Cuba
6	[Argentina and France]	Argentina and France	Argentina and France
7	[Netherlands , Sint Maarten]	Netherlands	Sint Maarten
8	[Saint Vincent and the Grenadines]	Saint Vincent and the Grenadines	Saint Vincent and the Grenadines
9	[Trinidad and Tobago]	Trinidad and Tobago	Trinidad and Tobago
10	[Netherlands , Bonaire and Aruba]	Netherlands	Bonaire and Aruba
11	[United States of America]	United States of America	United States of America
12	[Maldives]	Maldives	Maldives
13	[Region of the Americas]	Region of the Americas	Region of the Americas
14	[United States of America , United States Vir...	United States of America	United States Virgin Islands
15	[Dominican Republic]	Dominican Republic	Dominican Republic
16	[France , Saint Martin and Guadeloupe]	France	Saint Martin and Guadeloupe
17	[Haiti]	Haiti	Haiti
18	[Bolivia]	Bolivia	Bolivia
19	[Guyana, Barbados and Ecuador]	Guyana, Barbados and Ecuador	Guyana, Barbados and Ecuador