XML example and exercise

study examples of accessing nodes in XML tree structure
work on exercise to be completed and submitted

reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
data source: http://www.dbis.informatik.uni-goettingen.de/Mondial



In [24]:

    
from xml.etree import ElementTree as ET
import pandas as pd

XML example

for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html



In [2]:

    
document_tree = ET.parse( './data/mondial_database_less.xml' )



In [3]:

    
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)









    



Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra



In [4]:

    
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])









    



* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

10 countries with the lowest infant mortality rates
10 cities with the largest population
10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
name and country of a) longest river, b) largest lake and c) airport at highest elevation



In [21]:

    
document = ET.parse( './data/mondial_database.xml' )

1. 10 countries with the lowest infant mortality rates



In [57]:

    
names = []
infant_mortalities = []
for element in document.iterfind('country[infant_mortality]'):
    names.append(element.find('name').text)
    infant_mortalities.append(float(element.find('infant_mortality').text))
    
results = pd.DataFrame({'name': names, 'infant_mortality':infant_mortalities})
results.sort_values('infant_mortality').head(10)









    Out[57]:






  
    
      
      infant_mortality
      name
    
  
  
    
      36
      1.81
      Monaco
    
    
      90
      2.13
      Japan
    
    
      109
      2.48
      Bermuda
    
    
      34
      2.48
      Norway
    
    
      98
      2.53
      Singapore
    
    
      35
      2.60
      Sweden
    
    
      8
      2.63
      Czech Republic
    
    
      72
      2.73
      Hong Kong
    
    
      73
      3.13
      Macao
    
    
      39
      3.15
      Iceland

2. 10 cities with the largest population



In [73]:

    
cities = []
populations = []
for element in document.iterfind('country[city]'):
    for sub in element.iterfind('city[population]'):
        cities.append(sub.find('name').text)
        pops = [int(p.text) for p in sub.findall('population')]
        populations.append(pops[-1])

results = pd.DataFrame({'city': cities, 'population': populations}) 
results.sort_values('population', ascending=False).head(10)









    Out[73]:






  
    
      
      city
      population
    
  
  
    
      165
      Seoul
      9708483
    
    
      154
      Al Qahirah
      8471859
    
    
      75
      Bangkok
      7506700
    
    
      123
      Hong Kong
      7055071
    
    
      87
      Ho Chi Minh
      5968384
    
    
      201
      Singapore
      5076700
    
    
      153
      Al Iskandariyah
      4123869
    
    
      205
      New Taipei
      3939305
    
    
      166
      Busan
      3403135
    
    
      102
      Pyongyang
      3255288

3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)



In [154]:

    
ethnic_groups = [e.text for e in document.findall('.//ethnicgroup')]
ethnic_dict = dict.fromkeys(ethnic_groups, 0)
for element in document.iterfind('country[ethnicgroup]'):
    population = [int(p.text) for p in element.findall('population')][-1]
    groups = [e.text for e in element.findall('ethnicgroup')]
    percentages = [float(perc.get('percentage')) for perc in element.findall('ethnicgroup')]
    for i, group in enumerate(groups):
        ethnic_dict[group] += percentages[i] * population
        
round(pd.Series(ethnic_dict)).astype(int).sort_values(ascending=False).head(10)









    Out[154]:





Han Chinese    124505880000
Indo-Aryan      87181558344
European        49487221972
African         31832512037
Dravidian       30271374425
Mestizo         15773435494
Bengali         14677691672
Russian         13185699608
Japanese        12653421200
Malay           12199355037
dtype: int64

4. name and country of a) longest river, b) largest lake and c) airport at highest elevation



In [ ]:

    
# convert country code
country_dict = {}
for element in document.iterfind('country'):
    country_dict[element.get('car_code')] = element.find('name').text



In [136]:

    
# find longest river
river_name = ''
river_code = ''
river_length = 0.0
for element in document.iterfind('river[length]'):
    if float(element.find('length').text) > river_length:
        river_length = float(element.find('length').text) 
        river_name = element.find('name').text 
        river_code = element.get('country')
        
countries = ', '.join([country_dict[c] for c in river_code.split(' ')])        
print('longest river \n    name: {}\n    countries: {}'.format(river_name, countries))









    



longest river 
    name: Amazonas
    countries: Colombia, Brazil, Peru



In [137]:

    
# find largest lake
lake_name = ''
lake_code = ''
lake_area = 0.0
for element in document.iterfind('lake[area]'):
    if float(element.find('area').text) > lake_area:
        lake_area = float(element.find('area').text) 
        lake_name = element.find('name').text 
        lake_code = element.get('country')
        
countries = ', '.join([country_dict[c] for c in lake_code.split(' ')])        
print('largest lake\n    name: {}\n    countries: {}'.format(lake_name, countries))









    



largest lake
    name: Caspian Sea
    countries: Russia, Azerbaijan, Kazakhstan, Iran, Turkmenistan



In [152]:

    
# find highest airport elevation
airport_name = ''
airport_code = ''
airport_elevation = 0.0
for element in document.iterfind('airport[elevation]'):
    if element.find('elevation').text is None:
        continue
    if float(element.find('elevation').text) > airport_elevation:
        airport_elevation = float(element.find('elevation').text) 
        airport_name = element.find('name').text 
        airport_code = element.get('country')
        
countries = ', '.join([country_dict[c] for c in airport_code.split(' ')])        
print('highest airport elevation\n    name: {}\n    countries: {}'.format(airport_name, countries))









    



highest airport elevation
    name: El Alto Intl
    countries: Bolivia

	infant_mortality	name
36	1.81	Monaco
90	2.13	Japan
109	2.48	Bermuda
34	2.48	Norway
98	2.53	Singapore
35	2.60	Sweden
8	2.63	Czech Republic
72	2.73	Hong Kong
73	3.13	Macao
39	3.15	Iceland

	city	population
165	Seoul	9708483
154	Al Qahirah	8471859
75	Bangkok	7506700
123	Hong Kong	7055071
87	Ho Chi Minh	5968384
201	Singapore	5076700
153	Al Iskandariyah	4123869
205	New Taipei	3939305
166	Busan	3403135
102	Pyongyang	3255288