XML example and exercise


  • study examples of accessing nodes in XML tree structure
  • work on exercise to be completed and submitted



In [24]:
from xml.etree import ElementTree as ET
import pandas as pd

XML example


In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)


Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra

In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])


* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

  1. 10 countries with the lowest infant mortality rates
  2. 10 cities with the largest population
  3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
  4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [21]:
document = ET.parse( './data/mondial_database.xml' )

1. 10 countries with the lowest infant mortality rates


In [57]:
names = []
infant_mortalities = []
for element in document.iterfind('country[infant_mortality]'):
    names.append(element.find('name').text)
    infant_mortalities.append(float(element.find('infant_mortality').text))
    
results = pd.DataFrame({'name': names, 'infant_mortality':infant_mortalities})
results.sort_values('infant_mortality').head(10)


Out[57]:
infant_mortality name
36 1.81 Monaco
90 2.13 Japan
109 2.48 Bermuda
34 2.48 Norway
98 2.53 Singapore
35 2.60 Sweden
8 2.63 Czech Republic
72 2.73 Hong Kong
73 3.13 Macao
39 3.15 Iceland

2. 10 cities with the largest population


In [73]:
cities = []
populations = []
for element in document.iterfind('country[city]'):
    for sub in element.iterfind('city[population]'):
        cities.append(sub.find('name').text)
        pops = [int(p.text) for p in sub.findall('population')]
        populations.append(pops[-1])

results = pd.DataFrame({'city': cities, 'population': populations}) 
results.sort_values('population', ascending=False).head(10)


Out[73]:
city population
165 Seoul 9708483
154 Al Qahirah 8471859
75 Bangkok 7506700
123 Hong Kong 7055071
87 Ho Chi Minh 5968384
201 Singapore 5076700
153 Al Iskandariyah 4123869
205 New Taipei 3939305
166 Busan 3403135
102 Pyongyang 3255288

3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)


In [154]:
ethnic_groups = [e.text for e in document.findall('.//ethnicgroup')]
ethnic_dict = dict.fromkeys(ethnic_groups, 0)
for element in document.iterfind('country[ethnicgroup]'):
    population = [int(p.text) for p in element.findall('population')][-1]
    groups = [e.text for e in element.findall('ethnicgroup')]
    percentages = [float(perc.get('percentage')) for perc in element.findall('ethnicgroup')]
    for i, group in enumerate(groups):
        ethnic_dict[group] += percentages[i] * population
        
round(pd.Series(ethnic_dict)).astype(int).sort_values(ascending=False).head(10)


Out[154]:
Han Chinese    124505880000
Indo-Aryan      87181558344
European        49487221972
African         31832512037
Dravidian       30271374425
Mestizo         15773435494
Bengali         14677691672
Russian         13185699608
Japanese        12653421200
Malay           12199355037
dtype: int64

4. name and country of a) longest river, b) largest lake and c) airport at highest elevation


In [ ]:
# convert country code
country_dict = {}
for element in document.iterfind('country'):
    country_dict[element.get('car_code')] = element.find('name').text

In [136]:
# find longest river
river_name = ''
river_code = ''
river_length = 0.0
for element in document.iterfind('river[length]'):
    if float(element.find('length').text) > river_length:
        river_length = float(element.find('length').text) 
        river_name = element.find('name').text 
        river_code = element.get('country')
        
countries = ', '.join([country_dict[c] for c in river_code.split(' ')])        
print('longest river \n    name: {}\n    countries: {}'.format(river_name, countries))


longest river 
    name: Amazonas
    countries: Colombia, Brazil, Peru

In [137]:
# find largest lake
lake_name = ''
lake_code = ''
lake_area = 0.0
for element in document.iterfind('lake[area]'):
    if float(element.find('area').text) > lake_area:
        lake_area = float(element.find('area').text) 
        lake_name = element.find('name').text 
        lake_code = element.get('country')
        
countries = ', '.join([country_dict[c] for c in lake_code.split(' ')])        
print('largest lake\n    name: {}\n    countries: {}'.format(lake_name, countries))


largest lake
    name: Caspian Sea
    countries: Russia, Azerbaijan, Kazakhstan, Iran, Turkmenistan

In [152]:
# find highest airport elevation
airport_name = ''
airport_code = ''
airport_elevation = 0.0
for element in document.iterfind('airport[elevation]'):
    if element.find('elevation').text is None:
        continue
    if float(element.find('elevation').text) > airport_elevation:
        airport_elevation = float(element.find('elevation').text) 
        airport_name = element.find('name').text 
        airport_code = element.get('country')
        
countries = ', '.join([country_dict[c] for c in airport_code.split(' ')])        
print('highest airport elevation\n    name: {}\n    countries: {}'.format(airport_name, countries))


highest airport elevation
    name: El Alto Intl
    countries: Bolivia