XML example and exercise


  • study examples of accessing nodes in XML tree structure
  • work on exercise to be completed and submitted



In [2]:
from xml.etree import ElementTree as ET

XML example


In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)


Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra

In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':', end=''),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])


* Albania:Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:Skopje, Kumanovo
* Serbia:Beograd, Novi Sad, Niš
* Montenegro:Podgorica
* Kosovo:Prishtine
* Andorra:Andorra la Vella

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

  1. 10 countries with the lowest infant mortality rates
  2. 10 cities with the largest population
  3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
  4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [6]:
document = ET.parse( './data/mondial_database.xml' )

In [7]:
#Question 1: Find 10 countries with the lowest infant mortality rates

import pandas as pd
import numpy as np

#Create a dataframe with columns 'Country' and 'Mortality'
country = []
mortality = []
root = document.getroot()
for element in root.findall('country'):
    country.append(element.find('name').text)
    if element.find('infant_mortality') is not None and element.find('infant_mortality').text is not None:
        mortality.append(float(element.find('infant_mortality').text))
    else:
        mortality.append(np.NaN)
data = pd.DataFrame(list(zip(country,mortality)))
data.columns = ['Country', 'Mortality']

#Find 10 countries with the lowest infant mortality rates
data.sort_values('Mortality').iloc[:10,]


Out[7]:
Country Mortality
38 Monaco 1.81
98 Japan 2.13
117 Bermuda 2.48
36 Norway 2.48
106 Singapore 2.53
37 Sweden 2.60
10 Czech Republic 2.63
78 Hong Kong 2.73
79 Macao 3.13
44 Iceland 3.15

In [54]:
#Question 2: Find 10 cities with the largest population

city = []
population = []
year=[]

for element in root.findall('country'):
    this_country = element.find('name').text
    for subelement in element.iter('city'):
          for subelement2 in subelement.findall('population'):
                  city.append(subelement.find('name').text + ', ' + this_country)
                  population.append(float(subelement2.text))
                  year.append(float(subelement2.get('year')))
            
data_city = pd.DataFrame(list(zip(city, population, year)))
data_city.columns = ['City','Population','Year']

#Based on 'data_city', create a dataframe 'data_city_latest' that only include the rows of the most recent year
max_year = data_city.groupby('City')['Year'].max().to_frame().reset_index()
data_city_latest = pd.merge(max_year, data_city, how='left', on=['City', 'Year'])            
         
#Find 10 cities with the largest population
data_city_latest.sort_values('Population', ascending=False).iloc[:10,]


Out[54]:
City Year Population
2421 Shanghai, China 2010.0 22315474.0
1145 Istanbul, Turkey 2012.0 13710512.0
1768 Mumbai, India 2011.0 12442373.0
1755 Moskva, Russia 2013.0 11979529.0
288 Beijing, China 2010.0 11716620.0
2578 São Paulo, Brazil 2010.0 11152344.0
2674 Tianjin, China 2010.0 11090314.0
957 Guangzhou, China 2010.0 11071424.0
699 Delhi, India 2011.0 11034555.0
2434 Shenzhen, China 2010.0 10358381.0

In [29]:
#Question 3: Find 10 ethnic groups with the largest overall populations 

#Create a dataframe 'data_ethnic' with columns 'Country', 'Ethnicity' and 'Percentage'
country = []
ethnic = []
percentage = []

for element in root.findall('country'):
    this_country = element.find('name').text
    for subelement in element.findall('ethnicgroup'):
        country.append(this_country)
        ethnic.append(subelement.text)
        percentage.append(float(subelement.get('percentage')))

data_ethnic = pd.DataFrame(list(zip(country, ethnic, percentage)))
data_ethnic.columns = ['Country','Ethnicity','Percentage']

#Create a dataframe 'data_pop' with columns 'Country', 'Population' and 'Year'
country = []
population = []
year = []

for element in root.findall('country'):
    this_country = element.find('name').text
    for subelement in element.findall('population'):
        country.append(this_country)
        population.append(float(subelement.text))
        year.append(float(subelement.get('year')))

data_pop = pd.DataFrame(list(zip(country, population, year)))
data_pop.columns = ['Country','Population','Year']

#Based on 'data_pop', create a dataframe 'data_pop_latest' that only include the rows of the most recent year
max_year = data_pop.groupby('Country')['Year'].max().to_frame().reset_index()
data_pop_latest = pd.merge(max_year, data_pop, how='left', on=['Country', 'Year'])

#Merge dataframes 'data_ethnic' and 'data_pop_latest'. 
#The new dataframe contains columns 'Country','Ethnicity','Percentage','Population', and 'Year'
data = pd.merge(data_ethnic, data_pop_latest, how='left', on=['Country'])

#Find 10 ethnic groups with the largest overall populations 
data['Count']=data['Percentage']*data['Population']*0.01
data.groupby('Ethnicity')['Count'].sum().to_frame().reset_index().sort_values(by='Count', ascending=False).iloc[:10,]


Out[29]:
Ethnicity Count
113 Han Chinese 1.245059e+09
120 Indo-Aryan 8.718156e+08
89 European 4.948722e+08
2 African 3.183251e+08
77 Dravidian 3.027137e+08
176 Mestizo 1.577344e+08
42 Bengali 1.467769e+08
217 Russian 1.318570e+08
128 Japanese 1.265342e+08
163 Malay 1.219936e+08

In [11]:
# Question 4 (a) name and country of longest river

# Create a dictionary with country code as the ID and country name as the value
country_dict = {}

for element in root.findall('country'):
      this_country = element.find('name').text  
      this_country_id = element.get('car_code') 
      country_dict[this_country_id] = this_country

#Create a dataframe with columns 'River', 'Country ID', and 'Length'
river = []
country_ids = []
length = []

for element in root.findall('river'):
    river.append(element.find('name').text)
    country_ids.append(element.get('country'))
    if element.find('length') is not None and element.find('length').text is not None:
        length.append(float(element.find('length').text))
    else:
        length.append(np.NaN)

data_river = pd.DataFrame(list(zip(river, country_ids, length)))
data_river.columns = ['River','Country ID','Length'] 


#Find the name and country of longest river
this_river = data_river.sort_values(by='Length', ascending=False)
countries_ids = this_river.iloc[:1,1].values[0].split(' ')
countries = [country_dict[c] for c in countries_ids ]

print('The longest river is ' + this_river.iloc[:1,0].values[0]  +', which crosses ' + ', '.join(countries))


The longest river is Amazonas, which crosses Colombia, Brazil, Peru

In [14]:
# Question 4 (b) name and country of largest lake 
    
#Create a dataframe with columns 'Lake', 'Country ID', and 'Area'
lake = []
country_ids = []
area = []

for element in root.findall('lake'):
    lake.append(element.find('name').text)
    country_ids.append(element.get('country'))
    if element.find('area') is not None and element.find('area').text is not None:
        area.append(float(element.find('area').text))
    else:
        area.append(np.NaN)

data_lake = pd.DataFrame(list(zip(lake, country_ids, area)))
data_lake.columns = ['Lake','Country ID','Area'] 

#Find the name and country of largest lake 
this_lake = data_lake.sort_values(by='Area', ascending=False)
countries_ids = this_lake.iloc[:1,1].values[0].split(' ')
countries = [country_dict[c] for c in countries_ids ]

print('The largest lake is ' + this_lake.iloc[:1,0].values[0]  +' crosses ' + ', '.join(countries))


The largest lake is Caspian Sea crosses Russia, Azerbaijan, Kazakhstan, Iran, Turkmenistan

In [15]:
# Question 4 (c) name and country of airport at highest elevation

#Create a dataframe with columns 'Airport', 'Country ID', and 'Elevation'
airport = []
country_ids = []
elevation = []

for element in root.findall('airport'):
    airport.append(element.find('name').text)
    country_ids.append(element.get('country'))
    if element.find('elevation') is not None and element.find('elevation').text is not None:
        elevation.append(float(element.find('elevation').text))
    else:                               
        elevation.append(np.NaN)

data_airport = pd.DataFrame(list(zip(airport, country_ids, elevation)))
data_airport.columns = ['Airport','Country ID','Elevation'] 

# Find the name and country of airport at highest elevation
this_airport = data_airport.sort_values(by='Elevation', ascending=False)
countries_ids = this_airport.iloc[:1,1].values[0].split(' ')
countries = [country_dict[c] for c in countries_ids ]

print('The airport at highest elevation is ' + this_airport.iloc[:1,0].values[0]  +' crosses ' + ', '.join(countries))


The airport at highest elevation is El Alto Intl crosses Bolivia

In [ ]: