XML example and exercise

study examples of accessing nodes in XML tree structure
work on exercise to be completed and submitted

reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
data source: http://www.dbis.informatik.uni-goettingen.de/Mondial



In [2]:

    
from xml.etree import ElementTree as ET

XML example

for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html



In [3]:

    
document_tree = ET.parse( './data/mondial_database_less.xml' )



In [4]:

    
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)









    



Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra



In [5]:

    
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':', end=''),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])









    



* Albania:Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:Skopje, Kumanovo
* Serbia:Beograd, Novi Sad, Niš
* Montenegro:Podgorica
* Kosovo:Prishtine
* Andorra:Andorra la Vella

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

10 countries with the lowest infant mortality rates
10 cities with the largest population
10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
name and country of a) longest river, b) largest lake and c) airport at highest elevation



In [6]:

    
document = ET.parse( './data/mondial_database.xml' )



In [7]:

    
#Question 1: Find 10 countries with the lowest infant mortality rates

import pandas as pd
import numpy as np

#Create a dataframe with columns 'Country' and 'Mortality'
country = []
mortality = []
root = document.getroot()
for element in root.findall('country'):
    country.append(element.find('name').text)
    if element.find('infant_mortality') is not None and element.find('infant_mortality').text is not None:
        mortality.append(float(element.find('infant_mortality').text))
    else:
        mortality.append(np.NaN)
data = pd.DataFrame(list(zip(country,mortality)))
data.columns = ['Country', 'Mortality']

#Find 10 countries with the lowest infant mortality rates
data.sort_values('Mortality').iloc[:10,]









    Out[7]:






  
    
      
      Country
      Mortality
    
  
  
    
      38
      Monaco
      1.81
    
    
      98
      Japan
      2.13
    
    
      117
      Bermuda
      2.48
    
    
      36
      Norway
      2.48
    
    
      106
      Singapore
      2.53
    
    
      37
      Sweden
      2.60
    
    
      10
      Czech Republic
      2.63
    
    
      78
      Hong Kong
      2.73
    
    
      79
      Macao
      3.13
    
    
      44
      Iceland
      3.15



In [54]:

    
#Question 2: Find 10 cities with the largest population

city = []
population = []
year=[]

for element in root.findall('country'):
    this_country = element.find('name').text
    for subelement in element.iter('city'):
          for subelement2 in subelement.findall('population'):
                  city.append(subelement.find('name').text + ', ' + this_country)
                  population.append(float(subelement2.text))
                  year.append(float(subelement2.get('year')))
            
data_city = pd.DataFrame(list(zip(city, population, year)))
data_city.columns = ['City','Population','Year']

#Based on 'data_city', create a dataframe 'data_city_latest' that only include the rows of the most recent year
max_year = data_city.groupby('City')['Year'].max().to_frame().reset_index()
data_city_latest = pd.merge(max_year, data_city, how='left', on=['City', 'Year'])            
         
#Find 10 cities with the largest population
data_city_latest.sort_values('Population', ascending=False).iloc[:10,]









    Out[54]:






  
    
      
      City
      Year
      Population
    
  
  
    
      2421
      Shanghai, China
      2010.0
      22315474.0
    
    
      1145
      Istanbul, Turkey
      2012.0
      13710512.0
    
    
      1768
      Mumbai, India
      2011.0
      12442373.0
    
    
      1755
      Moskva, Russia
      2013.0
      11979529.0
    
    
      288
      Beijing, China
      2010.0
      11716620.0
    
    
      2578
      São Paulo, Brazil
      2010.0
      11152344.0
    
    
      2674
      Tianjin, China
      2010.0
      11090314.0
    
    
      957
      Guangzhou, China
      2010.0
      11071424.0
    
    
      699
      Delhi, India
      2011.0
      11034555.0
    
    
      2434
      Shenzhen, China
      2010.0
      10358381.0



In [29]:

    
#Question 3: Find 10 ethnic groups with the largest overall populations 

#Create a dataframe 'data_ethnic' with columns 'Country', 'Ethnicity' and 'Percentage'
country = []
ethnic = []
percentage = []

for element in root.findall('country'):
    this_country = element.find('name').text
    for subelement in element.findall('ethnicgroup'):
        country.append(this_country)
        ethnic.append(subelement.text)
        percentage.append(float(subelement.get('percentage')))

data_ethnic = pd.DataFrame(list(zip(country, ethnic, percentage)))
data_ethnic.columns = ['Country','Ethnicity','Percentage']

#Create a dataframe 'data_pop' with columns 'Country', 'Population' and 'Year'
country = []
population = []
year = []

for element in root.findall('country'):
    this_country = element.find('name').text
    for subelement in element.findall('population'):
        country.append(this_country)
        population.append(float(subelement.text))
        year.append(float(subelement.get('year')))

data_pop = pd.DataFrame(list(zip(country, population, year)))
data_pop.columns = ['Country','Population','Year']

#Based on 'data_pop', create a dataframe 'data_pop_latest' that only include the rows of the most recent year
max_year = data_pop.groupby('Country')['Year'].max().to_frame().reset_index()
data_pop_latest = pd.merge(max_year, data_pop, how='left', on=['Country', 'Year'])

#Merge dataframes 'data_ethnic' and 'data_pop_latest'. 
#The new dataframe contains columns 'Country','Ethnicity','Percentage','Population', and 'Year'
data = pd.merge(data_ethnic, data_pop_latest, how='left', on=['Country'])

#Find 10 ethnic groups with the largest overall populations 
data['Count']=data['Percentage']*data['Population']*0.01
data.groupby('Ethnicity')['Count'].sum().to_frame().reset_index().sort_values(by='Count', ascending=False).iloc[:10,]









    Out[29]:






  
    
      
      Ethnicity
      Count
    
  
  
    
      113
      Han Chinese
      1.245059e+09
    
    
      120
      Indo-Aryan
      8.718156e+08
    
    
      89
      European
      4.948722e+08
    
    
      2
      African
      3.183251e+08
    
    
      77
      Dravidian
      3.027137e+08
    
    
      176
      Mestizo
      1.577344e+08
    
    
      42
      Bengali
      1.467769e+08
    
    
      217
      Russian
      1.318570e+08
    
    
      128
      Japanese
      1.265342e+08
    
    
      163
      Malay
      1.219936e+08



In [11]:

    
# Question 4 (a) name and country of longest river

# Create a dictionary with country code as the ID and country name as the value
country_dict = {}

for element in root.findall('country'):
      this_country = element.find('name').text  
      this_country_id = element.get('car_code') 
      country_dict[this_country_id] = this_country

#Create a dataframe with columns 'River', 'Country ID', and 'Length'
river = []
country_ids = []
length = []

for element in root.findall('river'):
    river.append(element.find('name').text)
    country_ids.append(element.get('country'))
    if element.find('length') is not None and element.find('length').text is not None:
        length.append(float(element.find('length').text))
    else:
        length.append(np.NaN)

data_river = pd.DataFrame(list(zip(river, country_ids, length)))
data_river.columns = ['River','Country ID','Length'] 


#Find the name and country of longest river
this_river = data_river.sort_values(by='Length', ascending=False)
countries_ids = this_river.iloc[:1,1].values[0].split(' ')
countries = [country_dict[c] for c in countries_ids ]

print('The longest river is ' + this_river.iloc[:1,0].values[0]  +', which crosses ' + ', '.join(countries))









    



The longest river is Amazonas, which crosses Colombia, Brazil, Peru



In [14]:

    
# Question 4 (b) name and country of largest lake 
    
#Create a dataframe with columns 'Lake', 'Country ID', and 'Area'
lake = []
country_ids = []
area = []

for element in root.findall('lake'):
    lake.append(element.find('name').text)
    country_ids.append(element.get('country'))
    if element.find('area') is not None and element.find('area').text is not None:
        area.append(float(element.find('area').text))
    else:
        area.append(np.NaN)

data_lake = pd.DataFrame(list(zip(lake, country_ids, area)))
data_lake.columns = ['Lake','Country ID','Area'] 

#Find the name and country of largest lake 
this_lake = data_lake.sort_values(by='Area', ascending=False)
countries_ids = this_lake.iloc[:1,1].values[0].split(' ')
countries = [country_dict[c] for c in countries_ids ]

print('The largest lake is ' + this_lake.iloc[:1,0].values[0]  +' crosses ' + ', '.join(countries))









    



The largest lake is Caspian Sea crosses Russia, Azerbaijan, Kazakhstan, Iran, Turkmenistan



In [15]:

    
# Question 4 (c) name and country of airport at highest elevation

#Create a dataframe with columns 'Airport', 'Country ID', and 'Elevation'
airport = []
country_ids = []
elevation = []

for element in root.findall('airport'):
    airport.append(element.find('name').text)
    country_ids.append(element.get('country'))
    if element.find('elevation') is not None and element.find('elevation').text is not None:
        elevation.append(float(element.find('elevation').text))
    else:                               
        elevation.append(np.NaN)

data_airport = pd.DataFrame(list(zip(airport, country_ids, elevation)))
data_airport.columns = ['Airport','Country ID','Elevation'] 

# Find the name and country of airport at highest elevation
this_airport = data_airport.sort_values(by='Elevation', ascending=False)
countries_ids = this_airport.iloc[:1,1].values[0].split(' ')
countries = [country_dict[c] for c in countries_ids ]

print('The airport at highest elevation is ' + this_airport.iloc[:1,0].values[0]  +' crosses ' + ', '.join(countries))









    



The airport at highest elevation is El Alto Intl crosses Bolivia



In [ ]:

	Country	Mortality
38	Monaco	1.81
98	Japan	2.13
117	Bermuda	2.48
36	Norway	2.48
106	Singapore	2.53
37	Sweden	2.60
10	Czech Republic	2.63
78	Hong Kong	2.73
79	Macao	3.13
44	Iceland	3.15

	City	Year	Population
2421	Shanghai, China	2010.0	22315474.0
1145	Istanbul, Turkey	2012.0	13710512.0
1768	Mumbai, India	2011.0	12442373.0
1755	Moskva, Russia	2013.0	11979529.0
288	Beijing, China	2010.0	11716620.0
2578	São Paulo, Brazil	2010.0	11152344.0
2674	Tianjin, China	2010.0	11090314.0
957	Guangzhou, China	2010.0	11071424.0
699	Delhi, India	2011.0	11034555.0
2434	Shenzhen, China	2010.0	10358381.0

	Ethnicity	Count
113	Han Chinese	1.245059e+09
120	Indo-Aryan	8.718156e+08
89	European	4.948722e+08
2	African	3.183251e+08
77	Dravidian	3.027137e+08
176	Mestizo	1.577344e+08
42	Bengali	1.467769e+08
217	Russian	1.318570e+08
128	Japanese	1.265342e+08
163	Malay	1.219936e+08