XML example and exercise

study examples of accessing nodes in XML tree structure
work on exercise to be completed and submitted

reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
data source: http://www.dbis.informatik.uni-goettingen.de/Mondial



In [1]:

XML example

for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html



In [3]:



In [ ]:



In [ ]:

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

10 countries with the lowest infant mortality rates
10 cities with the largest population
10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
name and country of a) longest river, b) largest lake and c) airport at highest elevation



In [22]:

    
# Answer to Exercise 1 (Find 10 countries with the lowest infant mortality rates)

import pandas as pd
import numpy as np
from xml.etree import ElementTree as ET

document_tree = ET.parse( './data/mondial_database.xml' )

# Set-up an empty dataframe as a placeholder for country, infant_mortality column values
country_df = pd.DataFrame(columns = ["country","infant_mortality"])

# Initialize variables.
country = ""
infant_mortality = ""

# Iterate through the xml tree and get the country name and its corresponding mortality rate.
# Store this in a data frame for faster manipulation of data.
for element in document_tree.iterfind('country'):
    country = element.find('name').text
    for subelement in element.getiterator('infant_mortality'):
        infant_mortality = float(subelement.text)          
            
    country_df.loc[len(country_df)] = [country, infant_mortality]
    
# Sort data and find top ten countries in ascending order (default)
country_df.sort_values(by = 'infant_mortality').head(10)









    Out[22]:






  
    
      
      country
      infant_mortality
    
  
  
    
      38
      Monaco
      1.81
    
    
      98
      Japan
      2.13
    
    
      36
      Norway
      2.48
    
    
      117
      Bermuda
      2.48
    
    
      106
      Singapore
      2.53
    
    
      37
      Sweden
      2.60
    
    
      10
      Czech Republic
      2.63
    
    
      78
      Hong Kong
      2.73
    
    
      79
      Macao
      3.13
    
    
      44
      Iceland
      3.15



In [7]:

    
# Answer to Exercise 2 (Find 10 cities with largest populations)

import pandas as pd
from xml.etree import ElementTree as ET

document_tree = ET.parse( './data/mondial_database.xml' )

# Set-up an empty dataframe as a placeholder for city, population column values
city_df = pd.DataFrame(columns = ["city","population"])

# Initialize variables
cityname = ""
population = float(0)

# Iterate through the xml tree and get the city name and its corresponding population.
# Need to loop through the country element first since that's top of the list.
# Store this in a data frame for faster manipulation of data.
for country in document_tree.iterfind('country'):
    for city in country.iter('city'):
        cityname = city.find('name').text
        for pop in city.iterfind('population'):
            population = float(pop.text)       
            
        city_df.loc[len(city_df)] = [cityname, population]
    
# Sort data and find top ten cities in descending order
city_df.sort_values(by = 'population', ascending = False).head(10)



In [16]:

    
# Answer to Exercise 3 ( Find 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries))

import pandas as pd
from xml.etree import ElementTree as ET

document_tree = ET.parse( './data/mondial_database.xml' )

# Set-up an empty dataframe as a placeholder for city, population column values
pop_df = pd.DataFrame(columns = ["Country","Ethnicity","Population"])

# Initialize variables
countryname = ""
countrypop = float(0)
ethnicgrppop = float(0)

# Strategy: This xml file lists the population by country.  However, it does list the
#           different ethnic groups within the country and corresponding percent population.
#           So we will just compute the ethnic population accordingly.

# Iterate through the xml tree and get the country name and its corresponding ethnic populations.
# Need to loop through the country element first since that's top of the list.
# Store this in a data frame for faster manipulation of data.
for country in document_tree.iterfind('country'):
    countryname = country.find('name').text
    for pop in country.iterfind('population'):
        countrypop = float(pop.text)
        
    for ethnicgrp in country.iterfind('ethnicgroup'):
        ethnicgrpname = ethnicgrp.text
        
        # Calculate the ethnic population.
        # Formula = country population * ethnic group percentage
        ethnicgrppop = round(float(ethnicgrp.attrib['percentage']) * int(countrypop) * 0.01)
        
        # Save these values in a dataframe
        pop_df.loc[len(pop_df)] = [countryname, ethnicgrpname, ethnicgrppop]
    
# Group the data first by ethnic group irrespective of the country.
# Then display the top ten ethnic groups with largest population.
pop_df.groupby('Ethnicity').sum().sort_values(by = 'Population', ascending=False).head(10)









    Out[16]:






  
    
      
      Population
    
    
      Ethnicity
      
    
  
  
    
      Han Chinese
      1.245059e+09
    
    
      Indo-Aryan
      8.718156e+08
    
    
      European
      4.948722e+08
    
    
      African
      3.183251e+08
    
    
      Dravidian
      3.027137e+08
    
    
      Mestizo
      1.577344e+08
    
    
      Bengali
      1.467769e+08
    
    
      Russian
      1.318570e+08
    
    
      Japanese
      1.265342e+08
    
    
      Malay
      1.219936e+08



In [ ]:



In [ ]:

	city	population
1341	Shanghai	22315474.0
771	Istanbul	13710512.0
1527	Mumbai	12442373.0
479	Moskva	11979529.0
1340	Beijing	11716620.0
2810	São Paulo	11152344.0
1342	Tianjin	11090314.0
1064	Guangzhou	11071424.0
1582	Delhi	11034555.0
1067	Shenzhen	10358381.0

	country	infant_mortality
38	Monaco	1.81
98	Japan	2.13
36	Norway	2.48
117	Bermuda	2.48
106	Singapore	2.53
37	Sweden	2.60
10	Czech Republic	2.63
78	Hong Kong	2.73
79	Macao	3.13
44	Iceland	3.15

	Population
Ethnicity
Han Chinese	1.245059e+09
Indo-Aryan	8.718156e+08
European	4.948722e+08
African	3.183251e+08
Dravidian	3.027137e+08
Mestizo	1.577344e+08
Bengali	1.467769e+08
Russian	1.318570e+08
Japanese	1.265342e+08
Malay	1.219936e+08