XML example and exercise


  • study examples of accessing nodes in XML tree structure
  • work on exercise to be completed and submitted



In [1]:

XML example


In [3]:


In [ ]:


In [ ]:


XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

  1. 10 countries with the lowest infant mortality rates
  2. 10 cities with the largest population
  3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
  4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [22]:
# Answer to Exercise 1 (Find 10 countries with the lowest infant mortality rates)

import pandas as pd
import numpy as np
from xml.etree import ElementTree as ET

document_tree = ET.parse( './data/mondial_database.xml' )

# Set-up an empty dataframe as a placeholder for country, infant_mortality column values
country_df = pd.DataFrame(columns = ["country","infant_mortality"])

# Initialize variables.
country = ""
infant_mortality = ""

# Iterate through the xml tree and get the country name and its corresponding mortality rate.
# Store this in a data frame for faster manipulation of data.
for element in document_tree.iterfind('country'):
    country = element.find('name').text
    for subelement in element.getiterator('infant_mortality'):
        infant_mortality = float(subelement.text)          
            
    country_df.loc[len(country_df)] = [country, infant_mortality]
    
# Sort data and find top ten countries in ascending order (default)
country_df.sort_values(by = 'infant_mortality').head(10)


Out[22]:
country infant_mortality
38 Monaco 1.81
98 Japan 2.13
36 Norway 2.48
117 Bermuda 2.48
106 Singapore 2.53
37 Sweden 2.60
10 Czech Republic 2.63
78 Hong Kong 2.73
79 Macao 3.13
44 Iceland 3.15

In [7]:
# Answer to Exercise 2 (Find 10 cities with largest populations)

import pandas as pd
from xml.etree import ElementTree as ET

document_tree = ET.parse( './data/mondial_database.xml' )

# Set-up an empty dataframe as a placeholder for city, population column values
city_df = pd.DataFrame(columns = ["city","population"])

# Initialize variables
cityname = ""
population = float(0)

# Iterate through the xml tree and get the city name and its corresponding population.
# Need to loop through the country element first since that's top of the list.
# Store this in a data frame for faster manipulation of data.
for country in document_tree.iterfind('country'):
    for city in country.iter('city'):
        cityname = city.find('name').text
        for pop in city.iterfind('population'):
            population = float(pop.text)       
            
        city_df.loc[len(city_df)] = [cityname, population]
    
# Sort data and find top ten cities in descending order
city_df.sort_values(by = 'population', ascending = False).head(10)


Out[7]:
city population
1341 Shanghai 22315474.0
771 Istanbul 13710512.0
1527 Mumbai 12442373.0
479 Moskva 11979529.0
1340 Beijing 11716620.0
2810 São Paulo 11152344.0
1342 Tianjin 11090314.0
1064 Guangzhou 11071424.0
1582 Delhi 11034555.0
1067 Shenzhen 10358381.0

In [16]:
# Answer to Exercise 3 ( Find 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries))

import pandas as pd
from xml.etree import ElementTree as ET

document_tree = ET.parse( './data/mondial_database.xml' )

# Set-up an empty dataframe as a placeholder for city, population column values
pop_df = pd.DataFrame(columns = ["Country","Ethnicity","Population"])

# Initialize variables
countryname = ""
countrypop = float(0)
ethnicgrppop = float(0)

# Strategy: This xml file lists the population by country.  However, it does list the
#           different ethnic groups within the country and corresponding percent population.
#           So we will just compute the ethnic population accordingly.

# Iterate through the xml tree and get the country name and its corresponding ethnic populations.
# Need to loop through the country element first since that's top of the list.
# Store this in a data frame for faster manipulation of data.
for country in document_tree.iterfind('country'):
    countryname = country.find('name').text
    for pop in country.iterfind('population'):
        countrypop = float(pop.text)
        
    for ethnicgrp in country.iterfind('ethnicgroup'):
        ethnicgrpname = ethnicgrp.text
        
        # Calculate the ethnic population.
        # Formula = country population * ethnic group percentage
        ethnicgrppop = round(float(ethnicgrp.attrib['percentage']) * int(countrypop) * 0.01)
        
        # Save these values in a dataframe
        pop_df.loc[len(pop_df)] = [countryname, ethnicgrpname, ethnicgrppop]
    
# Group the data first by ethnic group irrespective of the country.
# Then display the top ten ethnic groups with largest population.
pop_df.groupby('Ethnicity').sum().sort_values(by = 'Population', ascending=False).head(10)


Out[16]:
Population
Ethnicity
Han Chinese 1.245059e+09
Indo-Aryan 8.718156e+08
European 4.948722e+08
African 3.183251e+08
Dravidian 3.027137e+08
Mestizo 1.577344e+08
Bengali 1.467769e+08
Russian 1.318570e+08
Japanese 1.265342e+08
Malay 1.219936e+08

In [ ]:


In [ ]: