XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

  1. 10 countries with the lowest infant mortality rates
  2. 10 cities with the largest population
  3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
  4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [218]:
document = ET.parse( './data/mondial_database.xml' )
import pandas as pd

In [253]:
root = document.getroot()

Not all the entries have an infant mortality rate element. So we need to make sure loop loops for the element named 'infant_mortality'.


In [252]:
#get infant mortality of each country, add to heap if under capacity 
#otherwise check if new value is greater than smallest.
inf_mort = dict()
for element in document.iterfind('country'):
    for subelement in element.iterfind('infant_mortality'):
        inf_mort[element.find('name').text] = float(subelement.text)

In [343]:
infmort_df = pd.DataFrame.from_dict(inf_mort, orient ='index')
infmort_df.columns = ['infant_mortality']
infmort_df.index.names = ['country']
infmort_df.sort_values(by = 'infant_mortality', ascending = True).head(10)


Out[343]:
infant_mortality
country
Monaco 1.81
Japan 2.13
Norway 2.48
Bermuda 2.48
Singapore 2.53
Sweden 2.60
Czech Republic 2.63
Hong Kong 2.73
Macao 3.13
Iceland 3.15

Thus we have the countries with the ten lowest reported infant mortality rate element values (in order). To get the top ten populations by city, we have to make sure we get all cities, not just the elements directly under a country, and to keep track of the various population subelements, which all have the same name.


In [342]:
current_pop = 0
current_pop_year = 0
citypop = dict()

for country in document.iterfind('country'):
    for city in country.getiterator('city'):
        
        for subelement in city.iterfind('population'):
            #compare attributes of identically named subelements. Use this to hold onto the latest pop estimate. 
            if int(subelement.attrib['year']) > current_pop_year:
                current_pop = int(subelement.text)
                current_pop_year = int(subelement.attrib['year'])
                
        citypop[city.findtext('name')] = current_pop
        current_pop = 0
        current_pop_year = 0
  
    
citypop_df = pd.DataFrame.from_dict(citypop, orient ='index')
citypop_df.columns = ['population']
citypop_df.index.names = ['city']
citypop_df.sort_values(by = 'population', ascending = False).head(10)


Out[342]:
population
city
Shanghai 22315474
Istanbul 13710512
Mumbai 12442373
Moskva 11979529
Beijing 11716620
São Paulo 11152344
Tianjin 11090314
Guangzhou 11071424
Delhi 11034555
Shenzhen 10358381

Top ten cities in the world by population as reported by the database.


In [341]:
ethn = dict()
current_pop = 0
current_pop_year = 0
for country in document.iterfind('country'):
    for population in country.getiterator('population'):
        #compare attributes of identically named subelements. Use this to hold onto the latest pop estimate.
        #Probably faster way to do this if sure of tree structure (i.e. last element is always latest) 
        if int(population.attrib['year']) > current_pop_year:
                current_pop = int(population.text)
                current_pop_year = int(population.attrib['year'])
    for ethn_gp in country.iterfind('ethnicgroup'):
        if ethn_gp.text in ethn:
            ethn[ethn_gp.text] += current_pop*float(ethn_gp.attrib['percentage'])/100
        else:
            ethn[ethn_gp.text] = current_pop*float(ethn_gp.attrib['percentage'])/100
    current_pop = 0
    current_pop_year = 0

ethnic_df = pd.DataFrame.from_dict(ethn, orient ='index')
ethnic_df.columns = ['population']
ethnic_df.index.names = ['ethnic_group']
ethnic_df.groupby(ethnic_df.index).sum().sort_values(by = 'population', ascending = False).head(10)


Out[341]:
population
ethnic_group
Han Chinese 1.245059e+09
Indo-Aryan 8.718156e+08
European 4.948722e+08
African 3.183251e+08
Dravidian 3.027137e+08
Mestizo 1.577344e+08
Bengali 1.467769e+08
Russian 1.304840e+08
Japanese 1.265342e+08
Malay 1.219936e+08

Largest ethnic groups by population, based on the latest estimates from each country. Finally, we look for the longest river, largest lake, and highest airport. We can take advantage of the intelligent attributes included in the database already. Playing around with the river elements, we see that while the long rivers may have multiple 'located' subelements, for each country, the river element itself has a country attribute which lists the country codes all together. This simplifies the problem. We assume there are no ties... simply because it's a bit quicker and because the coincidence seems a bit ridiculous.


In [382]:
river_ctry=None
river_name= None
lake_ctry= None
lake_name= None
airport_ctry= None
airport_name = None
river_length= 0
lake_area = 0
airport_elv = 0

for river in document.iterfind('river'):
    for length in river.iterfind('length'):
        if river_length < float(length.text):
            river_length=float(length.text)
            river_ctry= river.attrib['country']
            river_name = river.findtext('name')


for lake in document.iterfind('lake'):
    for area in lake.iterfind('area'):
        if lake_area < float(area.text):
            lake_area=float(area.text)
            lake_ctry= lake.attrib['country']
            lake_name = lake.findtext('name')

for airport in document.iterfind('airport'):
    for elv in airport.iterfind('elevation'):
        #apprarently there is an airport in the database with an elevation tag an no entry. 
        #Probably should have been doing this previously
        if (elv.text is not None) and (airport_elv < float(elv.text)):
            airport_elv=float(elv.text)
            airport_ctry= airport.attrib['country']
            airport_name = airport.findtext('name') 
            
data = [[lake_name,river_name,airport_name],[lake_ctry,river_ctry,airport_ctry],[lake_area,river_length,airport_elv]]
df = pd.DataFrame(data, columns = ['Largest Lake','Longest River','Highest Airport'],index=['Name','Location (Country Code)','Metric Value'])
df


Out[382]:
Largest Lake Longest River Highest Airport
Name Caspian Sea Amazonas El Alto Intl
Location (Country Code) R AZ KAZ IR TM CO BR PE BOL
Metric Value 386400 6448 4063

In [ ]: