XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

10 countries with the lowest infant mortality rates
10 cities with the largest population
10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
name and country of a) longest river, b) largest lake and c) airport at highest elevation



In [218]:

    
document = ET.parse( './data/mondial_database.xml' )
import pandas as pd



In [253]:

    
root = document.getroot()

Not all the entries have an infant mortality rate element. So we need to make sure loop loops for the element named 'infant_mortality'.



In [252]:

    
#get infant mortality of each country, add to heap if under capacity 
#otherwise check if new value is greater than smallest.
inf_mort = dict()
for element in document.iterfind('country'):
    for subelement in element.iterfind('infant_mortality'):
        inf_mort[element.find('name').text] = float(subelement.text)



In [343]:

    
infmort_df = pd.DataFrame.from_dict(inf_mort, orient ='index')
infmort_df.columns = ['infant_mortality']
infmort_df.index.names = ['country']
infmort_df.sort_values(by = 'infant_mortality', ascending = True).head(10)









    Out[343]:






  
    
      
      infant_mortality
    
    
      country
      
    
  
  
    
      Monaco
      1.81
    
    
      Japan
      2.13
    
    
      Norway
      2.48
    
    
      Bermuda
      2.48
    
    
      Singapore
      2.53
    
    
      Sweden
      2.60
    
    
      Czech Republic
      2.63
    
    
      Hong Kong
      2.73
    
    
      Macao
      3.13
    
    
      Iceland
      3.15

Thus we have the countries with the ten lowest reported infant mortality rate element values (in order). To get the top ten populations by city, we have to make sure we get all cities, not just the elements directly under a country, and to keep track of the various population subelements, which all have the same name.



In [342]:

    
current_pop = 0
current_pop_year = 0
citypop = dict()

for country in document.iterfind('country'):
    for city in country.getiterator('city'):
        
        for subelement in city.iterfind('population'):
            #compare attributes of identically named subelements. Use this to hold onto the latest pop estimate. 
            if int(subelement.attrib['year']) > current_pop_year:
                current_pop = int(subelement.text)
                current_pop_year = int(subelement.attrib['year'])
                
        citypop[city.findtext('name')] = current_pop
        current_pop = 0
        current_pop_year = 0
  
    
citypop_df = pd.DataFrame.from_dict(citypop, orient ='index')
citypop_df.columns = ['population']
citypop_df.index.names = ['city']
citypop_df.sort_values(by = 'population', ascending = False).head(10)

Top ten cities in the world by population as reported by the database.



In [341]:

    
ethn = dict()
current_pop = 0
current_pop_year = 0
for country in document.iterfind('country'):
    for population in country.getiterator('population'):
        #compare attributes of identically named subelements. Use this to hold onto the latest pop estimate.
        #Probably faster way to do this if sure of tree structure (i.e. last element is always latest) 
        if int(population.attrib['year']) > current_pop_year:
                current_pop = int(population.text)
                current_pop_year = int(population.attrib['year'])
    for ethn_gp in country.iterfind('ethnicgroup'):
        if ethn_gp.text in ethn:
            ethn[ethn_gp.text] += current_pop*float(ethn_gp.attrib['percentage'])/100
        else:
            ethn[ethn_gp.text] = current_pop*float(ethn_gp.attrib['percentage'])/100
    current_pop = 0
    current_pop_year = 0

ethnic_df = pd.DataFrame.from_dict(ethn, orient ='index')
ethnic_df.columns = ['population']
ethnic_df.index.names = ['ethnic_group']
ethnic_df.groupby(ethnic_df.index).sum().sort_values(by = 'population', ascending = False).head(10)









    Out[341]:






  
    
      
      population
    
    
      ethnic_group
      
    
  
  
    
      Han Chinese
      1.245059e+09
    
    
      Indo-Aryan
      8.718156e+08
    
    
      European
      4.948722e+08
    
    
      African
      3.183251e+08
    
    
      Dravidian
      3.027137e+08
    
    
      Mestizo
      1.577344e+08
    
    
      Bengali
      1.467769e+08
    
    
      Russian
      1.304840e+08
    
    
      Japanese
      1.265342e+08
    
    
      Malay
      1.219936e+08

Largest ethnic groups by population, based on the latest estimates from each country. Finally, we look for the longest river, largest lake, and highest airport. We can take advantage of the intelligent attributes included in the database already. Playing around with the river elements, we see that while the long rivers may have multiple 'located' subelements, for each country, the river element itself has a country attribute which lists the country codes all together. This simplifies the problem. We assume there are no ties... simply because it's a bit quicker and because the coincidence seems a bit ridiculous.



In [382]:

    
river_ctry=None
river_name= None
lake_ctry= None
lake_name= None
airport_ctry= None
airport_name = None
river_length= 0
lake_area = 0
airport_elv = 0

for river in document.iterfind('river'):
    for length in river.iterfind('length'):
        if river_length < float(length.text):
            river_length=float(length.text)
            river_ctry= river.attrib['country']
            river_name = river.findtext('name')


for lake in document.iterfind('lake'):
    for area in lake.iterfind('area'):
        if lake_area < float(area.text):
            lake_area=float(area.text)
            lake_ctry= lake.attrib['country']
            lake_name = lake.findtext('name')

for airport in document.iterfind('airport'):
    for elv in airport.iterfind('elevation'):
        #apprarently there is an airport in the database with an elevation tag an no entry. 
        #Probably should have been doing this previously
        if (elv.text is not None) and (airport_elv < float(elv.text)):
            airport_elv=float(elv.text)
            airport_ctry= airport.attrib['country']
            airport_name = airport.findtext('name') 
            
data = [[lake_name,river_name,airport_name],[lake_ctry,river_ctry,airport_ctry],[lake_area,river_length,airport_elv]]
df = pd.DataFrame(data, columns = ['Largest Lake','Longest River','Highest Airport'],index=['Name','Location (Country Code)','Metric Value'])
df









    Out[382]:






  
    
      
      Largest Lake
      Longest River
      Highest Airport
    
  
  
    
      Name
      Caspian Sea
      Amazonas
      El Alto Intl
    
    
      Location (Country Code)
      R AZ KAZ IR TM
      CO BR PE
      BOL
    
    
      Metric Value
      386400
      6448
      4063



In [ ]:

	population
city
Shanghai	22315474
Istanbul	13710512
Mumbai	12442373
Moskva	11979529
Beijing	11716620
São Paulo	11152344
Tianjin	11090314
Guangzhou	11071424
Delhi	11034555
Shenzhen	10358381

	infant_mortality
country
Monaco	1.81
Japan	2.13
Norway	2.48
Bermuda	2.48
Singapore	2.53
Sweden	2.60
Czech Republic	2.63
Hong Kong	2.73
Macao	3.13
Iceland	3.15

	population
ethnic_group
Han Chinese	1.245059e+09
Indo-Aryan	8.718156e+08
European	4.948722e+08
African	3.183251e+08
Dravidian	3.027137e+08
Mestizo	1.577344e+08
Bengali	1.467769e+08
Russian	1.304840e+08
Japanese	1.265342e+08
Malay	1.219936e+08

	Largest Lake	Longest River	Highest Airport
Name	Caspian Sea	Amazonas	El Alto Intl
Location (Country Code)	R AZ KAZ IR TM	CO BR PE	BOL
Metric Value	386400	6448	4063