Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find
In [218]:
document = ET.parse( './data/mondial_database.xml' )
import pandas as pd
In [253]:
root = document.getroot()
Not all the entries have an infant mortality rate element. So we need to make sure loop loops for the element named 'infant_mortality'.
In [252]:
#get infant mortality of each country, add to heap if under capacity
#otherwise check if new value is greater than smallest.
inf_mort = dict()
for element in document.iterfind('country'):
for subelement in element.iterfind('infant_mortality'):
inf_mort[element.find('name').text] = float(subelement.text)
In [343]:
infmort_df = pd.DataFrame.from_dict(inf_mort, orient ='index')
infmort_df.columns = ['infant_mortality']
infmort_df.index.names = ['country']
infmort_df.sort_values(by = 'infant_mortality', ascending = True).head(10)
Out[343]:
Thus we have the countries with the ten lowest reported infant mortality rate element values (in order). To get the top ten populations by city, we have to make sure we get all cities, not just the elements directly under a country, and to keep track of the various population subelements, which all have the same name.
In [342]:
current_pop = 0
current_pop_year = 0
citypop = dict()
for country in document.iterfind('country'):
for city in country.getiterator('city'):
for subelement in city.iterfind('population'):
#compare attributes of identically named subelements. Use this to hold onto the latest pop estimate.
if int(subelement.attrib['year']) > current_pop_year:
current_pop = int(subelement.text)
current_pop_year = int(subelement.attrib['year'])
citypop[city.findtext('name')] = current_pop
current_pop = 0
current_pop_year = 0
citypop_df = pd.DataFrame.from_dict(citypop, orient ='index')
citypop_df.columns = ['population']
citypop_df.index.names = ['city']
citypop_df.sort_values(by = 'population', ascending = False).head(10)
Out[342]:
Top ten cities in the world by population as reported by the database.
In [341]:
ethn = dict()
current_pop = 0
current_pop_year = 0
for country in document.iterfind('country'):
for population in country.getiterator('population'):
#compare attributes of identically named subelements. Use this to hold onto the latest pop estimate.
#Probably faster way to do this if sure of tree structure (i.e. last element is always latest)
if int(population.attrib['year']) > current_pop_year:
current_pop = int(population.text)
current_pop_year = int(population.attrib['year'])
for ethn_gp in country.iterfind('ethnicgroup'):
if ethn_gp.text in ethn:
ethn[ethn_gp.text] += current_pop*float(ethn_gp.attrib['percentage'])/100
else:
ethn[ethn_gp.text] = current_pop*float(ethn_gp.attrib['percentage'])/100
current_pop = 0
current_pop_year = 0
ethnic_df = pd.DataFrame.from_dict(ethn, orient ='index')
ethnic_df.columns = ['population']
ethnic_df.index.names = ['ethnic_group']
ethnic_df.groupby(ethnic_df.index).sum().sort_values(by = 'population', ascending = False).head(10)
Out[341]:
Largest ethnic groups by population, based on the latest estimates from each country. Finally, we look for the longest river, largest lake, and highest airport. We can take advantage of the intelligent attributes included in the database already. Playing around with the river elements, we see that while the long rivers may have multiple 'located' subelements, for each country, the river element itself has a country attribute which lists the country codes all together. This simplifies the problem. We assume there are no ties... simply because it's a bit quicker and because the coincidence seems a bit ridiculous.
In [382]:
river_ctry=None
river_name= None
lake_ctry= None
lake_name= None
airport_ctry= None
airport_name = None
river_length= 0
lake_area = 0
airport_elv = 0
for river in document.iterfind('river'):
for length in river.iterfind('length'):
if river_length < float(length.text):
river_length=float(length.text)
river_ctry= river.attrib['country']
river_name = river.findtext('name')
for lake in document.iterfind('lake'):
for area in lake.iterfind('area'):
if lake_area < float(area.text):
lake_area=float(area.text)
lake_ctry= lake.attrib['country']
lake_name = lake.findtext('name')
for airport in document.iterfind('airport'):
for elv in airport.iterfind('elevation'):
#apprarently there is an airport in the database with an elevation tag an no entry.
#Probably should have been doing this previously
if (elv.text is not None) and (airport_elv < float(elv.text)):
airport_elv=float(elv.text)
airport_ctry= airport.attrib['country']
airport_name = airport.findtext('name')
data = [[lake_name,river_name,airport_name],[lake_ctry,river_ctry,airport_ctry],[lake_area,river_length,airport_elv]]
df = pd.DataFrame(data, columns = ['Largest Lake','Longest River','Highest Airport'],index=['Name','Location (Country Code)','Metric Value'])
df
Out[382]:
In [ ]: