XML example and exercise

study examples of accessing nodes in XML tree structure
work on exercise to be completed and submitted

reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
data source: http://www.dbis.informatik.uni-goettingen.de/Mondial



In [3]:

    
from xml.etree import ElementTree as ET

XML example

for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html



In [2]:

    
document_tree = ET.parse( './data/mondial_database_less.xml' )



In [3]:

    
document_tree.getroot()[0].attrib









    Out[3]:





{'area': '28750',
 'capital': 'cty-Albania-Tirane',
 'car_code': 'AL',
 'memberships': 'org-BSEC org-CEI org-CD org-SELEC org-CE org-EAPC org-EBRD org-EITI org-FAO org-IPU org-IAEA org-IBRD org-ICC org-ICAO org-ICCt org-Interpol org-IDA org-IFRCS org-IFC org-IFAD org-ILO org-IMO org-IMF org-IOC org-IOM org-ISO org-OIF org-ITU org-ITUC org-IDB org-MIGA org-NATO org-OSCE org-OPCW org-OAS org-OIC org-PCA org-UN org-UNCTAD org-UNESCO org-UNIDO org-UPU org-WCO org-WFTU org-WHO org-WIPO org-WMO org-UNWTO org-WTO'}



In [4]:

    
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text









    



Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra



In [5]:

    
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]









    



* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella



In [21]:

    
document_tree.getroot()[0].find('population').attrib









    Out[21]:





{'measured': 'est.', 'year': '1950'}



In [24]:

    
for i in document_tree.getroot()[0].findall('population'):
    print(str(i.text) +  " year: "+ str(i.get('year')))









    



1214489 year: 1950
1618829 year: 1960
2138966 year: 1970
2734776 year: 1980
3446882 year: 1990
3249136 year: 1997
3304948 year: 2000
3069275 year: 2001
2800138 year: 2011



In [16]:

    
for child in document_tree.getroot():
    print(child.find('name').text + ' infant : '+child.find('infant_mortality').text)









    



Albania infant : 13.19
Greece infant : 4.78
Macedonia infant : 7.9
Serbia infant : 6.16






    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-73bf8ab058a5> in <module>()
      1 for child in document_tree.getroot():
----> 2     print(child.find('name').text + ' infant : '+child.find('infant_mortality').text)

AttributeError: 'NoneType' object has no attribute 'text'



In [6]:

    
# print names of all countries
for child in document_tree.getroot()[0]:
    print child.find('infant_mortality')









    



None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None



In [37]:

    
testroot=document_tree.getroot()



In [28]:

    
for child in root.findall(".//infant_mortality/.."):
    print(child.find('name').text + ' infant : '+str(float(child.find('infant_mortality').text)))









    



Albania infant : 13.19
Greece infant : 4.78
Macedonia infant : 7.9
Serbia infant : 6.16
Andorra infant : 3.69



In [69]:

    
for child in testroot.findall("./country"):
    print([(i.text,i.get('year')) for i in child.findall("population[@year='2011']")])









    



[('2800138', '2011')]
[('10816286', '2011')]
[('2059794', '2011')]
[('7120666', '2011')]
[('620029', '2011')]
[('1733872', '2011')]
[('78115', '2011')]



In [74]:

    
for child in testroot[0].findall('./ethnicgroup'):
    print(child.text)









    



Albanian
Greek



In [162]:

    
(ET.tostring(root))[:2717100].rfind('river')









    Out[162]:





2716891



In [163]:

    
(ET.tostring(root))[2716850:2716950]









    Out[163]:





'ntry="H" id="island-MargitSziget" river="river-Donau">\n      <name>Margit Sziget</name>\n      <locat'



In [183]:

    
for i in root.find('./river'):
    if i.tag!='located_at':
        print(i.tag + ' ' + i.find('name').text)









    



river Thjorsa
river Joekulsa a Fjoellum
river Glomma
river Lagen
river Goetaaelv
river Klaraelv
river Umeaelv
river Dalaelv
river Vaesterdalaelv
river Oesterdalaelv



In [256]:

    
for i in root.findall('./country/airport')[:10]:
    print list(i)



In [248]:

    
for i in root.findall('.//airport/..')[:10]:
    print(i.tag)









    



mondial



In [257]:

    
root.find('.//gmtOffset/..').tag









    Out[257]:





'airport'



In [271]:

    
float(root.find('.//airport/latitude').text)









    Out[271]:





34.210017



In [204]:

    
root.find('./river').find('source').attrib









    Out[204]:





{'country': 'IS'}



In [61]:

    
ET.tostring(testroot[6].find('./ethnicgroup'))









    Out[61]:





'<ethnicgroup percentage="43">Spanish</ethnicgroup>\n      '



In [107]:

    
p = []
for child in testroot.findall('.//ethnicgroup/..//population/..'):
    p += [[i.text,float(i.get('percentage')),child.find('name').text,float(child.find("population[@year='2011']").text)] for i in child.findall('ethnicgroup')]



In [108]:

    
p









    Out[108]:





[['Albanian', 95.0, 'Albania', 2800138.0],
 ['Greek', 3.0, 'Albania', 2800138.0],
 ['Greek', 93.0, 'Greece', 10816286.0],
 ['Macedonian', 64.2, 'Macedonia', 2059794.0],
 ['Albanian', 25.2, 'Macedonia', 2059794.0],
 ['Turkish', 3.9, 'Macedonia', 2059794.0],
 ['Gypsy', 2.7, 'Macedonia', 2059794.0],
 ['Serb', 1.8, 'Macedonia', 2059794.0],
 ['Serb', 82.9, 'Serbia', 7120666.0],
 ['Montenegrin', 0.9, 'Serbia', 7120666.0],
 ['Hungarian', 3.9, 'Serbia', 7120666.0],
 ['Roma', 1.4, 'Serbia', 7120666.0],
 ['Bosniak', 1.8, 'Serbia', 7120666.0],
 ['Croat', 1.1, 'Serbia', 7120666.0],
 ['Montenegrin', 43.0, 'Montenegro', 620029.0],
 ['Serb', 32.0, 'Montenegro', 620029.0],
 ['Bosniak', 8.0, 'Montenegro', 620029.0],
 ['Albanian', 5.0, 'Montenegro', 620029.0],
 ['Albanian', 92.0, 'Kosovo', 1733872.0],
 ['Serbian', 5.0, 'Kosovo', 1733872.0],
 ['Spanish', 43.0, 'Andorra', 78115.0],
 ['Andorran', 33.0, 'Andorra', 78115.0],
 ['Portuguese', 11.0, 'Andorra', 78115.0],
 ['French', 2.0, 'Andorra', 78115.0],
 ['African', 5.0, 'Andorra', 78115.0]]



In [93]:

    
testeth=pd.DataFrame(p,columns=['ethnicgroup','percentage','county','cpop'])
testeth['epop']=testeth['percentage']*testeth['cpop']/100.0
testeth[['ethnicgroup','epop']].groupby('ethnicgroup').sum().sort('epop',ascending=False)









    Out[93]:






  
    
      
      epop
    
    
      ethnicgroup
      
    
  
  
    
      Greek
      10143150.120
    
    
      Serb
      6138517.686
    
    
      Albanian
      4805362.878
    
    
      Macedonian
      1322387.748
    
    
      Montenegrin
      330698.464
    
    
      Hungarian
      277705.974
    
    
      Bosniak
      177774.308
    
    
      Roma
      99689.324
    
    
      Serbian
      86693.600
    
    
      Turkish
      80331.966
    
    
      Croat
      78327.326
    
    
      Gypsy
      55614.438
    
    
      Spanish
      33589.450
    
    
      Andorran
      25777.950
    
    
      Portuguese
      8592.650
    
    
      African
      3905.750
    
    
      French
      1562.300

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

10 countries with the lowest infant mortality rates
10 cities with the largest population
10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
name and country of a) longest river, b) largest lake and c) airport at highest elevation



In [4]:

    
import pandas as pd



In [5]:

    
document = ET.parse( './data/mondial_database.xml' )



In [6]:

    
root=document.getroot()



In [7]:

    
pd.DataFrame([[child.find('name').text, float(child.find('infant_mortality').text)] for child in root.findall(".//infant_mortality/..")], columns=['country','infant mortality']).sort('infant mortality')[:10]









    Out[7]:






  
    
      
      country
      infant mortality
    
  
  
    
      36
      Monaco
      1.81
    
    
      90
      Japan
      2.13
    
    
      109
      Bermuda
      2.48
    
    
      34
      Norway
      2.48
    
    
      98
      Singapore
      2.53
    
    
      35
      Sweden
      2.60
    
    
      8
      Czech Republic
      2.63
    
    
      72
      Hong Kong
      2.73
    
    
      73
      Macao
      3.13
    
    
      39
      Iceland
      3.15



In [8]:

    
pd.DataFrame([[child.find('name').text, int(child.find('population').text)] for child in root.findall("./country/city/population/..")], columns=['city','pop']).sort('pop',ascending=False)[:10]









    Out[8]:






  
    
      
      city
      pop
    
  
  
    
      165
      Seoul
      10229262
    
    
      123
      Hong Kong
      7055071
    
    
      154
      Al Qahirah
      6053000
    
    
      75
      Bangkok
      5876000
    
    
      87
      Ho Chi Minh
      3924435
    
    
      166
      Busan
      3813814
    
    
      205
      New Taipei
      3722082
    
    
      84
      Hanoi
      3056146
    
    
      153
      Al Iskandariyah
      2917000
    
    
      204
      Taipei
      2626138



In [9]:

    
p = []
for child in root.findall('.//ethnicgroup/..//population/..'):
    p += [[i.text,float(i.get('percentage')),child.find('name').text,float(child.findall("population")[-1].text)] for i in child.findall('ethnicgroup')]



In [ ]:



In [12]:

    
eth=pd.DataFrame(p,columns=['ethnic group','percentage','county','cpop'])
eth['epop']=eth['percentage']*eth['cpop']/100.0
eth[['ethnic group','epop']].groupby('ethnic group').sum().sort('epop',ascending=False)[:10]









    Out[12]:






  
    
      
      epop
    
    
      ethnic group
      
    
  
  
    
      Han Chinese
      1.245059e+09
    
    
      Indo-Aryan
      8.718156e+08
    
    
      European
      4.948722e+08
    
    
      African
      3.183251e+08
    
    
      Dravidian
      3.027137e+08
    
    
      Mestizo
      1.577344e+08
    
    
      Bengali
      1.467769e+08
    
    
      Russian
      1.318570e+08
    
    
      Japanese
      1.265342e+08
    
    
      Malay
      1.219936e+08

Create CAR code dictionary



In [13]:

    
codedict={child.get('car_code'):child.find('name').text for child in root.findall('./country')}

Create DF of all rivers and lengths



In [14]:

    
rivers=pd.DataFrame([[i.find('name').text,float(i.find('length').text),i.find('source').get('country')] for i in root.findall('./river/name/../length/..')], columns=['rname','length','scountry'])

Find the river with the longest length



In [15]:

    
maxriver=rivers.iloc[rivers['length'].idxmax('length')]
print('The longest river is the '+maxriver['rname']+", with it's source located in "+codedict[maxriver['scountry']])









    



The longest river is the Amazonas, with it's source located in Peru



In [16]:

    
lakes=pd.DataFrame([[i.find('name').text,float(i.find('area').text),i.find('located').get('country')] for i in root.findall('./lake/name/../area/../located/..')], columns=['lname','area','lco'])



In [17]:

    
maxlake=lakes.iloc[lakes['area'].idxmax()]
print('The largest lake is the '+maxlake['lname']+", located in "+codedict[maxlake['lco']])









    



The largest lake is the Caspian Sea, located in Russia

I don't know how we're supposed to find the airport country, as the data file doesnt have country info as attributes or children of the airports, and the airports are not children of the root of the tree. They do have latitude and longitude data



In [64]:

    
airport=pd.DataFrame([[i.find('name').text,i.find('elevation').text,i.find('latitude').text,i.find('longitude').text] for i in root.findall('./airport/name/../latitude/../longitude/../elevation/..')], columns=['aname','elevation','latitude','longitude'])



In [67]:

    
airport['elevation']=airport['elevation'].astype(float)



In [66]:

    
maxairport=airport.iloc[airport['elevation'].idxmax()]
print('The highest airport is '+maxairport['aname']+", located at latitude "+maxairport['latitude']+", longitude "+ maxairport['longitude'])









    



The highest airport is El Alto Intl, located at latitude -16.513339, longitude -68.192256



In [ ]:

	epop
ethnicgroup
Greek	10143150.120
Serb	6138517.686
Albanian	4805362.878
Macedonian	1322387.748
Montenegrin	330698.464
Hungarian	277705.974
Bosniak	177774.308
Roma	99689.324
Serbian	86693.600
Turkish	80331.966
Croat	78327.326
Gypsy	55614.438
Spanish	33589.450
Andorran	25777.950
Portuguese	8592.650
African	3905.750
French	1562.300

	country	infant mortality
36	Monaco	1.81
90	Japan	2.13
109	Bermuda	2.48
34	Norway	2.48
98	Singapore	2.53
35	Sweden	2.60
8	Czech Republic	2.63
72	Hong Kong	2.73
73	Macao	3.13
39	Iceland	3.15

	city	pop
165	Seoul	10229262
123	Hong Kong	7055071
154	Al Qahirah	6053000
75	Bangkok	5876000
87	Ho Chi Minh	3924435
166	Busan	3813814
205	New Taipei	3722082
84	Hanoi	3056146
153	Al Iskandariyah	2917000
204	Taipei	2626138

	epop
ethnic group
Han Chinese	1.245059e+09
Indo-Aryan	8.718156e+08
European	4.948722e+08
African	3.183251e+08
Dravidian	3.027137e+08
Mestizo	1.577344e+08
Bengali	1.467769e+08
Russian	1.318570e+08
Japanese	1.265342e+08
Malay	1.219936e+08