XML example and exercise


  • study examples of accessing nodes in XML tree structure
  • work on exercise to be completed and submitted



In [3]:
from xml.etree import ElementTree as ET

XML example


In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
document_tree.getroot()[0].attrib


Out[3]:
{'area': '28750',
 'capital': 'cty-Albania-Tirane',
 'car_code': 'AL',
 'memberships': 'org-BSEC org-CEI org-CD org-SELEC org-CE org-EAPC org-EBRD org-EITI org-FAO org-IPU org-IAEA org-IBRD org-ICC org-ICAO org-ICCt org-Interpol org-IDA org-IFRCS org-IFC org-IFAD org-ILO org-IMO org-IMF org-IOC org-IOM org-ISO org-OIF org-ITU org-ITUC org-IDB org-MIGA org-NATO org-OSCE org-OPCW org-OAS org-OIC org-PCA org-UN org-UNCTAD org-UNESCO org-UNIDO org-UPU org-WCO org-WFTU org-WHO org-WIPO org-WMO org-UNWTO org-WTO'}

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text


Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra

In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]


* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella

In [21]:
document_tree.getroot()[0].find('population').attrib


Out[21]:
{'measured': 'est.', 'year': '1950'}

In [24]:
for i in document_tree.getroot()[0].findall('population'):
    print(str(i.text) +  " year: "+ str(i.get('year')))


1214489 year: 1950
1618829 year: 1960
2138966 year: 1970
2734776 year: 1980
3446882 year: 1990
3249136 year: 1997
3304948 year: 2000
3069275 year: 2001
2800138 year: 2011

In [16]:
for child in document_tree.getroot():
    print(child.find('name').text + ' infant : '+child.find('infant_mortality').text)


Albania infant : 13.19
Greece infant : 4.78
Macedonia infant : 7.9
Serbia infant : 6.16
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-73bf8ab058a5> in <module>()
      1 for child in document_tree.getroot():
----> 2     print(child.find('name').text + ' infant : '+child.find('infant_mortality').text)

AttributeError: 'NoneType' object has no attribute 'text'

In [6]:
# print names of all countries
for child in document_tree.getroot()[0]:
    print child.find('infant_mortality')


None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None

In [37]:
testroot=document_tree.getroot()

In [28]:
for child in root.findall(".//infant_mortality/.."):
    print(child.find('name').text + ' infant : '+str(float(child.find('infant_mortality').text)))


Albania infant : 13.19
Greece infant : 4.78
Macedonia infant : 7.9
Serbia infant : 6.16
Andorra infant : 3.69

In [69]:
for child in testroot.findall("./country"):
    print([(i.text,i.get('year')) for i in child.findall("population[@year='2011']")])


[('2800138', '2011')]
[('10816286', '2011')]
[('2059794', '2011')]
[('7120666', '2011')]
[('620029', '2011')]
[('1733872', '2011')]
[('78115', '2011')]

In [74]:
for child in testroot[0].findall('./ethnicgroup'):
    print(child.text)


Albanian
Greek

In [162]:
(ET.tostring(root))[:2717100].rfind('river')


Out[162]:
2716891

In [163]:
(ET.tostring(root))[2716850:2716950]


Out[163]:
'ntry="H" id="island-MargitSziget" river="river-Donau">\n      <name>Margit Sziget</name>\n      <locat'

In [183]:
for i in root.find('./river'):
    if i.tag!='located_at':
        print(i.tag + ' ' + i.find('name').text)


river Thjorsa
river Joekulsa a Fjoellum
river Glomma
river Lagen
river Goetaaelv
river Klaraelv
river Umeaelv
river Dalaelv
river Vaesterdalaelv
river Oesterdalaelv

In [256]:
for i in root.findall('./country/airport')[:10]:
    print list(i)

In [248]:
for i in root.findall('.//airport/..')[:10]:
    print(i.tag)


mondial

In [257]:
root.find('.//gmtOffset/..').tag


Out[257]:
'airport'

In [271]:
float(root.find('.//airport/latitude').text)


Out[271]:
34.210017

In [204]:
root.find('./river').find('source').attrib


Out[204]:
{'country': 'IS'}

In [61]:
ET.tostring(testroot[6].find('./ethnicgroup'))


Out[61]:
'<ethnicgroup percentage="43">Spanish</ethnicgroup>\n      '

In [107]:
p = []
for child in testroot.findall('.//ethnicgroup/..//population/..'):
    p += [[i.text,float(i.get('percentage')),child.find('name').text,float(child.find("population[@year='2011']").text)] for i in child.findall('ethnicgroup')]

In [108]:
p


Out[108]:
[['Albanian', 95.0, 'Albania', 2800138.0],
 ['Greek', 3.0, 'Albania', 2800138.0],
 ['Greek', 93.0, 'Greece', 10816286.0],
 ['Macedonian', 64.2, 'Macedonia', 2059794.0],
 ['Albanian', 25.2, 'Macedonia', 2059794.0],
 ['Turkish', 3.9, 'Macedonia', 2059794.0],
 ['Gypsy', 2.7, 'Macedonia', 2059794.0],
 ['Serb', 1.8, 'Macedonia', 2059794.0],
 ['Serb', 82.9, 'Serbia', 7120666.0],
 ['Montenegrin', 0.9, 'Serbia', 7120666.0],
 ['Hungarian', 3.9, 'Serbia', 7120666.0],
 ['Roma', 1.4, 'Serbia', 7120666.0],
 ['Bosniak', 1.8, 'Serbia', 7120666.0],
 ['Croat', 1.1, 'Serbia', 7120666.0],
 ['Montenegrin', 43.0, 'Montenegro', 620029.0],
 ['Serb', 32.0, 'Montenegro', 620029.0],
 ['Bosniak', 8.0, 'Montenegro', 620029.0],
 ['Albanian', 5.0, 'Montenegro', 620029.0],
 ['Albanian', 92.0, 'Kosovo', 1733872.0],
 ['Serbian', 5.0, 'Kosovo', 1733872.0],
 ['Spanish', 43.0, 'Andorra', 78115.0],
 ['Andorran', 33.0, 'Andorra', 78115.0],
 ['Portuguese', 11.0, 'Andorra', 78115.0],
 ['French', 2.0, 'Andorra', 78115.0],
 ['African', 5.0, 'Andorra', 78115.0]]

In [93]:
testeth=pd.DataFrame(p,columns=['ethnicgroup','percentage','county','cpop'])
testeth['epop']=testeth['percentage']*testeth['cpop']/100.0
testeth[['ethnicgroup','epop']].groupby('ethnicgroup').sum().sort('epop',ascending=False)


Out[93]:
epop
ethnicgroup
Greek 10143150.120
Serb 6138517.686
Albanian 4805362.878
Macedonian 1322387.748
Montenegrin 330698.464
Hungarian 277705.974
Bosniak 177774.308
Roma 99689.324
Serbian 86693.600
Turkish 80331.966
Croat 78327.326
Gypsy 55614.438
Spanish 33589.450
Andorran 25777.950
Portuguese 8592.650
African 3905.750
French 1562.300

XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

  1. 10 countries with the lowest infant mortality rates
  2. 10 cities with the largest population
  3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
  4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [4]:
import pandas as pd

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
root=document.getroot()

In [7]:
pd.DataFrame([[child.find('name').text, float(child.find('infant_mortality').text)] for child in root.findall(".//infant_mortality/..")], columns=['country','infant mortality']).sort('infant mortality')[:10]


Out[7]:
country infant mortality
36 Monaco 1.81
90 Japan 2.13
109 Bermuda 2.48
34 Norway 2.48
98 Singapore 2.53
35 Sweden 2.60
8 Czech Republic 2.63
72 Hong Kong 2.73
73 Macao 3.13
39 Iceland 3.15

In [8]:
pd.DataFrame([[child.find('name').text, int(child.find('population').text)] for child in root.findall("./country/city/population/..")], columns=['city','pop']).sort('pop',ascending=False)[:10]


Out[8]:
city pop
165 Seoul 10229262
123 Hong Kong 7055071
154 Al Qahirah 6053000
75 Bangkok 5876000
87 Ho Chi Minh 3924435
166 Busan 3813814
205 New Taipei 3722082
84 Hanoi 3056146
153 Al Iskandariyah 2917000
204 Taipei 2626138

In [9]:
p = []
for child in root.findall('.//ethnicgroup/..//population/..'):
    p += [[i.text,float(i.get('percentage')),child.find('name').text,float(child.findall("population")[-1].text)] for i in child.findall('ethnicgroup')]

In [ ]:


In [12]:
eth=pd.DataFrame(p,columns=['ethnic group','percentage','county','cpop'])
eth['epop']=eth['percentage']*eth['cpop']/100.0
eth[['ethnic group','epop']].groupby('ethnic group').sum().sort('epop',ascending=False)[:10]


Out[12]:
epop
ethnic group
Han Chinese 1.245059e+09
Indo-Aryan 8.718156e+08
European 4.948722e+08
African 3.183251e+08
Dravidian 3.027137e+08
Mestizo 1.577344e+08
Bengali 1.467769e+08
Russian 1.318570e+08
Japanese 1.265342e+08
Malay 1.219936e+08

Create CAR code dictionary


In [13]:
codedict={child.get('car_code'):child.find('name').text for child in root.findall('./country')}

Create DF of all rivers and lengths


In [14]:
rivers=pd.DataFrame([[i.find('name').text,float(i.find('length').text),i.find('source').get('country')] for i in root.findall('./river/name/../length/..')], columns=['rname','length','scountry'])

Find the river with the longest length


In [15]:
maxriver=rivers.iloc[rivers['length'].idxmax('length')]
print('The longest river is the '+maxriver['rname']+", with it's source located in "+codedict[maxriver['scountry']])


The longest river is the Amazonas, with it's source located in Peru

In [16]:
lakes=pd.DataFrame([[i.find('name').text,float(i.find('area').text),i.find('located').get('country')] for i in root.findall('./lake/name/../area/../located/..')], columns=['lname','area','lco'])

In [17]:
maxlake=lakes.iloc[lakes['area'].idxmax()]
print('The largest lake is the '+maxlake['lname']+", located in "+codedict[maxlake['lco']])


The largest lake is the Caspian Sea, located in Russia

I don't know how we're supposed to find the airport country, as the data file doesnt have country info as attributes or children of the airports, and the airports are not children of the root of the tree. They do have latitude and longitude data


In [64]:
airport=pd.DataFrame([[i.find('name').text,i.find('elevation').text,i.find('latitude').text,i.find('longitude').text] for i in root.findall('./airport/name/../latitude/../longitude/../elevation/..')], columns=['aname','elevation','latitude','longitude'])

In [67]:
airport['elevation']=airport['elevation'].astype(float)

In [66]:
maxairport=airport.iloc[airport['elevation'].idxmax()]
print('The highest airport is '+maxairport['aname']+", located at latitude "+maxairport['latitude']+", longitude "+ maxairport['longitude'])


The highest airport is El Alto Intl, located at latitude -16.513339, longitude -68.192256

In [ ]: