Intro to XML

Design Goals

  • platform independent data transfer
  • easy to write code to read/write
  • document validation
  • human readable
  • support wide variety of apps

  • See the official XML Origin and Goals

  • Free as in Beer means - free in the sense of costing no money; gratis.

Benefit of XML

  • robust parses in most languagaes
  • we can focus on our app
  • free

XML Design prinicples

  • we can build databases to support various types of queries
  • we can piece together data from different sources
  • XML can be converted into different formats with no loss of information

XML in practice

  • most things can be represented by a list of properties and their values
  • nested structures

Fundamentals of XML

  • XML document is made up of elements
  • XML can be document oriented
  • XML documents can also be data oriented

Parsing XML


In [1]:
import xml.etree.ElementTree as ET
import pprint

In [2]:
def get_root(fname):
    tree = ET.parse(fname)
    return tree.getroot()

In [3]:
article_file = 'exampleResearchArticle.xml'
root = get_root(article_file)

In [4]:
for child in root:
    print child.tag


ui
ji
fm
bdy
bm

We can use xpath to get nested tags


In [5]:
for a in root.findall("./fm/bibl/aug/au"):
    email = a.find("email")
    if email is not None:
        print email.text


omer@extremegate.com
mcarmont@hotmail.com
laver17@gmail.com
nyska@internet-zahav.net
kammarh@gmail.com
gideon.mann.md@gmail.com
barns.nz@gmail.com
eukots@gmail.com

Let's try and get all of author data from the file. May need to look at ElementTree documentation


In [6]:
def get_authors(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
            "fnm": author.find('fnm').text,
            "snm": author.find('snm').text,
            "email": author.find('email').text,
            "insr": []
        }
        for insr in author.findall("insr"):
            data['insr'].append(insr.attrib['iid'])

        authors.append(data)

    return authors

In [7]:
def test1():
    solution = [{'insr': ['I1'], 'fnm': 'Omer', 'snm': 'Mei-Dan', 'email': 'omer@extremegate.com'},
                {'insr': ['I2'], 'fnm': 'Mike', 'snm': 'Carmont', 'email': 'mcarmont@hotmail.com'},
                {'insr': ['I3', 'I4'], 'fnm': 'Lior', 'snm': 'Laver', 'email': 'laver17@gmail.com'},
                {'insr': ['I3'], 'fnm': 'Meir', 'snm': 'Nyska', 'email': 'nyska@internet-zahav.net'},
                {'insr': ['I8'], 'fnm': 'Hagay', 'snm': 'Kammar', 'email': 'kammarh@gmail.com'},
                {'insr': ['I3', 'I5'], 'fnm': 'Gideon', 'snm': 'Mann', 'email': 'gideon.mann.md@gmail.com'},
                {'insr': ['I6'], 'fnm': 'Barnaby', 'snm': 'Clarck', 'email': 'barns.nz@gmail.com'},
                {'insr': ['I7'], 'fnm': 'Eugene', 'snm': 'Kots', 'email': 'eukots@gmail.com'}]
    
    root = get_root(article_file)
    data = get_authors(root)

    assert data[0] == solution[0]
    assert data[1]["fnm"] == solution[1]["fnm"]
    assert data[1]["insr"] == solution[1]["insr"]

In [8]:
test1()

Data Wrangling Procedure - Screen Scraping

  • Check the AirTrans website page source in your browser!
  • Build list of Carrier values
  • Build list of airport values
  • make HTTP requests to download all data
  • then parse the data files

Extracting entities

Look at BeautifulSoup


In [12]:
from bs4 import BeautifulSoup

In [13]:
def options(soup, id):
    option_values = []
    carrier_list = soup.find(id = id)
    for option in carrier_list.find_all('option'):
        option_values.append(option['value'])
    return option_values

In [15]:
soup = BeautifulSoup(open('air_trans_home.html'))

In [44]:
carrierList = options(soup, 'CarrierList')

In [54]:
airportList = options(soup, 'AirportList')

In [56]:
def removeAll(_list):
    for value in _list[:]:
        if value.startswith("All"):
            _list.remove(value)

In [57]:
removeAll(carrierList)
removeAll(airportList)

In [58]:
print len(carrierList)
print len(airportList)


15
1190

Beginning to Build Our http Requests

  • When scraping need to understand how the website expects request
  • network tab of browser is useful for that. browser knows how to make requests so we copy that
  • emulate in code
  • if stuff blows up look at your HTTP traffic
  • return till it works

Wireshark could be useful


In [29]:
def getValueOfId(soup, id):
    return soup.find(id = id)['value']

In [38]:
import requests

transtats_url = "http://www.transtats.bts.gov/Data_Elements.aspx?Data=2"

s = requests.Session()

r = s.get(transtats_url)
soup = BeautifulSoup(r.text)

r = s.post(transtats_url,
                data={'AirportList': "BOS",
                      'CarrierList': "VX",
                      'Submit': 'Submit',
                      "__VIEWSTATEGENERATOR": getValueOfId(soup, '__VIEWSTATEGENERATOR'),
                      "__EVENTTARGET": "",
                      "__EVENTARGUMENT": "",
                      "__EVENTVALIDATION": getValueOfId(soup, '__EVENTVALIDATION'),
                      "__VIEWSTATE": getValueOfId(soup, '__VIEWSTATE')
                })

f = open('virgin_and_logan_airport.html', 'w')
f.write(r.text)

In [71]:
soup = BeautifulSoup(open("data/FL-ATL.html"), "lxml")

In [75]:
table = soup.find("table", id="DataGrid1")

In [85]:
rows = table.find_all("tr")[1:]
for row in rows:
    row_data = map(lambda x: x.string, row.find_all("td"))
    
    if row_data[1] == 'TOTAL':
        continue
    
    print {
            'year': int(row_data[0]),
            'month': int(row_data[1]),
            'flights': {
                'domestic': int(row_data[2].replace(",", "")),
                'international': int(row_data[3].replace(",", ""))
            }
        }


{'flights': {'international': 92565, 'domestic': 815489}, 'month': 10, 'year': 2002}
{'flights': {'international': 91342, 'domestic': 766775}, 'month': 11, 'year': 2002}
{'flights': {'international': 96881, 'domestic': 782175}, 'month': 12, 'year': 2002}