Introduction to Web Data Formats

Workshop on Web Scraping and Text Processing with Python

by Radhika Saksena, Princeton University, saksena@princeton.edu, radhika.saksena@gmail.com

Disclaimer: The code examples presented in this workshop are for educational purposes only. Please seek advice from a legal expert about the legal implications of using this code for web scraping.

1. Introduction

In this document, we will look at some common data formats encountered on the Internet. We will be extensively working with some of these document formats in the workshop. Data is available on the web in diﬀerent formats - HTML, XML, JSON, YAML and more. XML, JSON and YAML are popular data exchange formats for structured/semi-structured text data. For example, in the workshop, we will be interacting with Twitter content, which is available as JSON strings. The term structured, as applied to datasets in the present context, implies that the organization of data elements is well-defined and predictable. This makes extracting and working with structured content far more easier than with unstructured content. Let's look at these data formats and some examples in more detail.

2. XML Document format

Here, we will examine the anatomy of an XML document that represents the US House of Representatives roll call for the Hurricane Sandy relief bill.

An XML document consists of markup which are special characters and syntactic structures that mark-up the content (text, images, etc.) of the XML document. The content is what we are interested in reading and the mark-up constructs describe the structure of the document so that it becomes machine-readable. Listed below are the basic elements of XML that you will need to consider when using Python to read an XML document:

The XML declaration is a line, such as the one shown below, which is found at the start of the XML document. The declaration may or may not be included in the XML document. If present, it may contain an attribute that speciﬁes the character encoding of the document content and is something that could be useful for scraping purposes.

Tags are constructs that logically mark-up elements of the XML document. The name of the tag is enclosed in angle brackets. HTML and XML tags come in pairs - <tag> and </tag> - with the former annotating the start of the element and the latter annotating the end. If the tag is marking up an empty element, then the opening and closing set is fused into a single annotation - <tag/>. The snippet below is from an XML document describing the US House vote on the Hurricane Sandy Relief bill (http://en.wikipedia.org/wiki/Hurricane_Sandy_relief_bill). The full XML document is available at https://www.govtrack.us/data/congress/113/votes/2013/h7/data.xml. As shown here, the XML document contains various tags that correspond to particular details of the vote such as the roll-call information in the <roll> tag, the question being voted on in the <question> tag and so on.



In [ ]:

    
<roll where="house" session="113" year="2013" roll="7" source="house.gov"
    datetime="2013-01-04T11:22:00-05:00" updated="2013-07-18T22:42:27-04:00"
    aye="354" nay="67" nv="8" present="0">
    <category>passage-suspension</category>
    <type>On Motion to Suspend the Rules and Pass</type>
    <question>On Motion to Suspend the Rules and Pass: H R 41 To temporarily increase the
        borrowing authority of the Federal Emergency Management Agency for carrying out
        the National Flood Insurance Program</question>
    <required>2/3</required>
    <result>Passed</result>
    <bill session="113" type="h" number="41"/>
    <option key="+">Yea</option>
    <option key="-">Nay</option>
    <option key="P">Present</option>
    <option key="0">Not Voting</option>
    <voter id="400004" vote="+" value="Yea" state="AL"/>
    <voter id="400006" vote="+" value="Yea" state="LA"/>
    <voter id="412500" vote="+" value="Yea" state="NV"/>
    ...
</roll>

Furthermore, tags can be nested within each other. This is demonstrated in the roll-call example above, where, the <category>, <type>, <question>, <result>, etc. tags are nested within the parent <roll> tag. The nested or child tags annotate more specialized information about the document element. In this example the <roll> tag annotates the complete roll-call element, while the child tags such as <question> and <result> point to speciﬁc information about the roll-call.

At the lowest level of tag nesting in the XML document, if the leaf tag is not empty, then it will normally annotate a text element with substantive content. For example, the <question> tag in the example above annotates a text element describing the question being voted on. In addition to attribute values (see next item), most of our scraping efforts are geared towards extracting this content and tags act as guides/references for the web-scraping code.

Attributes are name-value pairs which qualify tags. (Note that the value in the name-value pair is enclosed in quotes.) The <voter> tag in the roll-call example above is qualiﬁed by three attributes describing (a) the voter id, (b) the type of vote/value and (c) the voter’s state. Note that although the <voter> tag has attributes, it is also an empty element, i.e., it does not encapsulate text or child elements. In fact, since this is an empty element, the starting and ending <voter> tags are fused into one tag with a forward slash to denote that the tag is marking up an empty element - <voter id=... />.

Attributes are useful in guiding us towards content of interest in two ways. First, we might be interested in content contained in tags with speciﬁc attribute values. Second, in cases such as the roll-call example, the attribute value itself may be of substantive interest. For example, if we wanted to list the legislators who voted for the Hurricane Sandy Relief bill, we need only to extract the id attribute of all the <voter> tags from the roll-call XML example.

Given an XML document and the concept of tags, attributes and content, we can discern elements of the XML document which are the logical units of the document. An element comprises the starting and ending tags, attributes, any nested tags and textual content. For example in the roll-call example, the <roll> tag along with all its sub-structure and content is considered to be an element and similarly the <question> tag is a child element of <roll>. Among the child elements of <roll>, one might be interested in the string contained in the <question> tag. Often such text strings are what we are trying to get at in our web scraping endeavors. The tags and attributes, that encapsulate the content of interest, serve as machine-readable guides that programmatically lead us to it. Using scripting languages, such as Pythons, we can automate such content extraction across thousands and even more XML documents.

3. HTML vs XML

Majority of text content on the web is available as HTML web pages. Concepts such as elements, tags, attributes that we saw for XML documents, in the previous section, also apply to HTML documents. However, HTML diﬀers from XML in that it is primarily meant for publishing content to the web rather than as a data exchange format. As a result there is often a lack of strict structure (schema) in HTML web pages. This can make traversing the HTML tag hierarchy from a Python script quite tricky. We will learn some techniques in the workshop which should make this task easier. Secondly, the unstructured nature of HTML documents aﬀects the re-usability of any Python script. For example, even if a script works for one HTML document on a website, it may not fully work with another HTML document that exists at the same level of hierarchy on the website, for example, providing the same data for another year, .

4. JSON Data Exchange Format

JSON (JavaScript-Object Notation) is a popular data exchange format and is increasingly becoming the format of choice over the XML format (e.g. Twitter uses JSON). JSON can be used for representing structured or semi-structured data. JSON data comprises of objects whose properties are listed as name-value pairs. Objects are enclosed in curly brackets and there can be nested objects in the JSON document just as elements can be nested within elements in XML documents. So here is a JSON snippet corresponding to the earlier Hurricane Sandy relief bill roll-call XML example - now in JSON format.



In [ ]:

    
{
  "bill": {
    "congress": 113, 
    "number": 41, 
    "type": "hr"
  }, 
  "category": "passage-suspension", 
  "chamber": "h", 
  "congress": 113, 
  "date": "2013-01-04T11:22:00-05:00", 
  "number": 7, 
  "question": "On Motion to Suspend the Rules and Pass: H R 41 To temporarily increase the borrowing authority of the Federal Emergency  Management Agency for carrying out the National Flood Insurance Program", 
  "requires": "2/3", 
  "result": "Passed", 
  "result_text": "Passed", 
  "session": "2013", 
  "source_url": "http://clerk.house.gov/evs/2013/roll007.xml", 
  "subject": "To temporarily increase the borrowing authority of the Federal Emergency Management Agency for carrying out the National Flood Insurance Program", 
  "type": "On Motion to Suspend the Rules and Pass", 
  "updated_at": "2014-02-17T08:57:08-05:00", 
  "vote_id": "h7-113.2013", 
  "votes": {
    "Nay": [
      {
        "display_name": "Amash", 
        "id": "A000367", 
        "party": "R", 
        "state": "MI"
      }, 
      {
        "display_name": "Barr", 
        "id": "B001282", 
        "party": "R", 
        "state": "KY"
      }, 
    ...
    }
    "Present": [], 
    "Yea": [
        {
        "display_name": "Aderholt", 
        "id": "A000055", 
        "party": "R", 
        "state": "AL"
      }, 
      {
        "display_name": "Alexander", 
        "id": "A000361", 
        "party": "R", 
        "state": "LA"
      }, 
    ...
    ]
  }
}

Here we have the roll-call object enclosed in the outermost curly brackets with many nested objects such as the bill object with values for its congress, number and type properties. As we saw in the XML example, the roll-call object has the question property which speciﬁes the motion being voted on and the result property which contains the result of the roll call.

In the JSON snippet above, we also note that the votes object has two child objects corresponding to the Nay votes and Yea votes. For each of the two types of votes, we have a list of voter objects enclosed in square brackets. Each voter object has values speciﬁed for its display_name, id, party and state properties and these are encapsulated as colon-separated name-value pairs within curly braces. Various tools have emerged to parse, emit and validate JSON and we will look at such a Python module in the workshop.

5. YAML Data Exchange Format

YAML (“YAML Ain’t Markup Language”) is another data-oriented format. The nested syntax resembles JSON but YAML doesn’t use enclosures such as brackets, quotes and tags. YAML has associative arrays to represent name-value pairs which specify properties and lists when multiple elements are involved. The nesting hierarchies are maintained by indentation in the same way that indentations are used in Python. The snippet of the YAML document giving details of the social media presence of some U.S. legislators that voted Nay in the previous roll-call example is listed below. The complete YAML document can be obtained from https://github.com/unitedstates/congress-legislators/blob/master/legislators-social-media.yaml



In [ ]:

    
- id:
    bioguide: A000367
    thomas: ’02029’
    govtrack: 412438
  social:
    twitter: repjustinamash
    facebook: repjustinamash
    youtube: repjustinamash
    facebook_id: ’173604349345646’
    youtube_id: UCeg6HhoCXrS8xpON9dxtZgA
- id:
    bioguide: B001282
    thomas: ’02131’
    govtrack: 412541
  social:
    twitter: RepAndyBarr
    facebook: RepAndyBarr
    youtube: RepAndyBarr
    facebook_id: ’457461137635018’
    youtube_id: UCVL2s6x7f7H0ZJ-uwU0pQ6Q

The YAML snippet above contains a list of two elements each starting with a hyphen. Each element contains nested elements id and social. These two sub-elements themselves are associative arrays (Python dict). The associative array corresponding to the id sub-element contains name-value pairs describing the legislator such as the bioguide ID, THOMAS ID, etc. while the social sub-element contains name-value pairs describing the legislator's social media presence.



In [ ]: