Workshop on Web Scraping and Text Processing with Python
by Radhika Saksena, Princeton University, saksena@princeton.edu, radhika.saksena@gmail.com
Disclaimer: The code examples presented in this workshop are for educational purposes only. Please seek advice from a legal expert about the legal implications of using this code for web scraping.
In this document, we will look at some common data formats encountered on the Internet. We will be extensively working with some of these document formats in the workshop. Data is available on the web in different formats - HTML, XML, JSON, YAML and more. XML, JSON and YAML are popular data exchange formats for structured/semi-structured text data. For example, in the workshop, we will be interacting with Twitter content, which is available as JSON strings. The term structured, as applied to datasets in the present context, implies that the organization of data elements is well-defined and predictable. This makes extracting and working with structured content far more easier than with unstructured content. Let's look at these data formats and some examples in more detail.
Here, we will examine the anatomy of an XML document that represents the US House of Representatives roll call for the Hurricane Sandy relief bill.
An XML document consists of markup which are special characters and syntactic structures that mark-up the content (text, images, etc.) of the XML document. The content is what we are interested in reading and the mark-up constructs describe the structure of the document so that it becomes machine-readable. Listed below are the basic elements of XML that you will need to consider when using Python to read an XML document:
<tag> and </tag> - with the former annotating the start of the element and the latter annotating the end. If the tag is marking up an empty element, then the opening and closing set is fused into a single annotation - <tag/>. The snippet below is from an XML document describing the US House vote on the Hurricane Sandy Relief bill (http://en.wikipedia.org/wiki/Hurricane_Sandy_relief_bill). The full XML document is available at https://www.govtrack.us/data/congress/113/votes/2013/h7/data.xml. As shown here, the XML document contains various tags that correspond to particular details of the vote such as the roll-call information in the <roll> tag, the question being voted on in the <question> tag and so on.
In [ ]:
<roll where="house" session="113" year="2013" roll="7" source="house.gov"
datetime="2013-01-04T11:22:00-05:00" updated="2013-07-18T22:42:27-04:00"
aye="354" nay="67" nv="8" present="0">
<category>passage-suspension</category>
<type>On Motion to Suspend the Rules and Pass</type>
<question>On Motion to Suspend the Rules and Pass: H R 41 To temporarily increase the
borrowing authority of the Federal Emergency Management Agency for carrying out
the National Flood Insurance Program</question>
<required>2/3</required>
<result>Passed</result>
<bill session="113" type="h" number="41"/>
<option key="+">Yea</option>
<option key="-">Nay</option>
<option key="P">Present</option>
<option key="0">Not Voting</option>
<voter id="400004" vote="+" value="Yea" state="AL"/>
<voter id="400006" vote="+" value="Yea" state="LA"/>
<voter id="412500" vote="+" value="Yea" state="NV"/>
...
</roll>
<category>, <type>, <question>, <result>, etc. tags are nested within the parent <roll> tag. The nested or child tags annotate more specialized information about the document element. In this example the <roll> tag annotates the complete roll-call element, while the child tags such as <question> and <result> point to specific information about the roll-call.<question> tag in the example above annotates a text element describing the question being voted on. In addition to attribute values (see next item), most of our scraping efforts are geared towards extracting this content and tags act as guides/references for the web-scraping code.<voter> tag in the roll-call example above is qualified by three attributes describing (a) the voter id, (b) the type of vote/value and (c) the voter’s state. Note that although the <voter> tag has attributes, it is also an empty element, i.e., it does not encapsulate text or child elements. In fact, since this is an empty element, the starting and ending <voter> tags are fused into one tag with a forward slash to denote that the tag is marking up an empty element - <voter id=... />.<voter> tags from the roll-call XML example.<roll> tag along with all its sub-structure and content is considered to be an element and similarly the <question> tag is a child element of <roll>. Among the child elements of <roll>, one might be interested in the string contained in the <question> tag. Often such text strings are what we are trying to get at in our web scraping endeavors. The tags and attributes, that encapsulate the content of interest, serve as machine-readable guides that programmatically lead us to it. Using scripting languages, such as Pythons, we can automate such content extraction across thousands and even more XML documents.Majority of text content on the web is available as HTML web pages. Concepts such as elements, tags, attributes that we saw for XML documents, in the previous section, also apply to HTML documents. However, HTML differs from XML in that it is primarily meant for publishing content to the web rather than as a data exchange format. As a result there is often a lack of strict structure (schema) in HTML web pages. This can make traversing the HTML tag hierarchy from a Python script quite tricky. We will learn some techniques in the workshop which should make this task easier. Secondly, the unstructured nature of HTML documents affects the re-usability of any Python script. For example, even if a script works for one HTML document on a website, it may not fully work with another HTML document that exists at the same level of hierarchy on the website, for example, providing the same data for another year, .
JSON (JavaScript-Object Notation) is a popular data exchange format and is increasingly becoming the format of choice over the XML format (e.g. Twitter uses JSON). JSON can be used for representing structured or semi-structured data. JSON data comprises of objects whose properties are listed as name-value pairs. Objects are enclosed in curly brackets and there can be nested objects in the JSON document just as elements can be nested within elements in XML documents. So here is a JSON snippet corresponding to the earlier Hurricane Sandy relief bill roll-call XML example - now in JSON format.
In [ ]:
{
"bill": {
"congress": 113,
"number": 41,
"type": "hr"
},
"category": "passage-suspension",
"chamber": "h",
"congress": 113,
"date": "2013-01-04T11:22:00-05:00",
"number": 7,
"question": "On Motion to Suspend the Rules and Pass: H R 41 To temporarily increase the borrowing authority of the Federal Emergency Management Agency for carrying out the National Flood Insurance Program",
"requires": "2/3",
"result": "Passed",
"result_text": "Passed",
"session": "2013",
"source_url": "http://clerk.house.gov/evs/2013/roll007.xml",
"subject": "To temporarily increase the borrowing authority of the Federal Emergency Management Agency for carrying out the National Flood Insurance Program",
"type": "On Motion to Suspend the Rules and Pass",
"updated_at": "2014-02-17T08:57:08-05:00",
"vote_id": "h7-113.2013",
"votes": {
"Nay": [
{
"display_name": "Amash",
"id": "A000367",
"party": "R",
"state": "MI"
},
{
"display_name": "Barr",
"id": "B001282",
"party": "R",
"state": "KY"
},
...
}
"Present": [],
"Yea": [
{
"display_name": "Aderholt",
"id": "A000055",
"party": "R",
"state": "AL"
},
{
"display_name": "Alexander",
"id": "A000361",
"party": "R",
"state": "LA"
},
...
]
}
}
In [ ]:
- id:
bioguide: A000367
thomas: ’02029’
govtrack: 412438
social:
twitter: repjustinamash
facebook: repjustinamash
youtube: repjustinamash
facebook_id: ’173604349345646’
youtube_id: UCeg6HhoCXrS8xpON9dxtZgA
- id:
bioguide: B001282
thomas: ’02131’
govtrack: 412541
social:
twitter: RepAndyBarr
facebook: RepAndyBarr
youtube: RepAndyBarr
facebook_id: ’457461137635018’
youtube_id: UCVL2s6x7f7H0ZJ-uwU0pQ6Q
In [ ]: