Web scraping in Python

I find sports more exciting when I know what to expect. How many points is our opponent scoring relative to how they did in the past? Is our offense shooting lights out or are they running up the score against a struggling defense? For many sports, Las Vegas sports books provides me the information I want with the line (expected difference between the home team's score and the visitor's score) and the total (the sum of the expected scores). For example, at game time on Monday, March 11th, Iona's men's basketball team was a four-point favorite (-4) over Manhattan, with a total of 117. While there may be some biases in these numbers, they are fairly accurate guides of what to expect.

At the moment, I'm interested in women's college lacrosse. Unfortunately, Vegas doesn't provide lines for lacrosse (or any other college sport but football and men's basketball. Laxpower has a power rating, which a numerical measure of the strength of a team. Comparing two team's ratings and adding about a home field advantage does give a pretty good line, but they don't provide total projections. Since expected point differences and totals for lacrosse aren't available anywhere else, I decided to create my own lines and totals for entertainment and educational purposes. You can see the results.

Creating lines involves: finding and downloading game data; developing a power rating model to rank each team; and using those models to predict future games. This general process isn't limited to constructing sports power rankings; these are the same general steps used to scrape data from websites for any sort of quantitative analysis. It's the same process I and my coauthors used, for example, to (analyze)[http://nealcaren.web.unc.edu/files/2012/05/smoc.pdf] a white racist web forum.

Luckily, Laxpower has all the game information, both for games played and scheduled. I wanted to cycle through each of the pages to get the information, but first I needed to know the URLs for all those pages. Luckily the ranking page has all the teams listed along with links to their pages, so I can grab the information from there. This two-step processes-finding the URLs you want to download and then visiting each of them-is fairly common for this kind of web scraping research.

I start by visiting the page I want to scrape in my browser and finding an example of the type of information I want to get, such as the text North Carolina which I know, based on clicking it, leads to all the game information for the North Carolina team. Again, in my browser, I viewed the source for the ranking page, that is the raw HTML. In this view, I searched for "North Carolina" again so I could get a sense of what each link looked like in the code.
Fortunately, the page had two pieces of information for each team listed in a way that was very easy to extract. Each link began with a " and ended in PHP". Acutally, this is just the relative path to the URL. I'll fill in the beginning part of the URL later). This was followed by a >, the school's name, and then a `>'. This is a situation where a simple regular expression would allow me to pull out the information I needed.

To get my list that contains all the URLs, I could take advantage of uniform way they were listed. In the Python variant of regular expressions the powerful combination of .*? will find any character, repeated any number of times, until it runs into something else. So searching a text for My .*? dog would grab all the text between My and dog. In my case, I wanted to extract all the instances of text that occurred between a quotation mark and PHP followed by a quotation mark, so I could search for instances of ".*?PHP" in the page's text.


In [7]:
import urllib2
import re

teams_html=urllib2.urlopen('http://www.laxpower.com/update13/binwom/rating01.php').read()
teams=re.findall('".*?PHP"',teams_html)
print teams[:5]


['"XMADXX.PHP"', '"XUFLXX.PHP"', '"XNWSXX.PHP"', '"XUNCXX.PHP"', '"XSYRXX.PHP"']

This is pretty good, but I don't want the quotation marks. I can be pickier about what I extract by using parentheses, which instructs re to only return the stuff between parentheses.


In [8]:
teams=re.findall('"(.*?PHP)"',teams_html)
print teams[:5]


['XMADXX.PHP', 'XUFLXX.PHP', 'XNWSXX.PHP', 'XUNCXX.PHP', 'XSYRXX.PHP']

As I noted above, next to this is also the school's name. I can extract this as well by extending the re statement.


In [9]:
teams=re.findall('"(.*?PHP)">(.*?)<',teams_html)
print teams[:5]


[('XMADXX.PHP', 'Maryland'), ('XUFLXX.PHP', 'Florida'), ('XNWSXX.PHP', 'Northwestern'), ('XUNCXX.PHP', 'North Carolina'), ('XSYRXX.PHP', 'Syracuse')]

Adding >(.*?)< had the effect of extending the search and returning everything between the greater than and less than signs. This is returned as a list of tuples. Note that regular expressions are complicated and more times than not will return either nothing or the entire text of the document. Trial, error, and reading are the only way forward.

I want to remove any duplicates by turning the returned list into a set, and then back into a list.


In [10]:
print len(teams)
teams=list(set(teams))
print len(teams)


200
100

I also want to store it in more useful format. I might forget later on whether the team or the URL was first in the tuple. Can't go wrong with a list of dictionaries.


In [11]:
teams=[{'team id':t[0],'team name':t[1]} for t in teams]
print teams[:5]


[{'team name': 'Lehigh', 'team id': 'XLEHXX.PHP'}, {'team name': 'Columbia', 'team id': 'XCMBXX.PHP'}, {'team name': 'Boston University', 'team id': 'XBOUXX.PHP'}, {'team name': 'Princeton', 'team id': 'XPRIXX.PHP'}, {'team name': 'Quinnipiac', 'team id': 'XQUIXX.PHP'}]

Now that I know all the teams and where to get information about them, I want to go to each of those pages and get the information about each game-who,when,where, and if it has already been played, what the score was. A quick look at the source for a page shows that information is stored in an HTML table. This is good news since other ways of presenting data on the page can be hard to get, while other ways the information shows up, such as those displayed using Flash, can be impossible.

I'm going to use the BeautifulSoup module to help parse the HTML. Regular expressions can get you pretty far, but modules like BeautifulSoup can save you a lot of time. They are much easier to use if you already know things like what a DOM element is, but are still usable for those who don't code web pages.

After downloading, opening, and soupifying the page (see the function below), you can extract the table with a simple table = soup.find("table") while rows=table.find_all("tr") will identify each of the rows. You might not want the first or last rows, depending on how the information is presented, so you can slice by appending something like [-1:] which will start after the first row.

Within in each row, you can extract a list of the cells with cell=row.findAll('td'). Another powerful feature of BeautifulSoup is that you can get rid of the HTML formatting with `.get_text()' which is a lot more efficient than a complicated regular expression, which might not always work. In my case, I'm not going to sort through the contents of each of the cells here. Since I want all the information, I'm just going to dump it to a file and organize it later.

My `get_team_page' function downloads the page and then extracts the contents of all the informative rows of the table and returns them as a list of lists. In retrospect, this should probably be split into two functions, with one that downloads the page and another that extracts the table information. That second function would be useful in other contexts, so I could use it in other projects.


In [12]:
from bs4 import BeautifulSoup

def get_team_page(team):
    team_url='http://www.laxpower.com/update13/binwom/%s?tab=detail' % team['team id']
    team_html=urllib2.urlopen(team_url).read()
    soup = BeautifulSoup(team_html.decode('utf-8', 'ignore'))
    table = soup.find("table")
    rows=[]
    for row in table.find_all("tr")[3:-1]:
        data=[d.get_text().replace(u'\xa0\x96','') for d in row.findAll('td')]
        outline=[team['team name']]+data[:5]
        rows.append([i.encode('utf-8') for i in outline])
    return rows

For maximum flexiblity, I want to output all the data to a tab separated file. I do this with the csv' module. The\t` tells the writer to use a tab instead of the default comma between items.


In [13]:
import csv
outfile=csv.writer(open('lax_13.tsv','wb'),delimiter='\t')

In order to be polite to the website, I want to pause a second between each page. Usually I try to save the contents of each page locally so that I only have to download it once. In this case, I'll be running it everyday and I want the most recent results, so I'm not going to save each page. One problem with this script is that the function above will crash if the web server is down or any other sort of HTML error. A better function would put the urllib2.urlopen() in a try: so that it can skip over those pages, if you think that is acceptable. Otherwise, you might have it so that if it can't download the page, it loads up the most recent locally saved version. All depends on what the data is and what you want to do with it.

The loop that goes through each team, downloads the page, returns the table, and then writes the results to a file has a print line so that I can watch it go and see if its getting hung up anywhere. I've commented it out here because it made this page too long.


In [14]:
from time import sleep

for team in teams:
    #print team['team name']
    game_history = get_team_page(team)
    outfile.writerows(game_history)
    #pasue to be polite to the website
    sleep(1)

The resulting file, lax_13.tsv, can be read in any statistical program or in Excel if you want to do your analysis there. To check that it worked, we can open it up and print out the first few rows.


In [15]:
game_info=csv.reader(open('lax_13.tsv','rb'), delimiter='\t')

for row in list(game_info)[:5]:
    print row


['Lehigh', '02/16', 'H', 'Villanova', '10', '9']
['Lehigh', '02/23', 'A', 'Temple', '7', '14']
['Lehigh', '02/27', 'A', 'Binghamton', '11', '10']
['Lehigh', '03/05', 'H', 'Delaware', '7', '17']
['Lehigh', '03/09', 'A', 'Navy', '4', '14']

Looks good to me.


In [1]:
#ignore code below. Imports style sheet for this page.

from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: