I find sports more exciting when I know what to expect. How many points is our opponent scoring relative to how they did in the past? Is our offense shooting lights out or are they running up the score against a struggling defense? For many sports, Las Vegas sports books provides me the information I want with the line (expected difference between the home team's score and the visitor's score) and the total (the sum of the expected scores). For example, at game time on Monday, March 11th, Iona's men's basketball team was a four-point favorite (-4) over Manhattan, with a total of 117. While there may be some biases in these numbers, they are fairly accurate guides of what to expect.
At the moment, I'm interested in women's college lacrosse. Unfortunately, Vegas doesn't provide lines for lacrosse (or any other college sport but football and men's basketball. Laxpower has a power rating, which a numerical measure of the strength of a team. Comparing two team's ratings and adding about a home field advantage does give a pretty good line, but they don't provide total projections. Since expected point differences and totals for lacrosse aren't available anywhere else, I decided to create my own lines and totals for entertainment and educational purposes. You can see the results.
Creating lines involves: finding and downloading game data; developing a power rating model to rank each team; and using those models to predict future games. This general process isn't limited to constructing sports power rankings; these are the same general steps used to scrape data from websites for any sort of quantitative analysis. It's the same process I and my coauthors used, for example, to (analyze)[http://nealcaren.web.unc.edu/files/2012/05/smoc.pdf] a white racist web forum.
Luckily, Laxpower has all the game information, both for games played and scheduled. I wanted to cycle through each of the pages to get the information, but first I needed to know the URLs for all those pages. Luckily the ranking page has all the teams listed along with links to their pages, so I can grab the information from there. This two-step processes-finding the URLs you want to download and then visiting each of them-is fairly common for this kind of web scraping research.
I start by visiting the page I want to scrape in my browser and finding an example of the type
of information I want to get, such as the text North Carolina
which I know, based on clicking it,
leads to all the game information for the North Carolina team. Again, in my browser, I viewed
the source for the ranking page, that is the raw HTML. In this view, I searched
for "North Carolina" again so I could get a sense of what each link looked like in the code.
Fortunately,
the page had two pieces of information for each team listed in a way that was very easy
to extract. Each link began with a "
and ended in PHP"
. Acutally, this is just
the relative path to the URL. I'll fill in the beginning part of the URL later). This was followed by a >
,
the school's name, and then a `>'. This is a situation where a simple
regular expression would allow me to pull
out the information I needed.
To get my list that contains all the URLs, I could take advantage of uniform
way they were listed. In the Python variant of regular expressions the powerful
combination of .*?
will find any character, repeated any number of times, until it
runs into something else. So searching a text for My .*? dog
would grab all the text
between My
and dog
. In my case, I wanted to extract all the instances
of text that occurred between a quotation mark and PHP
followed by a quotation mark, so
I could search for instances of ".*?PHP"
in the page's text.
In [7]:
import urllib2
import re
teams_html=urllib2.urlopen('http://www.laxpower.com/update13/binwom/rating01.php').read()
teams=re.findall('".*?PHP"',teams_html)
print teams[:5]
This is pretty good, but I don't want the quotation marks. I can be pickier
about what I extract by using parentheses, which instructs re
to only
return the stuff between parentheses.
In [8]:
teams=re.findall('"(.*?PHP)"',teams_html)
print teams[:5]
As I noted above, next to this is also the school's name. I can
extract this as well by extending the re
statement.
In [9]:
teams=re.findall('"(.*?PHP)">(.*?)<',teams_html)
print teams[:5]
Adding >(.*?)<
had the effect of extending the search and returning everything between
the greater than and less than signs. This is returned as a list of tuples. Note that
regular expressions are complicated and more times than not will return either nothing
or the entire text of the document. Trial, error, and reading are the only way forward.
I want to remove any duplicates by turning the returned list into a set, and then back into a list.
In [10]:
print len(teams)
teams=list(set(teams))
print len(teams)
I also want to store it in more useful format. I might forget later on whether the team or the URL was first in the tuple. Can't go wrong with a list of dictionaries.
In [11]:
teams=[{'team id':t[0],'team name':t[1]} for t in teams]
print teams[:5]
Now that I know all the teams and where to get information about them, I want to go to each of those pages and get the information about each game-who,when,where, and if it has already been played, what the score was. A quick look at the source for a page shows that information is stored in an HTML table. This is good news since other ways of presenting data on the page can be hard to get, while other ways the information shows up, such as those displayed using Flash, can be impossible.
I'm going to use the BeautifulSoup
module to help parse the HTML. Regular expressions
can get you pretty far, but modules like BeautifulSoup
can save you a lot of time. They
are much easier to use if you already know things like what a DOM element is, but are still
usable for those who don't code web pages.
After downloading, opening, and soupifying the page (see the function below), you can extract the table
with a simple table = soup.find("table")
while rows=table.find_all("tr")
will identify each of the
rows. You might not want the first or last rows, depending on how the information
is presented, so you can slice by appending something like [-1:]
which will
start after the first row.
Within in each row, you can extract a list of the cells with
cell=row.findAll('td')
. Another powerful feature of BeautifulSoup
is that you can
get rid of the HTML formatting with `.get_text()' which is a lot more efficient
than a complicated regular expression, which might not always work. In my case, I'm
not going to sort through the contents of each of the cells here. Since I want all the
information, I'm just going to dump it to a file and organize it later.
My `get_team_page' function downloads the page and then extracts the contents of all the informative rows of the table and returns them as a list of lists. In retrospect, this should probably be split into two functions, with one that downloads the page and another that extracts the table information. That second function would be useful in other contexts, so I could use it in other projects.
In [12]:
from bs4 import BeautifulSoup
def get_team_page(team):
team_url='http://www.laxpower.com/update13/binwom/%s?tab=detail' % team['team id']
team_html=urllib2.urlopen(team_url).read()
soup = BeautifulSoup(team_html.decode('utf-8', 'ignore'))
table = soup.find("table")
rows=[]
for row in table.find_all("tr")[3:-1]:
data=[d.get_text().replace(u'\xa0\x96','') for d in row.findAll('td')]
outline=[team['team name']]+data[:5]
rows.append([i.encode('utf-8') for i in outline])
return rows
For maximum flexiblity, I want to output all the data to a tab separated file. I
do this with the csv' module. The
\t` tells the writer to use a tab instead
of the default comma between items.
In [13]:
import csv
outfile=csv.writer(open('lax_13.tsv','wb'),delimiter='\t')
In order to be polite to the website, I want to pause a second between
each page. Usually I try to save the contents of each page locally
so that I only have to download it once. In this case, I'll be running it
everyday and I want the most recent results, so I'm not going to save each
page. One problem with this script is that the function above
will crash if the web server is down
or any other sort of HTML error. A better function would put the urllib2.urlopen()
in a try:
so that it can skip over those pages, if you think that is acceptable.
Otherwise, you might have it so that if it can't download the page, it loads up
the most recent locally saved version. All depends on what the data is and
what you want to do with it.
The loop that goes through each team, downloads the page,
returns the table, and then writes the results to a file has a print
line so that
I can watch it go and see if its getting hung up anywhere. I've commented it out here because
it made this page too long.
In [14]:
from time import sleep
for team in teams:
#print team['team name']
game_history = get_team_page(team)
outfile.writerows(game_history)
#pasue to be polite to the website
sleep(1)
The resulting file, lax_13.tsv
, can be read in any statistical program or in Excel
if you want to do your analysis there. To check that it worked, we can open it up
and print out the first few rows.
In [15]:
game_info=csv.reader(open('lax_13.tsv','rb'), delimiter='\t')
for row in list(game_info)[:5]:
print row
Looks good to me.
In [1]:
#ignore code below. Imports style sheet for this page.
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: