Scrape Pycon Talks Data from the Web

Second step in creating a talk recommender for Pycon. Get last year's talk data for bigger a corpus.
The rest of the project can be found on github: https://github.com/mikecunha/pycon_reco

Dependencies

Version info from watermark and pip freeze at the end of the notebook


In [1]:
from datetime import datetime
from time import sleep
import re
from pprint import PrettyPrinter
from urllib.request import urlopen

from bs4 import BeautifulSoup # http://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4.element import NavigableString
from markdown import markdown
import pyprind  # progress bar, e.g. here: http://nbviewer.ipython.org/github/rasbt/pyprind/blob/master/examples/pyprind_demo.ipynb
import pandas as pd

In [5]:
talk_sched_html = urlopen("https://us.pycon.org/2014/schedule/talks/")
tut_sched_html = urlopen("https://us.pycon.org/2014/schedule/tutorials/")

talk_sched_soup = BeautifulSoup( talk_sched_html.read() )
tut_sched_soup = BeautifulSoup( tut_sched_html.read() )

In [6]:
to_scrape = []

talk_links = talk_sched_soup.select("td.slot-talk span.title a")
tut_links = tut_sched_soup.select("td.slot-tutorial span.title a")

for t in talk_links + tut_links:
    to_scrape.append( t.attrs.get('href') )
    
list(enumerate(to_scrape))[-5:]


Out[6]:
[(126, '/2014/schedule/presentation/67/'),
 (127, '/2014/schedule/presentation/78/'),
 (128, '/2014/schedule/presentation/80/'),
 (129, '/2014/schedule/presentation/55/'),
 (130, '/2014/schedule/presentation/69/')]

In [7]:
# Scrape all the talk html pages
soups = {}
perc = pyprind.ProgPercent( len(to_scrape) )

for relative_url in to_scrape:
    
    perc.update()
    
    uri = "https://us.pycon.org" + relative_url
    talk_html = urlopen( uri )
    soups[uri] = BeautifulSoup( talk_html.read() )
    
    sleep(0.5) # Be nice.


[100 %] elapsed[sec]: 208.604 | ETA[sec]: 0.000 
Total time elapsed: 208.604 sec

In [8]:
talks = []

for uri, soup in soups.items():
    
    talk = {}
    
    content = soup.find(attrs={"class":"box-content"})

    elements = content.find_all("dd")
    talk['level'], talk['category'] = [ e.get_text(strip=True) for e in elements ]

    elements = content.find_all("h4")
    talk['str_time'], talk['author'] = [ e.get_text(strip=True) for e in elements ]

    talk['desc'] = soup.find(attrs={"class":"description"}).get_text(strip=True)
    
    # Abstracts contain some unparsed markdown
    abstract = soup.find(attrs={"class":"abstract"}).get_text(strip=True)
    html = markdown( abstract )
    abstract = ''.join(BeautifulSoup(html).findAll(text=True))
    
    talk['abstract'] = abstract.replace("\n"," ")

    talk['title'] = content.find("h2").get_text(strip=True)
    
    talks.append( talk )
    
talks = pd.DataFrame( talks )
talks.head()


Out[8]:
abstract author category desc level str_time title
0 Generators: The Final Frontier This tutorial e... David Beazley Python Core (language, stdlib, etc.) Python generators have long been useful for so... Experienced Thursday\n 9 a.m.–12:20 p.m. Generators: The Final Frontier
1 To be useful and trustworthy, specifications a... Catherine Devlin Education Code executes. Docs just sit there looking pr... Intermediate Sunday\n 1:50 p.m.–2:20 p.m. See Docs Run. Run, Docs, Run!
2 This tutorial is a systematic introduction to ... Mike Müller Python Core (language, stdlib, etc.) Descriptors and metaclasses are advanced Pytho... Intermediate Wednesday\n 9 a.m.–12:20 p.m. Descriptors and Metaclasses - Understanding an...
3 Since I started doing free-lance python web-ap... Kate Heddleston Best Practices & Patterns This is a talk about building full-stack pytho... Intermediate Friday\n 11:30 a.m.–noon So you want to be a full-stack developer? How ...
4 There are a lot of things to think about when ... Carl Meyer Best Practices & Patterns Got some code that you've written that would b... Novice Sunday\n 2:30 p.m.–3 p.m. Set your code free: releasing and maintaining ...

In [11]:
day_to_date = {'Wednesday': 'Apr 9 2014 ',
               'Thursday': 'Apr 10 2014 ',
               'Friday': 'Apr 11 2014 ',
               'Saturday': 'Apr 12 2014 ',
               'Sunday': 'Apr 13 2014 ',
               }

def parse_dt( dt ):
    """ Convert string to datetime """
    
    day, t = [ x.strip() for x in dt.split('\n') ]
    
    start, end = [ x.replace('.', '').replace(' ','').upper() for x in t.split("–") ]
    
    if end == "NOON":
        end = "12:00PM"
    elif end.find(':') < 0:
        end = end[0] + ":00" + end[-2:]
    if start.find(':') < 0:
        start = start[0] + ":00" + start[-2:]
    
    try:
        start = datetime.strptime( day_to_date[day] + start + ' EDT', '%b %d %Y %I:%M%p %Z' )
    except ValueError:
        print ("error converting start time: ", start)
    try:    
        end = datetime.strptime( day_to_date[day] + end + ' EDT', '%b %d %Y %I:%M%p %Z' )
    except ValueError:
        print ("error converting end time: ", end)
    
    return day, start, end

In [12]:
talks["weekday"], talks["start_dt"], talks["end_dt"] = zip(*talks["str_time"].map(parse_dt))

del talks["str_time"]

In [13]:
talks.head()


Out[13]:
abstract author category desc level title weekday start_dt end_dt
0 Generators: The Final Frontier This tutorial e... David Beazley Python Core (language, stdlib, etc.) Python generators have long been useful for so... Experienced Generators: The Final Frontier Thursday 2014-04-10 09:00:00 2014-04-10 12:20:00
1 To be useful and trustworthy, specifications a... Catherine Devlin Education Code executes. Docs just sit there looking pr... Intermediate See Docs Run. Run, Docs, Run! Sunday 2014-04-13 13:50:00 2014-04-13 14:20:00
2 This tutorial is a systematic introduction to ... Mike Müller Python Core (language, stdlib, etc.) Descriptors and metaclasses are advanced Pytho... Intermediate Descriptors and Metaclasses - Understanding an... Wednesday 2014-04-09 09:00:00 2014-04-09 12:20:00
3 Since I started doing free-lance python web-ap... Kate Heddleston Best Practices & Patterns This is a talk about building full-stack pytho... Intermediate So you want to be a full-stack developer? How ... Friday 2014-04-11 11:30:00 2014-04-11 12:00:00
4 There are a lot of things to think about when ... Carl Meyer Best Practices & Patterns Got some code that you've written that would b... Novice Set your code free: releasing and maintaining ... Sunday 2014-04-13 14:30:00 2014-04-13 15:00:00

Ignore old keynotes for now and just save talk and tutorial data


In [33]:
talks.to_csv( 'data/pycon_talks_2014.csv', sep="\t", index=False )

System Info


In [33]:
try:
    %load_ext watermark
except ImportError as e:
    %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
    %load_ext watermark

%watermark


15/03/2015 15:25:15

CPython 3.4.2
IPython 2.3.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)
system     : Darwin
release    : 12.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

In [1]:
import pip
sorted(["%s==%s" % (i.key, i.version) for i in pip.get_installed_distributions()])


Out[1]:
['beautifulsoup4==4.3.2',
 'bottle==0.12.8',
 'certifi==14.05.14',
 'd3py==0.2.3',
 'db.py==0.3.1',
 'decorator==3.4.0',
 'gensim==0.10.3',
 'gnureadline==6.3.3',
 'ipython==2.3.0',
 'jdcal==1.0',
 'jinja2==2.7.3',
 'markdown==2.6',
 'markupsafe==0.23',
 'matplotlib==1.4.2',
 'mpld3==0.2',
 'networkx==1.9.1',
 'nltk==3.0.0',
 'nose==1.3.4',
 'numpy==1.9.0',
 'oauthlib==0.7.2',
 'openpyxl==2.1.5',
 'pandas==0.15.0',
 'patsy==0.3.0',
 'prettytable==0.7.2',
 'progressbar33==2.4',
 'pygments==2.0.2',
 'pymongo==2.7.2',
 'pyparsing==2.0.3',
 'pyprind==2.9.1',
 'python-dateutil==2.2',
 'pytz==2014.7',
 'pyzmq==14.4.1',
 'qgrid==0.1.1',
 'requests-oauthlib==0.4.1',
 'requests==2.3.0',
 'scikit-learn==0.15.2',
 'scipy==0.14.0',
 'six==1.7.3',
 'statsmodels==0.6.1',
 'tornado==4.0.2',
 'tweepy==2.3',
 'vincent==0.4.4']

In [ ]: