Scrape Pycon Talks Data from the Web

Second step in creating a talk recommender for Pycon. Get last year's talk data for bigger a corpus.
The rest of the project can be found on github: https://github.com/mikecunha/pycon_reco

Dependencies

Version info from watermark and pip freeze at the end of the notebook



In [1]:

    
from datetime import datetime
from time import sleep
import re
from pprint import PrettyPrinter
from urllib.request import urlopen

from bs4 import BeautifulSoup # http://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4.element import NavigableString
from markdown import markdown
import pyprind  # progress bar, e.g. here: http://nbviewer.ipython.org/github/rasbt/pyprind/blob/master/examples/pyprind_demo.ipynb
import pandas as pd



In [5]:

    
talk_sched_html = urlopen("https://us.pycon.org/2014/schedule/talks/")
tut_sched_html = urlopen("https://us.pycon.org/2014/schedule/tutorials/")

talk_sched_soup = BeautifulSoup( talk_sched_html.read() )
tut_sched_soup = BeautifulSoup( tut_sched_html.read() )



In [6]:

    
to_scrape = []

talk_links = talk_sched_soup.select("td.slot-talk span.title a")
tut_links = tut_sched_soup.select("td.slot-tutorial span.title a")

for t in talk_links + tut_links:
    to_scrape.append( t.attrs.get('href') )
    
list(enumerate(to_scrape))[-5:]









    Out[6]:





[(126, '/2014/schedule/presentation/67/'),
 (127, '/2014/schedule/presentation/78/'),
 (128, '/2014/schedule/presentation/80/'),
 (129, '/2014/schedule/presentation/55/'),
 (130, '/2014/schedule/presentation/69/')]



In [7]:

    
# Scrape all the talk html pages
soups = {}
perc = pyprind.ProgPercent( len(to_scrape) )

for relative_url in to_scrape:
    
    perc.update()
    
    uri = "https://us.pycon.org" + relative_url
    talk_html = urlopen( uri )
    soups[uri] = BeautifulSoup( talk_html.read() )
    
    sleep(0.5) # Be nice.









    



[100 %] elapsed[sec]: 208.604 | ETA[sec]: 0.000 
Total time elapsed: 208.604 sec



In [8]:

    
talks = []

for uri, soup in soups.items():
    
    talk = {}
    
    content = soup.find(attrs={"class":"box-content"})

    elements = content.find_all("dd")
    talk['level'], talk['category'] = [ e.get_text(strip=True) for e in elements ]

    elements = content.find_all("h4")
    talk['str_time'], talk['author'] = [ e.get_text(strip=True) for e in elements ]

    talk['desc'] = soup.find(attrs={"class":"description"}).get_text(strip=True)
    
    # Abstracts contain some unparsed markdown
    abstract = soup.find(attrs={"class":"abstract"}).get_text(strip=True)
    html = markdown( abstract )
    abstract = ''.join(BeautifulSoup(html).findAll(text=True))
    
    talk['abstract'] = abstract.replace("\n"," ")

    talk['title'] = content.find("h2").get_text(strip=True)
    
    talks.append( talk )
    
talks = pd.DataFrame( talks )
talks.head()









    Out[8]:






  
    
      
      abstract
      author
      category
      desc
      level
      str_time
      title
    
  
  
    
      0
       Generators: The Final Frontier This tutorial e...
          David Beazley
       Python Core (language, stdlib, etc.)
       Python generators have long been useful for so...
        Experienced
        Thursday\n            9 a.m.–12:20 p.m.
                         Generators:  The Final Frontier
    
    
      1
       To be useful and trustworthy, specifications a...
       Catherine Devlin
                                  Education
       Code executes.  Docs just sit there looking pr...
       Intermediate
        Sunday\n            1:50 p.m.–2:20 p.m.
                          See Docs Run.  Run, Docs, Run!
    
    
      2
       This tutorial is a systematic introduction to ...
            Mike Müller
       Python Core (language, stdlib, etc.)
       Descriptors and metaclasses are advanced Pytho...
       Intermediate
       Wednesday\n            9 a.m.–12:20 p.m.
       Descriptors and Metaclasses - Understanding an...
    
    
      3
       Since I started doing free-lance python web-ap...
        Kate Heddleston
                  Best Practices & Patterns
       This is a talk about building full-stack pytho...
       Intermediate
            Friday\n            11:30 a.m.–noon
       So you want to be a full-stack developer? How ...
    
    
      4
       There are a lot of things to think about when ...
             Carl Meyer
                  Best Practices & Patterns
       Got some code that you've written that would b...
             Novice
           Sunday\n            2:30 p.m.–3 p.m.
       Set your code free: releasing and maintaining ...



In [11]:

    
day_to_date = {'Wednesday': 'Apr 9 2014 ',
               'Thursday': 'Apr 10 2014 ',
               'Friday': 'Apr 11 2014 ',
               'Saturday': 'Apr 12 2014 ',
               'Sunday': 'Apr 13 2014 ',
               }

def parse_dt( dt ):
    """ Convert string to datetime """
    
    day, t = [ x.strip() for x in dt.split('\n') ]
    
    start, end = [ x.replace('.', '').replace(' ','').upper() for x in t.split("–") ]
    
    if end == "NOON":
        end = "12:00PM"
    elif end.find(':') < 0:
        end = end[0] + ":00" + end[-2:]
    if start.find(':') < 0:
        start = start[0] + ":00" + start[-2:]
    
    try:
        start = datetime.strptime( day_to_date[day] + start + ' EDT', '%b %d %Y %I:%M%p %Z' )
    except ValueError:
        print ("error converting start time: ", start)
    try:    
        end = datetime.strptime( day_to_date[day] + end + ' EDT', '%b %d %Y %I:%M%p %Z' )
    except ValueError:
        print ("error converting end time: ", end)
    
    return day, start, end



In [12]:

    
talks["weekday"], talks["start_dt"], talks["end_dt"] = zip(*talks["str_time"].map(parse_dt))

del talks["str_time"]



In [13]:

    
talks.head()









    Out[13]:






  
    
      
      abstract
      author
      category
      desc
      level
      title
      weekday
      start_dt
      end_dt
    
  
  
    
      0
       Generators: The Final Frontier This tutorial e...
          David Beazley
       Python Core (language, stdlib, etc.)
       Python generators have long been useful for so...
        Experienced
                         Generators:  The Final Frontier
        Thursday
      2014-04-10 09:00:00
      2014-04-10 12:20:00
    
    
      1
       To be useful and trustworthy, specifications a...
       Catherine Devlin
                                  Education
       Code executes.  Docs just sit there looking pr...
       Intermediate
                          See Docs Run.  Run, Docs, Run!
          Sunday
      2014-04-13 13:50:00
      2014-04-13 14:20:00
    
    
      2
       This tutorial is a systematic introduction to ...
            Mike Müller
       Python Core (language, stdlib, etc.)
       Descriptors and metaclasses are advanced Pytho...
       Intermediate
       Descriptors and Metaclasses - Understanding an...
       Wednesday
      2014-04-09 09:00:00
      2014-04-09 12:20:00
    
    
      3
       Since I started doing free-lance python web-ap...
        Kate Heddleston
                  Best Practices & Patterns
       This is a talk about building full-stack pytho...
       Intermediate
       So you want to be a full-stack developer? How ...
          Friday
      2014-04-11 11:30:00
      2014-04-11 12:00:00
    
    
      4
       There are a lot of things to think about when ...
             Carl Meyer
                  Best Practices & Patterns
       Got some code that you've written that would b...
             Novice
       Set your code free: releasing and maintaining ...
          Sunday
      2014-04-13 14:30:00
      2014-04-13 15:00:00

Ignore old keynotes for now and just save talk and tutorial data



In [33]:

    
talks.to_csv( 'data/pycon_talks_2014.csv', sep="\t", index=False )

System Info



In [33]:

    
try:
    %load_ext watermark
except ImportError as e:
    %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
    %load_ext watermark

%watermark









    



15/03/2015 15:25:15

CPython 3.4.2
IPython 2.3.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)
system     : Darwin
release    : 12.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit



In [1]:

    
import pip
sorted(["%s==%s" % (i.key, i.version) for i in pip.get_installed_distributions()])









    Out[1]:





['beautifulsoup4==4.3.2',
 'bottle==0.12.8',
 'certifi==14.05.14',
 'd3py==0.2.3',
 'db.py==0.3.1',
 'decorator==3.4.0',
 'gensim==0.10.3',
 'gnureadline==6.3.3',
 'ipython==2.3.0',
 'jdcal==1.0',
 'jinja2==2.7.3',
 'markdown==2.6',
 'markupsafe==0.23',
 'matplotlib==1.4.2',
 'mpld3==0.2',
 'networkx==1.9.1',
 'nltk==3.0.0',
 'nose==1.3.4',
 'numpy==1.9.0',
 'oauthlib==0.7.2',
 'openpyxl==2.1.5',
 'pandas==0.15.0',
 'patsy==0.3.0',
 'prettytable==0.7.2',
 'progressbar33==2.4',
 'pygments==2.0.2',
 'pymongo==2.7.2',
 'pyparsing==2.0.3',
 'pyprind==2.9.1',
 'python-dateutil==2.2',
 'pytz==2014.7',
 'pyzmq==14.4.1',
 'qgrid==0.1.1',
 'requests-oauthlib==0.4.1',
 'requests==2.3.0',
 'scikit-learn==0.15.2',
 'scipy==0.14.0',
 'six==1.7.3',
 'statsmodels==0.6.1',
 'tornado==4.0.2',
 'tweepy==2.3',
 'vincent==0.4.4']



In [ ]:

	abstract	author	category	desc	level	str_time	title
0	Generators: The Final Frontier This tutorial e...	David Beazley	Python Core (language, stdlib, etc.)	Python generators have long been useful for so...	Experienced	Thursday\n 9 a.m.–12:20 p.m.	Generators: The Final Frontier
1	To be useful and trustworthy, specifications a...	Catherine Devlin	Education	Code executes. Docs just sit there looking pr...	Intermediate	Sunday\n 1:50 p.m.–2:20 p.m.	See Docs Run. Run, Docs, Run!
2	This tutorial is a systematic introduction to ...	Mike Müller	Python Core (language, stdlib, etc.)	Descriptors and metaclasses are advanced Pytho...	Intermediate	Wednesday\n 9 a.m.–12:20 p.m.	Descriptors and Metaclasses - Understanding an...
3	Since I started doing free-lance python web-ap...	Kate Heddleston	Best Practices & Patterns	This is a talk about building full-stack pytho...	Intermediate	Friday\n 11:30 a.m.–noon	So you want to be a full-stack developer? How ...
4	There are a lot of things to think about when ...	Carl Meyer	Best Practices & Patterns	Got some code that you've written that would b...	Novice	Sunday\n 2:30 p.m.–3 p.m.	Set your code free: releasing and maintaining ...