Scrape Pycon Talks Data from the Web

First step in creating a talk recommender for Pycon.
The rest of the project can be found on github: https://github.com/mikecunha/pycon_reco

Dependencies

Version info from watermark and pip freeze at the end of the notebook


In [2]:
from datetime import datetime
from time import sleep
import re
from pprint import PrettyPrinter
from urllib.request import urlopen

from bs4 import BeautifulSoup # http://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4.element import NavigableString
from markdown import markdown
import pyprind  # progress bar, e.g. here: http://nbviewer.ipython.org/github/rasbt/pyprind/blob/master/examples/pyprind_demo.ipynb
import pandas as pd

In [7]:
sched_html = urlopen("https://us.pycon.org/2015/schedule/")

if sched_html.status != 200:
    print ('Error: ', sched_html.status)
else:
    sched_soup = BeautifulSoup( sched_html.read() )

In [8]:
to_scrape = []

talk_links = sched_soup.select("td.slot-talk span.title a")
tut_links = sched_soup.select("td.slot-tutorial span.title a")

for t in talk_links + tut_links:
    to_scrape.append( t.attrs.get('href') )
    
list(enumerate(to_scrape))[-5:]


Out[8]:
[(126, '/2015/schedule/presentation/330/'),
 (127, '/2015/schedule/presentation/322/'),
 (128, '/2015/schedule/presentation/318/'),
 (129, '/2015/schedule/presentation/299/'),
 (130, '/2015/schedule/presentation/466/')]

In [9]:
# Scrape all the talk html pages
soups = {}
perc = pyprind.ProgPercent( len(to_scrape) )

for relative_url in to_scrape:
    
    perc.update()
    
    uri = "https://us.pycon.org" + relative_url
    talk_html = urlopen( uri )
    soups[uri] = BeautifulSoup( talk_html.read() )
    
    sleep(0.5) # Be nice.


[100 %] elapsed[sec]: 407.227 | ETA[sec]: 0.000 
Total time elapsed: 407.227 sec

In [17]:
talks = []

for uri, soup in soups.items():
    
    talk = {}
    
    content = soup.find(attrs={"class":"box-content"})

    elements = content.find_all("dd")
    talk['level'], talk['category'] = [ e.get_text(strip=True) for e in elements ]

    elements = content.find_all("h4")
    talk['str_time'], talk['author'] = [ e.get_text(strip=True) for e in elements ]

    talk['desc'] = soup.find(attrs={"class":"description"}).get_text(strip=True)
    
    # Abstracts contain some unparsed markdown
    abstract = soup.find(attrs={"class":"abstract"}).get_text(strip=True)
    html = markdown( abstract )
    abstract = ''.join(BeautifulSoup(html).findAll(text=True))
    
    talk['abstract'] = abstract.replace("\n"," ")

    talk['title'] = content.find("h2").get_text(strip=True)
    
    talks.append( talk )
    
talks = pd.DataFrame( talks )
talks.head()


Out[17]:
abstract author category desc level str_time title
0 Without virtual environments, your installed l... Renee Chu,Matt Makai Python Core (language, stdlib, etc.) Even though it’s possible to program without u... Novice Friday\n 4:30 p.m.–5 p.m. Don't Make Us Say We Told You So: virtualenv f...
1 overview This tutorial aims to teach participa... Andrew Seier,Étienne Tétreault-Pinard,Marianne... Python Libraries From Python basics to NYT-quality graphics, we... Novice Thursday\n 9 a.m.–12:20 p.m. Making Beautiful Graphs in Python and Sharing ...
2 Distributed systems are a fairly advanced fiel... lvh Best Practices & Patterns A very brief introduction to the theory and pr... Intermediate Friday\n 1:40 p.m.–2:25 p.m. Distributed Systems 101
3 Are you interested in learning how to orchestr... Luke Sneeringer Systems Administration Interested in Ansible, or in server orchestrat... Intermediate Thursday\n 9 a.m.–12:20 p.m. Ansible 101
4 Software engineers are never done learning as ... Sasha Laundy Community Software engineers are never done learning sin... Novice Saturday\n 4:30 p.m.–5 p.m. Your Brain's API: Giving and Getting Technical...

In [19]:
day_to_date = {'Wednesday': 'Apr 8 2015 ',
               'Thursday': 'Apr 9 2015 ',
               'Friday': 'Apr 10 2015 ',
               'Saturday': 'Apr 11 2015 ',
               'Sunday': 'Apr 12 2015 ',
               }

def parse_dt( dt ):
    """ Convert string to datetime """
    
    day, t = [ x.strip() for x in dt.split('\n') ]
    
    start, end = [ x.replace('.', '').replace(' ','').upper() for x in t.split("–") ]
    
    if end == "NOON":
        end = "12:00PM"
    elif end.find(':') < 0:
        end = end[0] + ":00" + end[-2:]
    if start.find(':') < 0:
        start = start[0] + ":00" + start[-2:]
    
    try:
        start = datetime.strptime( day_to_date[day] + start + ' EDT', '%b %d %Y %I:%M%p %Z' )
    except ValueError:
        print ("error converting start time: ", start)
    try:    
        end = datetime.strptime( day_to_date[day] + end + ' EDT', '%b %d %Y %I:%M%p %Z' )
    except ValueError:
        print ("error converting end time: ", end)
    
    return day, start, end

In [20]:
talks["weekday"], talks["start_dt"], talks["end_dt"] = zip(*talks["str_time"].map(parse_dt))

del talks["str_time"]

In [21]:
talks.head()


Out[21]:
abstract author category desc level title weekday start_dt end_dt
0 Without virtual environments, your installed l... Renee Chu,Matt Makai Python Core (language, stdlib, etc.) Even though it’s possible to program without u... Novice Don't Make Us Say We Told You So: virtualenv f... Friday 2015-04-10 16:30:00 2015-04-10 17:00:00
1 overview This tutorial aims to teach participa... Andrew Seier,Étienne Tétreault-Pinard,Marianne... Python Libraries From Python basics to NYT-quality graphics, we... Novice Making Beautiful Graphs in Python and Sharing ... Thursday 2015-04-09 09:00:00 2015-04-09 12:20:00
2 Distributed systems are a fairly advanced fiel... lvh Best Practices & Patterns A very brief introduction to the theory and pr... Intermediate Distributed Systems 101 Friday 2015-04-10 13:40:00 2015-04-10 14:25:00
3 Are you interested in learning how to orchestr... Luke Sneeringer Systems Administration Interested in Ansible, or in server orchestrat... Intermediate Ansible 101 Thursday 2015-04-09 09:00:00 2015-04-09 12:20:00
4 Software engineers are never done learning as ... Sasha Laundy Community Software engineers are never done learning sin... Novice Your Brain's API: Giving and Getting Technical... Saturday 2015-04-11 16:30:00 2015-04-11 17:00:00

Have to scrape Keynote information differently:


In [3]:
# grab html from the keynote page with bio's of speakers:
key_html = urlopen( "https://us.pycon.org/2015/events/keynotes/" )
key_soup = BeautifulSoup( key_html.read() )

In [23]:
auth_info = {}

# not as many unique tags for soup, so find them using regex on 
# the markdown that is present
for author in key_soup.findAll(text=re.compile('.*##[^\n]*')):
    
    start_tag = author.find_next('div')
    # the bio text is between these two tags
    stop_tag = author.find_next('p')
    
    desc = ''
    for elem in start_tag.next_elements:
        if elem == stop_tag:
            break
        elif isinstance(elem, NavigableString ):
            desc += elem.string
    
    talk_name, desc = desc.strip().split("\n", 1)
    author = author.strip("\r\n #")
    
    # Deal with unique format of one author
    if author == "Gabriella Coleman":
        talk_name, desc = desc.strip().split("\n", 1)
    
    auth_info[ author ] = { 'desc': desc.strip(),
                            'title': talk_name, }

pp = PrettyPrinter(indent=4)
pp.pprint( auth_info )


{   'Catherine Bracy': {   'desc': 'Catherine oversees Code for America’s '
                                   'civic engagement portfolio, including '
                                   'the Brigade program. She also founded '
                                   'and runs Code for All, Code for '
                                   'America’s international partnership '
                                   'program.\r\n'
                                   '\r\n'
                                   'Until November 2012, she was Director '
                                   'of Obama for America’s technology field '
                                   'office in San Francisco, the first of '
                                   'its kind in American political history. '
                                   'She was responsible for organizing '
                                   'technologists to volunteer their skills '
                                   'to the campaign’s technology and '
                                   'digital efforts.\r\n'
                                   '\r\n'
                                   'Prior to joining the campaign, she ran '
                                   'the Knight Foundation’s 2011 News '
                                   'Challenge and before that was the '
                                   'administrative director at Harvard’s '
                                   'Berkman Center for Internet & Society. '
                                   'She is on the board of directors at the '
                                   'Citizen Engagement Lab and the Public '
                                   'Laboratory.',
                           'title': 'Director, Code for America'},
    'Gabriella Coleman': {   'desc': 'Gabriella (Biella) Coleman holds the '
                                     'Wolfe Chair in Scientific and '
                                     'Technological Literacy at McGill '
                                     'University. Trained as a cultural '
                                     'anthropologist, she researches, '
                                     'writes, and teaches on computer '
                                     'hackers and digital activism and is '
                                     'the author of two books.\r\n'
                                     '\r\n'
                                     'Her first book, Coding Freedom: The '
                                     'Ethics and Aesthetics of Hacking, was '
                                     'published with Princeton University '
                                     'Press in 2013 and her most recent '
                                     'book, Hacker, Hoaxer, Whistleblower, '
                                     'Spy: The Many Faces of Anonymous, '
                                     'published by Verso, has been named to '
                                     "Kirkus Reviews' Best Books of "
                                     '2014.\r\n'
                                     '\r\n'
                                     'You can learn more about her work on '
                                     'her website: '
                                     'http://gabriellacoleman.org/.',
                             'title': 'Author & Professor'},
    'Gary Bernhardt': {   'desc': 'Gary Bernhardt is a creator and '
                                  'destroyer of software compelled to '
                                  'understand both sides of heated software '
                                  'debates: Vim and Emacs; Python and Ruby; '
                                  'Git and Mercurial. He runs  Destroy All '
                                  'Software, which publishes advanced '
                                  'screencasts for serious developers '
                                  'covering Unix, OO design, TDD, and '
                                  'dynamic languages.',
                          'title': 'Closing Keynote'},
    'Guido van Rossum': {   'desc': 'Guido van Rossum is the author of the '
                                    'Python programming language. He '
                                    'continues to serve as the "Benevolent '
                                    'Dictator For Life" (BDFL), meaning '
                                    'that he continues to oversee the '
                                    'Python development process, making '
                                    'decisions where necessary. He is '
                                    'currently employed by Dropbox.',
                            'title': "Python's Creator"},
    'Jacob Kaplan-Moss': {   'desc': 'Jacob Kaplan-Moss is the co-creator '
                                     'of Django and the founder of the '
                                     'Django Software Foundation. He has '
                                     'over a decade of experience as a web, '
                                     'open source, and Python developer. He '
                                     'is currently Director of Security at '
                                     'Heroku.',
                             'title': "Django's Co-creator"},
    'Julia Evans': {   'desc': 'Julia Evans is a programmer & data '
                               'scientist based in Montréal, Quebec. She '
                               'loves coding, math, playing with datasets, '
                               'teaching programming, open source '
                               'communities, and late night discussions on '
                               'how to dismantle oppression. She '
                               'co-organizes PyLadies Montréal and Montréal '
                               'All-Girl Hack Night.',
                       'title': 'Opening Statements'},
    'Van Lindberg': {   'desc': 'Van Lindberg is Vice President of '
                                'Intellectual Property at Rackspace. He is '
                                'trained as a computer engineer and lawyer, '
                                'but what he does best is “translate” to '
                                'help businesses, techies and attorneys '
                                'understand each other. Van likes working '
                                'with both computer code and legal code. '
                                'For the past several years, he has been '
                                'using natural language processing and '
                                'graph theory to help him digest and map '
                                'the U.S. Patent Database. Van is currently '
                                'chairman of the board of the Python '
                                'Software Foundation, as well as the author '
                                'of Intellectual Property and Open Source.',
                        'title': 'PSF Chair'}}

In [29]:
# Get datetimes to go along with them
weekdays = {0: 'Monday',
            1: 'Tuesday',
            2: 'Wednesday',
            3: 'Thursday',
            4: 'Friday',
            5: 'Saturday',
            6: 'Sunday',
            }

key_talks = []

for day_soup in sched_soup.findAll("h3"):
    
    day = day_soup.get_text(strip=True)
    day = day.replace(',','').replace('April','Apr')
    
    days_table = day_soup.findNext("table")
    keynotes = days_table.select("td.slot-lightning")
    
    for key in keynotes:
        
        key_title = key.get_text()
        
        if key_title.find('Keynote') > -1 or key_title.find('Opening') > -1:
            
            start_t = key.findPrevious("td").get_text(strip=True)
            start_t = datetime.strptime( day +' '+ start_t + ' EDT', '%b %d %Y %I:%M%p %Z' )
            
            end_t = key.findNext("td").get_text(strip=True)
            end_t = datetime.strptime( day +' '+ end_t + ' EDT', '%b %d %Y %I:%M%p %Z' )
            
            dow = weekdays[ start_t.weekday() ]
            
            category, author = key_title.strip().split(' - ', 1)
            
            author = author.split('- ',1)[0].strip()
            
            talk = {'start_dt': start_t, 
                    'end_dt': end_t,
                    'weekday': dow,
                    'author': author,
                    'category': category,
                    }
            
            # Add in keynote titles and descriptions
            for key, val in auth_info[author].items():
                talk[key] = val 
            
            key_talks.append( talk )
            
key_talks = pd.DataFrame( key_talks )
key_talks


Out[29]:
author category desc end_dt start_dt title weekday
0 Julia Evans Opening Statements Julia Evans is a programmer & data scientist b... 2015-04-10 09:30:00 2015-04-10 09:00:00 Opening Statements Friday
1 Catherine Bracy Keynote Catherine oversees Code for America’s civic en... 2015-04-10 10:10:00 2015-04-10 09:30:00 Director, Code for America Friday
2 Guido van Rossum Keynote Guido van Rossum is the author of the Python p... 2015-04-11 09:40:00 2015-04-11 09:00:00 Python's Creator Saturday
3 Gabriella Coleman Keynote Gabriella (Biella) Coleman holds the Wolfe Cha... 2015-04-11 10:20:00 2015-04-11 09:40:00 Author & Professor Saturday
4 Van Lindberg Keynote Van Lindberg is Vice President of Intellectual... 2015-04-12 09:20:00 2015-04-12 09:00:00 PSF Chair Sunday
5 Jacob Kaplan-Moss Keynote Jacob Kaplan-Moss is the co-creator of Django ... 2015-04-12 10:00:00 2015-04-12 09:20:00 Django's Co-creator Sunday
6 Gary Bernhardt Keynote Gary Bernhardt is a creator and destroyer of s... 2015-04-12 15:50:00 2015-04-12 15:10:00 Closing Keynote Sunday

In [30]:
# specifying columns at the end preserves column order
combined_talks = pd.concat([key_talks, talks], ignore_index=True, )[talks.columns]
combined_talks.tail()


Out[30]:
abstract author category desc level title weekday start_dt end_dt
133 Setting the scene My boss alerted me to an art... A. Jesse Jiryu Davis Python Libraries Your Python program is too slow, and you need ... Intermediate Python Performance Profiling: The Guts And The... Sunday 2015-04-12 13:50:00 2015-04-12 14:20:00
134 This tutorial is a systematic introduction to ... Mike Müller Python Core (language, stdlib, etc.) Descriptors and metaclasses are advanced Pytho... Experienced Descriptors and Metaclasses - Understanding an... Thursday 2015-04-09 09:00:00 2015-04-09 12:20:00
135 Using examples from real-code, show what reall... Raymond Hettinger Best Practices & Patterns Distillation of knowledge gained from a decade... Intermediate Beyond PEP 8 -- Best practices for beautiful i... Friday 2015-04-10 12:10:00 2015-04-10 12:55:00
136 The goal of static code analysis is to generat... Andreas Dewes Best Practices & Patterns Static code analysis is an useful tool that ca... Intermediate Learning from other's mistakes: Data-driven an... Saturday 2015-04-11 11:30:00 2015-04-11 12:00:00
137 In many ways Python is very similar to other p... Stuart Williams Python Core (language, stdlib, etc.) This tutorial is for developers who've been us... Intermediate Python Epiphanies Wednesday 2015-04-08 13:20:00 2015-04-08 16:40:00

In [31]:
combined_talks.to_csv( 'data/pycon_talks_2015.csv', sep="\t", index=False )

System Info


In [33]:
try:
    %load_ext watermark
except ImportError as e:
    %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
    %load_ext watermark

%watermark


15/03/2015 15:25:15

CPython 3.4.2
IPython 2.3.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)
system     : Darwin
release    : 12.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

In [1]:
import pip
sorted(["%s==%s" % (i.key, i.version) for i in pip.get_installed_distributions()])


Out[1]:
['beautifulsoup4==4.3.2',
 'bottle==0.12.8',
 'certifi==14.05.14',
 'd3py==0.2.3',
 'db.py==0.3.1',
 'decorator==3.4.0',
 'gensim==0.10.3',
 'gnureadline==6.3.3',
 'ipython==2.3.0',
 'jdcal==1.0',
 'jinja2==2.7.3',
 'markdown==2.6',
 'markupsafe==0.23',
 'matplotlib==1.4.2',
 'mpld3==0.2',
 'networkx==1.9.1',
 'nltk==3.0.0',
 'nose==1.3.4',
 'numpy==1.9.0',
 'oauthlib==0.7.2',
 'openpyxl==2.1.5',
 'pandas==0.15.0',
 'patsy==0.3.0',
 'prettytable==0.7.2',
 'progressbar33==2.4',
 'pygments==2.0.2',
 'pymongo==2.7.2',
 'pyparsing==2.0.3',
 'pyprind==2.9.1',
 'python-dateutil==2.2',
 'pytz==2014.7',
 'pyzmq==14.4.1',
 'qgrid==0.1.1',
 'requests-oauthlib==0.4.1',
 'requests==2.3.0',
 'scikit-learn==0.15.2',
 'scipy==0.14.0',
 'six==1.7.3',
 'statsmodels==0.6.1',
 'tornado==4.0.2',
 'tweepy==2.3',
 'vincent==0.4.4']

In [ ]: