Scrape Pycon Talks Data from the Web

First step in creating a talk recommender for Pycon.
The rest of the project can be found on github: https://github.com/mikecunha/pycon_reco

Dependencies

Version info from watermark and pip freeze at the end of the notebook



In [2]:

    
from datetime import datetime
from time import sleep
import re
from pprint import PrettyPrinter
from urllib.request import urlopen

from bs4 import BeautifulSoup # http://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4.element import NavigableString
from markdown import markdown
import pyprind  # progress bar, e.g. here: http://nbviewer.ipython.org/github/rasbt/pyprind/blob/master/examples/pyprind_demo.ipynb
import pandas as pd



In [7]:

    
sched_html = urlopen("https://us.pycon.org/2015/schedule/")

if sched_html.status != 200:
    print ('Error: ', sched_html.status)
else:
    sched_soup = BeautifulSoup( sched_html.read() )



In [8]:

    
to_scrape = []

talk_links = sched_soup.select("td.slot-talk span.title a")
tut_links = sched_soup.select("td.slot-tutorial span.title a")

for t in talk_links + tut_links:
    to_scrape.append( t.attrs.get('href') )
    
list(enumerate(to_scrape))[-5:]









    Out[8]:





[(126, '/2015/schedule/presentation/330/'),
 (127, '/2015/schedule/presentation/322/'),
 (128, '/2015/schedule/presentation/318/'),
 (129, '/2015/schedule/presentation/299/'),
 (130, '/2015/schedule/presentation/466/')]



In [9]:

    
# Scrape all the talk html pages
soups = {}
perc = pyprind.ProgPercent( len(to_scrape) )

for relative_url in to_scrape:
    
    perc.update()
    
    uri = "https://us.pycon.org" + relative_url
    talk_html = urlopen( uri )
    soups[uri] = BeautifulSoup( talk_html.read() )
    
    sleep(0.5) # Be nice.









    



[100 %] elapsed[sec]: 407.227 | ETA[sec]: 0.000 
Total time elapsed: 407.227 sec



In [17]:

    
talks = []

for uri, soup in soups.items():
    
    talk = {}
    
    content = soup.find(attrs={"class":"box-content"})

    elements = content.find_all("dd")
    talk['level'], talk['category'] = [ e.get_text(strip=True) for e in elements ]

    elements = content.find_all("h4")
    talk['str_time'], talk['author'] = [ e.get_text(strip=True) for e in elements ]

    talk['desc'] = soup.find(attrs={"class":"description"}).get_text(strip=True)
    
    # Abstracts contain some unparsed markdown
    abstract = soup.find(attrs={"class":"abstract"}).get_text(strip=True)
    html = markdown( abstract )
    abstract = ''.join(BeautifulSoup(html).findAll(text=True))
    
    talk['abstract'] = abstract.replace("\n"," ")

    talk['title'] = content.find("h2").get_text(strip=True)
    
    talks.append( talk )
    
talks = pd.DataFrame( talks )
talks.head()









    Out[17]:






  
    
      
      abstract
      author
      category
      desc
      level
      str_time
      title
    
  
  
    
      0
       Without virtual environments, your installed l...
                                    Renee Chu,Matt Makai
       Python Core (language, stdlib, etc.)
       Even though it’s possible to program without u...
             Novice
          Friday\n            4:30 p.m.–5 p.m.
       Don't Make Us Say We Told You So: virtualenv f...
    
    
      1
       overview This tutorial aims to teach participa...
       Andrew Seier,Étienne Tétreault-Pinard,Marianne...
                           Python Libraries
       From Python basics to NYT-quality graphics, we...
             Novice
       Thursday\n            9 a.m.–12:20 p.m.
       Making Beautiful Graphs in Python and Sharing ...
    
    
      2
       Distributed systems are a fairly advanced fiel...
                                                     lvh
                  Best Practices & Patterns
       A very brief introduction to the theory and pr...
       Intermediate
       Friday\n            1:40 p.m.–2:25 p.m.
                                 Distributed Systems 101
    
    
      3
       Are you interested in learning how to orchestr...
                                         Luke Sneeringer
                     Systems Administration
       Interested in Ansible, or in server orchestrat...
       Intermediate
       Thursday\n            9 a.m.–12:20 p.m.
                                             Ansible 101
    
    
      4
       Software engineers are never done learning as ...
                                            Sasha Laundy
                                  Community
       Software engineers are never done learning sin...
             Novice
        Saturday\n            4:30 p.m.–5 p.m.
       Your Brain's API: Giving and Getting Technical...



In [19]:

    
day_to_date = {'Wednesday': 'Apr 8 2015 ',
               'Thursday': 'Apr 9 2015 ',
               'Friday': 'Apr 10 2015 ',
               'Saturday': 'Apr 11 2015 ',
               'Sunday': 'Apr 12 2015 ',
               }

def parse_dt( dt ):
    """ Convert string to datetime """
    
    day, t = [ x.strip() for x in dt.split('\n') ]
    
    start, end = [ x.replace('.', '').replace(' ','').upper() for x in t.split("–") ]
    
    if end == "NOON":
        end = "12:00PM"
    elif end.find(':') < 0:
        end = end[0] + ":00" + end[-2:]
    if start.find(':') < 0:
        start = start[0] + ":00" + start[-2:]
    
    try:
        start = datetime.strptime( day_to_date[day] + start + ' EDT', '%b %d %Y %I:%M%p %Z' )
    except ValueError:
        print ("error converting start time: ", start)
    try:    
        end = datetime.strptime( day_to_date[day] + end + ' EDT', '%b %d %Y %I:%M%p %Z' )
    except ValueError:
        print ("error converting end time: ", end)
    
    return day, start, end



In [20]:

    
talks["weekday"], talks["start_dt"], talks["end_dt"] = zip(*talks["str_time"].map(parse_dt))

del talks["str_time"]



In [21]:

    
talks.head()









    Out[21]:






  
    
      
      abstract
      author
      category
      desc
      level
      title
      weekday
      start_dt
      end_dt
    
  
  
    
      0
       Without virtual environments, your installed l...
                                    Renee Chu,Matt Makai
       Python Core (language, stdlib, etc.)
       Even though it’s possible to program without u...
             Novice
       Don't Make Us Say We Told You So: virtualenv f...
         Friday
      2015-04-10 16:30:00
      2015-04-10 17:00:00
    
    
      1
       overview This tutorial aims to teach participa...
       Andrew Seier,Étienne Tétreault-Pinard,Marianne...
                           Python Libraries
       From Python basics to NYT-quality graphics, we...
             Novice
       Making Beautiful Graphs in Python and Sharing ...
       Thursday
      2015-04-09 09:00:00
      2015-04-09 12:20:00
    
    
      2
       Distributed systems are a fairly advanced fiel...
                                                     lvh
                  Best Practices & Patterns
       A very brief introduction to the theory and pr...
       Intermediate
                                 Distributed Systems 101
         Friday
      2015-04-10 13:40:00
      2015-04-10 14:25:00
    
    
      3
       Are you interested in learning how to orchestr...
                                         Luke Sneeringer
                     Systems Administration
       Interested in Ansible, or in server orchestrat...
       Intermediate
                                             Ansible 101
       Thursday
      2015-04-09 09:00:00
      2015-04-09 12:20:00
    
    
      4
       Software engineers are never done learning as ...
                                            Sasha Laundy
                                  Community
       Software engineers are never done learning sin...
             Novice
       Your Brain's API: Giving and Getting Technical...
       Saturday
      2015-04-11 16:30:00
      2015-04-11 17:00:00

Have to scrape Keynote information differently:



In [3]:

    
# grab html from the keynote page with bio's of speakers:
key_html = urlopen( "https://us.pycon.org/2015/events/keynotes/" )
key_soup = BeautifulSoup( key_html.read() )



In [23]:

    
auth_info = {}

# not as many unique tags for soup, so find them using regex on 
# the markdown that is present
for author in key_soup.findAll(text=re.compile('.*##[^\n]*')):
    
    start_tag = author.find_next('div')
    # the bio text is between these two tags
    stop_tag = author.find_next('p')
    
    desc = ''
    for elem in start_tag.next_elements:
        if elem == stop_tag:
            break
        elif isinstance(elem, NavigableString ):
            desc += elem.string
    
    talk_name, desc = desc.strip().split("\n", 1)
    author = author.strip("\r\n #")
    
    # Deal with unique format of one author
    if author == "Gabriella Coleman":
        talk_name, desc = desc.strip().split("\n", 1)
    
    auth_info[ author ] = { 'desc': desc.strip(),
                            'title': talk_name, }

pp = PrettyPrinter(indent=4)
pp.pprint( auth_info )









    



{   'Catherine Bracy': {   'desc': 'Catherine oversees Code for America’s '
                                   'civic engagement portfolio, including '
                                   'the Brigade program. She also founded '
                                   'and runs Code for All, Code for '
                                   'America’s international partnership '
                                   'program.\r\n'
                                   '\r\n'
                                   'Until November 2012, she was Director '
                                   'of Obama for America’s technology field '
                                   'office in San Francisco, the first of '
                                   'its kind in American political history. '
                                   'She was responsible for organizing '
                                   'technologists to volunteer their skills '
                                   'to the campaign’s technology and '
                                   'digital efforts.\r\n'
                                   '\r\n'
                                   'Prior to joining the campaign, she ran '
                                   'the Knight Foundation’s 2011 News '
                                   'Challenge and before that was the '
                                   'administrative director at Harvard’s '
                                   'Berkman Center for Internet & Society. '
                                   'She is on the board of directors at the '
                                   'Citizen Engagement Lab and the Public '
                                   'Laboratory.',
                           'title': 'Director, Code for America'},
    'Gabriella Coleman': {   'desc': 'Gabriella (Biella) Coleman holds the '
                                     'Wolfe Chair in Scientific and '
                                     'Technological Literacy at McGill '
                                     'University. Trained as a cultural '
                                     'anthropologist, she researches, '
                                     'writes, and teaches on computer '
                                     'hackers and digital activism and is '
                                     'the author of two books.\r\n'
                                     '\r\n'
                                     'Her first book, Coding Freedom: The '
                                     'Ethics and Aesthetics of Hacking, was '
                                     'published with Princeton University '
                                     'Press in 2013 and her most recent '
                                     'book, Hacker, Hoaxer, Whistleblower, '
                                     'Spy: The Many Faces of Anonymous, '
                                     'published by Verso, has been named to '
                                     "Kirkus Reviews' Best Books of "
                                     '2014.\r\n'
                                     '\r\n'
                                     'You can learn more about her work on '
                                     'her website: '
                                     'http://gabriellacoleman.org/.',
                             'title': 'Author & Professor'},
    'Gary Bernhardt': {   'desc': 'Gary Bernhardt is a creator and '
                                  'destroyer of software compelled to '
                                  'understand both sides of heated software '
                                  'debates: Vim and Emacs; Python and Ruby; '
                                  'Git and Mercurial. He runs  Destroy All '
                                  'Software, which publishes advanced '
                                  'screencasts for serious developers '
                                  'covering Unix, OO design, TDD, and '
                                  'dynamic languages.',
                          'title': 'Closing Keynote'},
    'Guido van Rossum': {   'desc': 'Guido van Rossum is the author of the '
                                    'Python programming language. He '
                                    'continues to serve as the "Benevolent '
                                    'Dictator For Life" (BDFL), meaning '
                                    'that he continues to oversee the '
                                    'Python development process, making '
                                    'decisions where necessary. He is '
                                    'currently employed by Dropbox.',
                            'title': "Python's Creator"},
    'Jacob Kaplan-Moss': {   'desc': 'Jacob Kaplan-Moss is the co-creator '
                                     'of Django and the founder of the '
                                     'Django Software Foundation. He has '
                                     'over a decade of experience as a web, '
                                     'open source, and Python developer. He '
                                     'is currently Director of Security at '
                                     'Heroku.',
                             'title': "Django's Co-creator"},
    'Julia Evans': {   'desc': 'Julia Evans is a programmer & data '
                               'scientist based in Montréal, Quebec. She '
                               'loves coding, math, playing with datasets, '
                               'teaching programming, open source '
                               'communities, and late night discussions on '
                               'how to dismantle oppression. She '
                               'co-organizes PyLadies Montréal and Montréal '
                               'All-Girl Hack Night.',
                       'title': 'Opening Statements'},
    'Van Lindberg': {   'desc': 'Van Lindberg is Vice President of '
                                'Intellectual Property at Rackspace. He is '
                                'trained as a computer engineer and lawyer, '
                                'but what he does best is “translate” to '
                                'help businesses, techies and attorneys '
                                'understand each other. Van likes working '
                                'with both computer code and legal code. '
                                'For the past several years, he has been '
                                'using natural language processing and '
                                'graph theory to help him digest and map '
                                'the U.S. Patent Database. Van is currently '
                                'chairman of the board of the Python '
                                'Software Foundation, as well as the author '
                                'of Intellectual Property and Open Source.',
                        'title': 'PSF Chair'}}



In [29]:

    
# Get datetimes to go along with them
weekdays = {0: 'Monday',
            1: 'Tuesday',
            2: 'Wednesday',
            3: 'Thursday',
            4: 'Friday',
            5: 'Saturday',
            6: 'Sunday',
            }

key_talks = []

for day_soup in sched_soup.findAll("h3"):
    
    day = day_soup.get_text(strip=True)
    day = day.replace(',','').replace('April','Apr')
    
    days_table = day_soup.findNext("table")
    keynotes = days_table.select("td.slot-lightning")
    
    for key in keynotes:
        
        key_title = key.get_text()
        
        if key_title.find('Keynote') > -1 or key_title.find('Opening') > -1:
            
            start_t = key.findPrevious("td").get_text(strip=True)
            start_t = datetime.strptime( day +' '+ start_t + ' EDT', '%b %d %Y %I:%M%p %Z' )
            
            end_t = key.findNext("td").get_text(strip=True)
            end_t = datetime.strptime( day +' '+ end_t + ' EDT', '%b %d %Y %I:%M%p %Z' )
            
            dow = weekdays[ start_t.weekday() ]
            
            category, author = key_title.strip().split(' - ', 1)
            
            author = author.split('- ',1)[0].strip()
            
            talk = {'start_dt': start_t, 
                    'end_dt': end_t,
                    'weekday': dow,
                    'author': author,
                    'category': category,
                    }
            
            # Add in keynote titles and descriptions
            for key, val in auth_info[author].items():
                talk[key] = val 
            
            key_talks.append( talk )
            
key_talks = pd.DataFrame( key_talks )
key_talks









    Out[29]:






  
    
      
      author
      category
      desc
      end_dt
      start_dt
      title
      weekday
    
  
  
    
      0
             Julia Evans
       Opening Statements
       Julia Evans is a programmer & data scientist b...
      2015-04-10 09:30:00
      2015-04-10 09:00:00
               Opening Statements
         Friday
    
    
      1
         Catherine Bracy
                  Keynote
       Catherine oversees Code for America’s civic en...
      2015-04-10 10:10:00
      2015-04-10 09:30:00
       Director, Code for America
         Friday
    
    
      2
        Guido van Rossum
                  Keynote
       Guido van Rossum is the author of the Python p...
      2015-04-11 09:40:00
      2015-04-11 09:00:00
                 Python's Creator
       Saturday
    
    
      3
       Gabriella Coleman
                  Keynote
       Gabriella (Biella) Coleman holds the Wolfe Cha...
      2015-04-11 10:20:00
      2015-04-11 09:40:00
               Author & Professor
       Saturday
    
    
      4
            Van Lindberg
                  Keynote
       Van Lindberg is Vice President of Intellectual...
      2015-04-12 09:20:00
      2015-04-12 09:00:00
                        PSF Chair
         Sunday
    
    
      5
       Jacob Kaplan-Moss
                  Keynote
       Jacob Kaplan-Moss is the co-creator of Django ...
      2015-04-12 10:00:00
      2015-04-12 09:20:00
              Django's Co-creator
         Sunday
    
    
      6
          Gary Bernhardt
                  Keynote
       Gary Bernhardt is a creator and destroyer of s...
      2015-04-12 15:50:00
      2015-04-12 15:10:00
                  Closing Keynote
         Sunday



In [30]:

    
# specifying columns at the end preserves column order
combined_talks = pd.concat([key_talks, talks], ignore_index=True, )[talks.columns]
combined_talks.tail()









    Out[30]:






  
    
      
      abstract
      author
      category
      desc
      level
      title
      weekday
      start_dt
      end_dt
    
  
  
    
      133
       Setting the scene My boss alerted me to an art...
       A. Jesse Jiryu Davis
                           Python Libraries
       Your Python program is too slow, and you need ...
       Intermediate
       Python Performance Profiling: The Guts And The...
          Sunday
      2015-04-12 13:50:00
      2015-04-12 14:20:00
    
    
      134
       This tutorial is a systematic introduction to ...
                Mike Müller
       Python Core (language, stdlib, etc.)
       Descriptors and metaclasses are advanced Pytho...
        Experienced
       Descriptors and Metaclasses - Understanding an...
        Thursday
      2015-04-09 09:00:00
      2015-04-09 12:20:00
    
    
      135
       Using examples from real-code, show what reall...
          Raymond Hettinger
                  Best Practices & Patterns
       Distillation of knowledge gained from a decade...
       Intermediate
       Beyond PEP 8 -- Best practices for beautiful i...
          Friday
      2015-04-10 12:10:00
      2015-04-10 12:55:00
    
    
      136
       The goal of static code analysis is to generat...
              Andreas Dewes
                  Best Practices & Patterns
       Static code analysis is an useful tool that ca...
       Intermediate
       Learning from other's mistakes: Data-driven an...
        Saturday
      2015-04-11 11:30:00
      2015-04-11 12:00:00
    
    
      137
       In many ways Python is very similar to other p...
            Stuart Williams
       Python Core (language, stdlib, etc.)
       This tutorial is for developers who've been us...
       Intermediate
                                       Python Epiphanies
       Wednesday
      2015-04-08 13:20:00
      2015-04-08 16:40:00



In [31]:

    
combined_talks.to_csv( 'data/pycon_talks_2015.csv', sep="\t", index=False )

System Info



In [33]:

    
try:
    %load_ext watermark
except ImportError as e:
    %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
    %load_ext watermark

%watermark









    



15/03/2015 15:25:15

CPython 3.4.2
IPython 2.3.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)
system     : Darwin
release    : 12.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit



In [1]:

    
import pip
sorted(["%s==%s" % (i.key, i.version) for i in pip.get_installed_distributions()])









    Out[1]:





['beautifulsoup4==4.3.2',
 'bottle==0.12.8',
 'certifi==14.05.14',
 'd3py==0.2.3',
 'db.py==0.3.1',
 'decorator==3.4.0',
 'gensim==0.10.3',
 'gnureadline==6.3.3',
 'ipython==2.3.0',
 'jdcal==1.0',
 'jinja2==2.7.3',
 'markdown==2.6',
 'markupsafe==0.23',
 'matplotlib==1.4.2',
 'mpld3==0.2',
 'networkx==1.9.1',
 'nltk==3.0.0',
 'nose==1.3.4',
 'numpy==1.9.0',
 'oauthlib==0.7.2',
 'openpyxl==2.1.5',
 'pandas==0.15.0',
 'patsy==0.3.0',
 'prettytable==0.7.2',
 'progressbar33==2.4',
 'pygments==2.0.2',
 'pymongo==2.7.2',
 'pyparsing==2.0.3',
 'pyprind==2.9.1',
 'python-dateutil==2.2',
 'pytz==2014.7',
 'pyzmq==14.4.1',
 'qgrid==0.1.1',
 'requests-oauthlib==0.4.1',
 'requests==2.3.0',
 'scikit-learn==0.15.2',
 'scipy==0.14.0',
 'six==1.7.3',
 'statsmodels==0.6.1',
 'tornado==4.0.2',
 'tweepy==2.3',
 'vincent==0.4.4']



In [ ]:

	abstract	author	category	desc	level	str_time	title
0	Without virtual environments, your installed l...	Renee Chu,Matt Makai	Python Core (language, stdlib, etc.)	Even though it’s possible to program without u...	Novice	Friday\n 4:30 p.m.–5 p.m.	Don't Make Us Say We Told You So: virtualenv f...
1	overview This tutorial aims to teach participa...	Andrew Seier,Étienne Tétreault-Pinard,Marianne...	Python Libraries	From Python basics to NYT-quality graphics, we...	Novice	Thursday\n 9 a.m.–12:20 p.m.	Making Beautiful Graphs in Python and Sharing ...
2	Distributed systems are a fairly advanced fiel...	lvh	Best Practices & Patterns	A very brief introduction to the theory and pr...	Intermediate	Friday\n 1:40 p.m.–2:25 p.m.	Distributed Systems 101
3	Are you interested in learning how to orchestr...	Luke Sneeringer	Systems Administration	Interested in Ansible, or in server orchestrat...	Intermediate	Thursday\n 9 a.m.–12:20 p.m.	Ansible 101
4	Software engineers are never done learning as ...	Sasha Laundy	Community	Software engineers are never done learning sin...	Novice	Saturday\n 4:30 p.m.–5 p.m.	Your Brain's API: Giving and Getting Technical...

	author	category	desc	end_dt	start_dt	title	weekday
0	Julia Evans	Opening Statements	Julia Evans is a programmer & data scientist b...	2015-04-10 09:30:00	2015-04-10 09:00:00	Opening Statements	Friday
1	Catherine Bracy	Keynote	Catherine oversees Code for America’s civic en...	2015-04-10 10:10:00	2015-04-10 09:30:00	Director, Code for America	Friday
2	Guido van Rossum	Keynote	Guido van Rossum is the author of the Python p...	2015-04-11 09:40:00	2015-04-11 09:00:00	Python's Creator	Saturday
3	Gabriella Coleman	Keynote	Gabriella (Biella) Coleman holds the Wolfe Cha...	2015-04-11 10:20:00	2015-04-11 09:40:00	Author & Professor	Saturday
4	Van Lindberg	Keynote	Van Lindberg is Vice President of Intellectual...	2015-04-12 09:20:00	2015-04-12 09:00:00	PSF Chair	Sunday
5	Jacob Kaplan-Moss	Keynote	Jacob Kaplan-Moss is the co-creator of Django ...	2015-04-12 10:00:00	2015-04-12 09:20:00	Django's Co-creator	Sunday
6	Gary Bernhardt	Keynote	Gary Bernhardt is a creator and destroyer of s...	2015-04-12 15:50:00	2015-04-12 15:10:00	Closing Keynote	Sunday

	abstract	author	category	desc	level	title	weekday	start_dt	end_dt
133	Setting the scene My boss alerted me to an art...	A. Jesse Jiryu Davis	Python Libraries	Your Python program is too slow, and you need ...	Intermediate	Python Performance Profiling: The Guts And The...	Sunday	2015-04-12 13:50:00	2015-04-12 14:20:00
134	This tutorial is a systematic introduction to ...	Mike Müller	Python Core (language, stdlib, etc.)	Descriptors and metaclasses are advanced Pytho...	Experienced	Descriptors and Metaclasses - Understanding an...	Thursday	2015-04-09 09:00:00	2015-04-09 12:20:00
135	Using examples from real-code, show what reall...	Raymond Hettinger	Best Practices & Patterns	Distillation of knowledge gained from a decade...	Intermediate	Beyond PEP 8 -- Best practices for beautiful i...	Friday	2015-04-10 12:10:00	2015-04-10 12:55:00
136	The goal of static code analysis is to generat...	Andreas Dewes	Best Practices & Patterns	Static code analysis is an useful tool that ca...	Intermediate	Learning from other's mistakes: Data-driven an...	Saturday	2015-04-11 11:30:00	2015-04-11 12:00:00
137	In many ways Python is very similar to other p...	Stuart Williams	Python Core (language, stdlib, etc.)	This tutorial is for developers who've been us...	Intermediate	Python Epiphanies	Wednesday	2015-04-08 13:20:00	2015-04-08 16:40:00