Gaps in O'Reilly's Python repertoire

Tanya Schlusser 11 December 2014

Slides prepared using iPython Notebook. (Awesome quick tutorial... and how to 'Markdown')

Following along? Clone this: https://github.com/tanyaschlusser/ipython_talk__OReilly_python_books

Motivation

But

So we

What to do?

Option 1

Option 2

Option Oh Heck Yeah

And the reaction?

Not naming names... B---- R--

Well I say

Well, it would be a haul ...

...wouldn't be able to get a drink for the next 18 - 36 months...

But what to write?

There are over 120 Python publications by O'Reilly alone

Approach it systematically:

  1. See what's out there now
  2. Think about what we can contribute
  3. Lots of writing

So, what's out there now?

Of course we're going to use Python to find out. And of course the universe is only the size of O'Reilly


In [1]:
# See what's out there now. Pull the:
#  -- media type (book | video)
#  -- title
#  -- publication date
import requests
from bs4 import BeautifulSoup

books_uri = "http://shop.oreilly.com/category/browse-subjects/programming/python.do?sortby=publicationDate&page=%d"

In [2]:
# Loop over all of the pages
results = []
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        yr = b.find("span", "directorydate").string.strip().split()
        while not yr[-1].isdigit():
            yr.pop()
        yr = int(yr[-1])
        title = b.find("div", "thumbheader").text.strip()
        url = b.find("div", "thumbheader").find("a")["href"]
        hasvideo = "Video" in b.text
        results.append(dict(year=yr, title=title, hasvideo=hasvideo))

In [3]:
# Want to
#  -- plot year over year number of books
#      ++ stacked plot with video + print
#  -- Get all the different words in the titles
#     ++ count them
#     ++ and order by frequency
#
# Use the Matplotlib magic command. Magic commands start with '%'.
# This sets up to plot inline. It doesn't import anything...
# Or use %pylab inline -- this apparently imports a lot of things into
# the global namespace
#
%matplotlib inline

In [4]:
# For year over year I need pandas.DataFrame.groupby
# For stacked plot I need matplotlib.pyplot
# Plain dictionary for the word counts
#
import matplotlib.pyplot as plt
import pandas as pd

In [5]:
# Year over year -- number of publications by 'video' and 'print'.
#
df = pd.DataFrame(results)
byyear = pd.crosstab(df["year"],df["hasvideo"])
byyear.rename(columns={True:'video', False:'print'}, inplace=True)
byyear.plot(kind="area", xlim=(2000,2014), title="Ever increasing publications")


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x108cc9210>

Um

That actually wasn't very satisfying...


In [6]:
# Out of curiosity, what happened in 2010?
df[df["year"]==2010]


Out[6]:
hasvideo title year
77 False Python 2.6 Text Processing Beginner's Guide 2010
78 False Programming Python, 4th Edition 2010
79 False Python Geospatial Development 2010
80 False wxPython 2.8 Application Development Cookbook 2010
81 False Python 2.6 Graphics Cookbook 2010
82 False Head First Python 2010
83 False Real World Instrumentation with Python 2010
84 False Python Text Processing with NLTK 2.0 Cookbook 2010
85 False MySQL for Python 2010
86 False Python Multimedia 2010
87 True Practical Python Programming: Callbacks 2010
88 False Python 3 Object Oriented Programming 2010
89 False Spring Python 1.1 2010
90 False Blender 2.49 Scripting 2010
91 False Professional IronPython 2010
92 False Grok 1.0 Web Development 2010
93 False Beginning Python 2010
94 False Python Testing 2010

In [7]:
# Break up the titles and count words in the titles.
#  -- Need a regex for the punctuation (commas and colons)
#  -- Need a stemmer for plurals, posessives, and verb conjugations
import re
from nltk.stem.porter import PorterStemmer

space_or_punct = re.compile("[\s,:\.]+")
stemmer = PorterStemmer()

title_words = {}
for r in results:
    title = space_or_punct.split( r["title"].lower() )
    stemmed_title = (stemmer.stem(t) for t in title)
    for t in stemmed_title:
        # don't retain version or release numbers
        if not t[0].isdigit():
            if t not in title_words:
                title_words[t] = 1
            else:
                title_words[t] += 1

print "Total distinct words in the titles:", len(title_words), "\n"
print "\t".join(title_words.keys())
print"\n"
print "\n".join(r["title"] for r in results if not r["hasvideo"])


Total distinct words in the titles: 158 

** Truncated for brevity **


In [11]:
# That was useless -- almost every word except for "Python" and "Learn"
# shows up only in one title. Lame.
#
# Loop over all of the pages and get the book descriptions.
# Maybe see if some things in the descriptions show common topics
# (hoping for 'introductory' or 'web development' or 'machine learning')
import nltk
# Before running the below you need to do nltk.download() and select 'stopwords' from the corpus
from nltk.corpus import stopwords

english_stops = stopwords.words("English")
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        title = b.find("div", "thumbheader").text.strip()
        # Only look at the books
        if not "Video" in b.text:
            print ".",
            url = "http://shop.oreilly.com" + b.find("div", "thumbheader").find("a")["href"]
            result2 = requests.get(url)
            soup2 = BeautifulSoup(result2.text)
            description = soup2.find("div", "detail-description-content").text
            description_results[title] = set([stemmer.stem(word)
                                          for word in space_or_punct.split(description.lower())
                                          if word not in english_stops])
print


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In [12]:
# Try clustering:
#   -- Distance between two book descriptions is the percent overlap
#      of words in both their descriptions
#   -- Arbitrarily (from qualitative looking) decide on a threshold of
#      17% overlap in descriptions for both books to be 'similar'
#      and look at what we get
percent_overlap = []
min_intersections = []
max_intersections = []
avg_intersections = []
similar_books = {}

sorted_titles = sorted(description_results.keys())
for i in range(len(sorted_titles)):
    this_description = description_results[sorted_titles[i]]
    
    def get_percent_overlap(title):
        intersection_size = len(this_description.intersection(description_results[title]))
        union_size = len(this_description.union(description_results[t]))
        return (intersection_size * 100) / union_size
        
    percent_overlap.append([get_percent_overlap(t) for t in sorted_titles])
    similar_books[sorted_titles[i]] = [
            t for t in sorted_titles
            if get_percent_overlap(t) > 17 and t != sorted_titles[i]
    ]
    min_intersections.append(round(min(percent_overlap[-1])))
    max_intersections.append(round(max(percent_overlap[-1])))
    avg_intersections.append(round(100 * sum(percent_overlap[-1]) / len(sorted_titles)))

print "\n".join("\n%s\n%s" % (k, "|".join(v)) for k, v in similar_books.iteritems())


** Truncated for brevity **


In [13]:
from scipy.cluster.hierarchy import linkage, dendrogram

In [14]:
plt.figure(figsize=(5,20))
data_link = linkage(percent_overlap, method='single', metric='euclidean')

den = dendrogram(data_link,labels=sorted_titles, orientation="left")

plt.ylabel('Samples', fontsize=9)
plt.xlabel('Distance')
plt.suptitle('Books clustered by description similarity', fontweight='bold', fontsize=14);



In [15]:
# The output of linkage is uninterpretable for a human.
# Nest them in a JSON for readability
# print data_link

human_links = dict(enumerate(sorted_titles))
index = len(human_links)
for link in data_link:
    left = link[0]
    right = link[1]
    human_links[index] = {left:human_links[left], right:human_links[right]}
    del(human_links[left], human_links[right])
    index += 1

import json
with open("data/hclust.json", "w") as outfile:
    outfile.write(json.dumps(human_links))

#print human_links

In [9]:
from IPython.display import HTML

container = """
    <a href='http://bl.ocks.org/mbostock/4063570'>Layout attribution: Michael Bostock</a>
    <div id='display_container'></div>"""

with open("data/d3-stacked-Tree.js") as infile:
    display_js = infile.read()

with open("data/human_hclust.json") as infile:
    the_json = infile.read()

HTML(container + display_js % the_json )


Out[9]:
Layout attribution: Michael Bostock

Quick interpretation

It makes sense that topics have little overlap. Otherwise why write a different book? Do we have anything to contribute?

  • Pure programming
    • Unless we write our own library, I think this space is full
    • Plus, any book should be better than Udacity / existing blogs, or why do it?
  • Games / Web
    • Lots of expertise here, right? (Not me, but what percent of ChiPy does this?)
  • Scientific / Hobby
    • The other half of ChiPy? The finance guys? (What percent?)
    • We could actually contact the pyMCMC guy Thomas Wiecki and offer to help him out

Thank you -- what next?

  • Contact me if you want do do something
  • Ideas welcome
  • This may or may not happen
    • depends on interest
    • and whether interest can be sustained

There was support for a 'personality piece'

  • About local Chicago successes (Thanks Don Sheu!)
  • Idea:
    • Python in production seems newer, and we have a handful of local companies who have gone this route
    • A recent Berkely paper summarizes today's enterprise analytics pipeline:
      • Three types of people:
        • App users (e.g. Business Objects, Qlikview)
        • Scripters (e.g. SAS)
        • Hackers (whole workflow: SQL (100%) + Python(63%) + various)
      • Some issues:
        • Visualization at the data exploration stage
        • Managing the workflow
        • Isolated inaccessible 3rd party data
    • We've (as a community) figured out how to do this -- could this be our book?
      • How different successful local firms use Python in production

Postscript: IPython evangelism

Making this deck in IPython was life-changing -- Python talks belong in IPython.

How to:

  1. .github.io repo: (instructions)
  2. Make the notebook

     pip install ipython
     ipython notebook  # Make something.
    
    

    Remember to identify the slides:

  3. Convert to html slideshow

     export PREFIX=http://cdn.jsdelivr.net/reveal.js/2.6.2
     ipython nbconvert <my_notebook>.ipynb \
         --to slides \
         --reveal-prefix ${PREFIX}
  4. Add the new slides to the .github.io repo. The slides are served statically

  5. Wait about 10 minutes and the slides are there

Also: