Gaps in O'Reilly's Python repertoire

Tanya Schlusser 11 December 2014

Slides prepared using iPython Notebook. (Awesome quick tutorial... and how to 'Markdown')

Following along? Clone this: github/tanyaschlusser/ipython_talk__OReilly_python_books



There are over 120 Python publications by O'Reilly alone

Approach it systematically:

  1. See what's out there now
  2. Think about what we can contribute
  3. Lots of writing

So, what's out there now?

Of course we're going to use Python to find out. And of course the universe is only the size of O'Reilly

In [1]:
# See what's out there now. Pull the:
#  -- media type (book | video)
#  -- title
#  -- publication date
import requests
from bs4 import BeautifulSoup

books_uri = ""

In [2]:
# Loop over all of the pages
results = []
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        yr = b.find("span", "directorydate").string.strip().split()
        while not yr[-1].isdigit():
        yr = int(yr[-1])
        title = b.find("div", "thumbheader").text.strip()
        url = b.find("div", "thumbheader").find("a")["href"]
        hasvideo = "Video" in b.text
        results.append(dict(year=yr, title=title, hasvideo=hasvideo))

In [3]:
# Want to
#  -- plot year over year number of books
#      ++ stacked plot with video + print
#  -- Get all the different words in the titles
#     ++ count them
#     ++ and order by frequency
# Use the Matplotlib magic command. Magic commands start with '%'.
# This sets up to plot inline. It doesn't import anything...
# Or use %pylab inline -- this apparently imports a lot of things into
# the global namespace
%matplotlib inline

In [18]:
# For year over year I need pandas.DataFrame.groupby
# For stacked plot I need matplotlib.pyplot
# Plain dictionary for the word counts
import pylab
import matplotlib.pyplot as plt
import pandas as pd

In [19]:
# Year over year -- number of publications by 'video' and 'print'.
df = pd.DataFrame(results)
byyear = pd.crosstab(df["year"],df["hasvideo"])
byyear.rename(columns={True:'video', False:'print'}, inplace=True)
byyear.plot(kind="area", xlim=(2000,2014), title="Ever increasing publications")

<matplotlib.axes._subplots.AxesSubplot at 0x10d2ed0d0>


That actually wasn't very satisfying...

In [6]:
# Out of curiosity, what happened in 2010?

hasvideo title year
77 False Programming Python, 4th Edition 2010
78 False Python 2.6 Text Processing Beginner's Guide 2010
79 False Python Geospatial Development 2010
80 False wxPython 2.8 Application Development Cookbook 2010
81 False Python 2.6 Graphics Cookbook 2010
82 False Real World Instrumentation with Python 2010
83 False Head First Python 2010
84 False Python Text Processing with NLTK 2.0 Cookbook 2010
85 False MySQL for Python 2010
86 False Python Multimedia 2010
87 True Practical Python Programming: Callbacks 2010
88 False Python 3 Object Oriented Programming 2010
89 False Spring Python 1.1 2010
90 False Blender 2.49 Scripting 2010
91 False Professional IronPython 2010
92 False Grok 1.0 Web Development 2010
93 False Beginning Python 2010
94 False Python Testing 2010

In [7]:
# Break up the titles and count words in the titles.
#  -- Need a regex for the punctuation (commas and colons)
#  -- Need a stemmer for plurals, posessives, and verb conjugations
import re
from nltk.stem.porter import PorterStemmer

space_or_punct = re.compile("[\s,:\.]+")
stemmer = PorterStemmer()

title_words = {}
for r in results:
    title = space_or_punct.split( r["title"].lower() )
    stemmed_title = (stemmer.stem(t) for t in title)
    for t in stemmed_title:
        # don't retain version or release numbers
        if not t[0].isdigit():
            if t not in title_words:
                title_words[t] = 1
                title_words[t] += 1

print "Total distinct words in the titles:", len(title_words), "\n"
print "\t".join(title_words.keys())
print "\n".join(r["title"] for r in results if not r["hasvideo"])

Total distinct words in the titles: 158 

expert	text	sage	violent	mysql	languag	blender	xml	web	flask	sympi	intermedi	to	program	black	instrument	[how-to]	hat	introduc	real	applic	maya	get	python	next-gener	financ	autom	pocket	framework	game	hotshot	world	introduct	gray	edit	gui	scikit-learn	win32	stuff	nutshel	pygam	vision	compil	kivi	spring	stat	wxpython	video	bore	orient	librari	freecad	&	network	for	pattern	multimedia	playground	refer	machin	model	cython	raspberri	matplotlib	standard	test-driven	ipython	on	pypars	instant	forens	hdf5	of	script	social	studio	head	think	first	and	beginner'	app	profession	api	array	guid	use	opencv	bioinformat	interact	numpi	system	start	secret	master	natur	pi	analysi	web2pi	tool	probabilist	visual	sqlalchemi	tkinter	ironpython	with	starter	twist	graphic	geospati	work	learn	design	object-ori	comput	cherrypi	creat	process	lightweight	mine	agent	an	high	aw	in	crash	cours	develop	how-to	callback	perform	make	pysid	complex	build	test	complet	grok	begin	express	oscon	mongodb	object	regular	data	parallel	kid	a	essenti	practic	algorithm	django	fluent	dummi	programm	the	cookbook	nltk

In [8]:
# That was useless -- almost every word except for "Python" and "Learn"
# shows up only in one title. Lame.
# Loop over all of the pages and get the book descriptions.
# Maybe see if some things in the descriptions show common topics
# (hoping for 'introductory' or 'web development' or 'machine learning')
import nltk
# Before running the below you need to do and select 'stopwords' from the corpus
from nltk.corpus import stopwords

english_stops = stopwords.words("English")
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        title = b.find("div", "thumbheader").text.strip()
        # Only look at the books
        if not "Video" in b.text:
            print ".",
            url = "" + b.find("div", "thumbheader").find("a")["href"]
            result2 = requests.get(url)
            soup2 = BeautifulSoup(result2.text)
            description = soup2.find("div", "detail-description-content").text
            description_results[title] = set([stemmer.stem(word)
                                          for word in space_or_punct.split(description.lower())
                                          if word not in english_stops])

In [9]:
# Try clustering:
#   -- Distance between two book descriptions is the percent overlap
#      of words in both their descriptions
#   -- Arbitrarily (from qualitative looking) decide on a threshold of
#      17% overlap in descriptions for both books to be 'similar'
#      and look at what we get
percent_overlap = []
min_intersections = []
max_intersections = []
avg_intersections = []
similar_books = {}

sorted_titles = sorted(description_results.keys())
for i in range(len(sorted_titles)):
    this_description = description_results[sorted_titles[i]]
    def get_percent_overlap(title):
        intersection_size = len(this_description.intersection(description_results[title]))
        union_size = len(this_description.union(description_results[t]))
        return (intersection_size * 100) / union_size
    percent_overlap.append([get_percent_overlap(t) for t in sorted_titles])
    similar_books[sorted_titles[i]] = [
            t for t in sorted_titles
            if get_percent_overlap(t) > 17 and t != sorted_titles[i]
    avg_intersections.append(round(100 * sum(percent_overlap[-1]) / len(sorted_titles)))

print "\n".join("\n%s\n%s" % (k, "|".join(v)) for k, v in similar_books.iteritems())

In [10]:
from scipy.cluster.hierarchy import linkage, dendrogram

In [17]:
data_link = linkage(percent_overlap, method='single', metric='euclidean')

den = dendrogram(data_link,labels=sorted_titles, orientation="left")

plt.ylabel('Samples', fontsize=9)
plt.suptitle('Books clustered by description similarity', fontweight='bold', fontsize=14);

<matplotlib.text.Text at 0x10a7b8710>

In [20]:
# The output of linkage is uninterpretable for a human.
# Nest them in a JSON for readability
# print data_link

human_links = dict(enumerate(sorted_titles))
index = len(human_links)
for link in data_link:
    left = link[0]
    right = link[1]
    human_links[index] = {left:human_links[left], right:human_links[right]}
    del(human_links[left], human_links[right])
    index += 1

import json
with open("data/hclust.json", "w") as outfile:

#print human_links

In [23]:
from IPython.display import HTML

container = """
    <script type="text/javascript" src=""></script>
    <a href=''>Attribution: Michael Bostock</a>
    <div id='display_container'></div>"""

with open("data/d3-stacked-Tree.js") as infile:
    display_js =

with open("data/human_hclust.json") as infile:
    the_json =

HTML(container + display_js % the_json )

Attribution: Michael Bostock

Quick interpretation

It makes sense that topics have little overlap. Otherwise why write a different book? Themes:

  • Pure programming (47)
    • Beginner (26): Intro to | Learning << insert subject here >>
    • Advanced (17): Testing | Optimization
    • Concise (4): Cookbooks | Essentials | Pocket refs
  • Games / Web (30)
    • Visualization (6): Maya, Blender, CAD
    • Development Tools (19): Django, Flask, Grok, Kivvy, etc
    • Data (5): SQL | NOSQL | HDF5 | AWS
  • Scientific / Hobby (31)
    • Base tools (14): scikit, NumPy, nltk
    • Hardware interface (4): (Embedded code / instrumentation)
    • Applications (13): Stats / Machine Learning, Geospatial, Financial, Security

Thank you -- what next?

  • Contact me if you want do do something
  • Ideas welcome
  • This may or may not happen
    • depends on interest
    • and whether interest can be sustained