Gaps in O'Reilly's Python repertoire

Tanya Schlusser 11 December 2014

Slides prepared using iPython Notebook. (Awesome quick tutorial... and how to 'Markdown')

Following along? Clone this: github/tanyaschlusser/ipython_talk__OReilly_python_books

Motivation

But

So we

What to do?

Option 1

Option 2

Option Oh Heck Yeah

And the reaction?

Not naming names... B---- R--

Well I say

Well, it would be a haul ...

...wouldn't be able to get a drink for the next 18 - 36 months...

But what to write?

There are over 120 Python publications by O'Reilly alone

Approach it systematically:

See what's out there now
Think about what we can contribute
Lots of writing

So, what's out there now?

Of course we're going to use Python to find out. And of course the universe is only the size of O'Reilly



In [1]:

    
# See what's out there now. Pull the:
#  -- media type (book | video)
#  -- title
#  -- publication date
import requests
from bs4 import BeautifulSoup

books_uri = "http://shop.oreilly.com/category/browse-subjects/programming/python.do?sortby=publicationDate&page=%d"



In [2]:

    
# Loop over all of the pages
results = []
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        yr = b.find("span", "directorydate").string.strip().split()
        while not yr[-1].isdigit():
            yr.pop()
        yr = int(yr[-1])
        title = b.find("div", "thumbheader").text.strip()
        url = b.find("div", "thumbheader").find("a")["href"]
        hasvideo = "Video" in b.text
        results.append(dict(year=yr, title=title, hasvideo=hasvideo))



In [3]:

    
# Want to
#  -- plot year over year number of books
#      ++ stacked plot with video + print
#  -- Get all the different words in the titles
#     ++ count them
#     ++ and order by frequency
#
# Use the Matplotlib magic command. Magic commands start with '%'.
# This sets up to plot inline. It doesn't import anything...
# Or use %pylab inline -- this apparently imports a lot of things into
# the global namespace
#
%matplotlib inline



In [18]:

    
# For year over year I need pandas.DataFrame.groupby
# For stacked plot I need matplotlib.pyplot
# Plain dictionary for the word counts
#
import pylab
import matplotlib.pyplot as plt
import pandas as pd



In [19]:

    
# Year over year -- number of publications by 'video' and 'print'.
#
df = pd.DataFrame(results)
byyear = pd.crosstab(df["year"],df["hasvideo"])
byyear.rename(columns={True:'video', False:'print'}, inplace=True)
byyear.plot(kind="area", xlim=(2000,2014), title="Ever increasing publications")









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d2ed0d0>

Um

That actually wasn't very satisfying...



In [6]:

    
# Out of curiosity, what happened in 2010?
df[df["year"]==2010]









    Out[6]:






  
    
      
      hasvideo
      title
      year
    
  
  
    
      77
       False
                     Programming Python, 4th Edition
       2010
    
    
      78
       False
         Python 2.6 Text Processing Beginner's Guide
       2010
    
    
      79
       False
                       Python Geospatial Development
       2010
    
    
      80
       False
       wxPython 2.8 Application Development Cookbook
       2010
    
    
      81
       False
                        Python 2.6 Graphics Cookbook
       2010
    
    
      82
       False
              Real World Instrumentation with Python
       2010
    
    
      83
       False
                                   Head First Python
       2010
    
    
      84
       False
       Python Text Processing with NLTK 2.0 Cookbook
       2010
    
    
      85
       False
                                    MySQL for Python
       2010
    
    
      86
       False
                                   Python Multimedia
       2010
    
    
      87
        True
             Practical Python Programming: Callbacks
       2010
    
    
      88
       False
                Python 3 Object Oriented Programming
       2010
    
    
      89
       False
                                   Spring Python 1.1
       2010
    
    
      90
       False
                              Blender 2.49 Scripting
       2010
    
    
      91
       False
                             Professional IronPython
       2010
    
    
      92
       False
                            Grok 1.0 Web Development
       2010
    
    
      93
       False
                                    Beginning Python
       2010
    
    
      94
       False
                                      Python Testing
       2010



In [7]:

    
# Break up the titles and count words in the titles.
#  -- Need a regex for the punctuation (commas and colons)
#  -- Need a stemmer for plurals, posessives, and verb conjugations
import re
from nltk.stem.porter import PorterStemmer

space_or_punct = re.compile("[\s,:\.]+")
stemmer = PorterStemmer()

title_words = {}
for r in results:
    title = space_or_punct.split( r["title"].lower() )
    stemmed_title = (stemmer.stem(t) for t in title)
    for t in stemmed_title:
        # don't retain version or release numbers
        if not t[0].isdigit():
            if t not in title_words:
                title_words[t] = 1
            else:
                title_words[t] += 1

print "Total distinct words in the titles:", len(title_words), "\n"
print "\t".join(title_words.keys())
print"\n"
print "\n".join(r["title"] for r in results if not r["hasvideo"])









    



Total distinct words in the titles: 158 

expert	text	sage	violent	mysql	languag	blender	xml	web	flask	sympi	intermedi	to	program	black	instrument	[how-to]	hat	introduc	real	applic	maya	get	python	next-gener	financ	autom	pocket	framework	game	hotshot	world	introduct	gray	edit	gui	scikit-learn	win32	stuff	nutshel	pygam	vision	compil	kivi	spring	stat	wxpython	video	bore	orient	librari	freecad	&	network	for	pattern	multimedia	playground	refer	machin	model	cython	raspberri	matplotlib	standard	test-driven	ipython	on	pypars	instant	forens	hdf5	of	script	social	studio	head	think	first	and	beginner'	app	profession	api	array	guid	use	opencv	bioinformat	interact	numpi	system	start	secret	master	natur	pi	analysi	web2pi	tool	probabilist	visual	sqlalchemi	tkinter	ironpython	with	starter	twist	graphic	geospati	work	learn	design	object-ori	comput	cherrypi	creat	process	lightweight	mine	agent	an	high	aw	in	crash	cours	develop	how-to	callback	perform	make	pysid	complex	build	test	complet	grok	begin	express	oscon	mongodb	object	regular	data	parallel	kid	a	essenti	practic	algorithm	django	fluent	dummi	programm	the	cookbook	nltk


Introducing Python
Lightweight Django
Python Data Analysis
Think Stats, 2nd Edition
Fluent Python
Python 3 Text Processing with NLTK 3 Cookbook
Python for Secret Agents
Beginning Programming with Python For Dummies
Learning Python Data Visualization
High Performance Python
Cython
Practical Maya Programming with Python
Building Probabilistic Graphical Models with Python
Parallel Programming with Python
Django Essentials
Learning NumPy Array
Test-Driven Development with Python
Python Forensics
Flask Web Development
Python for Finance
Mastering Object-oriented Python
Python Tools for Visual Studio
Raspberry Pi Cookbook for Python Programmers
Creating Apps in Kivy
Python Network Programming Cookbook
Python for Finance
Mastering Python Regular Expressions
Learning Python with Raspberry Pi
Python Pocket Reference, 5th Edition
Python High Performance Programming
Learning scikit-learn: Machine Learning in Python
Learning Python Design Patterns
Python Data Visualization Cookbook
Tkinter GUI Application Development HOTSHOT
Learning Geospatial Analysis with Python
Python and HDF5
PySide GUI Application Development
Mining the Social Web, 2nd Edition
Kivy: Interactive Applications in Python
Practical Programming, 2nd Edition
Instant Flask Web Development
Building Machine Learning Systems with Python
Learning Python, 5th Edition
Python Geospatial Development, 2nd Edition
Instant SymPy Starter
Python Cookbook, 3rd Edition
Learning IPython for Interactive Computing and Data Visualization
NumPy Beginner's Guide, 2nd Edition
OpenCV Computer Vision with Python
Instant Pygame for Python Game Development How-to
Twisted Network Programming Essentials, 2nd Edition
Instant Django 1.5 Application Development Starter
Violent Python
Python for Kids
NumPy Cookbook
Python for Data Analysis
FreeCAD [How-to]
Think Python
Programming Computer Vision with Python
web2py Application Development Cookbook
Think Complexity
NumPy 1.5
Python and AWS Cookbook
MongoDB and Python
Python Testing Cookbook
Python 3 Web Development
Sage
Python For Dummies
Programming Python, 4th Edition
Python 2.6 Text Processing Beginner's Guide
Python Geospatial Development
wxPython 2.8 Application Development Cookbook
Python 2.6 Graphics Cookbook
Real World Instrumentation with Python
Head First Python
Python Text Processing with NLTK 2.0 Cookbook
MySQL for Python
Python Multimedia
Python 3 Object Oriented Programming
Spring Python 1.1
Blender 2.49 Scripting
Professional IronPython
Grok 1.0 Web Development
Beginning Python
Python Testing
Bioinformatics Programming Using Python
Matplotlib for Python Developers
Natural Language Processing with Python
Gray Hat Python
Expert Python Programming
Python
Essential SQLAlchemy
Professional Python Frameworks
Learning Python, 3rd Edition
Getting Started with Pyparsing
Next-Generation Web Frameworks in Python
CherryPy Essentials
Programming Python, 3rd Edition
Python in a Nutshell, 2nd Edition
Beginning Python
Python Pocket Reference, 3rd Edition
Learning Python, 2nd Edition
Making Use of Python
Python & XML
Python Standard Library
Python Programming On Win32
Black Hat Python
Python Playground
Python Crash Course
Automate the Boring Stuff with Python



In [8]:

    
# That was useless -- almost every word except for "Python" and "Learn"
# shows up only in one title. Lame.
#
# Loop over all of the pages and get the book descriptions.
# Maybe see if some things in the descriptions show common topics
# (hoping for 'introductory' or 'web development' or 'machine learning')
import nltk
# Before running the below you need to do nltk.download() and select 'stopwords' from the corpus
from nltk.corpus import stopwords

english_stops = stopwords.words("English")
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        title = b.find("div", "thumbheader").text.strip()
        # Only look at the books
        if not "Video" in b.text:
            print ".",
            url = "http://shop.oreilly.com" + b.find("div", "thumbheader").find("a")["href"]
            result2 = requests.get(url)
            soup2 = BeautifulSoup(result2.text)
            description = soup2.find("div", "detail-description-content").text
            description_results[title] = set([stemmer.stem(word)
                                          for word in space_or_punct.split(description.lower())
                                          if word not in english_stops])
print









    



. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



In [9]:

    
# Try clustering:
#   -- Distance between two book descriptions is the percent overlap
#      of words in both their descriptions
#   -- Arbitrarily (from qualitative looking) decide on a threshold of
#      17% overlap in descriptions for both books to be 'similar'
#      and look at what we get
percent_overlap = []
min_intersections = []
max_intersections = []
avg_intersections = []
similar_books = {}

sorted_titles = sorted(description_results.keys())
for i in range(len(sorted_titles)):
    this_description = description_results[sorted_titles[i]]
    
    def get_percent_overlap(title):
        intersection_size = len(this_description.intersection(description_results[title]))
        union_size = len(this_description.union(description_results[t]))
        return (intersection_size * 100) / union_size
        
    percent_overlap.append([get_percent_overlap(t) for t in sorted_titles])
    similar_books[sorted_titles[i]] = [
            t for t in sorted_titles
            if get_percent_overlap(t) > 17 and t != sorted_titles[i]
    ]
    min_intersections.append(round(min(percent_overlap[-1])))
    max_intersections.append(round(max(percent_overlap[-1])))
    avg_intersections.append(round(100 * sum(percent_overlap[-1]) / len(sorted_titles)))

print "\n".join("\n%s\n%s" % (k, "|".join(v)) for k, v in similar_books.iteritems())









    



Essential SQLAlchemy


CherryPy Essentials
Grok 1.0 Web Development|MySQL for Python|Spring Python 1.1

Mastering Object-oriented Python


Instant Pygame for Python Game Development How-to
Python Multimedia

Learning NumPy Array
Learning IPython for Interactive Computing and Data Visualization|NumPy 1.5|NumPy Beginner's Guide, 2nd Edition|NumPy Cookbook|Python Tools for Visual Studio|Sage

Fluent Python
Cython|Python for Finance

Python 3 Web Development
Grok 1.0 Web Development|Instant Django 1.5 Application Development Starter|Python Testing|Spring Python 1.1|wxPython 2.8 Application Development Cookbook

Twisted Network Programming Essentials, 2nd Edition


Python and AWS Cookbook


Gray Hat Python


MongoDB and Python


PySide GUI Application Development
Instant Django 1.5 Application Development Starter|MySQL for Python|Python|Python High Performance Programming|Python Tools for Visual Studio|wxPython 2.8 Application Development Cookbook

Blender 2.49 Scripting
Grok 1.0 Web Development|MySQL for Python

Real World Instrumentation with Python


MySQL for Python
Blender 2.49 Scripting|Building Machine Learning Systems with Python|CherryPy Essentials|FreeCAD [How-to]|Learning scikit-learn: Machine Learning in Python|Professional Python Frameworks|PySide GUI Application Development|Python|Python 3 Object Oriented Programming|Python Geospatial Development|Python Geospatial Development, 2nd Edition|Python Tools for Visual Studio|Spring Python 1.1|wxPython 2.8 Application Development Cookbook

Python 3 Object Oriented Programming
Building Machine Learning Systems with Python|Expert Python Programming|Grok 1.0 Web Development|MySQL for Python|Python|Python Geospatial Development|Spring Python 1.1

Learning Python, 3rd Edition
Introducing Python|Learning Python, 2nd Edition|Learning Python, 5th Edition

Test-Driven Development with Python


Learning scikit-learn: Machine Learning in Python
Building Machine Learning Systems with Python|Building Probabilistic Graphical Models with Python|MySQL for Python|Python Network Programming Cookbook|Spring Python 1.1

Tkinter GUI Application Development HOTSHOT


Building Probabilistic Graphical Models with Python
Building Machine Learning Systems with Python|Learning scikit-learn: Machine Learning in Python

Python Pocket Reference, 5th Edition
Learning Python, 5th Edition

Professional Python Frameworks
MySQL for Python|Professional IronPython

Python Testing Cookbook
Python Testing

Python Network Programming Cookbook
Learning scikit-learn: Machine Learning in Python|Python & XML|Python High Performance Programming

Learning Python, 2nd Edition
Learning Python, 3rd Edition|Learning Python, 5th Edition

Programming Computer Vision with Python


Python Testing
Grok 1.0 Web Development|Python 3 Web Development|Python High Performance Programming|Python Testing Cookbook|Spring Python 1.1

Python Multimedia
Instant Pygame for Python Game Development How-to

Python Programming On Win32


Bioinformatics Programming Using Python
Beginning Programming with Python For Dummies|Introducing Python

Python 3 Text Processing with NLTK 3 Cookbook


Think Stats, 2nd Edition


Python in a Nutshell, 2nd Edition


Python & XML
Python Network Programming Cookbook

Sage
Instant SymPy Starter|Learning NumPy Array|NumPy 1.5|NumPy Beginner's Guide, 2nd Edition

NumPy Beginner's Guide, 2nd Edition
Learning NumPy Array|NumPy 1.5|NumPy Cookbook|Python High Performance Programming|Sage

Raspberry Pi Cookbook for Python Programmers


Expert Python Programming
Python|Python 3 Object Oriented Programming

Python High Performance Programming
Learning IPython for Interactive Computing and Data Visualization|NumPy Beginner's Guide, 2nd Edition|PySide GUI Application Development|Python Network Programming Cookbook|Python Testing

Introducing Python
Bioinformatics Programming Using Python|Flask Web Development|Learning Python, 3rd Edition|Learning Python, 5th Edition|Programming Python, 4th Edition|Python|Python Cookbook, 3rd Edition|Python Geospatial Development, 2nd Edition

Learning Python with Raspberry Pi


Python Text Processing with NLTK 2.0 Cookbook


FreeCAD [How-to]
MySQL for Python|wxPython 2.8 Application Development Cookbook

Python Tools for Visual Studio
Learning NumPy Array|MySQL for Python|PySide GUI Application Development

Instant SymPy Starter
Instant Django 1.5 Application Development Starter|Sage

Flask Web Development
Introducing Python

Learning IPython for Interactive Computing and Data Visualization
Learning NumPy Array|Python High Performance Programming

Python Standard Library


NumPy Cookbook
Learning NumPy Array|NumPy 1.5|NumPy Beginner's Guide, 2nd Edition

Next-Generation Web Frameworks in Python


Python 2.6 Text Processing Beginner's Guide
Mastering Python Regular Expressions

Python for Finance
Cython|Fluent Python|Python for Data Analysis

Getting Started with Pyparsing


Professional IronPython
Professional Python Frameworks

Practical Maya Programming with Python
Python Geospatial Development, 2nd Edition

Python Cookbook, 3rd Edition
Introducing Python

Violent Python


Black Hat Python


NumPy 1.5
Learning NumPy Array|NumPy Beginner's Guide, 2nd Edition|NumPy Cookbook|Sage

web2py Application Development Cookbook


Python Data Analysis


Instant Flask Web Development
Instant Django 1.5 Application Development Starter

Programming Python, 4th Edition
Introducing Python|Programming Python, 3rd Edition

Building Machine Learning Systems with Python
Building Probabilistic Graphical Models with Python|Learning scikit-learn: Machine Learning in Python|Mastering Python Regular Expressions|Matplotlib for Python Developers|MySQL for Python|Parallel Programming with Python|Practical Programming, 2nd Edition|Python 2.6 Graphics Cookbook|Python 3 Object Oriented Programming|Python Geospatial Development, 2nd Edition|Spring Python 1.1

Beginning Python


Python Pocket Reference, 3rd Edition


High Performance Python


Python for Data Analysis
Python for Finance

Natural Language Processing with Python


Beginning Programming with Python For Dummies
Bioinformatics Programming Using Python|Python Crash Course

Python 2.6 Graphics Cookbook
Building Machine Learning Systems with Python

Learning Geospatial Analysis with Python


Mastering Python Regular Expressions
Building Machine Learning Systems with Python|Parallel Programming with Python|Python 2.6 Text Processing Beginner's Guide|Python Geospatial Development, 2nd Edition

OpenCV Computer Vision with Python


Python Data Visualization Cookbook
Matplotlib for Python Developers

wxPython 2.8 Application Development Cookbook
FreeCAD [How-to]|Matplotlib for Python Developers|MySQL for Python|Parallel Programming with Python|PySide GUI Application Development|Python|Python 3 Web Development|Spring Python 1.1

Creating Apps in Kivy


Kivy: Interactive Applications in Python


Learning Python Design Patterns


Instant Django 1.5 Application Development Starter
Grok 1.0 Web Development|Instant Flask Web Development|Instant SymPy Starter|PySide GUI Application Development|Python 3 Web Development

Python and HDF5


Python Crash Course
Beginning Programming with Python For Dummies

Head First Python


Cython
Fluent Python|Python for Finance

Spring Python 1.1
Building Machine Learning Systems with Python|CherryPy Essentials|Grok 1.0 Web Development|Learning scikit-learn: Machine Learning in Python|Matplotlib for Python Developers|MySQL for Python|Python 3 Object Oriented Programming|Python 3 Web Development|Python Testing|wxPython 2.8 Application Development Cookbook

Learning Python, 5th Edition
Introducing Python|Learning Python, 2nd Edition|Learning Python, 3rd Edition|Python Pocket Reference, 5th Edition

Practical Programming, 2nd Edition
Building Machine Learning Systems with Python

Grok 1.0 Web Development
Blender 2.49 Scripting|CherryPy Essentials|Instant Django 1.5 Application Development Starter|Matplotlib for Python Developers|Python 3 Object Oriented Programming|Python 3 Web Development|Python Testing|Spring Python 1.1

Making Use of Python


Matplotlib for Python Developers
Building Machine Learning Systems with Python|Grok 1.0 Web Development|Python Data Visualization Cookbook|Spring Python 1.1|wxPython 2.8 Application Development Cookbook

Python Geospatial Development
MySQL for Python|Python 3 Object Oriented Programming|Python Geospatial Development, 2nd Edition

Parallel Programming with Python
Building Machine Learning Systems with Python|Mastering Python Regular Expressions|Python Geospatial Development, 2nd Edition|wxPython 2.8 Application Development Cookbook

Django Essentials


Python Geospatial Development, 2nd Edition
Building Machine Learning Systems with Python|Introducing Python|Mastering Python Regular Expressions|MySQL for Python|Parallel Programming with Python|Practical Maya Programming with Python|Python Geospatial Development

Python for Kids


Programming Python, 3rd Edition
Programming Python, 4th Edition

Python Playground


Learning Python Data Visualization


Python for Secret Agents


Python Forensics


Lightweight Django


Python For Dummies


Automate the Boring Stuff with Python


Mining the Social Web, 2nd Edition


Think Complexity


Python
Expert Python Programming|Introducing Python|MySQL for Python|PySide GUI Application Development|Python 3 Object Oriented Programming|wxPython 2.8 Application Development Cookbook

Think Python



In [10]:

    
from scipy.cluster.hierarchy import linkage, dendrogram



In [17]:

    
plt.figure(figsize=(5,20))
data_link = linkage(percent_overlap, method='single', metric='euclidean')

den = dendrogram(data_link,labels=sorted_titles, orientation="left")

plt.ylabel('Samples', fontsize=9)
plt.xlabel('Distance')
plt.suptitle('Books clustered by description similarity', fontweight='bold', fontsize=14);









    Out[17]:





<matplotlib.text.Text at 0x10a7b8710>



In [20]:

    
# The output of linkage is uninterpretable for a human.
# Nest them in a JSON for readability
# print data_link

human_links = dict(enumerate(sorted_titles))
index = len(human_links)
for link in data_link:
    left = link[0]
    right = link[1]
    human_links[index] = {left:human_links[left], right:human_links[right]}
    del(human_links[left], human_links[right])
    index += 1

import json
with open("data/hclust.json", "w") as outfile:
    outfile.write(json.dumps(human_links))

#print human_links



In [23]:

    
from IPython.display import HTML

container = """
    <script type="text/javascript" src="http://d3js.org/d3.v3.min.js"></script>
    <a href='http://bl.ocks.org/mbostock/4063570'>Attribution: Michael Bostock</a>
    <div id='display_container'></div>"""

with open("data/d3-stacked-Tree.js") as infile:
    display_js = infile.read()

with open("data/human_hclust.json") as infile:
    the_json = infile.read()

HTML(container + display_js % the_json )









    Out[23]:





    
    Attribution: Michael Bostock

Quick interpretation

It makes sense that topics have little overlap. Otherwise why write a different book? Themes:

Pure programming (47)
- Beginner (26): Intro to | Learning << insert subject here >>
- Advanced (17): Testing | Optimization
- Concise (4): Cookbooks | Essentials | Pocket refs
Games / Web (30)
- Visualization (6): Maya, Blender, CAD
- Development Tools (19): Django, Flask, Grok, Kivvy, etc
- Data (5): SQL | NOSQL | HDF5 | AWS
Scientific / Hobby (31)
- Base tools (14): scikit, NumPy, nltk
- Hardware interface (4): (Embedded code / instrumentation)
- Applications (13): Stats / Machine Learning, Geospatial, Financial, Security

Thank you -- what next?

Contact me if you want do do something
Ideas welcome
This may or may not happen
- depends on interest
- and whether interest can be sustained

	hasvideo	title	year
77	False	Programming Python, 4th Edition	2010
78	False	Python 2.6 Text Processing Beginner's Guide	2010
79	False	Python Geospatial Development	2010
80	False	wxPython 2.8 Application Development Cookbook	2010
81	False	Python 2.6 Graphics Cookbook	2010
82	False	Real World Instrumentation with Python	2010
83	False	Head First Python	2010
84	False	Python Text Processing with NLTK 2.0 Cookbook	2010
85	False	MySQL for Python	2010
86	False	Python Multimedia	2010
87	True	Practical Python Programming: Callbacks	2010
88	False	Python 3 Object Oriented Programming	2010
89	False	Spring Python 1.1	2010
90	False	Blender 2.49 Scripting	2010
91	False	Professional IronPython	2010
92	False	Grok 1.0 Web Development	2010
93	False	Beginning Python	2010
94	False	Python Testing	2010