Python for fun and profit

Juan Luis Cano Rodríguez
Madrid, 2016-05-13 @ ETS Asset Management Factory

Outline

  • Introduction
  • Python for Data Science
  • Python for IT
  • General advice
  • Conclusions

Outline

  • Introduction
  • Python for Data Science

    • Interactive computation with Jupyter
    • Numerical analysis with NumPy, SciPy
    • Visualization with matplotlib and others
    • Data manipulation with pandas
    • Machine Learning with scikit-learn
  • Python for IT

    • Data gathering with Requests and Scrapy
    • Information extraction with lxml, BeautifulSoup and others
    • User interfaces with PyQt, xlwings and others
    • Other: memcached, SOA
  • General advice

    • Python packaging
    • The future of Python
  • Conclusions

>>> print(self)

  • Almost Aerospace Engineer
  • Quant Developer for BBVA at Indizen
  • Writer and furious tweeter at Pybonacci
  • Chair and BDFL of Python España non-profit
  • Co-creator and charismatic leader of AeroPython (*not the Lorena Barba course)
  • When time permits (rare) writes some open source Python code

Python for Data Science

  • Python is a dynamic, interpreted* language that is easy to learn
  • Very popular in science, research
  • Rich ecosystem of packages that interoperate
  • Multiple languages are used (FORTRAN, C/C++) and wrapped from Python for a convenient interface

Jupyter

  • Interactive computation environment in a browser
  • Traces its roots to IPython, created in 2001
  • Nowadays it's language-agnostic (40 languages)

Jupyter

It's a notebook!

  • Code is computed in cells
  • These can contain text, code, images, videos...
  • All resulting plots can be integrated in the interface
  • We can export it to different formats using nbconvert or from the UI

In [ ]:


In [ ]:


In [ ]:

It's interactive!


In [30]:
from ipywidgets import interact, fixed

In [34]:
from sympy import init_printing, Symbol, Eq, factor
init_printing(use_latex=True)

x = Symbol('x')

def factorit(n):
    return Eq(x**n-1, factor(x**n-1))

In [35]:
interact(factorit, n=(2,40))


$$x^{9} - 1 = \left(x - 1\right) \left(x^{2} + x + 1\right) \left(x^{6} + x^{3} + 1\right)$$

In [28]:
# Import matplotlib (plotting), skimage (image processing) and interact (user interfaces)
# This enables their use in the Notebook.
%matplotlib inline
from matplotlib import pyplot as plt

from skimage import data
from skimage.feature import blob_doh
from skimage.color import rgb2gray

# Extract the first 500px square of the Hubble Deep Field.
image = data.hubble_deep_field()[0:500, 0:500]
image_gray = rgb2gray(image)

def plot_blobs(max_sigma=30, threshold=0.1, gray=False):
    """
    Plot the image and the blobs that have been found.
    """
    blobs = blob_doh(image_gray, max_sigma=max_sigma, threshold=threshold)
    
    fig, ax = plt.subplots(figsize=(8,8))
    ax.set_title('Galaxies in the Hubble Deep Field')
    
    if gray:
        ax.imshow(image_gray, interpolation='nearest', cmap='gray_r')
        circle_color = 'red'
    else:
        ax.imshow(image, interpolation='nearest')
        circle_color = 'yellow'
    for blob in blobs:
        y, x, r = blob
        c = plt.Circle((x, y), r, color=circle_color, linewidth=2, fill=False)
        ax.add_patch(c)

In [29]:
interact(plot_blobs, max_sigma=(10, 40, 2), threshold=(0.005, 0.02, 0.001))


NumPy

  • N-dimensional data structure.
  • Homogeneously typed.
  • Efficient!

A universal function (or ufunc for short) is a function that operates on ndarrays. It is a “vectorized function".


In [8]:
import numpy as np

In [9]:
my_list  = list(range(0,100000))
res1 = %timeit -o sum(my_list)


1000 loops, best of 3: 1.14 ms per loop

In [10]:
array = np.arange(0, 100000)
res2 = %timeit -o np.sum(array)


10000 loops, best of 3: 61.1 µs per loop

In [11]:
res1.best / res2.best


Out[11]:
18.68618617922427

NumPy is much more:

  • Advanced manipulation tricks: broadcasting, fancy indexing
  • Functions: generalized linear algebra, Fast Fourier transforms
  • Use case:
    • In-memory, fits-in-my-computer, homogeneous data
    • Easily vectorized operations

SciPy

General purpose scientific computing library

  • scipy.linalg: ATLAS LAPACK and BLAS libraries
  • scipy.stats: distributions, statistical functions...
  • scipy.integrate: integration of functions and ODEs
  • scipy.optimization: local and global optimization, fitting, root finding...
  • scipy.interpolate: interpolation, splines...
  • scipy.fftpack: Fourier trasnforms
  • scipy.signal: Signal processing
  • scipy.special: Special functions
  • scipy.io: Reading/Writing scientific formats

matplotlib

  • The father of all Python visualization packages
  • Modeled after MATLAB API
  • Powerful and versatile, but often complex and not so well documented
  • Undergoing a deep default style change

In [2]:
# This line integrates matplotlib with the notebook
%matplotlib inline

import matplotlib.pyplot as plt

In [3]:
import numpy as np
x = np.linspace(-2, 10)
plt.plot(x, np.sin(x) / x)


Out[3]:
[<matplotlib.lines.Line2D at 0x7f860913cac8>]

In [4]:
def g(x, y):
    return np.cos(x) + np.sin(y) ** 2

x = np.linspace(-2, 3, 1000)
y = np.linspace(-2, 3, 1000)

xx, yy = np.meshgrid(x, y)
zz = g(xx, yy)

fig = plt.figure(figsize=(6, 6))
cs = plt.contourf(xx, yy, zz, np.linspace(-1, 2, 13), cmap=plt.cm.viridis)
plt.colorbar()

cs = plt.contour(xx, yy, zz, np.linspace(-1, 2, 13), colors='k')

plt.clabel(cs)

plt.xlabel("x")
plt.ylabel("y")
plt.title(r"Function $g(x, y) = \cos{x} + \sin^2{y}$")
plt.close()

In [5]:
fig


Out[5]:

There are many alternatives to matplotlib, each one with its use cases, design decisions, and tradeoffs. Here are some of them:

  • seaborn: High level layer on top of matplotlib, easier API and beautiful defaults for common visualizations
  • ggplot: For those who prefer R-like plotting (API and appearance)
  • plotly: 2D and 3D interactive plots in the browser as a web service
  • Bokeh: targets modern web browsers and big data
  • pyqtgraph: Qt embedding, realtime plots

Others: pygal, mpld3, bqplot...

Use the best tool for the job! And in case of doubt, just get matplotlib :)

pandas

  • High-performance, easy-to-use data structures and data analysis
  • Inspired by R DataFrames
  • Not just NumPy on steroids
  • Input/Output functions for a variety of formats
  • SQL and query-like operations

In [2]:
import numpy as np
import pandas as pd

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df



ImportErrorTraceback (most recent call last)
<ipython-input-2-9e6ebddea8d2> in <module>()
      1 import numpy as np
----> 2 import pandas as pd
      3 
      4 dates = pd.date_range('20130101', periods=6)
      5 df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

ImportError: No module named 'pandas'

scikit-learn and others

  • scikit-learn: A high quality machine learning package
  • Theano: primitives for building neural networks
  • TensorFlow: Google's take on machine learning and deep learning
  • keras: deep learning built on top of Theano and TensorFlow

Many possibilities!

Infrastructure

Information retrieval extraction

  • Web spraping: Requests, Scrapy
  • Information extraction: lxml, json, BeautifulSoup, pyparsing
    • Many options depending on the specific format
  • Cache systems: memcached, redis-py
    • Python wrappers for existing, mature systems

User interfaces and applications

  • GUI toolkits
    • PyQt is the more powerful, but watch out license terms
    • Other: Tkinter, PyGTK...
    • Are desktop apps dead anyway? Perhaps move to browser?
  • Service-oriented architectures (SOA)
    • Flask: Small web frameworks focused on little services
    • Django + Django REST framework: More complex, difficult to master, very powerful

Python + Excel

  • Most interesting option: xlwings
    • Possibility to create User Defined Functions (UDFs) in Python to be used from Excel
    • Also, call VBA Subs and Functions from Python!
    • Creation of Excel Addins
    • Support for pandas DataFrames, charts, CSE formulas and more
    • Debugging helpers, easier than you might think!
  • Challenges: deployment
  • Other options: openpyxl, xlwt, XlsxWriter

General advice

Python packaging

Remember this picture? It's so 2012!

And by the way, a bit too optimistic:

  • setuptools is back
  • easy_install is not gone

The future of Python

  • Origins: 1995-2005
    • Python is first used in science
    • Thin wrappers around LAPACK and other Netlib libraries
    • No notion of good practices, GitHub or whatsoever
    • Several array packages
    • In 2001 IPython and matplotlib appear
    • Year 2002 marks the beginning of the digital age
    • In 2005 NumPy is born to rule them all
  • Growth: 2005-2015
    • The community starts gaining traction
    • GitHub brings a new era on online collaboration, development pace accelerates
    • In 2008 pandas is born
    • In 2012 IPython receives $1.15M from the Sloan Foundation and Continuum Analytics is born
    • Big Data starts getting more and more attention

"Prediction is very difficult, especially about the future."

  • Future: 2015-2025?
    • In 2016 Continuum brings Python to HDFS
    • Highly opinionated: Oracle v Google poses threat to Java
    • The Jupyter project diversifies, different notebook interfaces and use cases appear
    • Replacements for NumPy and pandas are developed (DyND, Blaze)
    • Julia and other competitors mature
    • Python 2 will reach EOL in 2020

Muchas gracias 😊

Per Python ad astra!