Attribute: Close to Data


In [1]:
from IPython.display import display, Image, HTML
from talktools import website, nbviewer

Overview

All computations in scientific computing and data science involve data at some level. Here are some important questions about the human experience of working with data.

  • How many steps (and how difficult) are required to visualize a dataset?
  • How many steps (and how difficult) are required to explore a dataset? Exploration is the act of building intuition about a dataset. It typically involves visualization, but also includes iteration and computation.
  • How many steps (and how difficult) are required to transform, analyze, summarize, model the dataset?

The number of steps required for each of these activities is determined by the software tool.

  • Many difficult steps $\rightarrow$ far from the data
  • Few easy steps $\rightarrow$ close to the data

These are attributes of software tools designed to work with data.

Here are some examples:

  • Far from the data: C, C++, Fortran, Vim, Emacs, Make
  • Close to the data: Excel, d3, Matlab, Mathematica

We are trying make IPython as close to the data as possible.

Visualize, Interact, Compute

Data exploration is an iterative process that involves quick, repeated passes at visualization, interaction and computation.


In [2]:
Image('images/VizInteractCompute.png')


Out[2]:

Right now this cycle is still really painful. For most users, the time scale for this cycle is typically minutes or even hours. We are working to bring it down to seconds for all users.

Visualization (not plotting)

Viewing raw data in a textual format is a difficult way to explore data.


In [3]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [4]:
a = np.random.rand(25); a


Out[4]:
array([ 0.12175105,  0.08334302,  0.05530447,  0.66302344,  0.50851338,
        0.65687547,  0.72115499,  0.69456045,  0.87594787,  0.94138312,
        0.51143503,  0.14200351,  0.95525064,  0.59265453,  0.55305521,
        0.28802671,  0.91027987,  0.97277739,  0.25178276,  0.55624818,
        0.65159251,  0.46324681,  0.0921479 ,  0.24735489,  0.98898735])

To really see the data, you need richer representations.


In [5]:
hist(a)


Out[5]:
(array([5, 0, 3, 0, 3, 3, 4, 1, 1, 5]),
 array([ 0.05530447,  0.14867276,  0.24204104,  0.33540933,  0.42877762,
         0.52214591,  0.6155142 ,  0.70888248,  0.80225077,  0.89561906,
         0.98898735]),
 <a list of 10 Patch objects>)

This might be a plot, but could also be a static image, HTML, audio sample, etc.

Since 2011, IPython has had a rich display system. This rich display system allows Python objects to declare non-textual representations that can be displayed in the Notebook. These rich representations include:

  • PNG/JPEG
  • HTML
  • JavaScript
  • LaTeX
  • SVG

These rich representaions are displayed using IPython's display function:


In [6]:
from IPython.display import display, Audio, Latex

Throughout this talk I have been using the Image and HTML representations of objects. Here is an example of a Python object with an HTML5 audio representation:


In [7]:
a = Audio('data/Bach Cello Suite #3.wav')

In [8]:
display(a)


And a LaTeX representation of Maxwell's equations:


In [9]:
Latex(r"""\begin{eqnarray}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0 
\end{eqnarray}""")


Out[9]:
\begin{eqnarray} \nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\ \nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\ \nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\ \nabla \cdot \vec{\mathbf{B}} & = 0 \end{eqnarray}

But this rich display system only covers part of the iterative cycle of data exploration. Most importantly, we need to interact with the data.

Interaction

For IPython 2.0 (release in January 2014) we have developed a layered architecture for building interactivity into the Notebook. This architecture allows Python and JavaScript code to communicate seamlessly and in real time.


In [10]:
Image('images/WidgetArch.png')


Out[10]:
  • Comm: This layer allows real-time, asynchronous, bi-directional JSON messaging between your Python objects in the IPython Kernel and JavaScript. Comm instances are very lightweight; they all share an existing pair of WebSocket/ZeroMQ connections. This layer is documented in IPEP 21.
  • Widgets: Widgets synchronize Python models (traitlets) with JavaScript models (Backbone.js) and manage the lifecycle and parent/child relationships of JS/HTML views. This layer is documented in IPEP 23.
  • Interact: High level interface for quick and dirty data exploration.

Let's go play!

Styling


In [11]:
%load_ext load_style

In [12]:
%load_style talk.css