In [1]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'width': 1024,
              'height': 768,
})


Out[1]:
{'center': 'False',
 'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'linear',
 'width': 1024}

In [2]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'simple',
              'transition': 'linear',
              'start_slideshow_at': 'selected',
              'center': 'False',
})


Out[2]:
{'center': 'False',
 'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'linear',
 'width': 1024}

SciPy 2015

Austin, TX, July 6 - 12

Conference structure

This conference is different. The audience is mostly people working in science, but many people working in tech and software development. It's a very diverse crowd, and they are friendly.

  • 2 days of tutorials
  • 3 days of main conference: 3 simultaneous tracks of talks + poster session + lightning talks
    • plenty of social events each evening
    • a job fair
  • 2 days of sprints

Tutorials

I attended a variety of tutorials. The ones I enjoyed the most were:

  • Machine Learning with Scikit-Learn (2 sessions)
  • Building Python Data Applications with Bokeh and Blaze

Machine Learning with Scikit-Learn

The full tutorial materials can be found here: https://github.com/amueller/scipy_2015_sklearn_tutorial

This was a 2-session (8 hour) tutorial; it covered a host of topics:

Morning Session

  • What is machine learning?
  • Supervised learning
    • Training and test data
    • Classification
    • Regression
  • Unsupervised Learning
    • Unsupervised transformers
    • Preprocessing and scaling
    • Dimensionality reduction
    • Clustering

Application : Classification of digits


In [3]:
from sklearn.datasets import load_digits
digits = load_digits()
%matplotlib inline
import matplotlib.pyplot as plt

In [4]:
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(24, 48):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))


We can train, for example, a Gaussian Naive Bayes classifier to identify digits from a test set.


In [5]:
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import train_test_split

In [6]:
digits.data.shape


Out[6]:
(1797, 64)

In [7]:
# split the data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=5)

# train the model
clf = GaussianNB()
clf.fit(X_train, y_train)

# use the model to predict the labels of the test data
predicted = clf.predict(X_test)
expected = y_test

How did we do?


In [8]:
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(24,48):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,
              interpolation='nearest')
    
    # label the image with the target value
    if predicted[i] == expected[i]:
        ax.text(0, 7, str(predicted[i]), color='green')
    else:
        ax.text(0, 7, str(predicted[i]), color='red')



In [9]:
print(clf.score(X_test, y_test))


0.811111111111

Application: Eigenfaces


In [10]:
from sklearn import datasets
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4,
                                       data_home='../tutorials/scipy_2015_sklearn_tutorial/notebooks/datasets/')
lfw_people.data.shape


Out[10]:
(1288, 1850)

In [11]:
fig = plt.figure(figsize=(14, 4))
# plot several images
for i in range(20):
    ax = fig.add_subplot(2, 10, i + 1, xticks=[], yticks=[])
    ax.imshow(lfw_people.images[i], cmap=plt.cm.bone)


We can perform PCA to extract features from the set of images:


In [12]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                        lfw_people.data, 
                                        lfw_people.target, 
                                        random_state=0)

print(X_train.shape, X_test.shape)


(966, 1850) (322, 1850)

In [13]:
from sklearn import decomposition
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)


Out[13]:
RandomizedPCA(copy=True, iterated_power=3, n_components=150,
       random_state=None, whiten=True)

And we can view the mean face:


In [14]:
plt.imshow(pca.mean_.reshape((50, 37)), cmap=plt.cm.bone)


Out[14]:
<matplotlib.image.AxesImage at 0x7fb8a4ffd668>

As well as the eigenfaces, if we like:


In [15]:
fig = plt.figure(figsize=(16, 6))
for i in range(30):
    ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])
    ax.imshow(pca.components_[i].reshape((50, 37)), cmap=plt.cm.bone)


And plenty more example activities:

  • Methods: Text feature abstraction, bag of words
  • Application : SMS spam detection
  • Summary : Model building and generalization

Afternoon Session

  • Cross-Validation
  • Model Complexity: Overfitting and underfitting
  • Complexity of various model types
  • Grid search for adjusting hyperparameters
  • Basic regression with cross-validation
  • Application : Titanic survival with Random Forest
  • Building Pipelines
    • Motivation and Basics
    • Preprocessing and Classification
    • Grid-searching Parameters of the feature extraction

And plenty of examples:

  • Application : Image classification

  • Model complexity, learning curves and validation curves

  • In-Depth supervised models
    • Linear Models
    • Kernel SVMs
    • trees and Forests
  • Learning with Big Data
    • Out-Of-Core learning
    • The hashing trick for large text corpuses

Building Python Data Applications with Blaze and Bokeh

The full tutorial materials can be found here: https://github.com/chdoig/scipy2015-blaze-bokeh

This session covered mainly Bokeh's API, which allows one to easily build interactive plots that can be rendered in a browser.


In [16]:
import pandas as pd
from bokeh.plotting import figure, show, output_notebook
output_notebook()

# Get data
df = pd.read_csv('../tutorials/scipy2015-blaze-bokeh/data/Land_Ocean_Monthly_Anomaly_Average.csv')

# Process data
df['datetime'] = pd.to_datetime(df['datetime'])
df.head()


BokehJS successfully loaded.
Out[16]:
datetime anomaly uncert
0 1850-01-01 -0.699 0.411
1 1850-02-01 -0.210 0.469
2 1850-03-01 -0.349 0.377
3 1850-04-01 -0.625 0.319
4 1850-05-01 -0.594 0.317

In [17]:
# Create plot
f = figure(plot_width=800, plot_height=400)
f.line(df['datetime'], df['anomaly'], color='skyblue', legend='Temp')
f.line(df['datetime'], pd.rolling_mean(df['anomaly'], window=10), color='grey', legend='Rolling mean')
f.title = 'Temperature anomaly with time'

# Show plot
show(f)


My favorite talks

There were a ton of great talks, and quite a few relevant to the stuff we do. Here's a selection of ones that stood out to me, with links.

Dask

Dask provides data structures that can do complex out-of-core operations with an API similar to numpy and pandas.


In [18]:
from datetime import timedelta
from IPython.display import YouTubeVideo
start=int(timedelta(minutes=7, seconds=45).total_seconds())
YouTubeVideo("1kkFZ4P-XHg", start=start, autoplay=0, theme="light",
             color="blue", height=400, width=700)


Out[18]:

HDF5 is Eating the World

A talk on HDF5 by the author of h5py on ways to use HDF5 effectively.

Deep Learning: Tips from the Road

A talk by Kyle Kastner (one of the scikit-learn tutorial instructors) on deep learning, its uses, and its limitations. A great, short intro to neural networks.


In [19]:
from datetime import timedelta
from IPython.display import YouTubeVideo
start=int(timedelta(minutes=0, seconds=40).total_seconds())
YouTubeVideo("TBBtOeY2Q78", start=start, autoplay=0, theme="light", 
             color="red", height=400, width=700)


Out[19]:

Agent Based Modeling in Python with Mesa

A talk by the authors of Mesa, a new package that fills a hole in the python ecosystem: a package for building agent-based simulations.

VisPy: Harnessing The GPU For Fast, High Level Visualization

Great talk by Luke Campagnola on the capabilities of VisPy to generate complex visualizations in real time using the GPU and OpenGL.


In [20]:
from datetime import timedelta
from IPython.display import YouTubeVideo
start=int(timedelta(minutes=11, seconds=43).total_seconds())
YouTubeVideo("_3YoaeoiIFI", start=start, autoplay=0, theme="light", 
             color="red", height=400, width=700)


Out[20]:

And many more!

Great keynotes!

There were three keynote talks, one given each day of the main conference. All three are worth watching:

Things I missed (but will watch on YouTube later)

Things I missed (but will watch on YouTube later)

Things I missed (but will watch on YouTube later)

Most important:

  • New contacts: met a lot of people I'm already working with on new projects.
  • New ideas: plenty of raw material to work with.
  • Inspiration! When you're excited about working on things, that's the first step to doing something cool.