In [1]:

    
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'width': 1024,
              'height': 768,
})









    Out[1]:





{'center': 'False',
 'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'linear',
 'width': 1024}



In [2]:

    
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'simple',
              'transition': 'linear',
              'start_slideshow_at': 'selected',
              'center': 'False',
})









    Out[2]:





{'center': 'False',
 'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'linear',
 'width': 1024}

SciPy 2015

Austin, TX, July 6 - 12

Conference structure

This conference is different. The audience is mostly people working in science, but many people working in tech and software development. It's a very diverse crowd, and they are friendly.

2 days of tutorials
3 days of main conference: 3 simultaneous tracks of talks + poster session + lightning talks
- plenty of social events each evening
- a job fair
2 days of sprints

Tutorials

I attended a variety of tutorials. The ones I enjoyed the most were:

Machine Learning with Scikit-Learn (2 sessions)
Building Python Data Applications with Bokeh and Blaze

Machine Learning with Scikit-Learn

The full tutorial materials can be found here: https://github.com/amueller/scipy_2015_sklearn_tutorial

This was a 2-session (8 hour) tutorial; it covered a host of topics:

Morning Session

What is machine learning?
Supervised learning
- Training and test data
- Classification
- Regression
Unsupervised Learning
- Unsupervised transformers
- Preprocessing and scaling
- Dimensionality reduction
- Clustering

Application : Classification of digits



In [3]:

    
from sklearn.datasets import load_digits
digits = load_digits()
%matplotlib inline
import matplotlib.pyplot as plt



In [4]:

    
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(24, 48):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))

We can train, for example, a Gaussian Naive Bayes classifier to identify digits from a test set.



In [5]:

    
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import train_test_split



In [6]:

    
digits.data.shape









    Out[6]:





(1797, 64)



In [7]:

    
# split the data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=5)

# train the model
clf = GaussianNB()
clf.fit(X_train, y_train)

# use the model to predict the labels of the test data
predicted = clf.predict(X_test)
expected = y_test

How did we do?



In [8]:

    
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(24,48):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,
              interpolation='nearest')
    
    # label the image with the target value
    if predicted[i] == expected[i]:
        ax.text(0, 7, str(predicted[i]), color='green')
    else:
        ax.text(0, 7, str(predicted[i]), color='red')



In [9]:

    
print(clf.score(X_test, y_test))









    



0.811111111111

Application: Eigenfaces



In [10]:

    
from sklearn import datasets
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4,
                                       data_home='../tutorials/scipy_2015_sklearn_tutorial/notebooks/datasets/')
lfw_people.data.shape









    Out[10]:





(1288, 1850)



In [11]:

    
fig = plt.figure(figsize=(14, 4))
# plot several images
for i in range(20):
    ax = fig.add_subplot(2, 10, i + 1, xticks=[], yticks=[])
    ax.imshow(lfw_people.images[i], cmap=plt.cm.bone)

We can perform PCA to extract features from the set of images:



In [12]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                        lfw_people.data, 
                                        lfw_people.target, 
                                        random_state=0)

print(X_train.shape, X_test.shape)









    



(966, 1850) (322, 1850)



In [13]:

    
from sklearn import decomposition
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)









    Out[13]:





RandomizedPCA(copy=True, iterated_power=3, n_components=150,
       random_state=None, whiten=True)

And we can view the mean face:



In [14]:

    
plt.imshow(pca.mean_.reshape((50, 37)), cmap=plt.cm.bone)









    Out[14]:





<matplotlib.image.AxesImage at 0x7fb8a4ffd668>

As well as the eigenfaces, if we like:



In [15]:

    
fig = plt.figure(figsize=(16, 6))
for i in range(30):
    ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])
    ax.imshow(pca.components_[i].reshape((50, 37)), cmap=plt.cm.bone)

And plenty more example activities:

Methods: Text feature abstraction, bag of words
Application : SMS spam detection
Summary : Model building and generalization

Afternoon Session

Cross-Validation
Model Complexity: Overfitting and underfitting
Complexity of various model types
Grid search for adjusting hyperparameters
Basic regression with cross-validation
Application : Titanic survival with Random Forest
Building Pipelines
- Motivation and Basics
- Preprocessing and Classification
- Grid-searching Parameters of the feature extraction

And plenty of examples:

Application : Image classification
Model complexity, learning curves and validation curves
In-Depth supervised models
- Linear Models
- Kernel SVMs
- trees and Forests
Learning with Big Data
- Out-Of-Core learning
- The hashing trick for large text corpuses

Building Python Data Applications with Blaze and Bokeh

The full tutorial materials can be found here: https://github.com/chdoig/scipy2015-blaze-bokeh

This session covered mainly Bokeh's API, which allows one to easily build interactive plots that can be rendered in a browser.



In [16]:

    
import pandas as pd
from bokeh.plotting import figure, show, output_notebook
output_notebook()

# Get data
df = pd.read_csv('../tutorials/scipy2015-blaze-bokeh/data/Land_Ocean_Monthly_Anomaly_Average.csv')

# Process data
df['datetime'] = pd.to_datetime(df['datetime'])
df.head()









    




    
        
        
        
    
        
        BokehJS successfully loaded.
    






    Out[16]:






  
    
      
      datetime
      anomaly
      uncert
    
  
  
    
      0
      1850-01-01
      -0.699
      0.411
    
    
      1
      1850-02-01
      -0.210
      0.469
    
    
      2
      1850-03-01
      -0.349
      0.377
    
    
      3
      1850-04-01
      -0.625
      0.319
    
    
      4
      1850-05-01
      -0.594
      0.317



In [17]:

    
# Create plot
f = figure(plot_width=800, plot_height=400)
f.line(df['datetime'], df['anomaly'], color='skyblue', legend='Temp')
f.line(df['datetime'], pd.rolling_mean(df['anomaly'], window=10), color='grey', legend='Rolling mean')
f.title = 'Temperature anomaly with time'

# Show plot
show(f)

My favorite talks

There were a ton of great talks, and quite a few relevant to the stuff we do. Here's a selection of ones that stood out to me, with links.

Dask

Dask provides data structures that can do complex out-of-core operations with an API similar to numpy and pandas.

Talk: https://youtu.be/1kkFZ4P-XHg



In [18]:

    
from datetime import timedelta
from IPython.display import YouTubeVideo
start=int(timedelta(minutes=7, seconds=45).total_seconds())
YouTubeVideo("1kkFZ4P-XHg", start=start, autoplay=0, theme="light",
             color="blue", height=400, width=700)









    Out[18]:

HDF5 is Eating the World

A talk on HDF5 by the author of h5py on ways to use HDF5 effectively.

Talk: https://youtu.be/nddj5OA8LJo

Deep Learning: Tips from the Road

A talk by Kyle Kastner (one of the scikit-learn tutorial instructors) on deep learning, its uses, and its limitations. A great, short intro to neural networks.

Talk: https://youtu.be/TBBtOeY2Q78



In [19]:

    
from datetime import timedelta
from IPython.display import YouTubeVideo
start=int(timedelta(minutes=0, seconds=40).total_seconds())
YouTubeVideo("TBBtOeY2Q78", start=start, autoplay=0, theme="light", 
             color="red", height=400, width=700)









    Out[19]:

Agent Based Modeling in Python with Mesa

A talk by the authors of Mesa, a new package that fills a hole in the python ecosystem: a package for building agent-based simulations.

Talk: https://youtu.be/lcySLoprPMc
Project: https://github.com/projectmesa/mesa

VisPy: Harnessing The GPU For Fast, High Level Visualization

Great talk by Luke Campagnola on the capabilities of VisPy to generate complex visualizations in real time using the GPU and OpenGL.

Talk: https://youtu.be/_3YoaeoiIFI
Project: http://vispy.org/



In [20]:

    
from datetime import timedelta
from IPython.display import YouTubeVideo
start=int(timedelta(minutes=11, seconds=43).total_seconds())
YouTubeVideo("_3YoaeoiIFI", start=start, autoplay=0, theme="light", 
             color="red", height=400, width=700)









    Out[20]:

And many more!

Time Series Analysis for Network Security: https://youtu.be/ZSM-tmbBZ5E
Statistical Thinking for Data Science: https://youtu.be/TGGGDpb04Yc
xray: ND Labeled Arrays and Datasets: https://youtu.be/X0pAhJgySxk
RESTful HDF: https://youtu.be/JSFZ3i3WcjQ
Teaching with IPython/Jupyter Notebooks and JupyterHub: https://youtu.be/OuhtpxGuboY
Accelerating Python with the Numba JIT Compiler: https://youtu.be/eYIPEDnp5C4

Great keynotes!

There were three keynote talks, one given each day of the main conference. All three are worth watching:

Data Science at the New York Times by Chris Wiggins: https://youtu.be/MNosMXFGtBE
My Data Journey with Python by Wes McKinney: https://youtu.be/kHdkFyGCxiY
State of the Tools by Jake VanderPlaas: https://youtu.be/5GlNDD7qbP4

Things I missed (but will watch on YouTube later)

Computational Statistics tutorial (2 days) by Allen Downey and Chris Fonnesbeck: https://youtu.be/fMycLa1bsno and https://youtu.be/heFaYLKVZY4
Cython tutorial: https://youtu.be/gMvkiQ-gOW8
Jupyter : Ipython, State Of Multiuser And Real Time Collaboration: https://youtu.be/DyGoHAP8B_s
PyStruct Structured Prediction in Python: https://www.youtube.com/watch?v=12eMr1ZeMI4

Things I missed (but will watch on YouTube later)

Big Data in Practice The Example of Nilearn for Mining Brain Imaging Data: https://www.youtube.com/watch?v=wsnWCo1yAXw
Keep on Releasin' Continuous Delivery for Open Source: https://youtu.be/x8feIJTJRXY
A Better Default Colormap for Matplotlib: https://youtu.be/xAoljeRJ3lU

Things I missed (but will watch on YouTube later)

Story Time with Bokeh: https://youtu.be/c9CgHHz_iYk
many other talks that ran in parallel; check out the full playlist here: https://www.youtube.com/playlist?list=PLYx7XA2nY5Gcpabmu61kKcToLz0FapmHu

Most important:

New contacts: met a lot of people I'm already working with on new projects.
New ideas: plenty of raw material to work with.
Inspiration! When you're excited about working on things, that's the first step to doing something cool.

	datetime	anomaly	uncert
0	1850-01-01	-0.699	0.411
1	1850-02-01	-0.210	0.469
2	1850-03-01	-0.349	0.377
3	1850-04-01	-0.625	0.319
4	1850-05-01	-0.594	0.317