In [1]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'width': 1024,
              'height': 768,
})


Out[1]:
{'center': 'False',
 'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'linear',
 'width': 1024}

In [2]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'simple',
              'transition': 'linear',
              'start_slideshow_at': 'selected',
              'center': 'False',
})


Out[2]:
{'center': 'False',
 'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'linear',
 'width': 1024}

SciPy 2015

Austin, TX, July 6 - 12

Conference structure

This conference is different. The audience is mostly people working in science, but many people working in tech and software development. It's a very diverse crowd, and they are friendly.

  • 2 days of tutorials
  • 3 days of main conference: 3 simultaneous tracks of talks + poster session + lightning talks
    • plenty of social events each evening
    • a job fair
  • 2 days of sprints

Tutorials

I attended a variety of tutorials. The ones I enjoyed the most were:

  • Machine Learning with Scikit-Learn (2 sessions)
  • Building Python Data Applications with Bokeh and Blaze

Machine Learning with Scikit-Learn

The full tutorial materials can be found here: https://github.com/amueller/scipy_2015_sklearn_tutorial

This was a 2-session (8 hour) tutorial; it covered a host of topics:

Morning Session

  • What is machine learning?
  • Supervised learning
    • Training and test data
    • Classification
    • Regression
  • Unsupervised Learning
    • Unsupervised transformers
    • Preprocessing and scaling
    • Dimensionality reduction
    • Clustering

Application : Classification of digits


In [3]:
from sklearn.datasets import load_digits
digits = load_digits()
%matplotlib inline
import matplotlib.pyplot as plt

In [4]:
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(24, 48):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))


We can train, for example, a Gaussian Naive Bayes classifier to identify digits from a test set.


In [5]:
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import train_test_split

In [6]:
digits.data.shape


Out[6]:
(1797, 64)

In [7]:
# split the data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=5)

# train the model
clf = GaussianNB()
clf.fit(X_train, y_train)

# use the model to predict the labels of the test data
predicted = clf.predict(X_test)
expected = y_test

How did we do?


In [8]:
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(24,48):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,
              interpolation='nearest')
    
    # label the image with the target value
    if predicted[i] == expected[i]:
        ax.text(0, 7, str(predicted[i]), color='green')
    else:
        ax.text(0, 7, str(predicted[i]), color='red')



In [9]:
print(clf.score(X_test, y_test))


0.811111111111

Application: Eigenfaces


In [10]:
from sklearn import datasets
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4,
                                       data_home='../tutorials/scipy_2015_sklearn_tutorial/notebooks/datasets/')
lfw_people.data.shape


Out[10]:
(1288, 1850)

In [11]:
fig = plt.figure(figsize=(14, 4))
# plot several images
for i in range(20):
    ax = fig.add_subplot(2, 10, i + 1, xticks=[], yticks=[])
    ax.imshow(lfw_people.images[i], cmap=plt.cm.bone)


We can perform PCA to extract features from the set of images:


In [12]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                        lfw_people.data, 
                                        lfw_people.target, 
                                        random_state=0)

print(X_train.shape, X_test.shape)


(966, 1850) (322, 1850)

In [13]:
from sklearn import decomposition
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)


Out[13]:
RandomizedPCA(copy=True, iterated_power=3, n_components=150,
       random_state=None, whiten=True)

And we can view the mean face:


In [14]:
plt.imshow(pca.mean_.reshape((50, 37)), cmap=plt.cm.bone)


Out[14]:
<matplotlib.image.AxesImage at 0x7fb8a4ffd668>

As well as the eigenfaces, if we like:


In [15]:
fig = plt.figure(figsize=(16, 6))
for i in range(30):
    ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])
    ax.imshow(pca.components_[i].reshape((50, 37)), cmap=plt.cm.bone)


And plenty more example activities:

  • Methods: Text feature abstraction, bag of words
  • Application : SMS spam detection
  • Summary : Model building and generalization

Afternoon Session

  • Cross-Validation
  • Model Complexity: Overfitting and underfitting
  • Complexity of various model types
  • Grid search for adjusting hyperparameters
  • Basic regression with cross-validation
  • Application : Titanic survival with Random Forest
  • Building Pipelines
    • Motivation and Basics
    • Preprocessing and Classification
    • Grid-searching Parameters of the feature extraction

And plenty of examples:

  • Application : Image classification

  • Model complexity, learning curves and validation curves

  • In-Depth supervised models
    • Linear Models
    • Kernel SVMs
    • trees and Forests
  • Learning with Big Data
    • Out-Of-Core learning
    • The hashing trick for large text corpuses

Building Python Data Applications with Blaze and Bokeh

The full tutorial materials can be found here: https://github.com/chdoig/scipy2015-blaze-bokeh

This session covered mainly Bokeh's API, which allows one to easily build interactive plots that can be rendered in a browser.


In [16]:
import pandas as pd
from bokeh.plotting import figure, show, output_notebook
output_notebook()

# Get data
df = pd.read_csv('../tutorials/scipy2015-blaze-bokeh/data/Land_Ocean_Monthly_Anomaly_Average.csv')

# Process data
df['datetime'] = pd.to_datetime(df['datetime'])
df.head()