Part 2 - 2 Point Stats

Overview

  • Continuous and categorical microstructure function

  • Artificial microstructures

  • 2 point stats

  • Dimensionality reduction

  • Dask

  • Classify microstructure


In [31]:
import numpy as np
from pymks.bases import PrimitiveBasis

In [32]:
?PrimitiveBasis

In [33]:
data_cat = np.array([[1, 2, 0]])

Use the primitive basis or microstructure function to discretise the data. This is the most straightforward basis to understand.

$ m_k[h; s] = \delta[h - \phi_k[s]] $

$ \delta [x] = \max \left( 1 - \left|\frac{x}{\Delta h}\right|, 0 \right) $


In [34]:
basis = PrimitiveBasis(n_states=3, domain=[0, 2])
data_disc = basis.discretize(data_cat)

Discretizing the data adds a new index. The last index has the same length as n_states.


In [35]:
print(data_cat.shape)
print(data_disc.shape)


(1, 3)
(1, 3, 3)

Categorical data now has digital representation.


In [36]:
print(data_cat[0, 0])
print(data_disc[0, 0, :])


1
[ 0.  1.  0.]

Try this with continuous data.


In [37]:
data_con = np.array([[0.1, 0.5, 0.3]])

In [38]:
basis = PrimitiveBasis(n_states=5, domain=[0, 1])
data_disc = basis.discretize(data_con)

In [39]:
print(data_con.shape)
print(data_disc.shape)


(1, 3)
(1, 3, 5)

The continuous data varies between 0 and 1. This is a first order discretized $\delta$ function.


In [40]:
print(data_con[0, 0])
print(data_disc[0, 0])


0.1
[ 0.6  0.4  0.   0.   0. ]

Using a Legendre basis to discretize data. Legendre polynomials are good for this discretization because they are orthogonal with respect to the L2-norm


In [41]:
data_leg = np.array([[-0.5, 0.1, 0.9]])

In [42]:
from pymks.bases import LegendreBasis
basis = LegendreBasis(n_states=4, domain=[-1, 1])

In [43]:
data_disc = basis.discretize(data_leg)
print(data_disc.shape)


(1, 3, 4)

In [44]:
print(data_disc)


[[[ 0.5     -0.75    -0.3125   1.53125]
  [ 0.5      0.15    -1.2125  -0.51625]
  [ 0.5      1.35     1.7875   1.65375]]]

We can see how these are constructed using Scipy's eval_legendre.


In [45]:
from scipy.special import eval_legendre

In [46]:
eval_legendre(np.arange(4), -0.5) * (np.arange(4) + 0.5)


Out[46]:
array([ 0.5    , -0.75   , -0.3125 ,  1.53125])

Construct Artificial Microstructures


In [47]:
from pymks.datasets import make_microstructure

In [48]:
?make_microstructure

Use a Fourier Gaussian to create artificial microstructures. Fourier Gaussian comes from numpy.ndimage.


In [49]:
data = make_microstructure(grain_size=(20, 20))

By default 10 microstructures are generated with make_microstructures.


In [50]:
print(data.shape)


(10, 101, 101)

View the results using Matplotlib's imshow


In [51]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['image.cmap'] = 'Paired'

plt.imshow(data[0]);


Fiber Microstructure

Construct a different microstructure using a fiber structure.


In [52]:
data_fiber = make_microstructure(grain_size=(95, 15))

View the results using imshow


In [53]:
plt.imshow(data_fiber[0])
plt.show()
plt.imshow(data_fiber[1])
plt.show()


2 Point Stats

Construct 150 data sets with different structures


In [54]:
np.random.seed(0)

data_blob = make_microstructure(n_samples=50, grain_size=(30, 30))
data_fiber_v = make_microstructure(n_samples=50, grain_size=(100, 10))
data_fiber_h = make_microstructure(n_samples=50, grain_size=(10, 100))

In [55]:
plt.imshow(data_blob[2]);



In [56]:
plt.imshow(data_fiber_v[1]);



In [57]:
plt.imshow(data_fiber_h[0]);



In [58]:
data_fiber_v.shape


Out[58]:
(50, 101, 101)

Concatenate our data into a single array.


In [59]:
data = np.concatenate([data_blob, data_fiber_v, data_fiber_h], axis=0)

In [60]:
data.shape


Out[60]:
(150, 101, 101)

View all the data using subplots with Matplotlib


In [61]:
fig = plt.figure(figsize=(12, 12))
for i in range(12):
    for j in range(12):
        count = i * 12 + j
        ax = fig.add_subplot(12, 12, count + 1)
        ax.set_aspect('equal')
        img = plt.imshow(data[count])
        plt.axis('off')
        img.set_cmap('Paired')


2 Point Stats

Use convolutions to construct the 2 point stats across each state space pairing. This is quite fast in Fourier space.


In [62]:
from pymks.stats import correlate

basis = PrimitiveBasis(n_states=2, domain=[0, 1])
data_corr = correlate(data, basis=basis)

In [63]:
data_corr.shape


Out[63]:
(150, 101, 101, 3)

The two point stats reflect the shape of the structures


In [64]:
fig, axs = plt.subplots(1, 3)

for i, ax in enumerate(axs):
    im = ax.imshow(data_corr[0, :, :, i], vmin=0.0, vmax=1.0, cmap="viridis")
fig.set_size_inches(15, 5)
fig.colorbar(im, ax=axs.ravel().tolist());



In [65]:
print(data_corr[0, :, :, 0] + data_corr[0, :, :, 1] + data_corr[0, :, :, 2] + data_corr[0, ::-1, ::-1, 2])


[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

In [66]:
fig, axs = plt.subplots(1, 3)

for i, ax in enumerate(axs):
    im = ax.imshow(data_corr[50, :, :, i], vmin=0.0, vmax=1.0, cmap="viridis")
fig.set_size_inches(15, 5)
fig.colorbar(im, ax=axs.ravel().tolist());



In [67]:
fig, axs = plt.subplots(1, 3)

for i, ax in enumerate(axs):
    im = ax.imshow(data_corr[-1, :, :, i], vmin=0.0, vmax=1.0, cmap="viridis")
fig.set_size_inches(15, 5)
fig.colorbar(im, ax=axs.ravel().tolist());



In [68]:
data_corr.shape


Out[68]:
(150, 101, 101, 3)

Dimensionality Reduction

We've created a huge number of features, over 30,000 per sample. We need to reduce those down to only a few features to do classification using LogisticRegression.


In [69]:
data_reshape = np.reshape(data_corr, (data_corr.shape[0], data_corr[0].size))

In [70]:
print(data_reshape.shape)


(150, 30603)

In [71]:
data_mean = data_reshape - np.mean(data_reshape, axis=1)[:, None]

In [72]:
from sklearn.decomposition import PCA

pca_model = PCA(n_components=5)

In [73]:
data_pca = pca_model.fit_transform(data_mean)

In [74]:
data_pca.shape


Out[74]:
(150, 5)

In [75]:
# NBVAL_IGNORE_OUTPUT

colors = ['r'] * 50 + ['b'] * 50 + ['g'] * 50

fig, axss = plt.subplots(5, 5)

for i, axs in enumerate(axss):
    for j, ax in enumerate(axs):
        ax.scatter(data_pca[:, i], data_pca[:, j], color=colors)
fig.set_size_inches(12, 12)
fig.show()


/home/wd15/miniconda/lib/python3.6/site-packages/matplotlib/figure.py:403: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

In [76]:
classification = np.concatenate([np.zeros(50), np.ones(50), np.ones(50) + 1]).astype(int)

In [77]:
# NBVAL_SKIP

from bqplot import pyplot as bq_plt

bq_plt.figure(title="PCA2 v PCA3")
bq_plt.scatter(data_pca[:, 2], data_pca[:, 3], color=classification)
bq_plt.show()


Using Dask

Speed it up using Dask. This is an embarissingly parallel operation that is easy to paralleliz


In [78]:
# NBVAL_IGNORE_OUTPUT

%timeit correlate(data, basis=basis)


7.78 s ± 176 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [351]:
print(data.shape)


(150, 101, 101)

In [352]:
data_corr.shape


Out[352]:
(150, 101, 101, 3)

In [353]:
import dask.array as da

data_da = da.from_array(data, chunks=(5, 101, 101))

print(data_da.shape)
print(data_da.chunks)


(150, 101, 101)
((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), (101,), (101,))

In [354]:
# NBVAL_IGNORE_OUTPUT

import dask.threaded
import dask.multiprocessing

def correlate_(data_):
    basis = PrimitiveBasis(n_states=2, domain=[0, 1])
    return correlate(data_, basis=basis)

data_da = da.from_array(data, chunks=(5, 101, 101))

map_ = da.map_blocks(correlate_, data_da, new_axis=1)

%timeit map_.compute(num_workers=2, get=dask.multiprocessing.get)


5.12 s ± 86.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [355]:
# NBVAL_IGNORE_OUTPUT

%timeit map_.compute(num_workers=4, get=dask.multiprocessing.get)


4.55 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [356]:
# NBVAL_IGNORE_OUTPUT

%timeit map_.compute(get=dask.threaded.get)


4.02 s ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Classification


In [357]:
from sklearn.linear_model import LogisticRegression

In [358]:
model = LogisticRegression(C=1)
model.fit(data_pca, classification)


Out[358]:
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [359]:
classification_pred = model.predict(data_pca)

In [360]:
# NBVAL_IGNORE_OUTPUT

from sklearn.metrics import confusion_matrix

confusion_matrix(classification, classification_pred)


Out[360]:
array([[42,  4,  4],
       [ 2, 48,  0],
       [ 0,  0, 50]])

Train Test Split


In [361]:
X, y = np.reshape(data, (data.shape[0], data[0].size)), classification

In [362]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [363]:
print(X_train.shape)


(112, 10201)

In [364]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=2)
train_index, test_index = next(skf.split(X, y))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

In [365]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer


def transform(X):
    X_ = np.reshape(X, (X.shape[0],) + data[0].shape)
    basis = PrimitiveBasis(n_states=2, domain=[0, 1])
    X_ = correlate(X, basis=basis)
    X_ = np.reshape(X_, (X_.shape[0], X_[0].size))
    return X_ - np.mean(X_, axis=1)[:, None]
    
transformer = FunctionTransformer(transform)

model = Pipeline([('transformer', transformer),
                  ('pca', PCA(n_components=5)),
                  ('classifier', LogisticRegression())])

In [366]:
# NBVAL_IGNORE_OUTPUT

model.fit(X_train, y_train)


Out[366]:
Pipeline(steps=[('transformer', FunctionTransformer(accept_sparse=False,
          func=<function transform at 0x7fc2b30dbd08>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y=False, validate=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_so...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [367]:
y_pred = model.predict(X_test)

In [368]:
# NBVAL_IGNORE_OUTPUT

mat = confusion_matrix(y_test, y_pred)
print(mat)


[[13  7  5]
 [ 0 25  0]
 [ 0  0 25]]

Cross-validate

cross_val_score uses StratifiedKFold for labeled data


In [369]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(scores)


[ 0.9         0.93333333  0.86666667  0.96666667  0.9       ]