Exploring Data Clustering

Approaches to explore:

Notes on DBSCAN:

There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample is in a dense area of the vector space. A cluster is a set of core samples, that can be built by recursively by taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.

Any core sample is part of a cluster, by definition. Further, any cluster has at least min_samples points in it, following the definition of a core sample. For any sample that is not a core sample, and does have a distance higher than eps to any core sample, it is considered an outlier by the algorithm.

Consider instead using HDBSCAN



In [1]:

    
import glob
import os
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)



In [2]:

    
from bokeh.io import output_notebook, show
from bokeh.resources import INLINE
output_notebook(resources=INLINE)









    





    
        
        Loading BokehJS ...



In [3]:

    
import bokeh.plotting
import bokeh.models   # Legend, Range1d
import bokeh.layouts # gridplot
from bokeh.palettes import Dark2_7 as palette



In [4]:

    
import datashader
import collections
import xarray
from datashader.bokeh_ext import InteractiveImage

Explore Euclidean distance versus fractional difference

Euclidean Distance: $$ d_{ab} = \sqrt{\sum^N_{i=1}(a_i-b_i)^2}$$

Fractional difference used in SasCalc: $$ pd_{ab} = \frac{1}{N}\sum_{i=1}^N{\frac{\left|a_i-b_i\right|}{b_i}}$$

Modified fractional difference: $$ pd_{ab} = \frac{1}{N}\sum_{i=1}^N{\frac{\left|a_i-b_i\right|}{(b_i + a_i)/2}}$$



In [59]:

    
n = 100
a = 20 * np.ones(n)
b = np.linspace(1, n, n)
d = np.sqrt(np.square(a-b))

pd_ab = np.abs(a-b)/b
pd_ba = np.abs(a-b)/a
mean_ab = (a + b) / 2
pd = np.abs(a-b)/mean_ab



In [63]:

    
p = bokeh.plotting.figure(height=400)

p.line(b, d, color=palette[0], legend='Euclidean Distance', line_width=3)
p.line(b, pd_ab, color=palette[1], legend='Fractional Difference, AB', line_width=2)
p.line(b, pd_ba, color=palette[2], legend='Fractional Difference, BA', line_width=2)
p.line(b, pd, color=palette[3], legend='Modified Fractional Difference', line_width=2)

bokeh.plotting.show(p)



In [50]:

    
def calc_pd(a, b):
    mean_ab = (a + b) / 2.0
    pd = np.abs(a-b) / mean_ab
    return pd



In [55]:

    
calc_pd(2, 3)









    Out[55]:





0.40000000000000002



In [56]:

    
calc_pd(2, 1)









    Out[56]:





0.66666666666666663



In [49]:

    
np.savetxt









    Out[49]:





2

Get the data

Load the SAS data

Begin by designating where the theoretical SAXS and SANS data files are located.



In [16]:

    
import sys
sys.platform









    Out[16]:





'linux2'



In [17]:

    
if 'linux' in sys.platform:
    saxs_dir = '/home/schowell/data/scratch/docking/xray'
    pr_dir = '/home/schowell/data/scratch/docking/pr'

elif sys.platform == 'darwin':
    sas_dir = '/Users/schowell/scratch/docking/sascalc'
    pr_dir = '/Users/schowell/scratch/docking/pr'

sas_ext = '*.iq'
saxs_search = os.path.join(saxs_dir, sas_ext)
print(saxs_search)

pr_ext = '*.pr'
pr_search = os.path.join(pr_dir, pr_ext)
print(pr_search)









    



/home/schowell/data/scratch/docking/xray/*.iq
/home/schowell/data/scratch/docking/pr/*.pr



In [18]:

    
saxs_files = glob.glob(saxs_search)
pr_files = glob.glob(pr_search)
saxs_files.sort()
pr_files.sort()
n_saxs = len(saxs_files)
n_pr = len(pr_files)
print('Found {} SAXS data files'.format(n_saxs))
print('Found {} P(r) data files'.format(n_pr))









    



Found 60000 SAXS data files
Found 60000 P(r) data files



In [27]:

    
n = 1000



In [19]:

    
saxs_data = []

# load in the first data set to setup the q-mask
first_data = np.loadtxt(saxs_files[0])
q_mask = first_data[:, 0] <= 0.18  # only use data up to 0.18 1/A
first_data[:, 1] /= first_data[0, 1]  # normalize I(0), to 1
first_data = first_data[q_mask]
saxs_data.append(first_data[1:, 1])  # do not use I(0), it is the same for every dataset

# load in the rest of the data
for saxs_file in saxs_files[1:]:
    x_data = np.loadtxt(saxs_file)
    x_data[:, 1] /= x_data[0, 1]
    x_data = x_data[q_mask]
    assert np.allclose(x_data[0, 1], first_data[0, 1]), 'ERROR: data not normalized to I(0)'
    assert np.allclose(x_data[:, 0], first_data[:, 0]), 'ERROR: data not on same Q-grid'
    saxs_data.append(x_data[1:, 1])

saxs_data = np.array(saxs_data)

Store the Q-values



In [20]:

    
q_saxs = x_data[1:, 0]
print(q_saxs)









    



[ 0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09  0.1   0.11  0.12]

Load the P(r) data



In [21]:

    
pr_data_l = []
n_pr = len(pr_files)
n_r = np.empty(n_pr, dtype=int)

# load in all the data set
for i, pr_file in enumerate(pr_files):
    n_data = np.loadtxt(pr_file, delimiter=',', dtype=int)
    pr_data_l.append(n_data[:, 1])
    n_r[i] = len(n_data)
    
r_max = n_r.max()
pr_data = np.zeros([n_pr, r_max], dtype=int)
for i, n_data in enumerate(pr_data_l):
    pr_data[i, :len(n_data)] = n_data
    
#pr_data = np.array(sans_data)



In [22]:

    
pr_data[:3, :8]









    Out[22]:





array([[     0,  17675,  58008, 109234, 175999, 294267, 399870, 516175],
       [     0,  17666,  57908, 108736, 174919, 292554, 397581, 513548],
       [     0,  17663,  57873, 108699, 174595, 291759, 395975, 510582]])

Visualize the data

Create a dataframe for plotting purposes (each column will be a different I(Q) or P(r) curve)



In [30]:

    
n = 1000



In [28]:

    
saxs_df = pd.DataFrame(data=saxs_data[:n].T)
saxs_cols = ['c{}'.format(c) for c in saxs_df.columns]
saxs_df.columns = saxs_cols
saxs_df['q'] = q_saxs
saxs_df.head()









    Out[28]:






  
    
      
      c0
      c1
      c2
      c3
      c4
      c5
      c6
      c7
      c8
      c9
      ...
      c991
      c992
      c993
      c994
      c995
      c996
      c997
      c998
      c999
      q
    
  
  
    
      0
      0.937829
      0.946701
      0.934337
      0.945253
      0.946504
      0.939226
      0.940876
      0.944312
      0.948206
      0.934415
      ...
      0.948131
      0.934538
      0.936238
      0.938994
      0.937654
      0.946152
      0.936084
      0.938800
      0.939500
      0.01
    
    
      1
      0.775493
      0.803533
      0.765157
      0.798308
      0.802600
      0.779377
      0.783707
      0.795408
      0.808692
      0.766193
      ...
      0.808336
      0.764601
      0.769203
      0.778593
      0.774023
      0.801441
      0.770877
      0.778405
      0.781446
      0.02
    
    
      2
      0.571241
      0.612756
      0.557187
      0.602897
      0.610310
      0.575323
      0.578949
      0.598959
      0.621673
      0.560933
      ...
      0.620719
      0.553221
      0.557918
      0.573911
      0.566143
      0.608457
      0.566521
      0.574939
      0.581969
      0.03
    
    
      3
      0.385065
      0.423077
      0.372213
      0.409542
      0.418441
      0.385450
      0.383139
      0.407020
      0.433883
      0.379456
      ...
      0.431974
      0.363800
      0.364644
      0.383351
      0.374407
      0.416492
      0.383075
      0.386843
      0.398251
      0.04
    
    
      4
      0.251004
      0.270319
      0.240227
      0.255502
      0.263468
      0.245648
      0.235122
      0.256795
      0.280355
      0.249179
      ...
      0.277272
      0.230541
      0.226645
      0.242626
      0.235429
      0.262096
      0.250377
      0.248459
      0.262434
      0.05
    
  

5 rows × 1001 columns



In [29]:

    
pr_df = pd.DataFrame(data=pr_data[:n].T)
pr_cols = ['c{}'.format(c) for c in pr_df.columns]
pr_df.columns = pr_cols
r = np.arange(r_max)
pr_df['r'] = r
pr_df.head()









    Out[29]:






  
    
      
      c0
      c1
      c2
      c3
      c4
      c5
      c6
      c7
      c8
      c9
      ...
      c991
      c992
      c993
      c994
      c995
      c996
      c997
      c998
      c999
      r
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      17675
      17666
      17663
      17663
      17663
      17661
      17667
      17667
      17664
      17663
      ...
      17663
      17662
      17660
      17668
      17665
      17662
      17659
      17666
      17673
      1
    
    
      2
      58008
      57908
      57873
      57934
      57851
      57887
      57891
      57874
      57927
      57881
      ...
      57891
      57913
      57840
      57873
      57877
      57902
      57880
      57892
      57944
      2
    
    
      3
      109234
      108736
      108699
      108908
      108546
      108838
      108698
      108689
      108945
      108768
      ...
      108880
      108717
      108602
      108749
      108707
      108786
      108792
      108786
      108921
      3
    
    
      4
      175999
      174919
      174595
      175255
      174379
      174754
      174749
      174991
      175202
      174674
      ...
      175249
      174767
      174179
      174607
      174689
      174941
      174949
      175042
      175356
      4
    
  

5 rows × 1001 columns



In [31]:

    
q_range = q_saxs.min(), q_saxs.max()
pq_range = saxs_data.min(), saxs_data.max()



In [32]:

    
r_range = 0, r_max
pr_range = pr_data.min(), pr_data.max()



In [ ]:

    
%%time
saxs_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, plot_height=300, plot_width=300)
# create an ordered-dictionary of each line, aggregated
saxs_aggs = collections.OrderedDict((c, saxs_canvas.line(saxs_df, 'q', c)) for c in saxs_cols)
saxs_merged = xarray.concat(saxs_aggs.values(), dim=pd.Index(saxs_cols, name='cols'))
saxs_img = datashader.transfer_functions.shade(saxs_merged.sum(dim='cols'), how='eq_hist')



In [ ]:

    
%%time
pr_canvas = datashader.Canvas(x_range=r_range, y_range=pr_range, plot_height=300, plot_width=300)
# create an ordered-dictionary of each line, aggregated
pr_aggs = collections.OrderedDict((c, pr_canvas.line(pr_df, 'r', c)) for c in pr_cols)
pr_merged = xarray.concat(pr_aggs.values(), dim=pd.Index(pr_cols, name='cols'))
pr_img = datashader.transfer_functions.shade(pr_merged.sum(dim='cols'), how='eq_hist')



In [33]:

    
def saxs_image(q_range, pq_range, w, h):
    saxs_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, 
                                    plot_height=h, plot_width=w)
    saxs_aggs = collections.OrderedDict((c, saxs_canvas.line(saxs_df, 'q', c)) for c in saxs_cols)
    saxs_merged = xarray.concat(saxs_aggs.values(), dim=pd.Index(saxs_cols, name='cols'))
    saxs_img = datashader.transfer_functions.shade(saxs_merged.sum(dim='cols'), how='eq_hist')
    return saxs_img

p = bokeh.plotting.figure(x_range=q_range, y_range=pq_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='Q (1/A)', y_axis_label='P(Q)')
InteractiveImage(p, saxs_image)









    Out[33]:



In [34]:

    
def pr_image(r_range, pr_range, w, h):
    pr_canvas = datashader.Canvas(x_range=r_range, y_range=pr_range, 
                                  plot_height=h, plot_width=w)
    pr_aggs = collections.OrderedDict((c, pr_canvas.line(pr_df, 'r', c)) for c in pr_cols)
    pr_merged = xarray.concat(pr_aggs.values(), dim=pd.Index(pr_cols, name='cols'))
    pr_img = datashader.transfer_functions.shade(pr_merged.sum(dim='cols'), how='eq_hist')
    return pr_img

p = bokeh.plotting.figure(x_range=r_range, y_range=pr_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='r (A)', y_axis_label='P(r)')
InteractiveImage(p, pr_image)









    Out[34]:



In [35]:

    
n = 10000



In [36]:

    
saxs_df = pd.DataFrame(data=saxs_data[:n].T)
saxs_cols = ['c{}'.format(c) for c in saxs_df.columns]
saxs_df.columns = saxs_cols
saxs_df['q'] = q_saxs
saxs_df.head()









    Out[36]:






  
    
      
      c0
      c1
      c2
      c3
      c4
      c5
      c6
      c7
      c8
      c9
      ...
      c9991
      c9992
      c9993
      c9994
      c9995
      c9996
      c9997
      c9998
      c9999
      q
    
  
  
    
      0
      0.937829
      0.946701
      0.934337
      0.945253
      0.946504
      0.939226
      0.940876
      0.944312
      0.948206
      0.934415
      ...
      0.940958
      0.939116
      0.940864
      0.937079
      0.944755
      0.936625
      0.943202
      0.946597
      0.940724
      0.01
    
    
      1
      0.775493
      0.803533
      0.765157
      0.798308
      0.802600
      0.779377
      0.783707
      0.795408
      0.808692
      0.766193
      ...
      0.784456
      0.778668
      0.783708
      0.773757
      0.796629
      0.770141
      0.791879
      0.802867
      0.784213
      0.02
    
    
      2
      0.571241
      0.612756
      0.557187
      0.602897
      0.610310
      0.575323
      0.578949
      0.598959
      0.621673
      0.560933
      ...
      0.581632
      0.573073
      0.579025
      0.570180
      0.600095
      0.558478
      0.593637
      0.610606
      0.582745
      0.03
    
    
      3
      0.385065
      0.423077
      0.372213
      0.409542
      0.418441
      0.385450
      0.383139
      0.407020
      0.433883
      0.379456
      ...
      0.388851
      0.380772
      0.383195
      0.385816
      0.406344
      0.363673
      0.401851
      0.418459
      0.392598
      0.04
    
    
      4
      0.251004
      0.270319
      0.240227
      0.255502
      0.263468
      0.245648
      0.235122
      0.256795
      0.280355
      0.249179
      ...
      0.243540
      0.238525
      0.234777
      0.251849
      0.252838
      0.224074
      0.253418
      0.262958
      0.249527
      0.05
    
  

5 rows × 10001 columns



In [37]:

    
pr_df = pd.DataFrame(data=pr_data[:n].T)
pr_cols = ['c{}'.format(c) for c in pr_df.columns]
pr_df.columns = pr_cols
r = np.arange(r_max)
pr_df['r'] = r
pr_df.head()









    Out[37]:






  
    
      
      c0
      c1
      c2
      c3
      c4
      c5
      c6
      c7
      c8
      c9
      ...
      c9991
      c9992
      c9993
      c9994
      c9995
      c9996
      c9997
      c9998
      c9999
      r
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      17675
      17666
      17663
      17663
      17663
      17661
      17667
      17667
      17664
      17663
      ...
      17659
      17666
      17660
      17666
      17666
      17669
      17680
      17667
      17658
      1
    
    
      2
      58008
      57908
      57873
      57934
      57851
      57887
      57891
      57874
      57927
      57881
      ...
      57881
      57894
      57878
      57895
      57860
      57924
      58019
      57878
      57887
      2
    
    
      3
      109234
      108736
      108699
      108908
      108546
      108838
      108698
      108689
      108945
      108768
      ...
      108668
      108707
      108663
      108779
      108630
      108610
      108938
      108694
      108819
      3
    
    
      4
      175999
      174919
      174595
      175255
      174379
      174754
      174749
      174991
      175202
      174674
      ...
      174662
      174934
      174597
      174950
      174491
      174623
      175446
      174665
      174759
      4
    
  

5 rows × 10001 columns



In [ ]:

    
def saxs_image(q_range, pq_range, w, h):
    saxs_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, 
                                    plot_height=h, plot_width=w)
    saxs_aggs = collections.OrderedDict((c, saxs_canvas.line(saxs_df, 'q', c)) for c in saxs_cols)
    saxs_merged = xarray.concat(saxs_aggs.values(), dim=pd.Index(saxs_cols, name='cols'))
    saxs_img = datashader.transfer_functions.shade(saxs_merged.sum(dim='cols'), how='eq_hist')
    return saxs_img

p = bokeh.plotting.figure(x_range=q_range, y_range=pq_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='Q (1/A)', y_axis_label='P(Q)')
InteractiveImage(p, saxs_image)









    Out[ ]:



In [ ]:

    
def pr_image(r_range, pr_range, w, h):
    pr_canvas = datashader.Canvas(x_range=r_range, y_range=pr_range, 
                                  plot_height=h, plot_width=w)
    pr_aggs = collections.OrderedDict((c, pr_canvas.line(pr_df, 'r', c)) for c in pr_cols)
    pr_merged = xarray.concat(pr_aggs.values(), dim=pd.Index(pr_cols, name='cols'))
    pr_img = datashader.transfer_functions.shade(pr_merged.sum(dim='cols'), how='eq_hist')
    return pr_img

p = bokeh.plotting.figure(x_range=r_range, y_range=pr_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='r (A)', y_axis_label='P(r)')
InteractiveImage(p, pr_image)



In [ ]:

    
sans_img



In [ ]:

    
saxs_img



In [ ]:

    
pr_img

Manipulate the data



In [ ]:

    
n_samples, n_features = saxs_data.shape # for PCA, should be (n_samples, n_features)
print('# samples: {}\n# features: {}'.format(n_samples, n_features))

Each row in the NumPy array is a separate scattering pattern or pair-distance distribution curve. Each column with be a feature we explore.



In [ ]:

    
print(saxs_data[:3, :5])
print(pr_data[:3, :5])



In [ ]:

    
min_vals = saxs_data.min(axis=0)
max_vals = saxs_data.max(axis=0)
saxs_range = max_vals - min_vals
print(saxs_range)

Notice how the variation depends significantly on Q.



In [ ]:

    
min_vals = pr_data.min(axis=0)
max_vals = pr_data.max(axis=0)
pr_range = max_vals - min_vals
print(pr_range)



In [ ]:

    
range_pq = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Data Range')
range_pr = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Data Range')

range_pq.line(q_saxs, saxs_range, legend='SAXS range', color=palette[0])

range_pr.line(r, pr_range, legend='P(c) range', color=palette[3])
range_pr.legend.location = "bottom_center"

plots = bokeh.layouts.gridplot([[range_pq, range_pr]])
show(plots)

Rescale the data

Originally used StandardScaler but changed to RobustScaler to avoid complications from outliers (which skew the mean)



In [ ]:

    
from sklearn.preprocessing import StandardScaler, RobustScaler
saxs_scaler = RobustScaler()
sans_scaler = RobustScaler()
pr_scaler = RobustScaler()



In [ ]:

    
saxs_scaler.fit(saxs_data)
pr_scaler.fit(pr_data)



In [ ]:

    
scaled_saxs = saxs_scaler.transform(saxs_data)
scaled_pr = pr_scaler.transform(pr_data)
print(scaled_saxs[:3, :5])
print(scaled_pr[:3, :5])



In [ ]:

    
min_vals = scaled_saxs.min(axis=0)
max_vals = scaled_saxs.max(axis=0)
saxs_scaled_range = max_vals - min_vals

min_vals = scaled_pr.min(axis=0)
max_vals = scaled_pr.max(axis=0)
pr_scaled_range = max_vals - min_vals

print(saxs_scaled_range)
print(pr_scaled_range)

Notice how rescaling the data using the RobustScaler has removed the $Q$ or $r$ dependence from the ranges.



In [ ]:

    
range_pq0 = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Range Before')
range_pr0 = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Comparison')
range_pq1 = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Comparison')
range_pr1 = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Range After')

range_pq0.line(q_saxs, saxs_range, legend='SAXS before', color=palette[0])
range_pq1.line(q_saxs, saxs_range, legend='SAXS before', color=palette[0])

range_pq1.line(q_saxs, saxs_scaled_range, legend='SAXS after', color=palette[0], line_dash= [4, 2])

range_pr0.line(r, pr_range, legend='before', color=palette[3])
range_pr0.line(r, pr_scaled_range, legend='after', color=palette[3], line_dash= [4, 2])
range_pr1.line(r, pr_scaled_range, legend='after', color=palette[3], line_dash= [4, 2])

range_pr0.legend.location = "bottom_center"
range_pr1.legend.location = "top_left"

plots = bokeh.layouts.gridplot([[range_pq0, range_pr0],
                                [range_pq1, range_pr1]])
show(plots)

Note that for P(r), the last 5 points in almost every data set are zero, so the range did not change at all. See this below



In [ ]:

    
pr_data[:, -5:]



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [ ]:

    
def compare_q1_q2(q1, q2):
    plt.figure()
    plt.plot(saxs_data[:,q1], saxs_data[:,q2], 'bo', markersize=5)
    plt.title('original data')
    plt.xlabel(r'$Q_{}$'.format(q1))
    plt.ylabel(r'$Q_{}$'.format(q2))
    plt.figure()
    plt.plot(scaled_saxs[:,q1], scaled_saxs[:,q2], 'bo', markersize=5)
    plt.xlabel(r'$Q_{}$'.format(q1))
    plt.ylabel(r'$Q_{}$'.format(q2))
    plt.title('scaled data')



In [ ]:

    
compare_q1_q2(0, 8)



In [ ]:

    
plt.figure()
plt.plot(saxs_data[:,0], saxs_data[:,15], 'bo', markersize=5)
plt.title('original data')
plt.xlabel(r'$Q_{}$'.format(0))
plt.ylabel(r'$Q_{%d}$'%15)
plt.figure()
plt.plot(scaled_saxs[:,0], scaled_saxs[:,17], 'bo', markersize=5)
plt.xlabel(r'$Q_{}$'.format(0))
plt.ylabel(r'$Q_{}$'.format(17))
plt.title('scaled data')



In [ ]:

    
scaled_saxs.shape



In [ ]:

    
i0 = 2
i_compare = 0
for i0 in range(18):
    plt.figure()
    plt.plot(scaled_saxs[:,i0], scaled_saxs[:, i_compare], 'bo')
    plt.plot(scaled_saxs[112,i0], scaled_saxs[112, i_compare], 'rs')
    plt.plot(scaled_saxs[113,i0], scaled_saxs[113, i_compare], 'gs')
    plt.xlabel(r'$Q_{{}}$'.format(i0))
    plt.ylabel(r'$Q_{{}}$'.format(i_compare))

DBSCAN



In [ ]:

    
from sklearn.cluster import DBSCAN
from sklearn import metrics

Need to determine what distance, epsilon, to set as the minimum distance for defining core points. To achieve this we will use the $k$-dist plot of the distance to the $k^{th}$ nearest neighbor:

Plot the sorted $k$-dist
Find the point of greatest curvature, largest 2$^{nd}$ derivative
Use that distance



In [ ]:

    
from scipy.spatial.distance import pdist



In [ ]:

    
dist = np.zeros([n_samples, n_samples])
i = np.arange(n_samples)
dist[i[:, None] < i] = pdist(scaled_saxs)  # get the distances, consider other distance metrics

i_upper = np.triu_indices(n_samples, 1)
i_lower = np.tril_indices(n_samples, -1)
dist[i_lower] = dist[i_upper]  # make the matrix symmetric

np.fill_diagonal(dist, np.inf)  # fill the diagonal with large value for simplicity



In [ ]:

    
k_dist = dist.min(axis=1)
k_dist.sort()



In [ ]:

    
k_dist



In [ ]:

    
saxs_data



In [ ]:

    
raw_dist = np.zeros([n_samples, n_samples])
i = np.arange(n_samples)
raw_dist[i[:, None] < i] = pdist(saxs_data)  # get the distances, consider other distance metrics

i_upper = np.triu_indices(n_samples, 1)
i_lower = np.tril_indices(n_samples, -1)
raw_dist[i_lower] = raw_dist[i_upper]  # make the matrix symmetric

np.fill_diagonal(raw_dist, np.inf)  # fill the diagonal with large value for simplicity



In [ ]:

    
raw_k_dist = raw_dist.min(axis=1)
raw_k_dist.sort()



In [ ]:

    
raw_k_dist



In [ ]:

    
k_plot = bokeh.plotting.figure(y_axis_label='distance')

k_plot.line(i, k_dist/k_dist.max(), color=palette[-1], legend='scaled')
k_plot.line(i, raw_k_dist/raw_k_dist.max(), color=palette[0], legend='raw')

'''
n = 16
exponents = np.polyfit(i, raw_k_dist/raw_k_dist.max(), n)
fit = np.poly1d(exponents)
y = fit(i)
k_plot.line(i, y, color=palette[1], legend='fit ({})'.format(n))

d1_norm = smooth_diff_n5(i, raw_k_dist/raw_k_dist.max())
k_plot.line(d1_norm[:, 0], d1_norm[:, 1], color=palette[2], legend='d1')

d1_fit = smooth_diff_n5(i, y)
k_plot.line(d1_fit[:, 0], d1_fit[:, 1], color=palette[3], legend='d1 fit')

d2 = smooth_diff_n5(d1_norm[:, 0], d1_norm[:, 1])
k_plot.line(d2[:, 0], d2[:, 1], color=palette[4], legend='d2')

d2_fit = smooth_diff_n5(d1_fit[:, 0], d1_fit[:, 1])
k_plot.line(d2_fit[:, 0], d2_fit[:, 1], color=palette[5], legend='d2 fit')

#k_plot.line(i[1:-1], d2_dist, color=palette[0], line_dash=[5,2], legend='scaled d2')
# k_plot.line(i[1:-1], d2_k_dist, color=palette[2], line_dash=[10,1], legend='scaled np d2')

# k_plot.line(i[1:], d_raw_k_dist, color=palette[3], line_dash=[10,1], legend='raw d')
# k_plot.line(i[1:-1], d2_raw_dist, color=palette[1], line_dash=[5,2], legend='raw d2')
'''           

k_plot.legend.location = "top_left"

bokeh.plotting.show(k_plot)

Find the elbow, or point of maximum curvature



In [ ]:

    
x = np.arange(10)



In [ ]:

    
xx = np.vstack([x,x]).T



In [ ]:

    
def smooth_diff_n5(x, y):
    h_all = x[1:] - x[:-1]
    h = h_all[0]
    assert np.allclose(h_all, h), 'inconsistent dx'
    dy = (2 * (y[3:-1] - y[1:-3]) + y[4:] - y[:-4]) / (8 * h)
    
    dydx = np.vstack([x[2:-2], dy]).T
    return dydx



In [ ]:

    
def smooth_diff_n7(x, y):
    h_all = x[1:] - x[:-1]
    h = h_all[0]
    assert np.allclose(h_all, h), 'inconsistent dx'
    
    dy = (5 * (y[4:-2] - y[2:-4]) + 4 * (y[5:-1] - y[1:-5]) + y[6:] - y[:-6]) / (32 * h)
    
    dydx = np.vstack([x[3:-3], dy]).T
    return dydx



In [ ]:

    
d1_norm = smooth_diff(i, raw_k_dist/raw_k_dist.max())



In [ ]:

    
d1_raw = smooth_diff(i, raw_k_dist)



In [ ]:

    
x = np.linspace(0, 2*np.pi)
y = np.sin(x)
dydx = smooth_diff_n5(x, y)
d2ydx2 = smooth_diff_n5(dydx[:, 0], dydx[:, 1])

p = bokeh.plotting.figure()
p.line(x, y, legend='sin(x)', color=palette[0])
p.line(dydx[:, 0], dydx[:, 1], legend='cos(x)', color=palette[1])
p.line(d2ydx2[:, 0], d2ydx2[:, 1], legend='-sin(x)', color=palette[2])

bokeh.plotting.show(p)



In [ ]:

    
# Compute DBSCAN
## Tune these parameters to adjust cluster size ##
distance = 1  # <--need to explore tuning this parameter
min_samples = 2
##################################################
x_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_saxs)
x_core_samples_mask = np.zeros_like(x_db.labels_, dtype=bool)
x_core_samples_mask[x_db.core_sample_indices_] = True
saxs_labels = x_db.labels_ + 1 # 0's are independent groups
x_clusters_ = len(set(saxs_labels)) - (1 if -1 in saxs_labels else 0)

n_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_saxs)
n_core_samples_mask = np.zeros_like(n_db.labels_, dtype=bool)
n_core_samples_mask[n_db.core_sample_indices_] = True
sans_labels = n_db.labels_ + 1 # 0's are independent groups
n_clusters_ = len(set(sans_labels)) - (1 if -1 in sans_labels else 0)

p_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_pr)
p_core_samples_mask = np.zeros_like(p_db.labels_, dtype=bool)
p_core_samples_mask[p_db.core_sample_indices_] = True
pr_labels = p_db.labels_ + 1 # 0's are independent groups
p_clusters_ = len(set(pr_labels)) - (1 if -1 in pr_labels else 0)



In [ ]:

    
# Compute DBSCAN
## Tune these parameters to adjust cluster size ##
distance = 1  # <--need to explore tuning this parameter
min_samples = 2
##################################################
x_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_saxs)
x_core_samples_mask = np.zeros_like(x_db.labels_, dtype=bool)
x_core_samples_mask[x_db.core_sample_indices_] = True
saxs_labels = x_db.labels_ + 1 # 0's are independent groups
x_clusters_ = len(set(saxs_labels)) - (1 if -1 in saxs_labels else 0)

n_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_saxs)
n_core_samples_mask = np.zeros_like(n_db.labels_, dtype=bool)
n_core_samples_mask[n_db.core_sample_indices_] = True
sans_labels = n_db.labels_ + 1 # 0's are independent groups
n_clusters_ = len(set(sans_labels)) - (1 if -1 in sans_labels else 0)

p_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_pr)
p_core_samples_mask = np.zeros_like(p_db.labels_, dtype=bool)
p_core_samples_mask[p_db.core_sample_indices_] = True
pr_labels = p_db.labels_ + 1 # 0's are independent groups
p_clusters_ = len(set(pr_labels)) - (1 if -1 in pr_labels else 0)



In [ ]:

    
# x-ray clusters
x_unique = set(saxs_labels)
x_unique.remove(0)
print('cluster labels: {}'.format(x_unique))
print('unique clusters: {}'.format(len(x_unique) + list(saxs_labels).count(0)))
for c in set(saxs_labels):
    print('{}: {}'.format(c+1, list(saxs_labels).count(c)))



In [ ]:

    
# neutron clusters
unique = set(sans_labels)
unique.remove(0)
total_clusters = len(unique) + list(sans_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(sans_labels):
    print('{}: {}'.format(c, list(sans_labels).count(c)))



In [ ]:

    
# pr clusters
unique = set(pr_labels)
unique.remove(0)
total_clusters = len(unique) + list(pr_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(pr_labels):
    print('{}: {}'.format(c, list(pr_labels).count(c)))



In [ ]:

    
np.savetxt('saxs_clusters.txt', saxs_labels, fmt='%d')
np.savetxt('sans_clusters.txt', sans_labels, fmt='%d')
np.savetxt('pr_clusters.txt', pr_labels, fmt='%d')



In [ ]:

    
print(sans_labels)
print(sans_labels.shape)
slabels = np.array(sans_labels, dtype='str')
# print(slabels)
# print(slabels.shape)



In [ ]:

    
from matplotlib import offsetbox
i_compare = 0

mn = scaled_saxs.min(axis=0)
mx = scaled_saxs.max(axis=0)

# for i0 in range(1):
for i0 in range(18):
    plt.figure()
    
    # plot points to make the correct box size
    plt.plot(mn[i0], mn[i_compare], 'w.')
    plt.plot(mx[i0], mx[i_compare], 'w.')
    
    for j in range(len(scaled_saxs)):
        if slabels[j] != '0':
            plt.text(scaled_saxs[j, i0], scaled_saxs[j, i_compare], slabels[j],
                     fontdict={'weight': 'bold', 'size': 15}, 
                     color='r') # plt.cm.Set1(labels[i]/10.0))
        else:
            plt.plot(scaled_saxs[j, i0], scaled_saxs[j, i_compare], 'k.',
                    markersize=5)
                
    plt.xlabel(r'$Q_{}$'.format(i0))
    plt.ylabel(r'$Q_{}$'.format(i_compare))

HDBSCAN



In [ ]:

    
import hdbscan



In [ ]:

    
# Compute DBSCAN
## Tune these parameters to adjust cluster size ##
distance = 1  # <--need to explore tuning this parameter
min_samples = 2
##################################################
x_db = hdbscan.HDBSCAN(eps=distance, min_samples=min_samples).fit(scaled_saxs)
x_core_samples_mask = np.zeros_like(x_db.labels_, dtype=bool)
x_core_samples_mask[x_db.core_sample_indices_] = True
saxs_labels = x_db.labels_ + 1 # 0's are independent groups
x_clusters_ = len(set(saxs_labels)) - (1 if -1 in saxs_labels else 0)

n_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_saxs)
n_core_samples_mask = np.zeros_like(n_db.labels_, dtype=bool)
n_core_samples_mask[n_db.core_sample_indices_] = True
sans_labels = n_db.labels_ + 1 # 0's are independent groups
n_clusters_ = len(set(sans_labels)) - (1 if -1 in sans_labels else 0)

p_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_pr)
p_core_samples_mask = np.zeros_like(p_db.labels_, dtype=bool)
p_core_samples_mask[p_db.core_sample_indices_] = True
pr_labels = p_db.labels_ + 1 # 0's are independent groups
p_clusters_ = len(set(pr_labels)) - (1 if -1 in pr_labels else 0)



In [ ]:

    
# x-ray clusters
x_unique = set(saxs_labels)
x_unique.remove(0)
print('cluster labels: {}'.format(x_unique))
print('unique clusters: {}'.format(len(x_unique) + list(saxs_labels).count(0)))
for c in set(saxs_labels):
    print('{}: {}'.format(c, list(saxs_labels).count(c)))



In [ ]:

    
# neutron clusters
unique = set(sans_labels)
unique.remove(0)
total_clusters = len(unique) + list(sans_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(sans_labels):
    print('{}: {}'.format(c, list(sans_labels).count(c)))



In [ ]:

    
# pr clusters
unique = set(pr_labels)
unique.remove(0)
total_clusters = len(unique) + list(pr_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(pr_labels):
    print('{}: {}'.format(c, list(pr_labels).count(c)))



In [ ]:

    
np.savetxt('saxs_clusters.txt', saxs_labels, fmt='%d')
np.savetxt('sans_clusters.txt', sans_labels, fmt='%d')
np.savetxt('pr_clusters.txt', pr_labels, fmt='%d')



In [ ]:

    
print(sans_labels)
print(sans_labels.shape)
slabels = np.array(sans_labels, dtype='str')
# print(slabels)
# print(slabels.shape)



In [ ]:

    
from matplotlib import offsetbox
i_compare = 0

mn = scaled_saxs.min(axis=0)
mx = scaled_saxs.max(axis=0)

# for i0 in range(1):
for i0 in range(18):
    plt.figure()
    
    # plot points to make the correct box size
    plt.plot(mn[i0], mn[i_compare], 'w.')
    plt.plot(mx[i0], mx[i_compare], 'w.')
    
    for j in range(len(scaled_saxs)):
        if slabels[j] != '0':
            plt.text(scaled_saxs[j, i0], scaled_saxs[j, i_compare], slabels[j],
                     fontdict={'weight': 'bold', 'size': 15}, 
                     color='r') # plt.cm.Set1(labels[i]/10.0))
        else:
            plt.plot(scaled_saxs[j, i0], scaled_saxs[j, i_compare], 'k.',
                    markersize=5)
                
    plt.xlabel(r'$Q_{}$'.format(i0))
    plt.ylabel(r'$Q_{}$'.format(i_compare))

Write DCD output



In [ ]:

    
import sasmol.sasmol as sasmol

dcd_fname = glob.glob('*.dcd')
assert len(dcd_fname) == 1, 'ERROR: unsure which dcd file to use: {}'.format(dcd_fname)
dcd_fname = dcd_fname[0]

pdb_fname = glob.glob('*.pdb')
assert len(pdb_fname) == 1, 'ERROR: unsure which dcd file to use: {}'.format(pdb_fname)
pdb_fname = pdb_fname[0]



In [ ]:

    
mol = sasmol.SasMol(0)
mol.read_pdb(pdb_fname)



In [ ]:

    
if not np.alltrue(sans_labels == saxs_labels):
    print('WARNING: labels do not match\nusing neutron labels')
labels = sans_labels



In [ ]:



In [ ]:

    
dcd_fname



In [ ]:

    
# create a dcd for every cluster with >1 frame
dcd_fnames = []
cluster_out_files = [] # dcds for clusters
unique_out_fname = '{}_uniue.dcd'.format(dcd_fname[:-4]) 
dcd_out_file = mol.open_dcd_write(unique_out_fname) # dcd file for unique structures

dcd_in_file = mol.open_dcd_read(dcd_fname)

for i in xrange(len(unique)):
    dcd_fnames.append('{}_c{:02d}.dcd'.format(dcd_fname[:-4], i))
    cluster_out_files.append(mol.open_dcd_write(dcd_fnames[i]))

visited_cluster = set()
dcd_out_frame = 0
cluster_out_frame = np.zeros(len(unique), dtype=int)

for (i, label) in enumerate(labels):
    mol.read_dcd_step(dcd_in_file, i)
    if label == 0:
        dcd_out_frame += 1
        mol.write_dcd_step(dcd_out_file, 0, dcd_out_frame)
    else:
        cluster_out_frame[label-1] += 1
        # print('adding frame to cluster {}'.format(label-1))
        # print(cluster_out_frame)
        mol.write_dcd_step(cluster_out_files[label-1], 0, cluster_out_frame[label-1])
        if label not in visited_cluster:
            visited.add(label)
            dcd_out_frame += 1
            mol.write_dcd_step(dcd_out_file, 0, dcd_out_frame)
        
for cluster_out_file in cluster_out_files:
    mol.close_dcd_write(cluster_out_file)

mol.close_dcd_write(dcd_out_file)    
mol.close_dcd_read(dcd_in_file[0])

PCA Analysis



In [ ]:

    
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats



In [ ]:

    
pca_orig = PCA()
pca_orig.fit(saxs_data)



In [ ]:

    
pca_scaled = PCA()
pca_scaled.fit(scaled_saxs)



In [ ]:

    
print(pca_orig.explained_variance_ratio_)
print(pca_scaled.explained_variance_ratio_)



In [ ]:

    
plt.figure()
plt.plot(q_values, pca_orig.explained_variance_ratio_, 'o', label='unscaled')
plt.plot(q_values, pca_scaled.explained_variance_ratio_, 's', label='scaled')
plt.legend()



In [ ]:



In [ ]:

    
from sklearn.datasets.samples_generator import make_blobs



In [ ]:

    
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)



In [ ]:

    
X_scaled = StandardScaler().fit_transform(X)



In [ ]:

    
X_range = X.max(axis=0) - X.min(axis=0)
print(X_range)



In [ ]:

    
saxs_scaled_range = X_scaled.max(axis=0) - X_scaled.min(axis=0)
print(saxs_scaled_range)



In [ ]:

    
X_s2 = StandardScaler().fit_transform(X)



In [ ]:

    
X_s2_range = X_s2.max(axis=0) - X_s2.min(axis=0)
print(X_s2_range)



In [ ]:



In [ ]:

	c0	c1	c2	c3	c4	c5	c6	c7	c8	c9	...	c991	c992	c993	c994	c995	c996	c997	c998	c999	q
0	0.937829	0.946701	0.934337	0.945253	0.946504	0.939226	0.940876	0.944312	0.948206	0.934415	...	0.948131	0.934538	0.936238	0.938994	0.937654	0.946152	0.936084	0.938800	0.939500	0.01
1	0.775493	0.803533	0.765157	0.798308	0.802600	0.779377	0.783707	0.795408	0.808692	0.766193	...	0.808336	0.764601	0.769203	0.778593	0.774023	0.801441	0.770877	0.778405	0.781446	0.02
2	0.571241	0.612756	0.557187	0.602897	0.610310	0.575323	0.578949	0.598959	0.621673	0.560933	...	0.620719	0.553221	0.557918	0.573911	0.566143	0.608457	0.566521	0.574939	0.581969	0.03
3	0.385065	0.423077	0.372213	0.409542	0.418441	0.385450	0.383139	0.407020	0.433883	0.379456	...	0.431974	0.363800	0.364644	0.383351	0.374407	0.416492	0.383075	0.386843	0.398251	0.04
4	0.251004	0.270319	0.240227	0.255502	0.263468	0.245648	0.235122	0.256795	0.280355	0.249179	...	0.277272	0.230541	0.226645	0.242626	0.235429	0.262096	0.250377	0.248459	0.262434	0.05

	c0	c1	c2	c3	c4	c5	c6	c7	c8	c9	...	c991	c992	c993	c994	c995	c996	c997	c998	c999	r
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	17675	17666	17663	17663	17663	17661	17667	17667	17664	17663	...	17663	17662	17660	17668	17665	17662	17659	17666	17673	1
2	58008	57908	57873	57934	57851	57887	57891	57874	57927	57881	...	57891	57913	57840	57873	57877	57902	57880	57892	57944	2
3	109234	108736	108699	108908	108546	108838	108698	108689	108945	108768	...	108880	108717	108602	108749	108707	108786	108792	108786	108921	3
4	175999	174919	174595	175255	174379	174754	174749	174991	175202	174674	...	175249	174767	174179	174607	174689	174941	174949	175042	175356	4