Previous analyses have been pairwise classifications (linear vs. lineage, early/late vs split/coalescence, rectangular vs. square neighbor models). Here, we perform a multiclass analysis, and compare the performance of GBM with random forest classifiers.
In [1]:
import numpy as np
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cPickle as pickle
from copy import deepcopy
from sklearn.utils import shuffle
import sklearn_mmadsen.graphs as skmg
import sklearn_mmadsen.graphics as skmplt
%matplotlib inline
# plt.style.use("fivethirtyeight")
custom_style = {'axes.labelcolor': 'white',
'xtick.color': 'white',
'ytick.color': 'white'}
sns.set_style("darkgrid", rc=custom_style)
In [2]:
sc_1_3_graphs = pickle.load(open("train-cont-sc-1-3-graphs.pkl",'r'))
sc_1_3_labels = pickle.load(open("train-cont-sc-1-3-labels.pkl",'r'))
print "sc_1_3 cases: ", len(sc_1_3_graphs)
# sc_2_graphs = pickle.load(open("train-cont-sc-2-graphs.pkl",'r'))
# sc_2_labels = pickle.load(open("train-cont-sc-2-labels.pkl",'r'))
# print "sc_2 cases: ", len(sc_2_graphs)
sc_4_5_graphs = pickle.load(open("train-sc-4-5-cont-graphs.pkl",'r'))
sc_4_5_labels = pickle.load(open("train-sc-4-5-cont-labels.pkl",'r'))
print "sc_4_5 cases: ", len(sc_4_5_graphs)
NOTE: Removing sc-2 for the moment because the sample sizes are very small and it's hard to get a reliable test set compared to the other experiments. Will run more simulations
Now we need to construct a single data set with a 10% test set split, and we'd like it to be fairly even among the class labels.
0 = Linear
1 = Lineage
2 = Rectangular nearest neighbor
3 = Square nearest neighbor
In [3]:
text_labels = ['complete', 'lineage-split', 'rect-nn', 'square-nn']
In [4]:
full_train_graphs = []
full_train_labels = []
full_test_graphs = []
full_test_labels = []
In [5]:
def add_to_dataset(graphs, labels):
train_graphs, train_labels, test_graphs, test_labels = skmg.graph_train_test_split(graphs, labels, test_fraction=0.1)
print "train size: %s" % len(train_graphs)
print "test size: %s" % len(test_graphs)
full_train_graphs.extend(train_graphs)
full_train_labels.extend(train_labels)
full_test_graphs.extend(test_graphs)
full_test_labels.extend(test_labels)
In [6]:
add_to_dataset(sc_1_3_graphs, sc_1_3_labels)
#add_to_dataset(sc_2_graphs, sc_2_labels)
add_to_dataset(sc_4_5_graphs, sc_4_5_labels)
The goal here is to construct a standard training and test data matrix of numeric values, which will contain the sorted Laplacian eigenvalues of the graphs in each data set. One feature will thus represent the largest eigenvalue for each graph, a second feature will represent the second largest eigenvalue, and so on.
In [7]:
train_matrix = skmg.graphs_to_eigenvalue_matrix(full_train_graphs, num_eigenvalues=20)
test_matrix = skmg.graphs_to_eigenvalue_matrix(full_test_graphs, num_eigenvalues=20)
print train_matrix.shape
print test_matrix.shape
In [ ]:
In [ ]:
We're going to be using a gradient boosted classifier, which has some of best accuracy of any of the standard classifier methods. Ultimately we'll figure out the best hyperparameters using cross-validation, but first we just want to see whether the approach gets us anywhere in the right ballpark -- remember, we can 80% accuracy with just eigenvalue distance, so we have to be in that neighborhood or higher to be worth the effort of switching to a more complex model.
In [8]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
In [ ]:
In [9]:
clf = GradientBoostingClassifier(n_estimators = 250)
clf.fit(train_matrix, full_train_labels)
Out[9]:
In [10]:
pred_label = clf.predict(test_matrix)
In [11]:
cm = confusion_matrix(full_test_labels, pred_label)
cmdf = pd.DataFrame(cm)
cmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)
cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)
print cmdf
print classification_report(full_test_labels, pred_label)
print "Accuracy on test: %0.3f" % accuracy_score(full_test_labels, pred_label)
In [ ]:
In [12]:
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
In [13]:
pipeline = Pipeline([
('clf', GradientBoostingClassifier())
])
params = {
'clf__learning_rate': [5.0,2.0,1.0, 0.75, 0.5, 0.25, 0.1, 0.05, 0.01],
'clf__n_estimators': [10,25,50,100,250,500]
}
grid_search = GridSearchCV(pipeline, params, n_jobs = -1, verbose = 1)
In [14]:
grid_search.fit(train_matrix, full_train_labels)
Out[14]:
In [15]:
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters:")
best_params = grid_search.best_estimator_.get_params()
for param in sorted(params.keys()):
print("param: %s: %r" % (param, best_params[param]))
In [16]:
pred_label = grid_search.predict(test_matrix)
In [17]:
cm = confusion_matrix(full_test_labels, pred_label)
cmdf = pd.DataFrame(cm)
cmdf.columns = map(lambda x: 'predicted {}'.format(x), cmdf.columns)
cmdf.index = map(lambda x: 'actual {}'.format(x), cmdf.index)
print cmdf
print classification_report(full_test_labels, pred_label)
print "Accuracy on test: %0.3f" % accuracy_score(full_test_labels, pred_label)
In [18]:
axis_labs = ['Predicted Model', 'Actual Model']
hmap = skmplt.confusion_heatmap(full_test_labels, pred_label, text_labels,
axis_labels = axis_labs, transparent = True,
reverse_color = True, filename = "confusion-heatmap-sc1345.png")
NOTE: the above figure is transparent and white-on-dark for presentation purposes. The axis labels are there....
I'm not at all sure we have a good catalog of models that would approximate the regional interaction network of the Lower Mississippi River Valley yet, but we do have three model-classes (the two NN models are essentially indistinguishable given our analysis yere) that are easily distinguishable.
So let's see what model the optimized model fit chooses. We do this simply by reading in the GML for the minmax-by-weight seriation solution from IDSS, converting it to the same kind of eigenvalue matrix as the training data, with the same number of eigenvalues as the training data (even though this PFG subset has 20 assemblages), and then using the fitted and optimized gradient boosted tree model to predict the model class, and the probability of class assignment.
In [19]:
pfg_graph = nx.read_gml("../../data/pfg-cpl-minmax-by-weight-continuity.gml")
In [20]:
pfg_test_mat = skmg.graphs_to_eigenvalue_matrix([pfg_graph], num_eigenvalues=10)
In [21]:
pfg_predict = grid_search.predict(pfg_test_mat)
print pfg_predict
In [22]:
np.set_printoptions(precision=4)
np.set_printoptions(suppress=True)
probs = grid_search.predict_proba(pfg_test_mat)
probs
Out[22]:
Interestingly, there seems to be a vanishingly small probability that the PFG seriation is the result of a pure nearest neighbor model, although the square space seems to fare marginally better than the long, thin space. But overwhelmingly, the predictive weight is on the lineage splitting model, which makes sense given Carl Lipo's dissertation work. Less predictive probability comes from the complete graph/fully connected model, although it's still about 0.223, so there's clearly something about densely connected graphs that resonates in the PFG situation (and the lineage splitting model is essentially complete graphs within each lineage too).
Clearly, however, we need a larger catalog of model classes and variants from which to select. That's next on the todo list.
In [ ]:
In [23]:
gclf = skmg.GraphEigenvalueNearestNeighbors(n_neighbors=5)
gclf.fit(full_train_graphs, full_train_labels)
Out[23]:
In [24]:
gclf.predict([pfg_graph])[0]
Out[24]:
In [25]:
distances = gclf.predict_distance_to_train(pfg_graph)
In [26]:
distances.head()
Out[26]:
In [27]:
g = sns.FacetGrid(distances, col="model", margin_titles=True)
bins = np.linspace(0, 10, 20)
g.map(sns.distplot, "distance", color="steelblue")
Out[27]:
In this set of analyses, I have the sense that we have good discrimination between models, but that the PFG empirical data set is possibly outside the border of any of the network models and thus the train/test split is crucial in determining how it classifies. I'm wondering if we can visualize that, perhaps by doing dimensionality reduction on the eigenvalue data set and then seeing where the PFG data lies in the reduced manifold.
In [28]:
from sklearn import decomposition
from sklearn import manifold
In [29]:
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
In [30]:
X_tsne = tsne.fit_transform(train_matrix)
In [ ]:
plt.figure(figsize=(11,8.5))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=full_train_labels, cmap=plt.cm.Spectral)
In [ ]:
tsne = manifold.TSNE(n_components=3, init='pca', random_state=0)
X_tsne = tsne.fit_transform(train_matrix)
In [ ]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(1, figsize=(8, 8))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=25, azim=100)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], X_tsne[:, 2], c=full_train_labels, cmap=plt.cm.Spectral)
In [ ]:
In [ ]:
In [ ]:
test_g = full_test_graphs[16]
plt.figure(figsize=(12,8))
label_map = dict()
for n,d in test_g.nodes_iter(data=True):
label_map[n] = test_g.node[n]['label'].replace("assemblage-", "")
pos = nx.graphviz_layout(test_g, prog="neato")
nx.draw_networkx(test_g, pos, with_labels=True, labels = label_map)
plt.savefig("test_graph_sc1_5.png", transparent=False)
In [ ]:
full_test_labels[16]
In [ ]: