In this section we will explore the use of similar source, score-based, non-anchored LRs for forensic intelligence applications. Score-based LRs can be a valuable tool for relating forensic traces to each other in a manner that pertains to:
We will explore the use of likelihood ratios as a method for exploring the connectivity between forensic traces. As well as the interactive process of asking different kinds of forensic questions and interpretting the results.
Each cell contains some code that is executed when you select any cell and:
CRTL + ENTER
After this workshop is over, you will be able to download this notebook any time at: http://www.score_LR_demo.pendingassimilation.com
Execute the following code cell in order to import al of the Python3 libraries that will be required to run our example.
This code will also change the colour scheme of all code cells so that you can more easily tell the difference between
instruction cells (black text white background) and code cells syntax coloured text on a black background.
In [30]:
from IPython.core.display import HTML
import os
#def css_styling():
#response = urllib.request.urlopen('https://dl.dropboxusercontent.com/u/24373111/custom.css')
#desktopFile = os.path.expanduser("~\Desktop\EAFS_LR_software\ForensicIntelligence\custom.css")
# styles = open(desktopFile, "r").read()
# return HTML(styles)
## importing nice mathy things
import numpy as np # numericla operators
import matplotlib.pyplot as plt # plotting operators
import matplotlib
from numpy import ndarray
import random # for random data generation
from matplotlib.mlab import PCA as mlabPCA # pca required for plotting higher dimensional data
import matplotlib.mlab as mlab
from scipy.stats import norm
import networkx as nx # for generating and plotting graphs
from scipy.spatial.distance import pdist, cdist, squareform # pairwise distance operators
from scipy import sparse as sp
#css_styling() # defines the style and colour scheme for the markdown and code cells.
We will generate a dataset meant to emulate the kinds of intel availabel in pan-european drug seizure analysis.
The variables bellow will define our dataset for this example, including the number of seizure batches, the number of samples per seizure batch and the relative variability observed in the samples. Some noise parameters are introduced as well and we assume a particular number of marker compounds are measured between seizures.
Remember! This is only example data to illustrate the technology you can come back here and change the nature of the exmaple data any time
In [31]:
NUMBER_OF_CLASSES = 20
# number of different classes
SAMPLES_PER_CLASS = 400
# we assume balanced classes
NOISINESS_INTRA_CLASS = 0.25
#expresses the spread of the classes (between 0.0 and 1.0 gives workable data)
NOISINESS_INTER_CLASS = 2.5
# expresses the spaces in between the clases (between 5 and 10 times the value of NOISINESS_INTRA_CLASS is nice)
DIMENSIONALITY = 10
# how many features are measured for this multivariate data
FAKE_DIMENSIONS = 2 # these features are drawn from a different (non class specific distribution) so as to have no actual class descriminating power.
if NUMBER_OF_CLASSES * SAMPLES_PER_CLASS > 10000:
print('Too many samples requested, please limit simulated data to 10000 samples to avoid slowing down the server kernel... Using Default values')
print(NUMBER_OF_CLASSES)
In [32]:
##simulate interesting multiclass data
myData = np.empty((0,DIMENSIONALITY), int)
nonInformativeFeats = np.array(random.sample(range(DIMENSIONALITY), FAKE_DIMENSIONS))
labels = np.repeat(range(NUMBER_OF_CLASSES), SAMPLES_PER_CLASS) # integer class labels
#print labels
for x in range(0, NUMBER_OF_CLASSES):
A = np.random.rand(DIMENSIONALITY,DIMENSIONALITY)
cov = np.dot(A,A.transpose())*NOISINESS_INTRA_CLASS # ensure a positive semi-definite covariance matrix, but random relation between variables
mean = np.random.uniform(-NOISINESS_INTER_CLASS, NOISINESS_INTER_CLASS, DIMENSIONALITY) # random n-dimensional mean in a space we can plot easily
#print('random mutlivariate distribution mean for today is', mean)
#print('random positive semi-define matrix for today is', cov)
x = np.random.multivariate_normal(mean,cov,SAMPLES_PER_CLASS)
myData = np.append(myData, np.array(x), axis=0)
# exit here and concatenate
x = np.random.multivariate_normal(np.zeros(FAKE_DIMENSIONS),mlab.identity(FAKE_DIMENSIONS),SAMPLES_PER_CLASS*NUMBER_OF_CLASSES)
myData[:, nonInformativeFeats.astype(int)] = x
# substitute the noninformative dimensions with smaples drawn from the sampe boring distribution
#print myData
In [33]:
## plotting our dataset to see if it is what we expect
names = matplotlib.colors.cnames #colours for plotting
names_temp = names
col_means = myData.mean(axis=0,keepdims=True)
myData = myData - col_means # mean center the data before PCA
col_stds = myData.std(axis=0,keepdims=True)
myData = myData / col_stds # unit variance scaling
results = mlabPCA(myData)# PCA results into and ND array scores, loadings
%matplotlib inline
fig = plt.figure()
ax = fig.add_subplot(111)
ax.axis('equal');
for i in range(NUMBER_OF_CLASSES):
plt.plot(results.Y[labels==i,0],results.Y[labels==i,1], 'o', markersize=7, color=names_temp.popitem()[1], alpha=0.5)
# plot the classes after PCA just for rough idea of their overlap.
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4,4])
plt.ylim([-4,4])
plt.title('Transformed samples with class labels from matplotlib.mlab.PCA()')
plt.show()
Here we must define three parameters that will affect the shape of our graph.
The holdout samples will be used later to draw graphs. We can verify if useful intel is being provided based on the original seizure identities of the holdout samples. The holdout set is a random sample from the full data generated.
In [34]:
DISTANCE_METRIC = 'canberra'
# this can be any of: euclidean, minkowski, cityblock, seuclidean, sqeuclidean, cosine, correlation
# hamming, jaccard, chebyshev, canberra, braycurtis, mahalanobis, yule
DISTRIBUTION = 'normal'
HOLDOUT_SIZE = 0.01
In [35]:
# optional feature selection/masking for different questions
sz = myData.shape
RELEVANT_FEATURES = range(0,sz[1])
#### Later feature selcection occurs here ####
#RELEVANT_FEATURES = [2,5,7] ####
#### Selected features specifically relevant to precursor chemical composition
#####################################
myData = myData[:,RELEVANT_FEATURES]
The following code cell executes the data division as defined above and reports on the size of your reference collection, holdout set, and the dimensionality of the data. Additionally a graphical sample of the data is displayed with variable magnitudes mapped to a cold->hot colour scale.
In [36]:
# divide the dataset
idx = np.random.choice(np.arange(NUMBER_OF_CLASSES*SAMPLES_PER_CLASS), int(NUMBER_OF_CLASSES*SAMPLES_PER_CLASS*HOLDOUT_SIZE), replace=False)
#10% holdout set removed to demonstrate the LR values of new samples!
test_samples = myData[idx,:]
test_labels = labels[idx]
train_samples = np.delete(myData, idx, 0)
train_labels = np.delete(labels, idx)
print(train_samples.shape, 'is the size of the data used to model same-source and different-source distributions')
print(test_samples.shape, 'are the points we will evaluate to see LRs achieved')
fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].imshow(train_samples[np.random.randint(DIMENSIONALITY,size=DIMENSIONALITY*2),:])
axes[1].imshow(test_samples[np.random.randint(test_samples.shape[0],size=test_samples.shape[0]),:])
plt.show()
The following code cell does a lot of the work necessary to get from multivariate data to univariate score-based likelihood ratios
In [37]:
#Pairwise distance calculations are going in here
same_dists = np.empty((0,1))
diff_dists = np.empty((0,1))
for labInstance in np.unique(train_labels):
dists = pdist(train_samples[train_labels==labInstance,:],DISTANCE_METRIC)
# this is already the condensed-form (lower triangle) with no duplicate comparisons.
same_dists = np.append(same_dists, np.array(dists))
del dists
dists = cdist(train_samples[train_labels==labInstance,:], train_samples[train_labels!=labInstance,:], DISTANCE_METRIC)
#print dists.shape
train_samples[train_labels!=labInstance,:]
diff_dists = np.append(diff_dists, np.array(dists).flatten())
#print same_dists.shape
#print diff_dists.shape
minval = min(np.min(diff_dists),np.min(same_dists))
maxval = max(np.max(diff_dists),np.max(same_dists))
# plot the histograms to see difference in distributions
# Same source data
mu_s, std_s = norm.fit(same_dists) # fit the intensities wth a normal
plt.hist(same_dists, np.arange(minval, maxval, abs(minval-maxval)/100), normed=1, facecolor='green', alpha=0.65)
y_same = mlab.normpdf(np.arange(minval, maxval, abs(minval-maxval)/100), mu_s, std_s) # estimate the pdf over the plot range
l=plt.plot(np.arange(minval, maxval, abs(minval-maxval)/100), y_same, 'g--', linewidth=1)
# Different source data
mu_d, std_d = norm.fit(diff_dists) # fit the intensities wth a normal
plt.hist(diff_dists, np.arange(minval, maxval, abs(minval-maxval)/100), normed=1, facecolor='blue', alpha=0.65)
y_diff = mlab.normpdf(np.arange(minval, maxval, abs(minval-maxval)/100), mu_d, std_d) # estimate the pdf over the plot range
l=plt.plot(np.arange(minval, maxval, abs(minval-maxval)/100), y_diff, 'b--', linewidth=1)
print('same source comparisons made: ', same_dists.shape[0])
print('diff source comparisons made: ', diff_dists.shape[0])
The following code cell compares the new seizures (remember we seperated them into the holdout set early on) against the distributions that we modelled from our forensic reference collection to determine how rare a particular score between two illicit drug profiles is in the context of our cummalitive knowlegde of the market as define by our seizures collection.
In [38]:
print(' mu same: ', mu_s, ' std same: ', std_s)
print(' mu diff: ', mu_d, ' std diff: ', std_d)
newDists = squareform(pdist((test_samples),DISTANCE_METRIC)) # new samples (unknown group memebership)
l=plt.plot(np.arange(minval, maxval, abs(minval-maxval)/100), y_diff, 'b-', linewidth=1)
l=plt.plot(np.arange(minval, maxval, abs(minval-maxval)/100), y_same, 'g-', linewidth=1)
l=plt.scatter(squareform(newDists), np.ones(squareform(newDists).shape[0], dtype=np.int)*max(y_same))
# plot the new distances compared to the distributions
lr_holder = [];
for element in squareform(newDists):
value = mlab.normpdf(element, mu_s, std_s)/mlab.normpdf(element, mu_d, std_d)
lr_holder.append(value)
#print value
#lr_holder = np.true_divide(lr_holder,1) #inversion becasue now it will be used as a spring for network plotting
newDists[newDists==0] = 0.000001 # avoid divide by zero
newDists = np.true_divide(newDists,newDists.max())
In the following code cell we must set the thresholds for drawing edges between seizures in a graph based on thier similarity and the likelihood ratio of observing that similarity given they originate from the same source. The output figure at the bottom of the previous code cell can help you decide on a realistic threshold.
The same points will be used for the graph using similarity scores as the graph using likelihood ratios so that they can be compared
In [39]:
# SET A THRESHOLD FOR EDGES:
EDGE_THRESHOLD_DISTANCE = 0.75
EDGE_THRESHOLD_LR = 1.0
The following code block examines the pairwise similarity between hold out samples and then compares that similarity to the threshold you defined. The weight of the edges is determined by the magnitude of the similarity score (normalized to a range of 0.0-1.0).
In [40]:
# Plot just the distance based graph
G = nx.Graph(); # empty graph
I,J,V = sp.find(newDists) # index pairs associated with distance
G.add_weighted_edges_from(np.column_stack((I,J,V))) # distance as edge weight
#pos=nx.spectral_layout(G) # automate layout in 2-d space
#nx.draw(G, pos, node_size=200, edge_color='k', with_labels=True, linewidths=1,prog='neato') # draw
#print G.number_of_edges()
edge_ind_collector = []
for e in G.edges_iter(data=True):
if G[e[0]][e[1]]['weight'] > EDGE_THRESHOLD_DISTANCE:
edge_ind_collector.append(e) # remove edges that indicate weak linkage
G.remove_edges_from(edge_ind_collector)
pos=nx.spring_layout(G) # automate layout in 2-d space
G = nx.convert_node_labels_to_integers(G)
nx.draw(G, pos, node_size=200, edge_color='k', with_labels=True, linewidths=1,prog='neato') # draw
print('node indeces:', G.nodes())
print(' seizure ids:', list(test_labels))
The following code block examines the likelihood ratio of observing a particular score between pairs of the hold out samples. The scores are then compared against the similarities modeled using our forensic reference samples to determine how rare the observation of such a score is. The weight of the edges is determined by the magnitude of the likelihood ratio of observing the score given the following competing hypotheses:
As with the graph based on similarity scores alone:
In [41]:
# plot the likelihood based graph
#print lr_holder
G2 = nx.Graph(); # empty graph
I,J,V = sp.find(squareform(lr_holder)) # index pairs associated with distance
G2.add_weighted_edges_from(np.column_stack((I,J,V))) # LR value as edge weight
edge_ind_collector = []
for f in G2.edges_iter(data=True):
if G2[f[0]][f[1]]['weight'] < EDGE_THRESHOLD_LR:
#print(f)
edge_ind_collector.append(f) # remove edges that indicate weak linkage
G2.remove_edges_from(edge_ind_collector)
#print G.number_of_edges()
pos=nx.spring_layout(G2) # automate layout in 2-d space
G2 = nx.convert_node_labels_to_integers(G2)
nx.draw(G2, pos, node_size=200, edge_color='k', with_labels=True, linewidths=1,prog='neato')
print('node indeces:', G2.nodes())
print(' seizure ids:', list(test_labels))
Particular variables may be known to relate to characteristics that are of particular investigative interest. For example some chemical characteristics quantified in illicit drug profiling may relate to residual composition of specific precursor compounds.
We can articulate a new source identitiy hypothesis in terms of the features we allow to be used in our model. For example the variables selected thus far are defined in an earlier cell.
\#### Later feature selcection occurs here ####
\# RELEVANT_FEATURES = [2,5,7] ####
\#### Selected features specifically relevant to precursor chemical composition
\#####################################
\# RELEVANT_FEATURES = [2,5,7] ####
Investigation of the characteristics of the graph using similar source, score-based, non-anchored LRs as a linkage function can yield interesting discoveries pertaining to particular nodes. In this case nodes may represent pieces of evidence, or case dossiers where multiple types of information are considered as features in a multivariate model.
In our example the nodes represent drug seizures and their features are impurity compounds resulting of chemical fingerprinting.
Some interesting characteristics of the graph might be:
https://networkx.github.io/documentation/latest/reference/algorithms.html