Forensic Intelligence Applications

In this section we will explore the use of similar source, score-based, non-anchored LRs for forensic intelligence applications. Score-based LRs can be a valuable tool for relating forensic traces to each other in a manner that pertains to:

The relative rarity of a relationship between two traces
The different senses of similarity and thier expression in terms of multivariable features

We will explore the use of likelihood ratios as a method for exploring the connectivity between forensic traces. As well as the interactive process of asking different kinds of forensic questions and interpretting the results.

How to work with the Jupyter notebook interface

Each cell contains some code that is executed when you select any cell and:

click on the __execute__ icon
OR
presses CRTL + ENTER

In between these code cells are text cells that give explanations about the input and output of these cells as well as information about the operations being performed. In this workshop segment we will work our way down this iPython3 notebook. All the code cells can be edited to change the parameters of our example. We will pause to discuss the output generated at various stages of this process.

Accessing this notebook

After this workshop is over, you will be able to download this notebook any time at: http://www.score_LR_demo.pendingassimilation.com

Lets get started

Execute the following code cell in order to import al of the Python3 libraries that will be required to run our example.
This code will also change the colour scheme of all code cells so that you can more easily tell the difference between
instruction cells (black text white background) and code cells syntax coloured text on a black background.



In [30]:

    
from IPython.core.display import HTML
import os
#def css_styling():
    #response = urllib.request.urlopen('https://dl.dropboxusercontent.com/u/24373111/custom.css')
    #desktopFile = os.path.expanduser("~\Desktop\EAFS_LR_software\ForensicIntelligence\custom.css")
#    styles = open(desktopFile, "r").read()
#    return HTML(styles)
## importing nice mathy things
import numpy as np # numericla operators
import matplotlib.pyplot as plt # plotting operators
import matplotlib
from numpy import ndarray 
import random # for random data generation
from matplotlib.mlab import PCA as mlabPCA # pca required for plotting higher dimensional data
import matplotlib.mlab as mlab
from scipy.stats import norm
import networkx as nx # for generating and plotting graphs
from scipy.spatial.distance import pdist, cdist, squareform # pairwise distance operators
from scipy import sparse as sp
#css_styling() # defines the style and colour scheme for the markdown and code cells.

Define some characteristics of the dataset for our example

We will generate a dataset meant to emulate the kinds of intel availabel in pan-european drug seizure analysis.
The variables bellow will define our dataset for this example, including the number of seizure batches, the number of samples per seizure batch and the relative variability observed in the samples. Some noise parameters are introduced as well and we assume a particular number of marker compounds are measured between seizures.

Remember! This is only example data to illustrate the technology you can come back here and change the nature of the exmaple data any time



In [31]:

    
NUMBER_OF_CLASSES = 20 
# number of different classes 
SAMPLES_PER_CLASS = 400 
# we assume balanced classes
NOISINESS_INTRA_CLASS = 0.25 
#expresses the spread of the classes (between 0.0 and 1.0 gives workable data)
NOISINESS_INTER_CLASS = 2.5 
# expresses the spaces in between the clases (between 5 and 10 times the value of NOISINESS_INTRA_CLASS is nice)
DIMENSIONALITY = 10 
# how many features are measured for this multivariate data
FAKE_DIMENSIONS = 2 # these features are drawn from a different (non class specific distribution) so as to have no actual class descriminating power.

if NUMBER_OF_CLASSES * SAMPLES_PER_CLASS > 10000:
    print('Too many samples requested, please limit simulated data to 10000 samples to avoid slowing down the server kernel... Using Default values')
    print(NUMBER_OF_CLASSES)

Generate a sample data set

The following code cell will generate a new simulated data set according to the parameters defined in the previous code cell



In [32]:

    
##simulate interesting multiclass data 
myData = np.empty((0,DIMENSIONALITY), int)
nonInformativeFeats = np.array(random.sample(range(DIMENSIONALITY), FAKE_DIMENSIONS))

labels =  np.repeat(range(NUMBER_OF_CLASSES), SAMPLES_PER_CLASS) # integer class labels
#print labels

for x in range(0, NUMBER_OF_CLASSES):
    A = np.random.rand(DIMENSIONALITY,DIMENSIONALITY)
    cov = np.dot(A,A.transpose())*NOISINESS_INTRA_CLASS # ensure a positive semi-definite covariance matrix, but random relation between variables
    mean = np.random.uniform(-NOISINESS_INTER_CLASS, NOISINESS_INTER_CLASS, DIMENSIONALITY) # random n-dimensional mean in a space we can plot easily
    #print('random mutlivariate distribution mean for today is', mean)
    #print('random positive semi-define matrix for today is', cov)

    x = np.random.multivariate_normal(mean,cov,SAMPLES_PER_CLASS)
    myData = np.append(myData,  np.array(x), axis=0)
    # exit here and concatenate

x = np.random.multivariate_normal(np.zeros(FAKE_DIMENSIONS),mlab.identity(FAKE_DIMENSIONS),SAMPLES_PER_CLASS*NUMBER_OF_CLASSES)
myData[:, nonInformativeFeats.astype(int)] = x
# substitute the noninformative dimensions with smaples drawn from the sampe boring distribution
    
#print myData

Examine the generated data

The following code cell will display a 2-dimensional projection of your simulated illicit drugs seizure data set.
The colours indicate similar source seizures. Remember, this is a projection of the higher-dimensional data for plotting purposes only.



In [33]:

    
## plotting our dataset to see if it is what we expect
names = matplotlib.colors.cnames #colours for plotting
names_temp = names

col_means = myData.mean(axis=0,keepdims=True)
myData = myData - col_means # mean center the data before PCA

col_stds = myData.std(axis=0,keepdims=True)
myData = myData / col_stds # unit variance scaling 

results = mlabPCA(myData)# PCA results into and ND array scores, loadings
%matplotlib inline

fig = plt.figure()
ax = fig.add_subplot(111)
ax.axis('equal');
for i in range(NUMBER_OF_CLASSES):
    plt.plot(results.Y[labels==i,0],results.Y[labels==i,1], 'o', markersize=7, color=names_temp.popitem()[1], alpha=0.5)
# plot the classes after PCA just for rough idea of their overlap.

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4,4])
plt.ylim([-4,4])
plt.title('Transformed samples with class labels from matplotlib.mlab.PCA()')
    
plt.show()

Defining a similarity model to generate score-based likelihood ratios

Here we must define three parameters that will affect the shape of our graph.

The distance metric will convert the high dimensional data into a set of univariate pair-wise scores between samples.
The distribution that will be used to model the scores in one dimension must be chosen here.
The holdout size is a set of sample from our generated data set that will be removed before modelling the distributions.

The holdout samples will be used later to draw graphs. We can verify if useful intel is being provided based on the original seizure identities of the holdout samples. The holdout set is a random sample from the full data generated.



In [34]:

    
DISTANCE_METRIC = 'canberra'
# this can be any of: euclidean, minkowski, cityblock, seuclidean, sqeuclidean, cosine, correlation
# hamming, jaccard, chebyshev, canberra, braycurtis, mahalanobis, yule
DISTRIBUTION = 'normal'
HOLDOUT_SIZE = 0.01



In [35]:

    
# optional feature selection/masking for different questions
sz = myData.shape
RELEVANT_FEATURES = range(0,sz[1])

#### Later feature selcection occurs here ####
#RELEVANT_FEATURES = [2,5,7] ####
#### Selected features specifically relevant to precursor chemical composition
#####################################
myData = myData[:,RELEVANT_FEATURES]

Dividing the data

The following code cell executes the data division as defined above and reports on the size of your reference collection, holdout set, and the dimensionality of the data. Additionally a graphical sample of the data is displayed with variable magnitudes mapped to a cold->hot colour scale.



In [36]:

    
# divide the dataset
idx = np.random.choice(np.arange(NUMBER_OF_CLASSES*SAMPLES_PER_CLASS), int(NUMBER_OF_CLASSES*SAMPLES_PER_CLASS*HOLDOUT_SIZE), replace=False)
#10% holdout set removed to demonstrate the LR values of new samples!

test_samples = myData[idx,:]
test_labels = labels[idx]

train_samples = np.delete(myData, idx, 0)
train_labels = np.delete(labels, idx)

print(train_samples.shape, 'is the size of the data used to model same-source and different-source distributions')
print(test_samples.shape, 'are the points we will evaluate to see LRs achieved')

fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].imshow(train_samples[np.random.randint(DIMENSIONALITY,size=DIMENSIONALITY*2),:])
axes[1].imshow(test_samples[np.random.randint(test_samples.shape[0],size=test_samples.shape[0]),:])
plt.show()









    



((7920, 10), 'is the size of the data used to model same-source and different-source distributions')
((80, 10), 'are the points we will evaluate to see LRs achieved')

Modeling the general sense of similarity

The following code cell does a lot of the work necessary to get from multivariate data to univariate score-based likelihood ratios

If you attended the lecture from Jacob about score-based LRs then this should be very familiar!

First, pair wise comparisons are where source identity is known (because the data was generated with a ground truth)
The comparisons are accumalated into groups based on the source identity same or different batch
The acumalated score distributions are modeled using a probability density function
The parameters of those distributions and a graphical representation is displayed



In [37]:

    
#Pairwise distance calculations are going in here
same_dists = np.empty((0,1))
diff_dists = np.empty((0,1))

for labInstance in np.unique(train_labels):
    dists = pdist(train_samples[train_labels==labInstance,:],DISTANCE_METRIC)
    # this is already the condensed-form (lower triangle) with no duplicate comparisons.
    same_dists = np.append(same_dists, np.array(dists))
    del dists
        
    dists = cdist(train_samples[train_labels==labInstance,:], train_samples[train_labels!=labInstance,:], DISTANCE_METRIC)
    #print dists.shape
    train_samples[train_labels!=labInstance,:]
    diff_dists = np.append(diff_dists, np.array(dists).flatten())
#print same_dists.shape
#print diff_dists.shape

minval = min(np.min(diff_dists),np.min(same_dists))
maxval = max(np.max(diff_dists),np.max(same_dists))

# plot the histograms to see difference in distributions
# Same source data 
mu_s, std_s = norm.fit(same_dists) # fit the intensities wth a normal
plt.hist(same_dists, np.arange(minval, maxval, abs(minval-maxval)/100), normed=1, facecolor='green', alpha=0.65)
y_same = mlab.normpdf(np.arange(minval, maxval, abs(minval-maxval)/100), mu_s, std_s) # estimate the pdf over the plot range
l=plt.plot(np.arange(minval, maxval, abs(minval-maxval)/100), y_same, 'g--', linewidth=1)

# Different source data
mu_d, std_d = norm.fit(diff_dists) # fit the intensities wth a normal
plt.hist(diff_dists, np.arange(minval, maxval, abs(minval-maxval)/100), normed=1, facecolor='blue', alpha=0.65)
y_diff = mlab.normpdf(np.arange(minval, maxval, abs(minval-maxval)/100), mu_d, std_d) # estimate the pdf over the plot range
l=plt.plot(np.arange(minval, maxval, abs(minval-maxval)/100), y_diff, 'b--', linewidth=1)

print('same source comparisons made: ', same_dists.shape[0])
print('diff source comparisons made: ', diff_dists.shape[0])









    



('same source comparisons made: ', 1564234)
('diff source comparisons made: ', 59590012)

Relating a score to a likelihood ratio

The following code cell compares the new seizures (remember we seperated them into the holdout set early on) against the distributions that we modelled from our forensic reference collection to determine how rare a particular score between two illicit drug profiles is in the context of our cummalitive knowlegde of the market as define by our seizures collection.

The distribution relating to pair-wise similarities between same source samples is plotted in green
The distribution relating to pair-wise similarities between different source samples is plotted in blue
The new recovered samples are compared to one another and plotted as dots along the top of the figure



In [38]:

    
print(' mu same: ', mu_s, ' std same: ', std_s)
print(' mu diff: ', mu_d, ' std diff: ', std_d)

newDists = squareform(pdist((test_samples),DISTANCE_METRIC)) # new samples (unknown group memebership)
l=plt.plot(np.arange(minval, maxval, abs(minval-maxval)/100), y_diff, 'b-', linewidth=1)
l=plt.plot(np.arange(minval, maxval, abs(minval-maxval)/100), y_same, 'g-', linewidth=1)

l=plt.scatter(squareform(newDists), np.ones(squareform(newDists).shape[0], dtype=np.int)*max(y_same))
# plot the new distances compared to the distributions

lr_holder = [];

for element in squareform(newDists):
    value = mlab.normpdf(element, mu_s, std_s)/mlab.normpdf(element, mu_d, std_d)
    lr_holder.append(value)
    #print value
#lr_holder = np.true_divide(lr_holder,1) #inversion becasue now it will be used as a spring for network plotting
newDists[newDists==0] = 0.000001 # avoid divide by zero
newDists = np.true_divide(newDists,newDists.max())









    



(' mu same: ', 5.2953500009438148, ' std same: ', 1.629801778055336)
(' mu diff: ', 7.1993637110270159, ' std diff: ', 1.1725728089405212)

Set thresholds for edges in your undirected graph

In the following code cell we must set the thresholds for drawing edges between seizures in a graph based on thier similarity and the likelihood ratio of observing that similarity given they originate from the same source. The output figure at the bottom of the previous code cell can help you decide on a realistic threshold.

The same points will be used for the graph using similarity scores as the graph using likelihood ratios so that they can be compared



In [39]:

    
# SET  A THRESHOLD FOR EDGES:
EDGE_THRESHOLD_DISTANCE = 0.75
EDGE_THRESHOLD_LR = 1.0

Generate a graph using the similarity between samples as a linkage function

The following code block examines the pairwise similarity between hold out samples and then compares that similarity to the threshold you defined. The weight of the edges is determined by the magnitude of the similarity score (normalized to a range of 0.0-1.0).

Closer nodes are more similar to one another.
Edges not meeting the threshold criteria are removed.
Any unconnected nodes are removed.



In [40]:

    
# Plot just the distance based graph
G = nx.Graph(); # empty graph
I,J,V = sp.find(newDists) # index pairs associated with distance
G.add_weighted_edges_from(np.column_stack((I,J,V))) # distance as edge weight

#pos=nx.spectral_layout(G) # automate layout in 2-d space
#nx.draw(G,  pos, node_size=200, edge_color='k', with_labels=True, linewidths=1,prog='neato') # draw
#print G.number_of_edges()
edge_ind_collector = []

for e in G.edges_iter(data=True):
    if G[e[0]][e[1]]['weight'] > EDGE_THRESHOLD_DISTANCE:
        edge_ind_collector.append(e) # remove edges that indicate weak linkage
G.remove_edges_from(edge_ind_collector)

pos=nx.spring_layout(G) # automate layout in 2-d space
G = nx.convert_node_labels_to_integers(G)
nx.draw(G,  pos, node_size=200, edge_color='k', with_labels=True, linewidths=1,prog='neato') # draw
print('node indeces:', G.nodes())
print(' seizure ids:', list(test_labels))









    



('node indeces:', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79])
(' seizure ids:', [10, 11, 15, 7, 5, 9, 0, 7, 4, 7, 1, 10, 11, 11, 9, 3, 3, 2, 17, 17, 16, 0, 2, 12, 13, 4, 19, 11, 13, 5, 3, 19, 17, 10, 10, 14, 12, 9, 4, 17, 14, 3, 14, 12, 8, 15, 8, 9, 10, 5, 19, 2, 6, 4, 17, 4, 13, 3, 19, 14, 11, 13, 6, 14, 1, 10, 18, 8, 17, 5, 19, 17, 4, 2, 1, 10, 1, 7, 17, 19])

Generate a graph using the likelihood ratio as a linkage function

The following code block examines the likelihood ratio of observing a particular score between pairs of the hold out samples. The scores are then compared against the similarities modeled using our forensic reference samples to determine how rare the observation of such a score is. The weight of the edges is determined by the magnitude of the likelihood ratio of observing the score given the following competing hypotheses:

Hp: The samples originate from a common source
Hd: The samples originate from a different and arbitrary source

As with the graph based on similarity scores alone:

Edges not meeting the threshold criteria are removed.



In [41]:

    
# plot the likelihood based graph
#print lr_holder
G2 = nx.Graph(); # empty graph
I,J,V = sp.find(squareform(lr_holder)) # index pairs associated with distance
G2.add_weighted_edges_from(np.column_stack((I,J,V))) # LR value as edge weight

edge_ind_collector = []

for f in G2.edges_iter(data=True):
    if G2[f[0]][f[1]]['weight'] < EDGE_THRESHOLD_LR:
        #print(f)
        edge_ind_collector.append(f) # remove edges that indicate weak linkage
G2.remove_edges_from(edge_ind_collector)
#print G.number_of_edges()

pos=nx.spring_layout(G2) # automate layout in 2-d space
G2 = nx.convert_node_labels_to_integers(G2)
nx.draw(G2,  pos, node_size=200, edge_color='k', with_labels=True, linewidths=1,prog='neato')
print('node indeces:', G2.nodes())
print(' seizure ids:', list(test_labels))









    



('node indeces:', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79])
(' seizure ids:', [10, 11, 15, 7, 5, 9, 0, 7, 4, 7, 1, 10, 11, 11, 9, 3, 3, 2, 17, 17, 16, 0, 2, 12, 13, 4, 19, 11, 13, 5, 3, 19, 17, 10, 10, 14, 12, 9, 4, 17, 14, 3, 14, 12, 8, 15, 8, 9, 10, 5, 19, 2, 6, 4, 17, 4, 13, 3, 19, 14, 11, 13, 6, 14, 1, 10, 18, 8, 17, 5, 19, 17, 4, 2, 1, 10, 1, 7, 17, 19])

You can now go back to any of the code cells and experiment with changing parameters! After each change, scroll back down to the graphs to see the results.

Feature selection for hypthoses investigation

Particular variables may be known to relate to characteristics that are of particular investigative interest. For example some chemical characteristics quantified in illicit drug profiling may relate to residual composition of specific precursor compounds.

We can articulate a new source identitiy hypothesis in terms of the features we allow to be used in our model. For example the variables selected thus far are defined in an earlier cell.

scroll back up to cell 9, which should contain the following text \#### Later feature selcection occurs here #### \# RELEVANT_FEATURES = [2,5,7] #### \#### Selected features specifically relevant to precursor chemical composition \#####################################
Now make the following change : __remove this # symbol -----> \# RELEVANT_FEATURES = [2,5,7] ####

Intel from graph analytics

Investigation of the characteristics of the graph using similar source, score-based, non-anchored LRs as a linkage function can yield interesting discoveries pertaining to particular nodes. In this case nodes may represent pieces of evidence, or case dossiers where multiple types of information are considered as features in a multivariate model.

In our example the nodes represent drug seizures and their features are impurity compounds resulting of chemical fingerprinting.
Some interesting characteristics of the graph might be:

The most connected element (perhaps indicating a high similarity to the largest number of other seizures)
The general degree of connectivity of the graph (perhaps an indication of the interconnectivity of seizures in the market)
biapartitie projectiongs of the graph (indicating the number of elements/element groupings likely to be independent of one another)
Connectivity between pairs of nodes in the graph based on different edge thresholds now that edges are directly related to likelihoods based on our forensic reference collection more or less likely connections have a very intuitive meaning in the graph space.
There are many more graph theory algorithms that can be deplyed to analyze a large group of forensic traces see: https://networkx.github.io/documentation/latest/reference/algorithms.html