In [49]:
import pdb #debugger
import math
import numpy as np
Many data mining problems can be described as fining similar sets:
We can get both false positive and false negative. To complete the task we will use 3 steps below
Concerting document to sets
for each hi:
compute hi(r)
for each column c:
if c==1 in row r:
for each hi:
if hi(r) < M(i,c):
M(i,c) = hi(r)
Distance describe similarities of the points. Distance have to:
The edit distance is the minimum number of character insertions and character deletions required to turn one string into another. Compute the edit distance between each pair of the strings he, she, his, and hers. Then, identify which of the following is a true statement about the number of pairs at a certain edit distance.
In [44]:
from itertools import combinations as Combinations
#https://github.com/ztane/python-Levenshtein
from Levenshtein import distance as EditDistance
from Levenshtein import editops as EditOps
from collections import Counter
inputWords = ['he', 'she', 'his', 'hers']
distanceGroups = []
for wordA,wordB in list(Combinations(inputWords,2)): #processing all combinations of words
editDistance = EditDistance(wordA,wordB)
print "Distance between %s and %s is: %i" %(wordA,wordB,editDistance)
print EditOps(wordA,wordB)
distanceGroups.append(editDistance)
#count them
Counter(distanceGroups)
Out[44]:
Note that question asked about insertion and deletion only, which is different to standard defintion as it ignores replacement. We can calculate this by hand using this presentation. This will change results for groups with replace giving us:
Consider the given matrix and perform a minhashing of the data, with the order of rows: R4, R6, R1, R3, R5, R2. Which of the following is the correct minhash value of the stated column? Note: we give the minhash value in terms of the original name of the row, rather than the order of the row in the permutation. These two schemes are equivalent, since we only care whether hash values for two columns are equal, not what their actual values are.
In [120]:
M = np.matrix('0,1,1,0;\
1,0,1,1;\
0,1,0,1;\
0,0,1,0;\
1,0,1,0;\
0,1,0,0')
print M
mininhashOrder = [x -1 for x in [4,6,1,3,5,2]] #python idx from 0
print mininhashOrder
orderedM = M[mininhashOrder,:]
print orderedM
print "Note that answer is based on ORGINAL rows!"
minihashValues = [4,1,0,3]
for idx in range(len(minihashValues)):
print "Column c%i == R%i" %(idx+1,mininhashOrder[minihashValues[idx]]+1)
In [117]:
mininhashOrder[1]
Out[117]:
In [78]:
#http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
def CaclulateNgrams(document,ngram_size):
ngram = zip(*[document[idx:] for idx in range(ngram_size)])
return list(set(ngram)) #remove duplicates
In [84]:
documentA = 'ABRACADABRA'
documentB = 'BRICABRAC'
ngram_documentA = CaclulateNgrams(documentA,2)
ngram_documentB = CaclulateNgrams(documentB,2)
ngram_same = list(set(ngram_documentA)&set(ngram_documentB))
print ngram_documentA
print CaclulateNgrams(documentB,2)
print "We have %i ngrams for document A and %i for document B" %(len(ngram_documentA),len(ngram_documentB))
similiarityJaccard = len(ngram_same)/float(len(ngram_documentA)+len(ngram_documentB))
print "%i are in common between documents, leading to Jaccard similarity of %.2f" %(len(ngram_same),similiarityJaccard)
print ngram_same
Suppose we want to assign points to whichever of the points (0,0) or (100,40) is nearer. Depending on whether we use the L1 or L2 norm, a point (x,y) could be clustered with a different one of these two points. For this problem, you should work out the conditions under which a point will be assigned to (0,0) when the L1 norm is used, but assigned to (100,40) when the L2 norm is used. Identify one of those points from the list below.
In [171]:
from scipy.spatial.distance import cdist as Distance
def IsThePointCorrect(point):
pointA = [(0,0)]
pointB = [(100,40)]
#print Distance(pointA,point,'cityblock')[0]
#print Distance(pointB,point,'cityblock')[0]
#print Distance(pointA,point,'euclidean')[0]
#print Distance(pointB,point,'euclidean')[0]
if (Distance(pointA,point,'cityblock')[0]<Distance(pointB,point,'cityblock')[0]) \
& (Distance(pointB,point,'euclidean')[0]<Distance(pointA,point,'cityblock')[0]):
isThePointCorrect = True
else: isThePointCorrect = False
return isThePointCorrect
In [173]:
print IsThePointCorrect([(56,13)])
print IsThePointCorrect([(61,10)])
print IsThePointCorrect([(58,13)])
print IsThePointCorrect([(58,13)])
Suppose we have transactions that satisfy the following assumptions:
Integers are always represented by 4 bytes. Suppose we run the a-priori algorithm to find frequent pairs and can choose on the second pass between the triangular-matrix method for counting candidate pairs (a triangular array count[i][j] that holds an integer count for each pair of items (i, j) where i < j) and a hash table of item-item-count triples. Neglect in the first case the space needed to translate between original item numbers and numbers for the frequent items, and in the second case neglect the space needed for the hash table. Assume that item numbers and counts are always 4-byte integers. As a function of N and M, what is the minimum number of bytes of main memory needed to execute the a-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate, i.e., to within 10%) minumum number of bytes of main memory, S, needed for the a-priori algorithm to execute with this data.