Quiz - Week 5B

Q1.

We wish to cluster the following set of points:

into 10 clusters. We initially choose each of the green points (25,125), (44,105), (29,97), (35,63), (55,63), (42,57), (23,40), (64,37), (33,22), and (55,20) as a centroid. Assign each of the gold points to their nearest centroid.
- Note: the scales of the horizontal and vertical axes differ, so you really need to apply the formula for distance of points; you can't just "eyeball" it.
Then, recompute the centroids of each of the clusters. Do any of the points then get reassigned to a new cluster on the next round? Identify the true statement in the list below. Each statement refers either to a centroid AFTER recomputation of centroids (precise to one decimal place) or to a point that gets reclassified.



In [34]:

    
# Solution 
import numpy as np
import math
def dist(pt1, pt2):
    return math.sqrt( (pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2 )

pts1 = [ (25,125), (44,105), (29,97), (35, 63), (55, 63), (42, 57), (23, 40), (64,37), (33,22), (55,20) ]
pts2 = [ (28,145), (38,115), (50,130),(65,140), (55,118), (50, 90), (43, 83), (63,88), (50,60), (50,30) ]

clusters = [ [pt] for pt in pts1]
# simulate pass 1
for pt2 in pts2:
    minDist = 9999
    minIdx  = None
    
    # find the closest centroid
    for idx, pt1 in enumerate(pts1):
        if minDist > dist(pt1, pt2):
            minDist = dist(pt1, pt2)
            minIdx = idx
    
    clusters[minIdx].append(pt2)

    
centroids = [ [ sum(y) / float(len(y)) for y in zip(*parray) ] for parray in clusters]

# print centroids after RECOMPUTATION
print ", ".join([ "({0:.1f} {1:.1f})".format(*pt) for pt in centroids])









    



(34.3 133.3), (52.5 109.3), (36.0 90.0), (35.0 63.0), (52.5 61.5), (42.0 57.0), (23.0 40.0), (64.0 37.0), (33.0 22.0), (52.5 25.0)

Q2.

When performing a k-means clustering, success depends very much on the initially chosen points. Suppose that we choose two centroids (a,b) = (5,10) and (c,d) = (20,5), and the data truly belongs to two rectangular clusters, as suggested by the following diagram:

Under what circumstances will the initial clustering be successful?
- That is, under what conditions will all the yellow points be assigned to the centroid (5,10), while all of the blue points are assigned to cluster (20,5))?
- Identify in the list below, a pair of rectangles (described by their upper left corner, UL, and their lower-right corner LR) that are successfully clustered.



In [33]:

    
def assign_cluster(yellow_centroid, blue_centroid, ul, lr):
    if dist(ul, yellow_centroid) <= dist(ul, blue_centroid ) and \
       dist(lr, yellow_centroid) <= dist(lr, blue_centroid ) :
        print "yellow"
    elif dist(ul, yellow_centroid) > dist(ul, blue_centroid ) and \
       dist(lr, yellow_centroid) > dist(lr, blue_centroid ) :
        print "blue"
    else:
        print "none clustered"
        
#option 1
print "Option 1"
assign_cluster((5,10) , (20,5), (6,7), (11,14))
assign_cluster((5,10) , (20,5), (11,5), (17,2))

#option 2
print "Option 2"
assign_cluster((5,10) , (20,5), (6,7), (11,4))
assign_cluster((5,10) , (20,5), (14,10), (23,6))

#option 3
print "Option 3"
assign_cluster((5,10) , (20,5), (3,15), (13,7))
assign_cluster((5,10) , (20,5), (11,5), (17,2))

#option 4
print "Option 3"
assign_cluster((5,10) , (20,5), (3,15), (13,7))
assign_cluster((5,10) , (20,5), (14,10), (23,6))









    



Option 1
yellow
none clustered
Option 2
yellow
blue
Option 3
none clustered
none clustered
Option 3
none clustered
blue

Q3.

Suppose we apply the BALANCE algorithm with bids of 0 or 1 only, to a situation where advertiser A bids on query words x and y, while advertiser B bids on query words x and z. Both have a budget of $2.
Identify in the list below a sequence of four queries that will certainly be handled optimally by the algorithm.
- xxxz
- xzzz
- yzyy
- yxxz √

Q4.

The set cover problem is:
- given a list of sets, find a smallest collection of these sets such that every element in any of the sets is in at least one set of the collection.
- As we form a collection, we say an element is covered if it is in at least one set of the collection.
  - Note: In this problem, we shall represent sets by concatenating their elements, without brackets or commas.
  - For example, {A,B} will be represented simply as AB. There are many greedy algorithms that could be used to pick a collection of sets that is close to as small as possible.
- Here are some that you will consider in this problem.
  - Dumb: Select sets for the collection in the order in which they appear on the list. Stop when all elements are covered.
  - Simple: Consider sets in the order in which they appear on the list. When it is considered, select a set if it has at least one element that is not already covered. Stop when all elements are covered.
  - Largest-First: Consider sets in order of their size. If there are ties, break the tie in favor of the one that appears first on the list. When it is considered, select a set if it has at least one element that is not already covered. Stop when all elements are covered.
  - Most-Help: Consider sets in order of the number of elements they contain that are not already covered. If there are ties, break the tie in favor of the one that appears first on the list. Stop when all elements are covered.
- Here is a list of sets: AB, BC, CD, DE, EF, FG, GH, AH, ADG, ADF First, determine the optimum solution, that is, the fewest sets that can be selected for a collection that covers all eight elements A,B,...,H.
- Then, determine the sizes of the collections that will be constructed by each of the four algorithms mentioned above.
- Compute the ratio of the size returned by the algorithm to the optimum size, and identify one of these ratios in the list below, correct to two decimal places.

Solution 4.

Optimal Solution:
- one alternative - (ADG + BC + EF + GH): size = 4
Dumb:
- result: (AB + BC + CD + DE + EF + FG + GH): size = 7
Simple:
- result: (AB + BC + CD + DE + EF + FG + GH): size = 7
Largest-first:
- result: (ADG + AB + BC + DE + EF + GH): size = 6
Most-Help:
- result: (ADG + BC + EF + GH): size = 4

Q5.

This bipartite graph:
Has several perfect matchings. Find all the perfect matchings and then identify, in the list below, a pair of edges that can appear together in a perfect matching.

Solution 5.

Perfect Match:
- 1. a0 - b0, a1 - b2, a2 - b4, a3 - b1, a4 - b3
- 1. a0 - b1, a1 - b3, a2 - b0, a3 - b2, a4 - b4

Quiz - Week 5A

Q1.

Consider the diagonal matrix M =

1   0   0
0   2   0
0   0   0

Compute its Moore-Penrose pseudoinverse, and then identify, in the list below, the true statement about the elements of the pseudoinverse.



In [36]:

    
import numpy as np

mat = np.array([[1, 0, 0], [0, 2, 0], [0, 0, 0]])
pinv = np.linalg.pinv(mat)
print pinv









    



[[ 1.   0.   0. ]
 [ 0.   0.5  0. ]
 [ 0.   0.   0. ]]

Q2.

An ad publisher selects three ads to place on each page, in order from the top. Click-through rates (CTR's) at each position differ for each advertiser, and each advertiser has a different CTR for each position. Each advertiser bids for click-throughs, and each advertiser has a daily budget, which may not be exceeded. When a click-through occurs, the advertiser pays the amount they bid. In one day, there are 101 click-throughs to be auctioned.
Here is a table of the bids, CTR's for positions 1, 2, and 3, and budget for each advertiser.


Ad  Bid     CTR1    CTR2    CTR3    Budget
A   $.10    .015    .010    .005    $1
B   $.09    .016    .012    .006    $2
C   $.08    .017    .014    .007    $3
D   $.07    .018    .015    .008    $4
E   $.06    .019    .016    .010    $5

The publisher uses the following strategy to allocate the three ad slots:
- Any advertiser whose budget is spent is ignored in what follows.
  - The first slot goes to the advertiser whose expected yield for the first slot (product of the bid and the CTR for the first slot) is the greatest. This advertiser is ignored in what follows.
  - The second slot goes to the advertiser whose expected yield for the second slot (product of the bid and the CTR for the second slot) is the greatest. This advertiser is ignored in what follows.
  - The third slot goes to the advertiser whose expected yield for the third slot (product of the bid and the CTR for the third slot) is the greatest.
- The same three advertisers get the three ad positions until one of two things happens:
  - An advertiser runs out of budget, or
  - All 101 click-throughs have been obtained.
- Either of these events ends one phase of the allocation. If a phase ends because an advertiser ran out of budget, then they are assumed to get all the clicks their budget buys. During the same phase, we calculate the number of click-throughs received by the other two advertisers by assuming that all three received click-throughs in proportion to their respective CTR's for their positions (round to the nearest integer). If click-throughs remain, the publisher reallocates all three slots and starts a new phase.
- If the phase ends because all click-throughs have been allocated, assume that the three advertisers received click-throughs in proportion to their respective CTR's (again, rounding if necessary).
- Your task is to simulate the allocation of slots and to determine how many click-throughs each of the five advertisers get.



In [49]:

    
import numpy as np
print "slot 1 ============"
slot1 = .10 * .015
print slot1
slot1 = .09 * .016
print slot1
slot1 = .08 * .017
print slot1
slot1 = .07 * .018
print slot1
slot1 = .06 * .019
print slot1

print "slot 2 ============"
slot2 = .10 * .10
print slot2
slot2 = .09 * .12
print slot2
slot2 = .08 * .14
print slot2
slot2 = .07 * .15
print slot2
slot2 = .06 * .16
print slot2

print "slot 3 ============"
slot3 = .10 * .005
print slot3
slot3 = .09 * .006
print slot3
slot3 = .08 * .007
print slot3
slot3 = .07 * .008
print slot3
slot3 = .06 * .010
print slot3

# Program for simulating the process









    



slot 1 ============
0.0015
0.00144
0.00136
0.00126
0.00114
slot 2 ============
0.01
0.0108
0.0112
0.0105
0.0096
slot 3 ============
0.0005
0.00054
0.00056
0.00056
0.0006

Solution 2.

First Slot Expectation Ranking - A: 0.0015, B: 0.00144, C: 0.00136, D: 0.00126, E: 0.00114
Second Slot Expectation Ranking - C: 0.0112, B: 0.0108, C: 0.0105, A: 0.01, E:0.0096
Third Slot Expectation Ranking - E: 0.0006, C: 0.00056,D: 0.00056 ,B: 0.00054, A: 0.0005

Q3.

In certain clustering algorithms, such as CURE, we need to pick a representative set of points in a supposed cluster, and these points should be as far away from each other as possible. That is, begin with the two furthest points, and at each step add the point whose minimum distance to any of the previously selected points is maximum.
Suppose you are given the following points in two-dimensional Euclidean space: x = (0,0); y = (10,10), a = (1,6); b = (3,7); c = (4,3); d = (7,7), e = (8,2); f = (9,5). Obviously, x and y are furthest apart, so start with these. You must add five more points, which we shall refer to as the first, second,..., fifth points in what follows. The distance measure is the normal Euclidean L2-norm. Which of the following is true about the order in which the five points are added?



In [44]:

    
clustered_pts = [(0, 0), (10, 10)]
unclustered_pts = [(1,6), (3,7), (4,3), (7,7), (8,2), (9,5)]

def dist(pt1, pt2):
    return (pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2

for i in xrange(5):
    
    representative = (0,0)
    maxDist = 0
    for pt1 in unclustered_pts:
        # find the closest centroid
        minDist = 9999
        for idx, pt2 in enumerate(clustered_pts):
            if minDist > float(dist(pt1, pt2)):
                minDist = float(dist(pt1, pt2))
    
        if maxDist < minDist:
            maxDist = minDist
            representative = pt1
    
    print maxDist
    print "point ({0}, {1}) added to cluster".format(*representative)
    unclustered_pts.remove(representative)
    clustered_pts.append(representative)









    



68.0
point (8, 2) added to cluster
50.0
point (3, 7) added to cluster
17.0
point (4, 3) added to cluster
16.0
point (7, 7) added to cluster
8.0
point (9, 5) added to cluster

Solution

First point = e
Second point = b
Third point = c
Fourth point = d
Fifth point = f



In [ ]: