Q2

In this question, we'll explore the basics of using NumPy arrays. We'll also start using functions as they were intended: incremental building blocks to simplify a larger task. This means the solution to each part may be used in future parts. I've tried to make these constituent components fairly easy, but if you run into problems, please ask for help!

Remember: when using a solution you wrote from a previous part, you DO NOT need to copy it from its original cell into the cell you're currently working on! By having clicked the "Play" button on the cell with the code you want to use, you've essentially "saved" it into Python, so all you have to do is call it like you would any other function; no need to copy/paste it!

Part A

NumPy arrays are wonderful improvements over native Python lists for many reasons, the biggest of which is its ability to perform "vectorized" operations over entire arrays without having to write loops.

Write a function which takes two NumPy arrays as arguments and returns their difference (in order of the arguments themselves; if you're getting an AssertionError, try flipping the ordering of the arguments in your function).

Your function should:

  • be named difference
  • take two arguments: both NumPy arrays of floats
  • return 1 NumPy array, containing the element-wise difference of the two vectors (second array from first array).

You will need to check if the arrays are the same length; if not, raise a ValueError.

You cannot use any loops, built-in functions, or NumPy functions.


In [ ]:


In [ ]:
import numpy as np

np.random.seed(578435)
x11 = np.random.random(10)
x12 = np.random.random(10)
d1 = np.array([ 0.24542374,  0.19098998,  0.20645088,  0.49097139, -0.56594091,
       -0.13363814,  0.46859546, -0.32476466, -0.35938731,  0.17459786])
np.testing.assert_allclose(d1, difference(x11, x12))

In [ ]:
np.random.seed(85743)
x21 = np.random.random(20)
x22 = np.random.random(20)
d2 = np.array([-0.17964925, -0.57573602,  0.00109792, -0.06535934,  0.51321497,
        0.63854404, -0.17318834,  0.05553455,  0.08780665, -0.12503945,
        0.08794238, -0.53157235, -0.1133253 ,  0.34861933,  0.67987286,
        0.01188672,  0.2099561 , -0.40800005, -0.28166673, -0.35814679])
np.testing.assert_allclose(d2, difference(x21, x22), rtol = 1e-05)

In [ ]:
try:
    difference(np.array([1, 2, 3]), np.array([4, 5, 6, 7]))
except ValueError:
    assert True
else:
    assert False

Part B

Write a function which takes a NumPy array and returns another NumPy array with all its elements squared.

Your function should:

  • be named squares
  • take 1 argument: a NumPy array
  • return 1 value: a NumPy array where each element is the squared version of the input array

You cannot use any loops, built-in functions, or NumPy functions.


In [ ]:


In [ ]:
import numpy as np

np.random.seed(13735)
x1 = np.random.random(10)
y1 = np.array([ 0.10729775,  0.01234453,  0.37878359,  0.12131263,  0.89916465,
        0.50676134,  0.9927178 ,  0.20673811,  0.88873398,  0.09033156])
np.testing.assert_allclose(y1, squares(x1), rtol = 1e-06)

In [ ]:
np.random.seed(7853)
x2 = np.random.random(35)
y2 = np.array([  7.70558043e-02,   1.85146792e-01,   6.98666869e-01,
         9.93510847e-02,   1.94026134e-01,   8.43335268e-02,
         1.84097846e-04,   3.74604155e-03,   7.52840504e-03,
         9.34739871e-01,   3.15736597e-01,   6.73512540e-02,
         9.61011706e-02,   7.99394100e-01,   2.18175433e-01,
         4.87808337e-01,   5.36032332e-01,   3.26047002e-01,
         8.86429452e-02,   5.66360150e-01,   9.06164054e-01,
         1.73105310e-01,   5.02681242e-01,   3.07929118e-01,
         7.08507520e-01,   4.95455022e-02,   9.89891434e-02,
         8.94874125e-02,   4.56261817e-01,   9.46454001e-01,
         2.62274636e-01,   1.79655411e-01,   3.81695141e-01,
         5.66890651e-01,   8.03936029e-01])
np.testing.assert_allclose(y2, squares(x2))

Part C

Write a function which computes the sum of the elements of a NumPy array.

Your function should:

  • be named sum_of_elements
  • take 1 argument: a NumPy array
  • return 1 floating-point value: the sum of the elements in the NumPy array

You cannot use any loops, but you can use the numpy.sum function.


In [ ]:


In [ ]:
import numpy as np

np.random.seed(7631)
x1 = np.random.random(483)
s1 = 233.48919473752667
np.testing.assert_allclose(s1, sum_of_elements(x1))

In [ ]:
np.random.seed(13275)
x2 = np.random.random(23)
s2 = 12.146235770777777
np.testing.assert_allclose(s2, sum_of_elements(x2))

Part D

You may not have realized it yet, but in the previous three parts, you've implemented almost all of what's needed to compute the Euclidean distance between two vectors, as represented with NumPy arrays. All you have to do now is link the code you wrote in the previous three parts together in the right order.

Write a function which takes two NumPy arrays and computes their distance. Your function should:

  • be named distance
  • take 2 arguments: both NumPy arrays of the same length
  • return 1 number: a non-zero floating point value that is the distance between the two arrays

Remember how Euclidean distance $d$ between two vectors $\vec{a}$ and $\vec{b}$ is calculated:

$$ d(\vec{a}, \vec{b}) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + ... + (a_n - b_n) ^2} $$

where $a_1$ and $b_1$ are the first elements of the arrays $\vec{a}$ and $\vec{b}$; $a_2$ and $b_2$ are the second elements, and so on.

You've already implemented everything except the square root; in addition to that, you just need to arrange the functions you've written in the correct order inside your distance function. Aside from calling your functions from Parts A-C, there is VERY LITTLE ORIGINAL CODE you'll need to write here! The tricky part is understanding how to make all these parts work together.

You cannot use any functions aside from those you've already written.


In [ ]:


In [ ]:
import numpy as np
import numpy.linalg as nla

np.random.seed(477582)
x11 = np.random.random(10)
x12 = np.random.random(10)
np.testing.assert_allclose(nla.norm(x11 - x12), distance(x11, x12))

In [ ]:
np.random.seed(54782)
x21 = np.random.random(584)
x22 = np.random.random(584)
np.testing.assert_allclose(nla.norm(x21 - x22), distance(x21, x22))

Part E

Now, you'll use your distance function to find the pair of vectors that are closest to each other. This is a very, very common problem in data science: finding a data point that is most similar to another data point.

In this problem, you'll write a function that takes two arguments: the data point you have (we'll call this the "reference data point"), and a list of data points you want to search. You'll loop through this list and, using your distance() function defined in Part D, compute the distance between the reference data point and each data point in the list, hunting for the one that gives you the smallest distance (meaning here that it is most similar to your reference data point).

Your function should:

  • be named similarity_search
  • take 2 arguments: a reference point (NumPy array), and a list of data points (list of NumPy arrays)
  • return 1 value: the smallest distance you could find between your reference data point and one of the data points in the list

For example, similarity_search([1, 1], [ [1, 1], [2, 2], [3, 3] ]) should return 0, since the smallest distance that can be found between the reference data point [1, 1] and an data point in the list is the list's first element: an exact copy of the reference data point. The distance between a 2D point and itself will always be 0, so this is pretty much as small as you can get.

HINT: This really isn't much code at all! Conceptually it's nothing you haven't done before, either--it's very much like the question in Assignment 3 that asked you to write code to find the minimum value in a list. This just looks intimidating, because now you're dealing with NumPy arrays. If your solution goes beyond 10-15 lines of code, consider re-thinking the problem.


In [ ]:


In [ ]:
import numpy as np

r1 = np.array([1, 1])
l1 = [np.array([1, 1]), np.array([2, 2]), np.array([3, 3])]
a1 = 0.0
np.testing.assert_allclose(a1, similarity_search(r1, l1))

In [ ]:
np.random.seed(7643)

r2 = np.random.random(2) * 100
l2 = [np.random.random(2) * 100 for i in range(100)]
a2 = 1.6077074397123927
np.testing.assert_allclose(a2, similarity_search(r2, l2))