# Recommendations based on frequent pairs

This notebook shows how to calculate frequent pairs based on shopping lists, using a scoring function that is relevant for recommendatins.

## Generating artificial shopping lists

``````

In [39]:

import numpy as np
from itertools import groupby, permutations, chain, islice

def get_random_element(x):
if x is None:
return -1
else:
return x[np.random.randint(len(x))]

def split_every(n, iterable):
i = iter(iterable)
piece = list(islice(i, n))
while piece:
yield piece
piece = list(islice(i, n))

``````

Let's generate some random shopping lists:

``````

In [40]:

num_unique_items =   5000
num_paired_items =    100
num_bought_items = 250000

``````

Select some random pairs which are usually bought together

``````

In [41]:

test_givenpairs = np.random.randint(num_unique_items, size=(num_paired_items, 2))
test_pair_dict = dict([(k, [itemgetter(1)(f) for f in v]) for k, v in groupby(sorted(test_givenpairs,key=itemgetter(0)), key = itemgetter(0))])
print "We generate example pairs which will always be bought together. Some examples:"
test_pair_dict.items()[0:5]

``````
``````

We generate example pairs which will always be bought together. Some examples:

Out[41]:

[(2048, [2804]), (3086, [1164]), (22, [4816]), (547, [3575]), (41, [2330])]

``````

Determine the items that are bought. By running them modulo their index, we create an exponential distribution

``````

In [42]:

test_purchases_bought = [j % (i+1) for i, j in enumerate(np.random.randint(num_unique_items, size=(num_bought_items)))]

``````

Now map all bought items to a random item which is by our definition bought together. If one is available

``````

In [43]:

vectorized_select_paired_items = np.vectorize(test_pair_dict.get)
vectorized_select_random_item = np.vectorize(get_random_element)
vectorized_select_random_item(  # select one item random from the list of paired items
vectorized_select_paired_items(  # map them to the list of paired items
test_purchases_bought # the items we bought
)
))
np.random.shuffle(test_purchases_pairs)

``````

Given these pairs, we unravel it to create one long list, which is then splitted into different shopping baskets

``````

In [44]:

test_sequence = test_purchases_pairs.ravel()
test_shopping_baskets = [nonempty for nonempty in [np.unique(a[a>0]).tolist() for a in [np.array(a) for a in list(split_every(6, test_sequence))]] if len(nonempty)>0]
print "Each few items are joined together to form a list. Because some items are removed, lists have variable length. Some examples:"
print "Total number of generated shopping lists = {}".format(len(test_shopping_baskets))

``````
``````

Each few items are joined together to form a list. Because some items are removed, lists have variable length. Some examples:
[[266, 2538, 3794, 4274], [1099, 4111, 4177], [1442, 3249, 4480], [3652, 3769, 4565], [347, 1253, 2298]]
Total number of generated shopping lists = 83334

``````

``````

In [45]:

print "There are in total {} shopping lists".format(shopping_baskets_count)

``````
``````

There are in total 83334 shopping lists

``````

## Step 1: Calculating item frequencies

Getting item buy frequency. While the shopping lists do not have to fit in memory, a list of unique items should

``````

In [46]:

def check_uniqueness(t):
if len(set(t))!=len(t):
raise ValueError("Items in a transaction must be unique but got {}".format(t))
return t
else:
return t

``````
``````

In [47]:

print "The most popular item is bought in {} shopping baskets".format(max(item_freq.values()))
print "As example, the first few items with their frequencies:"
print item_freq.items()[0:5]

``````
``````

The most popular item is bought in 126 shopping baskets
As example, the first few items with their frequencies:
[(1, 60L), (2, 59L), (3, 53L), (4, 69L), (5, 50L)]

``````

## Step 2: Calculating item pair scores

We use the following score: \$\$ score = \dfrac{\bigg(\dfrac{X\ and\ Y}{X}\bigg)}{\bigg(\dfrac{(not\ X)\ and\ Y}{not\ X}\bigg)}\$\$

``````

In [48]:

from __future__ import division
def calculate_score( xy, xy_count):
"""
xy is a tuple of item ids
xy_count is the observation count
calculates:
x and y / x
/
not x and y / not x"""
x_item, y_item = xy
x = item_freq[x_item]
y = item_freq[y_item]
x_y = xy_count
notx_y = y - x_y
if notx_y==0:
return (xy, np.Inf)
else:
return (xy, (notx/x) * (x_y/notx_y))

def all_pairs(x):
return list(permutations(x, 2)) # permutations also generates the pairs with _1 and _2 flipped

def as_key_with_value(i):
def as_key(x):
return x, i
return as_key

``````
``````

In [49]:

.flatMap(all_pairs)\
.map(as_key_with_value(1))\
.map(lambda x:calculate_score(*x))\
.cache()

``````

Now we have the score for every pair of products ever bought

``````

In [50]:

pairs_count = pairs.count()
print "There are in total {} pairs of bought products".format(pairs_count)
print "The first few pairs with their score:"
pairs.take(10)

``````
``````

There are in total 513862 pairs of bought products
The first few pairs with their score:

Out[50]:

[((3075, 3343), 41.066074950690336),
((4171, 535), 41.3125),
((2261, 2169), 26.089598997493734),
((810, 3728), 42.48877551020408),
((4785, 2165), 29.65811965811966),
((2635, 4663), 17.135802469135804),
((112, 4246), 41.647),
((4649, 717), 26.0531914893617),
((2514, 2968), 31.7472359893252),
((607, 3139), 39.06801125703565)]

``````

## Step 3, Option 1: Selecting pairs based on score threshold

As context, let's get the histogram

``````

In [61]:

pairs.map(lambda k_v:k_v[1]).filter(lambda score: not np.isinf(score)).histogram(10)

``````
``````

Out[61]:

([5.683559866129364,
250.08873329128113,
494.4939067164329,
738.8990801415846,
983.3042535667364,
1227.7094269918882,
1472.1146004170398,
1716.5197738421916,
1960.9249472673434,
2205.330120692495,
2449.735294117647],
[513662, 0, 0, 4, 4, 17, 43, 24, 9, 2])

``````
``````

In [62]:

frequent_pairs = pairs.filter(lambda k_v:k_v[1]>250).collect()
print "The number of frequent pairs = {}".format(len(frequent_pairs))

``````
``````

The number of frequent pairs = 200

``````
``````

In [52]:

frequent_pairs[0:5]

``````
``````

Out[52]:

[((4257, 2061), 1321.952380952381),
((3627, 479), 930.1998710509349),
((2061, 4257), inf),
((547, 3575), 1665.88),
((2926, 1436), inf)]

``````

## Step 3, Option 2: Selecting top N scoring cross selling items.

Here we define some helper functions to keep the highest N co occurring items

``````

In [53]:

def aggregate_zero():
return []

def aggregate_seq(n):
seq.append(item)
seq.sort(key=lambda x:x[1], reverse=True)
return seq[0:n]

def aggregate_combine(n):
def combine(seq1, seq2):
return sorted(seq1+seq2, key=lambda x:x[1], reverse=True)[0:n]
return combine

``````

Instead of hard thresholds, we can just find the most cross sellable product for each product. Some examples:

``````

In [54]:

item_with_cross_sells = pairs\
.map(lambda kv: (kv[0][0], (kv[0][1], kv[1])))\
.aggregateByKey(aggregate_zero(), aggregate_seq(5), aggregate_combine(5)).cache()

``````
``````

In [55]:

item_with_cross_sells.take(5)

``````
``````

Out[55]:

[(2048,
[(2804, 2191.3947368421054),
(3589, 53.53455480552877),
(3468, 44.03648863035431),
(2428, 44.03648863035431),
(4481, 42.66034836065574)]),
(3072,
[(63, 57.431724137931035),
(1603, 55.222811671087534),
(2161, 53.17752234993614),
(1396, 51.278325123152705),
(3171, 41.02266009852217)]),
(8,
[(4082, 60.69825072886297),
(897, 41.30853174603175),
(4982, 38.13095238095238),
(4459, 38.13095238095238),
(1525, 36.27090592334495)]),
(16,
[(2027, 54.422222222222224),
(599, 42.224137931034484),
(1828, 40.81666666666666),
(1460, 38.265625),
(3584, 34.98571428571429)]),
(344,
[(636, 59.140625),
(3622, 56.56929347826087),
(2246, 50.042067307692314),
(4331, 44.10487288135593),
(3839, 37.17410714285714)])]

``````

Let's find a perfect pair, one with score infinity

``````

In [56]:

perfect_pair = pairs.filter(lambda x: np.isinf(x[1])).take(1)[0]
perfect_pair

``````
``````

Out[56]:

((2061, 4257), inf)

``````

And show with which other items that occurs:

``````

In [57]:

item_with_cross_sells.lookup(perfect_pair[0][0])

``````
``````

Out[57]:

[[(4257, inf),
(1971, 39.45945945945946),
(2047, 30.416666666666664),
(3390, 28.07692307692308),
(2232, 25.17241379310345)]]

``````