Machine Learning

Instacart Market Basket Analysis: Feature Set


Ryan Alexander Alberts

7/10/2017

In this notebook, I want to run LightGBM to a provisional set of features and begin training and cross-validation.


  • Create the Training Feature Set:
    • Products
      • (user_id | unique product_id) tuples
      • product lists, totals/avgs.
      • encapsulating recency
      • No. of orders since last occurance
    • Customers
      • order count, recent reorder rate
      • buying behavior - time of day and week
      • Weekday vs. Weekend
    • Basket Size
      • max, min, avg. product count per customer
      • variability of product count across customer orders
    • 'None'
      • 'None' handling

  • Future Topics
    • weighted avg. product count (timeseries, frequency)
    • order Frequency / cyclicality
    • Macro-level trends in timeseries-data, like spikes in product count in first xx% or last xx% of all customers orders, corresonding to Summer or holidays

In [3]:
import pandas as pd
import lightgbm as lgb
import re
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")
import calendar

In [7]:
# First, let's import requisite files
orders      = pd.read_csv('../Instacart_Input/orders.csv')
prior_set   = pd.read_csv('../Instacart_Input/order_products__prior.csv')
train_set   = pd.read_csv('../Instacart_Input/order_products__train.csv')
aisles      = pd.read_csv('../Instacart_Input/aisles.csv')
departments = pd.read_csv('../Instacart_Input/departments.csv')
products    = pd.read_csv('../Instacart_Input/products.csv')

In [5]:
orders.set_index('order_id', inplace=True, drop=False)
prior_set                           = prior_set.join(orders, on='order_id', rsuffix='_')
prior_set.drop('order_id_', inplace=True, axis=1)

temp                                = pd.DataFrame()
temp['average_days_between_orders'] = orders.groupby('user_id')['days_since_prior_order'].mean().astype(np.float32)
temp['orders']                      = orders[orders['eval_set'] == 'prior'].groupby('user_id').size().astype(np.int16)

user_data = pd.DataFrame()
user_data['total_items']            = prior_set.groupby('user_id').size().astype(np.int16)
user_data['all_products']           = prior_set.groupby('user_id')['product_id'].apply(set)
user_data['total_unique_items']     = (user_data.all_products.map(len)).astype(np.int16)
user_data = user_data.join(temp)
user_data['avg_basket_size']        = (user_data.total_items / user_data.orders).astype(np.float32)

user_data.reset_index(inplace=True)
user_data.head(20)


Out[5]:
user_id total_items all_products total_unique_items average_days_between_orders orders avg_basket_size
0 1 59 {17122, 196, 26405, 13032, 39657, 12427, 25133... 18 19.000000 10 5.900000
1 2 195 {45066, 2573, 18961, 23, 32792, 22559, 13351, ... 102 16.285715 14 13.928572
2 3 88 {17668, 39190, 44683, 21903, 14992, 21137, 324... 33 12.000000 12 7.333333
3 4 18 {26576, 21573, 17769, 25623, 35469, 37646, 366... 17 17.000000 5 3.600000
4 5 37 {11777, 40706, 48775, 20754, 28289, 6808, 1398... 23 11.500000 4 9.250000
5 6 14 {40992, 27521, 20323, 48679, 8424, 45007, 2565... 12 13.333333 3 4.666667
6 7 206 {11520, 35333, 519, 10504, 45066, 13198, 10895... 68 10.450000 20 10.300000
7 8 49 {11136, 8193, 17794, 39812, 24838, 651, 26882,... 36 23.333334 3 16.333334
8 9 76 {8834, 2732, 38277, 30252, 5002, 11790, 38159,... 58 22.000000 3 25.333334
9 10 143 {36865, 20995, 13829, 43014, 18441, 47626, 564... 94 21.799999 5 28.600000
10 11 94 {17794, 8197, 30855, 5605, 33037, 30480, 43352... 61 18.714285 7 13.428572
11 12 74 {11520, 45056, 17794, 40377, 44422, 17159, 446... 61 26.000000 5 14.800000
12 13 81 {41351, 41480, 37385, 31372, 42125, 19474, 565... 29 7.666667 12 6.750000
13 14 210 {8193, 17923, 18439, 45066, 34827, 47117, 9239... 142 21.230770 13 16.153847
14 15 72 {11266, 37059, 196, 42862, 10441, 12427, 37710... 13 10.636364 22 3.272727
15 16 70 {15872, 28289, 17794, 43014, 651, 7948, 40706,... 46 19.333334 6 11.666667
16 17 294 {36736, 30992, 45190, 46844, 5128, 812, 40203,... 83 8.000000 40 7.350000
17 18 39 {2826, 25997, 22031, 29328, 21137, 40723, 4162... 29 5.833333 6 6.500000
18 19 204 {27138, 21001, 41483, 19468, 40974, 21011, 520... 133 9.333333 9 22.666666
19 20 22 {13575, 6184, 9387, 46061, 41400, 22362, 13914} 7 11.250000 4 5.500000

In [ ]:


In [8]:
train                              = orders[orders['eval_set'] == 'train']
train_user_orders                  = orders[orders['user_id'].isin(train['user_id'].values)]
train_user_orders                  = train_user_orders.merge(prior_set, on='order_id') 
train_user_orders                  = train_user_orders.merge(user_data, on='user_id')
train_user_orders                  = train_user_orders.merge(products, on='product_id')

temp                               = pd.DataFrame(train_user_orders.groupby(['user_id', 
                                                                             'product_id']
                                                                           ).size()).reset_index()
temp.columns                       = ['user_id', 'product_id', 'usr_order_instances']

train_df                           = train_user_orders.groupby(['user_id', 
                                                                'product_id']
                                                              ).mean().reset_index()
train_df.merge(temp,
               on=['user_id', 
                   'product_id']
              )
train_df                           = train_df.drop(['order_id', 
                                                    'order_number', 
                                                    'reordered', 
                                                   ], axis=1)

train_df['order_dow']              = train_df['order_dow'].astype(np.float32)
train_df['order_hour_of_day']      = train_df['order_hour_of_day'].astype(np.float32)
train_df['days_since_prior_order'] = train_df['days_since_prior_order'].astype(np.float32)
train_df['add_to_cart_order']      = train_df['add_to_cart_order'].astype(np.float32)
train_df['avg_basket_size']        = train_df['avg_basket_size'].astype(np.float32)
train_df['aisle_id']               = train_df['aisle_id'].astype(np.int16)
train_df['department_id']          = train_df['department_id'].astype(np.int16)
train_df.head()


Out[8]:
user_id product_id order_dow order_hour_of_day days_since_prior_order add_to_cart_order total_items total_unique_items average_days_between_orders orders avg_basket_size aisle_id department_id
0 1 196 2.500000 10.300000 19.555555 1.400000 59 18 19.0 10 5.9 77 7
1 1 10258 2.555556 10.555555 19.555555 3.333333 59 18 19.0 10 5.9 117 19
2 1 10326 4.000000 15.000000 28.000000 5.000000 59 18 19.0 10 5.9 24 4
3 1 12427 2.500000 10.300000 19.555555 3.300000 59 18 19.0 10 5.9 23 19
4 1 13032 2.666667 8.000000 21.666666 6.333333 59 18 19.0 10 5.9 121 14

In [ ]:
# I've previously created 20 test submissions without machine learning algorithms
# and I benefited frmo starting with the most recent orders to get F1 score 0.365+ (top 50%)
# So I'm including this feature:
#   Reorder rates (% of order that includes reordered products) for recent orders

order_reup = train_user_orders.groupby(['user_id', 'order_number']).mean()
last_order = train_user_orders.groupby(['user_id'])['order_number'].max()
d          = {}

for user, order in order_reup['reordered'].index.values:
    if user not in d:
        count   = 0
        d[user] = 0
    if ( (order > 1) & (order >= last_order[user] - 4) ):
        d[user] += order_reup['reordered'][(user, order)]
        count+=1
    if order == last_order[user]:
        d[user] /= count
d
# Add to train_df [Warning: LONG PROCESSING TIME...]
#train_df['recent_reorder_rate'] = 0
#for i in d.keys():
#    train_df.loc[train_df.user_id == i, 'recent_reorder_rate'] = d[i]

In [ ]:

---- Questions -----

  • 2-fold cross-validation? i.e. splitting training set into two groups roughly the size of the actual test set, and running 5-fold CV on each?

  • How do I know when I'm overfitting? I only have 5 submissions per day, and would like to be able to estimate it without submitting.

  • ensemble methods - LightGBM and XGBoost? Using predictions from first model as input? Ranking predictions, max/min/std of predictions?

  • Any resources for parameter tuning for LightGBM and XGBoost?

  • How can I stratify training data into sub-sets that reflect the general popluation?

  • Should I train using a separate validation set, or is cross-validation per above sufficient?

  • libFM and Factorizing machines?

---- Notes -----


In [ ]:
#user1 = orders[orders['user_id'] == 1]['order_id'].values
#prior_set[prior_set['order_id'].isin(user1)]

In [ ]:
#users['total_items'] = train_user_orders.groupby(['user_id', 'product_id']).size() #[train_user_orders['user_id'] == 1]
#users = pd.DataFrame()
#users['total_items'] = train_user_orders.groupby('product_id').size().astype(np.int16)
#users['product_set'] = train_user_orders.groupby('user_id')['product_id'].apply(set)
#user_array = train_user_orders.groupby('user_id').size().index.values
#for user in user_array:
#    users['total_uniqueItems'] = len(np.unique(train_user_orders[train_user_orders['user_id'] == user]['product_id']))

#orders[orders['order_id'] == 1187899]
#user_1 = orders[orders['user_id'] == 1].groupby('order_id').size()
#user_1.index.values



# 20.6M rows if you have unique rows for each (order_id | product) tuple
# vs. 
# 8.5M rows if you have unique rows for each (user_id | product) tuple
# User 1 has 18 unique products spread across 10 prior orders (not including train order 11)
#np.unique(train_user_orders[train_user_orders['user_id'] == 1]['product_id'])

In [ ]:
#train_df.to_csv('train_df_LightGBM_vXXXXX.csv', index=False)