BeaconML

Machine learning in action for iBeacon-based advertising

This is a simple demo to show the machine learning in action for iBeacon-based advertising.

The source code for this demo is available on GitHub

This IP[y] Notebook performs a step-by-step execution of 'beacon_test.py' file with extra comments.

To simplify the process of machine learning we're using TinyLearn framework, which wraps around Scikit-Learn and Pandas modules for easier classification tasks. The most optimal ML algorithm and parameters are selected automatically by CommonClassifier with the help of cross-validation approach using GridSearchCV.

The training data is supplied as CSV file, which contains some statistics from a shopping mall, where iBeacons are installed.

Every record in CSV defines the parameters for a successful case - visitor has entered a store or clicked on a mobile app's banner / button.

Our goal is to predict 'Message Type' labels according to the supplied parameters - mobile app's context. Such labels will define the content type to be rendered on a smartphone:

  • Discount
  • Product Info
  • Joke

In [5]:
# Let's inspect this CSV file
import pandas as pd
some_data =pd.read_csv("../data/beacon_data.csv", header=0, index_col=None)
some_data.head(15)


Out[5]:
Mobile Platform Visitor Type iBeacon Proximity Zone Week-end / Holiday? Time of the Day Previous Worked? Executed Action Message Type
0 Android New Near Yes Noon No Click Product Info
1 Android Returned Far Yes Morning No Click Product Info
2 iOS Frequent Far No Noon Yes Enter Store Joke
3 Android New Immediate No Evening No Enter Store Discount
4 iOS New Near No Morning No Enter Store Product Info
5 iOS Returned Near No Noon Yes Enter Store Product Info
6 iOS Frequent Near Yes Evening Yes Click Product Info
7 iOS Returned Immediate Yes Evening No Enter Store Discount
8 Android New Immediate Yes Evening No Click Product Info
9 Android Returned Immediate No Noon Yes Click Discount
10 iOS Returned Far No Morning Yes Enter Store Joke
11 iOS New Far No Noon No Enter Store Discount
12 Android New Near No Morning No Enter Store Discount
13 iOS Returned Near No Noon Yes Click Discount
14 Android Returned Near No Morning Yes Click Discount

We've loaded CSV file into Pandas DataFrame, which will contain train and test data for our model.

Before we will be able to start training we need to encode the strings into numeric values using LabelEncoder.


In [7]:
# Encode strings from CSV into numeric values
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()

for col_name in some_data:
    some_data[col_name] = enc.fit_transform(some_data[col_name])

Now we split the DataFrame into train and test datasets.


In [8]:
# Split the data into training and test sets (the last 5 items)
train_features, train_labels = some_data.iloc[:-5, :-1], some_data.iloc[:-5, -1]

Let's execute the model training and print the results.


In [9]:
# Create an instance of CommonClassifier, which will use the default list of estimators.
# Removing the features with a weight smaller than 0.1.
from tinylearn import CommonClassifier

wrk = CommonClassifier(default=True, cv=3, reduce_func=lambda x: x < 0.1)
wrk.fit(train_features, train_labels)
wrk.print_fit_summary()


Selection summary based on GridSearchCV and 5 estimators.
Selected estimator 'ExtraTreeClassifier' with 0.714285714286 mean score.
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Other scores ...
Estimator 'RandomForestClassifier' has mean score 0.657142857143
Estimator 'LogisticRegression' has mean score 0.628571428571
Estimator 'SVC' has mean score 0.657142857143
Estimator 'SGDClassifier' has mean score 0.6

CommonClassifier has selected 'ExtraTreesClassifier' estimator. Let's do the actual prediction of labels on the test data:


In [10]:
# Predicting and decoding the labels back to strings
print("\nPredicted data:")
predicted = wrk.predict(some_data.iloc[-5:, :-1])
print(enc.inverse_transform(predicted))


Predicted data:
['Discount' 'Product Info' 'Product Info' 'Discount' 'Product Info']

Pretty close to the actual labels ... with the following accuracy:


In [12]:
import numpy as np
print("\nActual accuracy: " +
      str(np.sum(predicted == some_data.iloc[-5:, -1])/predicted.size*100) + '%')


Actual accuracy: 80.0%

Let's take a look at the internals of TinyLearn and CommonClassifier specifically:


In [ ]:
# %load tinylearn.py
# Copyright (c) 2015, Oleg Puzanov
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# * Redistributions of source code must retain the above copyright notice,
#   this list of conditions and the following disclaimer.
#
# * Redistributions in binary form must reproduce the above copyright notice,
#   this list of conditions and the following disclaimer in the documentation
#   and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

"""Helper classes for the basic classification tasks with Scikit-Learn and Pandas."""

import logging
import numpy as np
from fastdtw import fastdtw
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import (ExtraTreesClassifier,
                              RandomForestClassifier)
from sklearn.svm import SVC
from sklearn.linear_model import (LogisticRegression,
                                  SGDClassifier)


class FeatureReducer(object):
    """ Removes the features (columns) from the supplied DataFrame according to the
        function 'reduce_func'.

        The default use case is about removing the features, which have a very small weight
        and won't be useful for classification tasks.

        Feature weighting is implemented using ExtraTreesClassifier.
    """
    def __init__(self, df_features, df_targets, reduce_func=None):
        self.df_features = df_features
        self.df_targets = df_targets
        self.reduce_func = reduce_func
        self.dropped_cols = []

    def reduce(self, n_estimators=10):
        total_dropped = 0
        self.dropped_cols = []

        if self.reduce_func is not None:
            clf = ExtraTreesClassifier(n_estimators)
            clf.fit(self.df_features, self.df_targets).transform(self.df_features)

            for i in range(len(clf.feature_importances_)):
                if self.reduce_func(clf.feature_importances_[i]):
                    total_dropped += 1
                    logging.info("FeatureReducer: dropping column \'" +
                                 self.df_features.columns.values[i] + "\'")
                    self.dropped_cols.append(self.df_features.columns[i])

            [self.df_features.drop(c, axis=1, inplace=True) for c in self.dropped_cols]
        return total_dropped

    def print_weights(self, n_estimators=10):
        clf = ExtraTreesClassifier(n_estimators)
        clf.fit(self.df_features, self.df_targets).transform(self.df_features)
        [print("Feature \'" + self.df_features.columns.values[i] + " has weight " +
               clf.feature_importances_[i]) for i in range(len(clf.feature_importances_))]


class CrossValidator(object):
    """ Thin wrapper around 'cross_val_score' method of Scikit-Learn.
    """
    def __init__(self, estimator, df_features, df_targets, cv=5):
        self.scores = np.empty
        self.estimator = estimator
        self.df_features = df_features
        self.df_targets = df_targets
        self.cv = cv

    def cross_validate(self):
        self.scores = cross_val_score(self.estimator, self.df_features, self.df_targets, cv=self.cv)
        return self.scores

    def print_summary(self):
        if self.scores.size == 0:
            print("No data, please execute 'cross_validate' at first.")
        else:
            print("Cross-validation summary for " + self.estimator.__class__.__name__)
            print("Mean score: %0.2f (+/- %0.2f)" % (self.scores.mean(), self.scores.std() * 2))
            [print("Score #" + i + ": %0.2f", self.scores[i]) for i in range(len(self.scores))]


class CvEstimatorSelector(object):
    """Executes the cross-validation procedures to discover the best performing estimator
       from the supplied ones.

       The best estimator is selected according to the highest mean score.
    """
    def __init__(self, df_features, df_targets, cv=5):
        self.scores = {}
        self.estimators = {}
        self.df_features = df_features
        self.df_targets = df_targets
        self.cv = cv
        self.selected_name = None

    def add_estimator(self, name, instance):
        self.estimators[name] = instance

    def select_estimator(self):
        self.selected_name = None
        largest_val = 0

        for name in self.estimators:
            c_val = CrossValidator(self.estimators[name], self.df_features, self.df_targets, self.cv)
            self.scores[name] = c_val.cross_validate().mean()
            logging.info("Mean score for \'" + name + "\' estimator is " + str(self.scores[name]))
            if largest_val < self.scores[name]:
                largest_val = self.scores[name]
                self.selected_name = name

        return self.selected_name

    def print_summary(self):
        if self.selected_name is None:
            print("No data, please execute 'select_estimator' at first.")
        else:
            print("Selection summary based on the cross-validation of " +
                  str(len(self.estimators)) + " estimators.")
            print("Selected estimator \'" + self.selected_name +
                  "\' with " + str(self.scores[self.selected_name]) + " mean score.")
            print("Other scores ...")
            [print("Estimator \'" + n + " \' has mean score " +
                   str(self.scores[n])) for n in self.estimators if (n != self.selected_name)]


class GridSearchEstimatorSelector(object):
    """Thin wrapper around GridSearchCV class of Scikit-Learn for discovering
       the best performing estimator.
    """
    def __init__(self, df_features, df_targets, cv=5):
        self.scores = {}
        self.estimators = {}
        self.df_features = df_features
        self.df_targets = df_targets
        self.cv = cv
        self.selected_name = None
        self.best_estimator = None

    def add_estimator(self, name, instance, params):
        self.estimators[name] = {'instance': instance, 'params': params}

    def select_estimator(self):
        self.selected_name = None
        largest_val = 0

        for name in self.estimators:
            est = self.estimators[name]
            clf = GridSearchCV(est['instance'], est['params'], cv=self.cv)
            clf.fit(self.df_features, self.df_targets)
            self.scores[name] = clf.best_score_
            logging.info("Best score for \'" + name + "\' estimator is " + str(clf.best_score_))
            if largest_val < self.scores[name]:
                largest_val = self.scores[name]
                self.selected_name = name
                self.best_estimator = clf.best_estimator_

        return self.selected_name

    def print_summary(self):
        if self.selected_name is None:
            print("No data, please execute 'select_estimator' at first.")
        else:
            print("Selection summary based on GridSearchCV and " +
                  str(len(self.estimators)) + " estimators.")
            print("Selected estimator \'" + self.selected_name +
                  "\' with " + str(self.scores[self.selected_name]) + " mean score.")
            print(self.best_estimator)
            print("\nOther scores ...")
            [print("Estimator \'" + n + "\' has mean score " +
                   str(self.scores[n])) for n in self.estimators.keys() if (n != self.selected_name)]


class KnnDtwClassifier(BaseEstimator, ClassifierMixin):
    """Custom classifier implementation for Scikit-Learn using Dynamic Time Warping (DTW)
       and KNN (K-Nearest Neighbors) algorithms.

       This classifier can be used for labeling the varying-length sequences, like time series
       or motion data.

       FastDTW library is used for faster DTW calculations - linear instead of quadratic complexity.
    """
    def __init__(self, n_neighbors=1):
        self.n_neighbors = n_neighbors
        self.features = []
        self.labels = []

    def get_distance(self, x, y):
        return fastdtw(x, y)[0]

    def fit(self, X, y=None):
        for index, l in enumerate(y):
            self.features.append(X[index])
            self.labels.append(l)
        return self

    def predict(self, X):
        dist = np.array([self.get_distance(X, seq) for seq in self.features])
        indices = dist.argsort()[:self.n_neighbors]
        return np.array(self.labels)[indices]

    def predict_ext(self, X):
        dist = np.array([self.get_distance(X, seq) for seq in self.features])
        indices = dist.argsort()[:self.n_neighbors]
        return (dist[indices],
                indices)


class CommonClassifier(object):
    """Helper class to execute the common classification workflow - from training to prediction
       to metrics reporting with the popular ML algorithms, like SVM or Random Forest.

       Includes the default list of estimators with instances and parameters, which have been
       proven to work well.
    """
    def __init__(self, default=True, cv=5, reduce_func=None):
        self.cv = cv
        self.default = default
        self.reduce_func = reduce_func
        self.reducer = None
        self.grid_search = None

    def add_estimator(self, name, instance, params):
        self.grid_search.add_estimator(name, instance, params)

    def fit(self, X, y=None):
        if self.default:
            self.grid_search = GridSearchEstimatorSelector(X, y, self.cv)
            self.grid_search.add_estimator('SVC', SVC(), {'kernel': ["linear", "rbf"],
                                                          'C': [1, 5, 10, 50],
                                                          'gamma': [0.0, 0.001, 0.0001]})
            self.grid_search.add_estimator('RandomForestClassifier', RandomForestClassifier(),
                                       {'n_estimators': [5, 10, 20, 50]})
            self.grid_search.add_estimator('ExtraTreeClassifier', ExtraTreesClassifier(),
                                       {'n_estimators': [5, 10, 20, 50]})
            self.grid_search.add_estimator('LogisticRegression', LogisticRegression(),
                                       {'C': [1, 5, 10, 50], 'solver': ["lbfgs", "liblinear"]})
            self.grid_search.add_estimator('SGDClassifier', SGDClassifier(),
                                       {'n_iter': [5, 10, 20, 50], 'alpha': [0.0001, 0.001],
                                        'loss': ["hinge", "modified_huber",
                                                 "huber", "squared_hinge", "perceptron"]})

        if self.reduce_func is not None:
            self.reducer = FeatureReducer(X, y, self.reduce_func)
            self.reducer.reduce(10)

        return self.grid_search.select_estimator()

    def print_fit_summary(self):
        return self.grid_search.print_summary()

    def predict(self, X):
        if self.grid_search.selected_name is not None:
            if self.reduce_func is not None and len(self.reducer.dropped_cols) > 0:
                X.drop(self.reducer.dropped_cols, axis=1, inplace=True)
            return self.grid_search.best_estimator.predict(X)
        else:
            return None