Definition

Project Overview

  • Project Origin: Kaggle - Facebook V: Predicting Check Ins
  • Description: Facebook and Kaggle launched a machine learning engineering competition of identifying the correct place for check ins. For giving a flavor of what it takes to work with real data, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square and data was fabricated to resemble location signals coming from mobile devices.
  • Data sets:
    • train.csv, test.csv
      • row_id: id of the check-in event
      • x y: coordinates
      • accuracy: location accuracy
      • time: timestamp
      • place_id: id of the business (this is the target I'm predicting)

Problem Statement

  • Definition:
    • Predicting the top 3 places a person would most likely to check in to.
    • Total number of places on the map is over 100000.
    • The probability of place could be modeled as $$Pr(place\_id\ |\ x, y, time, acccuracy)$$
    • This problem is categorized as supervised classification.
  • Proposed Solution:
    1. K-Nearest Neighbors
      • Since I do not have any assumptions about the data distribution, KNN could be the good start point.
      • The weight of coordinates should be much higher than others.
    2. Gaussian Naive Bayes
      • When looking at histograms of features, we could see that all the shape look like a bell. So I think it's worth trying to know how well this method could achieve.
    3. Kernel Density Estimation
      • Based on the result of GNB, the assumption that all conditional probabilities of features were independent may hold on this problem
      • KDE could help finding more accurate estimation of probabilities of features
    4. Ensemble
      • At final step, I try to ensemble above three models with customized method.
      • This method is inspired by Larry Freeman

Metrics

  • Prediction is evaluated according to the Mean Average Precision @3 $$MAP@3 = \frac{1}{|U|} \sum_{u=1}^{|U|} \sum_{k=1}^{min(3,n)} P(k)$$
    • |U| is the number of check in events
    • P(k) is the precision at cutoff k
    • n is the number of predicted businesses.

Analysis

Data Exploration

  • In this section, I try to explore values distribution of following features:
    • Coordinates
    • Accuracy
    • Place_ids

In [2]:
import math
import os
import pandas as pd
import numpy as np
import time
from datetime import timedelta
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="matplotlib")

# For checking progress
from tqdm import tqdm_notebook, tnrange

# For plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # For 3D scatter
import matplotlib.cm as cm # For colored labels
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 16

In [3]:
# Reading data
df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')

In [5]:
# Count the occurrences of place_id
place_ranking = df_train.place_id.value_counts()

In [4]:
# Basic statistics of data
print 'Size of training data: {}'.format(df_train.shape)
print 'Size of testing data: {}'.format(df_test.shape)


Size of training data: (29118021, 6)
Size of testing data: (8607230, 5)

In [5]:
print "Description of training data: \n"
print df_train.describe()


Description of training data: 

             row_id             x             y      accuracy          time  \
count  2.911802e+07  2.911802e+07  2.911802e+07  2.911802e+07  2.911802e+07   
mean   1.455901e+07  4.999770e+00  5.001814e+00  8.284912e+01  4.170104e+05   
std    8.405649e+06  2.857601e+00  2.887505e+00  1.147518e+02  2.311761e+05   
min    0.000000e+00  0.000000e+00  0.000000e+00  1.000000e+00  1.000000e+00   
25%    7.279505e+06  2.534700e+00  2.496700e+00  2.700000e+01  2.030570e+05   
50%    1.455901e+07  5.009100e+00  4.988300e+00  6.200000e+01  4.339220e+05   
75%    2.183852e+07  7.461400e+00  7.510300e+00  7.500000e+01  6.204910e+05   
max    2.911802e+07  1.000000e+01  1.000000e+01  1.033000e+03  7.862390e+05   

           place_id  
count  2.911802e+07  
mean   5.493787e+09  
std    2.611088e+09  
min    1.000016e+09  
25%    3.222911e+09  
50%    5.518573e+09  
75%    7.764307e+09  
max    9.999932e+09  
  • We can see that testing and training data are splitted by time.
    • Training: Range of time is [1, 786239]
    • Testing: Range of time is [786242, 1006589]

In [6]:
print "Description of testing data: \n"
print df_test.describe()
del df_test


Description of testing data: 

             row_id             x             y      accuracy          time
count  8.607230e+06  8.607230e+06  8.607230e+06  8.607230e+06  8.607230e+06
mean   4.303614e+06  4.991417e+00  5.006705e+00  9.265208e+01  8.904637e+05
std    2.484693e+06  2.866409e+00  2.886888e+00  1.242906e+02  6.446783e+04
min    0.000000e+00  0.000000e+00  0.000000e+00  1.000000e+00  7.862420e+05
25%    2.151807e+06  2.517000e+00  2.502400e+00  4.200000e+01  8.332200e+05
50%    4.303614e+06  4.988000e+00  5.000900e+00  6.400000e+01  8.874620e+05
75%    6.455422e+06  7.463600e+00  7.505300e+00  7.900000e+01  9.454910e+05
max    8.607229e+06  1.000000e+01  1.000000e+01  1.026000e+03  1.006589e+06

Coordinate

  • We can see that the check-ins are roughly uniformly distributed on the map.
  • Because the number of data is huge, only taking 1% random samples to investigate the distribution of x and y.

In [7]:
sns.jointplot(data=df_train.sample(frac=0.01), x='x',y='y',kind='hex', stat_func=None,
             xlim=(0,10),ylim=(0,10));
plt.title('Distribution of 1% random samples',x=-2.5,y=1.2,fontsize=18);


  • From the top 3 popular places, we can know that standard deviation of x is much more than y.

In [8]:
data = df_train[df_train.place_id==place_ranking.index[0]]
data.plot(kind='scatter', x='x', y='y')
plt.title('Distribution at 1st popular place',fontsize=18)
plt.xlabel('X')
plt.ylabel('Y')
print 'Standard deviation of x: {}'.format(data.x.std())
print 'Standard deviation of y: {}'.format(data.y.std())
del data


Standard deviation of x: 0.356139513132
Standard deviation of y: 0.0115436699085

In [9]:
data = df_train[df_train.place_id==place_ranking.index[1]]
data.plot(kind='scatter', x='x', y='y')
plt.title('Distribution at 2nd popular place',fontsize=18)
plt.xlabel('X')
plt.ylabel('Y')
print 'Standard deviation of x: {}'.format(data.x.std())
print 'Standard deviation of y: {}'.format(data.y.std())
del data


Standard deviation of x: 0.240642261874
Standard deviation of y: 0.0157120481921

In [10]:
data = df_train[df_train.place_id==place_ranking.index[2]]
data.plot(kind='scatter', x='x', y='y')
plt.title('Distribution at 3rd popular place',fontsize=18)
plt.xlabel('X')
plt.ylabel('Y')
print 'Standard deviation of x: {}'.format(data.x.std())
print 'Standard deviation of y: {}'.format(data.y.std())
del data


Standard deviation of x: 0.340146567801
Standard deviation of y: 0.0176619907468

Accuracy

  • From the histogram of accuracy, we could know that majority of value is under 200.
  • The meaning of accuracy is still vague.

In [11]:
_, AX = plt.subplots(nrows=2, sharey=True)
AX[0].set_title('Histogram of accuracy')
AX[0].set_xlabel('Accuracy')
AX[0].set_ylabel('Frequency')
df_train['accuracy'].hist(bins=100,ax=AX[0])
AX[1].set_title('Histogram of accuracy under 200')
AX[1].set_xlim((0,200))
AX[1].set_xlabel('Accuracy')
AX[1].set_ylabel('Frequency')
df_train.accuracy.hist(bins=100,ax=AX[1])
plt.tight_layout(h_pad=1.5)


Place Ids

  • From this section, we can know that there are a huge number of place ids and they are not uniformly distributed. This means that any algorithm which trains using a one vs all approach would not be the best choice on this dataset.
  • What is "one vs all approach"?
    • This approach is training a single classifier per class for solving multiclass problems of n classes with the samples of that class as positive samples and all other samples as negatives.
    • Reference

In [12]:
unique_id = df_train.place_id.unique()
print "Number of unique place id: {}, roughly {:.3f} % of traing data.".format(len(unique_id), 
                                                                               len(unique_id) * 100.0 / df_train.shape[0])
del unique_id


Number of unique place id: 108390, roughly 0.372 % of traing data.

In [13]:
plt.xlabel('Frequency')
plt.ylabel('Place id')
place_ranking.head(5).plot.barh(xlim=(0,2000)).invert_yaxis()
plt.title('Top 5 most popular place id', fontsize=18);



In [14]:
plt.xlabel('Frequency')
plt.ylabel('Place id')
place_ranking.tail(5).plot.barh(xlim=(0,5))
plt.title('Bottom 5 least popular place id', fontsize=18);


Exploratory Visualization

  • In this section, I try to focus representing following things:
    • Explain why the unit of time would be minute.
    • Explain why I use log function to convert accuracy.
    • Explain why feature of time would be one of the most important features.

Explore Time

  • Because the column of time is intentionally left vague without defining the unit of time, this section provides two methods to conclude the unit of time would be minute.
  • Method 1: Converted the time to weekdays in order to visualize the weekly cycle at the top 3 popular places. (Reference)

In [15]:
# 1st popular place : 8772469670
time = df_train[df_train.place_id==place_ranking.index[0]].time
time = time % (24*60*7) # Converted into weekdays
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.title('Weekly cycle of 1st popular place', fontsize=18)
time.hist(bins=100);



In [16]:
# 2nd popular place : 1623394281
time = df_train[df_train.place_id==place_ranking.index[1]].time
time = time % (24*60*7) # Converted into weekdays
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.title('Weekly cycle of 2nd place', fontsize=18)
time.hist(bins=100);



In [17]:
# 3rd popular place : 1308450003
time = df_train[df_train.place_id==place_ranking.index[2]].time
time = time % (24*60*7) # Converted into weekdays
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.title('Weekly cycle of 3nd place', fontsize=18)
time.hist(bins=100);


  • Method 2: Used Fourier transform to extract the dominant frequency in order to find the period at the top 3 place. (Reference)

In [18]:
# 1st popular place : 8772469670
time = df_train[df_train.place_id==place_ranking.index[0]].time
# Histogram of time
hist = np.histogram(time,5000)
# FFT of Histogram
hist_fft = np.absolute(np.fft.fft(hist[0]))

plt.plot(hist_fft)
plt.xlim([0,1000])
plt.title('FFT of Histogram at 1st place', fontsize=18)
plt.xlabel('Frequency')

print "Period 1: {}, close to 10080 minutes a week.".format(time.max() / (hist_fft[2:200].argmax()+2.0))
print "Period 2: {}, close to 1440 minutes a day.".format(time.max() / (hist_fft[400:600].argmax()+400.0))


Period 1: 10151.96875, close to 10080 minutes a week.
Period 2: 1440.63414634, close to 1440 minutes a day.

In [19]:
# 2nd popular place : 1623394281
time = df_train[df_train.place_id==place_ranking.index[1]].time
# Histogram of time
hist = np.histogram(time,5000)
# FFT of Histogram
hist_fft = np.absolute(np.fft.fft(hist[0]))

plt.plot(hist_fft)
plt.xlim([0,1000])
plt.title('FFT of Histogram at 2nd place', fontsize=18)
plt.xlabel('Frequency')

print "Period 1: {}, close to 10080 minutes a week.".format(time.max() / (hist_fft[2:200].argmax()+2.0))
print "Period 2: {}, close to 1440 minutes a day.".format(time.max() / (hist_fft[400:600].argmax()+400.0))


Period 1: 10200.4155844, close to 10080 minutes a week.
Period 2: 1449.13653137, close to 1440 minutes a day.

In [20]:
# 3rd popular place : 1308450003
time = df_train[df_train.place_id==place_ranking.index[2]].time
# Histogram of time
hist = np.histogram(time,5000)
# FFT of Histogram
hist_fft = np.absolute(np.fft.fft(hist[0]))

plt.plot(hist_fft)
plt.xlim([0,1000])
plt.title('FFT of Histogram at 3rd place', fontsize=18)
plt.xlabel('Frequency')

print "Period 1: {}, close to 10080 minutes a week.".format(time.max() / (hist_fft[2:200].argmax()+2.0))
print "Period 2: {}, close to 1440 minutes a day.".format(time.max() / (hist_fft[300:500].argmax()+300.0))

del time


Period 1: 10138.9482759, close to 10080 minutes a week.
Period 2: 1451.99753086, close to 1440 minutes a day.
  • Suppose the unit of time is minute, we can calculate the time period of collecting data.
  • It took roughly one and half year.

In [21]:
T = df_train.time.max() - df_train.time.min()
T = pd.Timedelta(minutes=T)
print "Time period of collecting data is {} days {} hours {} minutes".format(T.days, T.seconds/3600, T.seconds/60%60)
del T


Time period of collecting data is 545 days 23 hours 58 minutes
  • From the below histograms at top 2 popular places, we could see that the check-ins are highly related with hours, weekday, and months. Maybe the reason is places have their own opening and closed time schedules.
  • The top 2 popular places looks complementary on opening hours.

In [22]:
# Histogram of hour and minute at the 1st popular place
place = place_ranking.index[0]
_, AX = plt.subplots(ncols=2, nrows=2, figsize=(15,6))
data = df_train[df_train.place_id==place]
data.hour = ((data.time/60)%24).astype(int)
data.hour.hist(bins=100,ax=AX[0][0])
AX[0][0].set_title('Histogram of Hour at place {}'.format(place), fontsize=18)
AX[0][0].set_xlabel('Hour')
AX[0][0].set_ylabel('Frequency')

data.weekday = ((data.time/(60*24))%7).astype(int)
data.weekday.hist(bins=100,ax=AX[0][1])
AX[0][1].set_title('Histogram of Weekday at place {}'.format(place), fontsize=18)
AX[0][1].set_xlabel('Weekday')

data.minute = (data.time%60).astype(int)
data.minute.hist(bins=100,ax=AX[1][0])
AX[1][0].set_title('Histogram of Minute at place {}'.format(place), fontsize=18)
AX[1][0].set_xlabel('Minute')

data.month = (data.time/(60*24*30)%12).astype(int)
data.month.hist(bins=100,ax=AX[1][1])
AX[1][1].set_title('Histogram of Month at place {}'.format(place), fontsize=18)
AX[1][1].set_xlabel('Month')

plt.tight_layout(h_pad=1.5)



In [23]:
# Histogram of hour and minute at the 2nd popular place
place = place_ranking.index[1]
_, AX = plt.subplots(ncols=2, nrows=2, figsize=(15,6))
data = df_train[df_train.place_id==place]
data.hour = ((data.time/60)%24).astype(int)
data.hour.hist(bins=100,ax=AX[0][0])
AX[0][0].set_title('Histogram of Hour at place {}'.format(place), fontsize=18)
AX[0][0].set_xlabel('Hour')
AX[0][0].set_ylabel('Frequency')

data.weekday = ((data.time/(60*24))%7).astype(int)
data.weekday.hist(bins=100,ax=AX[0][1])
AX[0][1].set_title('Histogram of Weekday at place {}'.format(place), fontsize=18)
AX[0][1].set_xlabel('Weekday')

data.minute = (data.time%60).astype(int)
data.minute.hist(bins=100,ax=AX[1][0])
AX[1][0].set_title('Histogram of Minute at place {}'.format(place), fontsize=18)
AX[1][0].set_xlabel('Minute')

data.month = (data.time/(60*24*30)%12).astype(int)
data.month.hist(bins=100,ax=AX[1][1])
AX[1][1].set_title('Histogram of Month at place {}'.format(place), fontsize=18)
AX[1][1].set_xlabel('Month')

plt.tight_layout(h_pad=1.5)

del data,place


Explore accuracy

  • CV: coefficient of variation, which is the ratio of the standard deviation to the mean. The higher the coefficient of variation, the greater the level of dispersion around the mean.
  • From below two scores of CV, we can see that the log of accuracy the level of dispersion is less.

In [24]:
_, AX = plt.subplots(ncols=2, figsize=(15,6)) 
df_train.accuracy.hist(bins=100, ax=AX[0])
AX[0].set_title('Histogram of Accuracy', fontsize=18)
AX[0].set_xlabel('Accuracy')
AX[0].set_ylabel('Frequency')

np.log10(df_train.accuracy).hist(bins=100, ax=AX[1])
AX[1].set_title('Histogram of Log of Accuracy', fontsize=18)
AX[1].set_xlabel('Log of Accuracy')

print "CV of accuracy: {:.3f}".format(np.std(df_train.accuracy) / np.mean(df_train.accuracy))
print "CV of log of accuracy: {:.3f}".format(np.std(np.log10(df_train.accuracy)) / np.mean(np.log10(df_train.accuracy)))


CV of accuracy: 1.385
CV of log of accuracy: 0.292
  • Following are the CV of accuracy at top 3 popular places, we can see that all places have lower CV of log of accuracy than CV of accuracy.
  • During the section of Methodology, we can see that we can get better result if taking the log of accuray as our feature instead of accuracy. The reason behind it may be log of accuracy at each place has less level of disperion, so it means that the possibility overlapping would be lower.

In [25]:
for i, order in zip(range(3), ['1st', '2nd', '3rd']):
    place = place_ranking.index[i]
    data = df_train[df_train.place_id==place]
    print "The {} place:".format(order)
    print "CV of accuracy: {:.3f}".format(np.std(data.accuracy) / np.mean(data.accuracy))
    print "CV of log of accuracy: {:.3f}".format(np.std(np.log10(data.accuracy)) / np.mean(np.log10(data.accuracy)))
    print "-"*20


The 1st place:
CV of accuracy: 1.290
CV of log of accuracy: 0.210
--------------------
The 2nd place:
CV of accuracy: 1.294
CV of log of accuracy: 0.259
--------------------
The 3rd place:
CV of accuracy: 1.357
CV of log of accuracy: 0.238
--------------------

Explore small grid

  • From this section, it could be found that adding dimension of time would help seperating clusters.
  • By zooming in on the maps, I'm trying to explore the data clusters on the x and y coordinates.
  • It seems that there are some clusters could be found on the coordinates, but it still has quite overlaps.
  • Inspired by Alexandru Papiu

In [26]:
small_grid = df_train[(df_train.x<0.1)&(df_train.y<0.1)]
# Mapping each place id with one color
color = dict(zip(small_grid.place_id.unique(), cm.rainbow(np.linspace(0,1,small_grid.place_id.unique().shape[0]))))

In [27]:
f, ax = plt.subplots()
for place, group in small_grid.groupby('place_id'):
    group.plot(ax=ax, kind='scatter', x='x', y='y', color=color[place])
ax.set_title('Check-ins colored by place_id', fontsize=18);


It could be found that the clusters are more easy to be seperated on some specific hour, because check-ins appear more often on specific hour for some specific places.


In [29]:
# Converted minutes into hours
small_grid.loc[:,'hour'] = ((small_grid.time /60)%24).astype(int)

In [30]:
ax = plt.figure(figsize=(15,8)).gca(projection='3d')
for place, group in small_grid.groupby('place_id'):
    ax.scatter(group.x,group.y,group.hour,c=color[place])
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Hour')
ax.set_title('3D Scatter of x, y and hour', fontsize=18);


From the below histogram, we could see the check-ins on these two places are almost complement on dimension of time, one is popular at morning and the other is popular at night. This could help us double check the idea of adding the dimensio of time.


In [31]:
print 'Top 3 popular places in the small grid'
for order, place in zip(['1st','2nd','3rd'], small_grid.place_id.value_counts().index[:3]):
    print '{} place: {}'.format(order, place)


Top 3 popular places in the small grid
1st place: 1006316884
2nd place: 5445221293
3rd place: 8378301865

In [32]:
f, (ax1,ax2) = plt.subplots(nrows=2, sharey=True,figsize=(15,10))
small_grid[small_grid.place_id==1006316884].hour.hist(bins=100,ax=ax1)
ax1.set_title('Histogram of check-ins at the 1st popular place', fontsize=18)
ax1.set_xlabel('Hour')
ax1.set_ylabel('Frequency')
small_grid[small_grid.place_id==8378301865].hour.hist(bins=100,ax=ax2)
ax2.set_title('Histogram of check-ins at the 3rd popular place', fontsize=18)
ax2.set_xlabel('Hour')
ax2.set_ylabel('Frequency')
del small_grid


Algorithms and Techniques

For this supervised classification problem, following are those algorithms I'm trying to use

  • KNN (K-Nearest Neighbor)
    • Description:
      • When a prediction is required for a unseen data instance, the KNN algorithm will search through the training dataset for the k-most similar instances.
      • The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used. Other types of data such as categorical or binary data, Hamming distance can be used.
      • It's a non parametric learning algorithm, so it means that KNN does not make any assumptions on the underlying data distribution.
      • It is also a lazy algorithm. What this means is that it does not use the training data points to do any generalization.
    • Discussion:
      • Since I do not have any assumptions about the data distribution, KNN could be the good start point.
      • The weight of y should be more than x because of the deviation of x is much more than y.
      • Since I only have limited knowledge about each feature but do not know exactly how important are they, the weight will be set based on trial and error.
      • Steps in Methodology:
        1. Dividing the whole map into many small cells.
        2. In each cell, splitting data by time into training and validation data sets.
        3. Training a KNN classifier in each cell without feature weighting, and calculating the accuracy.
        4. Training with feature weighting. The order of weights is y > x > year > hour > weekday > month > day.
    • Reference:
  • Gaussian Naive Bayes
    • Description:
      • It's a simple classifier based on applying Bayes Theorem with independence assumptions between features and probability models of each feature are Gaussian.
      • Equation of Bayes Theorem $$P(y|X) = \frac{P(X|y)P(y)}{P(X)}$$
        • X: features, y: class
        • P(y|X): posterior probability
        • P(y): prior probability of class
        • P(X): prior probability of features
        • P(X|y): class conditional model, which is gaussian distribution
    • Discussion:
      • From the below histograms, we could see that all the shape look like a bell. So I think it's worth trying to know how well this method could achieve.
      • And I assume each feature is independent although I do not prove it yet.
      • Steps in Methodology:
        1. Dividing the whole map into many small cells.
        2. In each cell, splitting data by time into training and validation data sets.
        3. Training a GNB classifier in each cell with feature selection, and calculating the accuracy.
        4. Training with feature selection. Selected features are x, y, hour, weekday, log of accuracy.
    • Reference:
  • Kernel Density Estimation
    • Description:
      • Kernel density estimation is a data smoothing way to estimate the probability density function of a random variable.
      • A major problem with histogram(upper-left panel) is that the choice of binning can have a disproportionate effect on the resulting visualization(upper-right panel). We can solve this problem and recover a smoother distribution by using KDE. By using different kernels, the visualization will be different. Bottom-left panel is top-hat kernel and bottom-right panel is Gaussian kernel.
      • The bandwidth of the kernel is a free parameter which exhibits a strong influences on the resulting estimate.
    • Discussion:
      • This problem could be approximately modeled as following $$P(place\_id\ |\ x,\ y,\ time,\ accuracy)$$ $$\approx P(x,\ y,\ time,\ accuracy|place\_id)*P(place\_id)$$ $$\approx P(x|place\_id)*P(y|place\_id)*P(time|place\_id)*P(accuracy|place\_id)*P(place\_id)$$
      • Based on the result of GNB, the assumption that all conditional probabilities of features were independent could hold on this problem. So I try to use KDE to estimate all conditional probabilities in order to get more accurate result than GNB. There are two parameters in KDE need to be addressed:
        • Kernel: The shape of kernel affects the smoothness of the resulting distribution. Here, I choose Gaussian as my kernel, simply because the result of GNB is not bad and GNB using Gauss