Demonstrates the use of the isoRandForest implementation of the iForest (IsolationForest) algorithm by Liu et al [2008].
Please be aware that the current implementation is meant to be a proof of concept, NOT a production-ready library In fact, scikit-learn already has a faster and more feature-complete implementation of the algorithm availble. This version is stripped down for the purpopse of exposing the core idea of the algorithm.
We begin by importing necessary libraries for the demo. Note that the only actual dependency for isoRandForest is NumPy. The others are imported to make plotting more convenient (and more pretty).
Be sure isoRandForest.py is in the same directory as this notebook before running.
In [12]:
import numpy as np
import pandas as pd
import ggplot as gg
import isoRandForest as irf
Here, we generate a log of simulated user activity. Our system will have two users: user 0 and user 1. The log will keep track of their login time, session length and of course the user ID.
User 0 comprises the first 6000 rows. This user's login behavior is as follows:
User 1 comprises the final 4000 rows. This user's login behavior is as follows:
Suppose user 0 is doing something they probably shouldn't be doing during a couple sessions. For simplicity, we'll assume those sessions are the first 3 in the data set (being the data was simulated iid, it really doesn't matter which 3 we pick). So, our nefarious user 0 alters the log so those sessions show as belonging to user 1. For shame, user 0.
In [22]:
np.random.seed(11111)
example_data = np.zeros([10000, 3])
user1_mask = np.array(range(10000)) < 6000
user2_mask = np.array(range(10000)) >= 6000
example_data[user1_mask, 0] = 50400 + 2400 * np.random.randn(user1_mask.sum())
example_data[user1_mask, 1] = np.random.exponential(360, user1_mask.sum())
example_data[user2_mask, 0] = 28800 + 3600 * np.random.randn(user2_mask.sum())
example_data[user2_mask, 1] = np.random.exponential(600, user2_mask.sum())
example_data[user2_mask, 2] = 1
example_data[[0,1,2],2] = 1
df = pd.DataFrame(example_data)
df.columns = ['login_time', 'session_len', 'user']
#df.to_csv('user_activity_log_demo.csv', index=False)
In [23]:
gg.ggplot(gg.aes(x='login_time', y='session_len', color='user'), data=df) + gg.geom_point()
Out[23]:
Hmm. It looks like overplotting is keeping us from seeing the altered sessions. You win this round, user 0. Let's see if we can find user 0's altered sessions using isoRandForest.
Don't worry if it takes a while (~2-4 minutes) to fit the model. The implementation hasn't been optimized yet.
In [24]:
anom_forest = irf.IsoRandForest(random_seed=12345)
In [25]:
anom_forest.fit(example_data)
Moment of Truth Time. Let's see what the anomaly scores look like for the first few entries in the log file.
In [26]:
anom_forest.anom_scores[0:20]
Out[26]:
Hey! Look at that! Our first three entries have anomaly scores much higher than the remaining entries.
Before we move forward, let's try fitting just one random tree and seeing what happens (don't worry, this will run almost instantly)
In [27]:
iTree = irf.IsoRandForest.IsoTree(max_depth=20)
iTree.fitTree(example_data)
In [36]:
print "Depth of row 0: %d\nDepth of row 10: %d" % (iTree.findDepth(example_data[0,:]), iTree.findDepth(example_data[10,:]))
What the numbers above mean is that, for this one random tree, it takes 13 random splits of the original data set to isolate the first row, whereas it takes 20 random splits (the maximum allowable depth of a single tree) to isolate the 11th row (actually, the 11th row wasn't quite isolated -- we just use the term isolated to signify that we've reached a leaf node in the tree).
The anomaly scores above represent the average of all such depths for a point, then normalized to be in the interval (0, 1).
In [46]:
df['user'] = df['user'].apply(lambda x: str(x))
df['anom'] = '0'
anom_mask = anom_forest.anom_scores > .65
df.loc[anom_mask, 'anom'] = '1'
In [47]:
g = gg.ggplot(gg.aes(x='login_time', y='session_len', color='anom'), data=df) +\
gg.geom_point()
print g
Notice that above, most of our anomalous points are pretty far from the center of either user cluster. However, we have three points that are well inside user 0's cluster. If those points belonged to user 1, they're pretty extreme outliers and definitely worth looking into a little closer.
In [48]:
print df[anom_mask]
Look at that. Our first three rows have login times that are way too late in the day for user 1. You were a worthy adversary, user 0, but not worthy enough to best the iForest algorithm.