Your friend owns a nightclub, and the nightclub is suffering an epidemic of stolen phones. At least one thief has been frequenting her club and stealing her visitors' phones. Her club has a licence scanner at its entrance, that records the name and date-of-birth of everyone who enters the club - so she should have the personal details of the thief or thieves; it's just mixed in with the details of her honest customers. She heard you call yourself a "data scientist", so has asked you to come up with a ranked list of up to 20 suspects to give to the police.
She's given you:
visitor_log.csv - details of who visited the club and on what day (those visiting 2AM Tuesday are counted as visiting on Monday).
`theft_log.csv' - a list of days on which thefts were reported to occur (again, thefts after midnight are counted as the previous day - we're being nice to you)
She wants from you:
In [9]:
import pandas as pd
from matplotlib.pyplot import plot
import numpy as np
from scipy.stats import zscore,normaltest
%pylab inline
pylab.rcParams['figure.figsize'] = (20, 15)
In [10]:
# First things first lets load the data
visitor_log = pd.read_csv('visitor_log.csv')
print visitor_log.shape
# After I loaded the first time the thief_log I noticed that there is no header, so I have changed the reading
# process adding the column name for the dates
thief_log = pd.read_csv('theft_log.csv', header=None,names=['theft_date'])
print thief_log.shape
In [11]:
# looking at the data
visitor_log.head()
Out[11]:
In [12]:
thief_log.head()
Out[12]:
In [13]:
# In a previous version of pandas there was a function called 'convert_objects', that function was used to
# infer automatically the type of the data in the columns
# Now that function is deprecated so I have to cast all the data to the correct type
visitor_log.visit_date = pd.to_datetime(visitor_log.visit_date)
visitor_log.dob = pd.to_datetime(visitor_log.dob)
visitor_log.name = visitor_log.name.astype(basestring)
visitor_log.dtypes
Out[13]:
In [14]:
thief_log.theft_date = pd.to_datetime(thief_log.theft_date)
thief_log.dtypes
Out[14]:
In [15]:
# Lets see if there is any duplicate or missing data and remove it
visitor_log = visitor_log.dropna().drop_duplicates()
thief_log.csv = thief_log.dropna().drop_duplicates()
print visitor_log.shape
print thief_log.shape
In [16]:
#Since there is the date record in each data set, it makes sense to merge them on the date.
# In this way we can later see who was in the nightclub when there was a robbery
log_merged = visitor_log.merge(thief_log,how='left', left_on='visit_date',right_on='theft_date')
In [17]:
# In order to plot, we need numerical values, so I am adding a boolean value that is 1 if there was a robbery and
# zero otherwise
log_merged['theft_bool']= log_merged['theft_date'].apply(lambda x: 0 if pd.isnull(x) else 1)
In [18]:
#Now it is possible to group the data by the visitor name and count how many times that visitor was at the club
# when there was a robbery.
# I sort the data in descending order and plot it
log_merged.groupby('name').theft_bool.sum().sort_values(ascending=False).plot(legend=False)
Out[18]:
In [19]:
# Lets have a look what is the density distribution of this data
log_merged.groupby('name').theft_bool.sum().sort_values(ascending=False).plot(legend=False,kind='kde')
Out[19]:
In [20]:
# Now lets count how many times a visitor was at the club during a robbery
# lets remove the visitors who were never at the club during a robbery
theft_sum = log_merged.groupby('name').theft_bool.sum().sort_values(ascending=False).reset_index()
theft_sum = theft_sum[theft_sum.theft_bool!=0]
theft_sum.columns = ['name','robbery_count']
theft_sum.shape
Out[20]:
In [21]:
# Karen Keeney seems our thief! But lets not jump to conclusions
theft_sum.head(20)
Out[21]:
In [22]:
#Lets have a look at the visitors who where at the club when there was no robbery
not_theft = log_merged[log_merged.theft_bool!=1]
not_theft.groupby('name').theft_bool.count().sort_values().plot(legend=False)
Out[22]:
In [23]:
# Lets store the previous count into a dataframe
not_theft_count = not_theft.groupby('name').theft_bool.count().sort_values().reset_index()
not_theft_count.columns=['name','not_robbery_count']
In [24]:
#Now its time to calculate some stats for the visitors!!
# I add the counts to the original dataset and make some new ones, the names should be self explanatory
final_stats = log_merged.merge(not_theft_count,how='left',left_on='name',right_on='name')
final_stats = final_stats.merge(theft_sum,how='left',left_on='name',right_on='name')
#here i replace missing values with zeros, this creates an 'artificial' value for the column theft_date
# but it doesnt matter since there is the theft_bool column now
final_stats.fillna(0,inplace=True)
final_stats['total_visits'] = final_stats.robbery_count + final_stats.not_robbery_count
final_stats['robbery_freq'] = final_stats.robbery_count/final_stats.total_visits
final_stats.head()
Out[24]:
In [25]:
#Lets have a look at the robbery frequency distribution
final_stats.robbery_freq.plot(legend=True,kind='density')
Out[25]:
In [26]:
#lets remove visitors who are clearly innocent
final_stats = final_stats[final_stats.robbery_freq!=0]
In [27]:
# Lets have another look at the distribution
final_stats.robbery_freq.plot(legend=True,kind='density')
Out[27]:
In [50]:
final_stats.sort_values(['robbery_freq','robbery_count'],ascending=False).head()
Out[50]:
In [29]:
final_stats.describe()
Out[29]:
In [30]:
# Now we want to confirm in a statistical way which visitors could be a thief
# Lets select just the numerical columns and calculate the z-score
numeric_cols = final_stats.drop('theft_date',axis=1).select_dtypes(include=[np.number]).columns
zscore_vals = final_stats[numeric_cols].apply(zscore).sort_values(['robbery_count','robbery_freq'],ascending=False)
zscore_vals.head()
Out[30]:
In [49]:
#the results have the original index of the visitor in the dataframe, lets show the top 50
idxs = zscore_vals.drop_duplicates().index.values
for i in idxs[:20]:
print final_stats.ix[i]
In [32]:
results = final_stats.join(zscore_vals,rsuffix='_zscore').sort_values(['robbery_count','theft_bool'],ascending=False).drop_duplicates('name')
In [48]:
results.sort_values(['robbery_count','robbery_freq'],ascending=False).head()
Out[48]:
In [34]:
#First lets select just the data when there was a theft and lets do the crosstab
robbery_dates = final_stats[final_stats.theft_bool==1]
cross_tab = pd.crosstab(robbery_dates['name'],robbery_dates['visit_date'], margins=True)
In [47]:
#Basically here we see a matrix like dataset. Every cell tells us if a visitor was at the club at that date
cross_tab.head()
Out[47]:
In [36]:
# We do the crosstab because it is easier to find out the gang members with data in this format!
cross_tab_mat=cross_tab.reset_index().values
In [53]:
#With data in this format we can calculate the correlation between visitors
#We need to transpose the dataframe because the correlation is calculated between the columns
cross_tab.transpose().corr().head()
Out[53]:
That is why we need to calculate the correlation manually. We calculate how many times two visitors were at the club when there was a theft, and we do this for every pair of visitors
In [37]:
#Lets count how many times two visitors were at the club on the same date, when a phone was stolen
from itertools import combinations
criminal_friends = []
for r1,r2 in combinations([x for x in cross_tab_mat],2):
matches = [i for i,j in zip(r1,r2) if i==j and i!=0]
criminal_friends.append((r1[0],r2[0], len(matches)))
In [38]:
#We clearly can see that Karen Keeney is the boss, and with her there are some other thiefs!
criminal_friends_ranking = sorted(criminal_friends, key=lambda x:x[2],reverse=True)
criminal_friends_ranking[:50]
Out[38]:
In [39]:
# Lets select the top names as possible thiefs
# I use 21 as a threshold because the number of names retrieved is less than 20
criminal_names = set()
for t in criminal_friends_ranking:
if t[2]>21:
criminal_names.add(t[0])
criminal_names.add(t[1])
print len(criminal_names)
In [40]:
#Finally here the list of possible thiefs of the nightclub!!
# If I have to choose, I would say that Karen Keeney is the boss! And the following names are most
# likely her accomplices
criminals = results[results.name.isin(criminal_names)].sort_values(['robbery_count','robbery_freq'],ascending=False)
In [43]:
criminals.head()
Out[43]:
In [46]:
criminals[['name','dob']]
Out[46]:
In [ ]:
In [ ]: