Stolen phones in a nightclub

Scenario

Your friend owns a nightclub, and the nightclub is suffering an epidemic of stolen phones. At least one thief has been frequenting her club and stealing her visitors' phones. Her club has a licence scanner at its entrance, that records the name and date-of-birth of everyone who enters the club - so she should have the personal details of the thief or thieves; it's just mixed in with the details of her honest customers. She heard you call yourself a "data scientist", so has asked you to come up with a ranked list of up to 20 suspects to give to the police.

She's given you: visitor_log.csv - details of who visited the club and on what day (those visiting 2AM Tuesday are counted as visiting on Monday). `theft_log.csv' - a list of days on which thefts were reported to occur (again, thefts after midnight are counted as the previous day - we're being nice to you)

She wants from you:

A list of ID details for the 20 most suspicious patrons, ranked from most-suspicious to least-suspicious.
If you think there are fewer than 20 thieves, a list of ID details for everyone that you think is a thief.

Please just follow the flow of the notebook!



In [9]:

    
import pandas as pd
from matplotlib.pyplot import plot
import numpy as np
from scipy.stats import zscore,normaltest
%pylab inline
pylab.rcParams['figure.figsize'] = (20, 15)









    



Populating the interactive namespace from numpy and matplotlib



In [10]:

    
# First things first lets load the data
visitor_log = pd.read_csv('visitor_log.csv')
print visitor_log.shape

# After I loaded the first time the thief_log I noticed that there is no header, so I have changed the reading
# process adding the column name for the dates
thief_log = pd.read_csv('theft_log.csv', header=None,names=['theft_date'])
print thief_log.shape









    



(42727, 3)
(34, 1)



In [11]:

    
# looking at the data
visitor_log.head()









    Out[11]:






  
    
      
      visit_date
      name
      dob
    
  
  
    
      0
      2016-06-16
      Samuel Casanova
      1990-03-12
    
    
      1
      2016-06-16
      Gerald Orozco
      1990-10-05
    
    
      2
      2016-06-16
      Sharon Davis
      1990-10-18
    
    
      3
      2016-06-16
      Spencer Spencer
      1991-10-13
    
    
      4
      2016-06-16
      Steven Estevez
      1992-01-13



In [12]:

    
thief_log.head()



In [13]:

    
# In a previous version of pandas there was a function called 'convert_objects', that function was used to
# infer automatically the type of the data in the columns
# Now that function is deprecated so I have to cast all the data to the correct type
visitor_log.visit_date = pd.to_datetime(visitor_log.visit_date)
visitor_log.dob = pd.to_datetime(visitor_log.dob)
visitor_log.name = visitor_log.name.astype(basestring)
visitor_log.dtypes









    Out[13]:





visit_date    datetime64[ns]
name                  object
dob           datetime64[ns]
dtype: object



In [14]:

    
thief_log.theft_date = pd.to_datetime(thief_log.theft_date)
thief_log.dtypes









    Out[14]:





theft_date    datetime64[ns]
dtype: object



In [15]:

    
# Lets see if there is any duplicate or missing data and remove it
visitor_log = visitor_log.dropna().drop_duplicates()
thief_log.csv = thief_log.dropna().drop_duplicates()
print visitor_log.shape
print thief_log.shape









    



(42727, 3)
(34, 1)

All good! Now we can proceed



In [16]:

    
#Since there is the date record in each data set, it makes sense to merge them on the date.
# In this way we can later see who was in the nightclub when there was a robbery
log_merged = visitor_log.merge(thief_log,how='left', left_on='visit_date',right_on='theft_date')



In [17]:

    
# In order to plot, we need numerical values, so I am adding a boolean value that is 1 if there was a robbery and 
# zero otherwise
log_merged['theft_bool']= log_merged['theft_date'].apply(lambda x: 0 if pd.isnull(x) else 1)



In [18]:

    
#Now it is possible to group the data by the visitor name and count how many times that visitor was at the club
# when there was a robbery.
# I sort the data in descending order and plot it
log_merged.groupby('name').theft_bool.sum().sort_values(ascending=False).plot(legend=False)









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fecb95a41d0>

It seems we have some interesting insight here.

From the plot we can see that Karen Keeney is the visitor who was most often at the club when there was a robbery.



In [19]:

    
# Lets have a look what is the density distribution of this data
log_merged.groupby('name').theft_bool.sum().sort_values(ascending=False).plot(legend=False,kind='kde')









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fecb955e410>

Looks like a normal distribution!



In [20]:

    
# Now lets count how many times a visitor was at the club during a robbery
# lets remove the visitors who were never at the club during a robbery
theft_sum = log_merged.groupby('name').theft_bool.sum().sort_values(ascending=False).reset_index()
theft_sum = theft_sum[theft_sum.theft_bool!=0]
theft_sum.columns = ['name','robbery_count']
theft_sum.shape









    Out[20]:





(906, 2)



In [21]:

    
# Karen Keeney seems our thief! But lets not jump to conclusions
theft_sum.head(20)









    Out[21]:






  
    
      
      name
      robbery_count
    
  
  
    
      0
      Karen Keeney
      30
    
    
      1
      Patricia Whitley
      27
    
    
      2
      Christine Victoria
      27
    
    
      3
      Judith Sanders
      27
    
    
      4
      Henry Brunson
      27
    
    
      5
      Lynn Bernhart
      27
    
    
      6
      Daniel Laster
      26
    
    
      7
      Arthur Freund
      26
    
    
      8
      Dale Kirkwood
      26
    
    
      9
      Roger Ballance
      26
    
    
      10
      Suzanne Larson
      25
    
    
      11
      Michael Mcbride
      25
    
    
      12
      Kathleen Benzi
      25
    
    
      13
      Barbara Mchale
      25
    
    
      14
      Luz Connelly
      25
    
    
      15
      Adrienne Ruck
      24
    
    
      16
      Florine Kim
      24
    
    
      17
      Johnie Johnson
      24
    
    
      18
      Raymond Shannon
      24
    
    
      19
      John Davis
      24

Since there were in total 34 theft days, it is clear that there is more than one thief.

We will come back to this later on.



In [22]:

    
#Lets have a look at the visitors who where at the club when there was no robbery
not_theft = log_merged[log_merged.theft_bool!=1]
not_theft.groupby('name').theft_bool.count().sort_values().plot(legend=False)









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fecb314e210>



In [23]:

    
# Lets store the previous count into a dataframe
not_theft_count = not_theft.groupby('name').theft_bool.count().sort_values().reset_index()
not_theft_count.columns=['name','not_robbery_count']



In [24]:

    
#Now its time to calculate some stats for the visitors!!
# I add the counts to the original dataset and make some new ones, the names should be self explanatory
final_stats = log_merged.merge(not_theft_count,how='left',left_on='name',right_on='name')
final_stats = final_stats.merge(theft_sum,how='left',left_on='name',right_on='name')
#here i replace missing values with zeros, this creates an 'artificial' value for the column theft_date
# but it doesnt matter since there is the theft_bool column now
final_stats.fillna(0,inplace=True)
final_stats['total_visits'] = final_stats.robbery_count + final_stats.not_robbery_count
final_stats['robbery_freq'] = final_stats.robbery_count/final_stats.total_visits
final_stats.head()









    Out[24]:






  
    
      
      visit_date
      name
      dob
      theft_date
      theft_bool
      not_robbery_count
      robbery_count
      total_visits
      robbery_freq
    
  
  
    
      0
      2016-06-16
      Samuel Casanova
      1990-03-12
      1970-01-01
      0
      38.0
      11.0
      49.0
      0.224490
    
    
      1
      2016-06-16
      Gerald Orozco
      1990-10-05
      1970-01-01
      0
      19.0
      2.0
      21.0
      0.095238
    
    
      2
      2016-06-16
      Sharon Davis
      1990-10-18
      1970-01-01
      0
      39.0
      1.0
      40.0
      0.025000
    
    
      3
      2016-06-16
      Spencer Spencer
      1991-10-13
      1970-01-01
      0
      91.0
      14.0
      105.0
      0.133333
    
    
      4
      2016-06-16
      Steven Estevez
      1992-01-13
      1970-01-01
      0
      55.0
      11.0
      66.0
      0.166667



In [25]:

    
#Lets have a look at the robbery frequency distribution
final_stats.robbery_freq.plot(legend=True,kind='density')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fecb93e9f10>



In [26]:

    
#lets remove visitors who are clearly innocent
final_stats = final_stats[final_stats.robbery_freq!=0]



In [27]:

    
# Lets have another look at the distribution
final_stats.robbery_freq.plot(legend=True,kind='density')









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fecb930de90>



In [50]:

    
final_stats.sort_values(['robbery_freq','robbery_count'],ascending=False).head()









    Out[50]:






  
    
      
      visit_date
      name
      dob
      theft_date
      theft_bool
      not_robbery_count
      robbery_count
      total_visits
      robbery_freq
    
  
  
    
      1334
      2016-06-22
      Oneida Randall
      1995-09-11
      2016-06-22
      1
      0.0
      11.0
      11.0
      1.0
    
    
      7496
      2016-07-21
      Oneida Randall
      1995-09-11
      2016-07-21
      1
      0.0
      11.0
      11.0
      1.0
    
    
      7941
      2016-07-23
      Oneida Randall
      1995-09-11
      2016-07-23
      1
      0.0
      11.0
      11.0
      1.0
    
    
      8366
      2016-07-25
      Oneida Randall
      1995-09-11
      2016-07-25
      1
      0.0
      11.0
      11.0
      1.0
    
    
      9253
      2016-07-29
      Oneida Randall
      1995-09-11
      2016-07-29
      1
      0.0
      11.0
      11.0
      1.0



In [29]:

    
final_stats.describe()









    Out[29]:






  
    
      
      theft_bool
      not_robbery_count
      robbery_count
      total_visits
      robbery_freq
    
  
  
    
      count
      42313.000000
      42313.000000
      42313.000000
      42313.000000
      42313.000000
    
    
      mean
      0.172240
      59.889750
      12.253350
      72.143100
      0.172240
    
    
      std
      0.377594
      31.821258
      6.716742
      38.006944
      0.051084
    
    
      min
      0.000000
      0.000000
      1.000000
      2.000000
      0.025000
    
    
      25%
      0.000000
      34.000000
      7.000000
      41.000000
      0.147059
    
    
      50%
      0.000000
      58.000000
      12.000000
      70.000000
      0.167785
    
    
      75%
      0.000000
      80.000000
      17.000000
      95.000000
      0.195652
    
    
      max
      1.000000
      154.000000
      30.000000
      177.000000
      1.000000



In [30]:

    
# Now we want to confirm in a statistical way which visitors could be a thief
# Lets select just the numerical columns and calculate the z-score
numeric_cols = final_stats.drop('theft_date',axis=1).select_dtypes(include=[np.number]).columns
zscore_vals = final_stats[numeric_cols].apply(zscore).sort_values(['robbery_count','robbery_freq'],ascending=False)
zscore_vals.head()









    Out[30]:






  
    
      
      theft_bool
      not_robbery_count
      robbery_count
      total_visits
      robbery_freq
    
  
  
    
      243
      2.192224
      1.637612
      2.642183
      1.838025
      0.764001
    
    
      664
      -0.456158
      1.637612
      2.642183
      1.838025
      0.764001
    
    
      875
      -0.456158
      1.637612
      2.642183
      1.838025
      0.764001
    
    
      1292
      2.192224
      1.637612
      2.642183
      1.838025
      0.764001
    
    
      1489
      -0.456158
      1.637612
      2.642183
      1.838025
      0.764001



In [49]:

    
#the results have the original index of the visitor in the dataframe, lets show the top 50 
idxs = zscore_vals.drop_duplicates().index.values
for i in idxs[:20]:
    print final_stats.ix[i]









    



visit_date           2016-06-17 00:00:00
name                        Karen Keeney
dob                  1993-12-25 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    112
robbery_count                         30
total_visits                         142
robbery_freq                    0.211268
Name: 243, dtype: object
visit_date           2016-06-19 00:00:00
name                        Karen Keeney
dob                  1993-12-25 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    112
robbery_count                         30
total_visits                         142
robbery_freq                    0.211268
Name: 664, dtype: object
visit_date           2016-06-16 00:00:00
name                      Judith Sanders
dob                  1993-08-26 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    112
robbery_count                         27
total_visits                         139
robbery_freq                    0.194245
Name: 15, dtype: object
visit_date           2016-06-17 00:00:00
name                      Judith Sanders
dob                  1993-08-26 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    112
robbery_count                         27
total_visits                         139
robbery_freq                    0.194245
Name: 240, dtype: object
visit_date           2016-06-17 00:00:00
name                       Lynn Bernhart
dob                  1995-11-13 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    116
robbery_count                         27
total_visits                         143
robbery_freq                    0.188811
Name: 296, dtype: object
visit_date           2016-06-18 00:00:00
name                       Lynn Bernhart
dob                  1995-11-13 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    116
robbery_count                         27
total_visits                         143
robbery_freq                    0.188811
Name: 505, dtype: object
visit_date           2016-06-17 00:00:00
name                    Patricia Whitley
dob                  1993-06-08 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    126
robbery_count                         27
total_visits                         153
robbery_freq                    0.176471
Name: 237, dtype: object
visit_date           2016-06-19 00:00:00
name                    Patricia Whitley
dob                  1993-06-08 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    126
robbery_count                         27
total_visits                         153
robbery_freq                    0.176471
Name: 654, dtype: object
visit_date           2016-06-16 00:00:00
name                  Christine Victoria
dob                  1996-11-26 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    139
robbery_count                         27
total_visits                         166
robbery_freq                    0.162651
Name: 160, dtype: object
visit_date           2016-06-17 00:00:00
name                  Christine Victoria
dob                  1996-11-26 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    139
robbery_count                         27
total_visits                         166
robbery_freq                    0.162651
Name: 370, dtype: object
visit_date           2016-06-16 00:00:00
name                       Henry Brunson
dob                  1996-06-21 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    142
robbery_count                         27
total_visits                         169
robbery_freq                    0.159763
Name: 125, dtype: object
visit_date           2016-06-17 00:00:00
name                       Henry Brunson
dob                  1996-06-21 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    142
robbery_count                         27
total_visits                         169
robbery_freq                    0.159763
Name: 340, dtype: object
visit_date           2016-06-20 00:00:00
name                       Daniel Laster
dob                  1995-12-31 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    109
robbery_count                         26
total_visits                         135
robbery_freq                    0.192593
Name: 947, dtype: object
visit_date           2016-06-26 00:00:00
name                       Daniel Laster
dob                  1995-12-31 00:00:00
theft_date           2016-06-26 00:00:00
theft_bool                             1
not_robbery_count                    109
robbery_count                         26
total_visits                         135
robbery_freq                    0.192593
Name: 2211, dtype: object
visit_date           2016-06-16 00:00:00
name                      Roger Ballance
dob                  1996-05-16 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    123
robbery_count                         26
total_visits                         149
robbery_freq                    0.174497
Name: 116, dtype: object
visit_date           2016-06-17 00:00:00
name                      Roger Ballance
dob                  1996-05-16 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    123
robbery_count                         26
total_visits                         149
robbery_freq                    0.174497
Name: 333, dtype: object
visit_date           2016-06-17 00:00:00
name                       Arthur Freund
dob                  1996-09-07 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    134
robbery_count                         26
total_visits                         160
robbery_freq                      0.1625
Name: 356, dtype: object
visit_date           2016-06-18 00:00:00
name                       Arthur Freund
dob                  1996-09-07 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    134
robbery_count                         26
total_visits                         160
robbery_freq                      0.1625
Name: 555, dtype: object
visit_date           2016-06-16 00:00:00
name                       Dale Kirkwood
dob                  1993-05-22 00:00:00
theft_date           1970-01-01 00:00:00
theft_bool                             0
not_robbery_count                    138
robbery_count                         26
total_visits                         164
robbery_freq                    0.158537
Name: 10, dtype: object
visit_date           2016-06-17 00:00:00
name                       Dale Kirkwood
dob                  1993-05-22 00:00:00
theft_date           2016-06-17 00:00:00
theft_bool                             1
not_robbery_count                    138
robbery_count                         26
total_visits                         164
robbery_freq                    0.158537
Name: 235, dtype: object



In [32]:

    
results = final_stats.join(zscore_vals,rsuffix='_zscore').sort_values(['robbery_count','theft_bool'],ascending=False).drop_duplicates('name')



In [48]:

    
results.sort_values(['robbery_count','robbery_freq'],ascending=False).head()









    Out[48]:






  
    
      
      visit_date
      name
      dob
      theft_date
      theft_bool
      not_robbery_count
      robbery_count
      total_visits
      robbery_freq
      theft_bool_zscore
      not_robbery_count_zscore
      robbery_count_zscore
      total_visits_zscore
      robbery_freq_zscore
    
  
  
    
      243
      2016-06-17
      Karen Keeney
      1993-12-25
      2016-06-17
      1
      112.0
      30.0
      142.0
      0.211268
      2.192224
      1.637612
      2.642183
      1.838025
      0.764001
    
    
      240
      2016-06-17
      Judith Sanders
      1993-08-26
      2016-06-17
      1
      112.0
      27.0
      139.0
      0.194245
      2.192224
      1.637612
      2.195532
      1.759091
      0.430758
    
    
      296
      2016-06-17
      Lynn Bernhart
      1995-11-13
      2016-06-17
      1
      116.0
      27.0
      143.0
      0.188811
      2.192224
      1.763315
      2.195532
      1.864337
      0.324394
    
    
      237
      2016-06-17
      Patricia Whitley
      1993-06-08
      2016-06-17
      1
      126.0
      27.0
      153.0
      0.176471
      2.192224
      2.077574
      2.195532
      2.127450
      0.082814
    
    
      370
      2016-06-17
      Christine Victoria
      1996-11-26
      2016-06-17
      1
      139.0
      27.0
      166.0
      0.162651
      2.192224
      2.486111
      2.195532
      2.469496
      -0.187726

Now that we have some more confirmation on which visitors could be a thief, lets find out who are the members of the gang



In [34]:

    
#First lets select just the data when there was a theft and lets do the crosstab
robbery_dates = final_stats[final_stats.theft_bool==1]
cross_tab = pd.crosstab(robbery_dates['name'],robbery_dates['visit_date'], margins=True)



In [47]:

    
#Basically here we see a matrix like dataset. Every cell tells us if a visitor was at the club at that date
cross_tab.head()









    Out[47]:






  
    
      visit_date
      2016-06-17 00:00:00
      2016-06-22 00:00:00
      2016-06-26 00:00:00
      2016-06-29 00:00:00
      2016-07-03 00:00:00
      2016-07-07 00:00:00
      2016-07-09 00:00:00
      2016-07-21 00:00:00
      2016-07-23 00:00:00
      2016-07-25 00:00:00
      ...
      2016-12-05 00:00:00
      2016-12-12 00:00:00
      2016-12-15 00:00:00
      2016-12-18 00:00:00
      2016-12-19 00:00:00
      2016-12-22 00:00:00
      2016-12-25 00:00:00
      2016-12-30 00:00:00
      2017-01-01 00:00:00
      All
    
    
      name
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Aaron Nelson
      0
      0
      0
      0
      1
      1
      0
      1
      0
      0
      ...
      0
      1
      0
      0
      0
      1
      1
      0
      0
      11
    
    
      Ada Tejeda
      1
      0
      1
      1
      1
      0
      1
      1
      1
      0
      ...
      1
      1
      0
      1
      0
      0
      0
      1
      0
      22
    
    
      Adam Berry
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      1
      0
      0
      0
      0
      0
      0
      1
      5
    
    
      Adam Dorsette
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      0
      0
      4
    
    
      Adam Johnson
      0
      0
      1
      1
      0
      1
      1
      0
      1
      0
      ...
      0
      0
      0
      0
      0
      1
      1
      0
      1
      13
    
  

5 rows × 35 columns



In [36]:

    
# We do the crosstab because it is easier to find out the gang members with data in this format!
cross_tab_mat=cross_tab.reset_index().values



In [53]:

    
#With data in this format we can calculate the correlation between visitors
#We need to transpose the dataframe because the correlation is calculated between the columns
cross_tab.transpose().corr().head()









    Out[53]:






  
    
      name
      Aaron Nelson
      Ada Tejeda
      Adam Berry
      Adam Dorsette
      Adam Johnson
      Adrian Whapham
      Adrienne Raney
      Adrienne Ruck
      Alan Shoemake
      Alfredo Vinci
      ...
      Willie Gould
      Willis Stewart
      Woodrow Johnston
      Yvette Vasquez
      Yvonne Knowles
      Yvonne Kouba
      Zenobia Yahna
      Zofia Treinen
      Zoraida Hendrix
      All
    
    
      name
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Aaron Nelson
      1.000000
      0.959119
      0.877803
      0.840999
      0.949540
      0.930520
      0.959812
      0.962523
      0.895460
      0.877803
      ...
      0.959341
      0.909632
      0.940764
      0.652668
      0.925802
      0.860146
      0.943522
      0.955057
      0.884159
      0.968431
    
    
      Ada Tejeda
      0.959119
      1.000000
      0.917056
      0.905482
      0.965054
      0.962449
      0.987829
      0.989812
      0.908014
      0.908014
      ...
      0.980369
      0.938204
      0.954786
      0.702710
      0.948129
      0.898973
      0.972712
      0.987829
      0.883381
      0.991164
    
    
      Adam Berry
      0.877803
      0.917056
      1.000000
      0.842657
      0.911518
      0.875504
      0.908512
      0.899297
      0.889474
      0.852632
      ...
      0.909088
      0.865181
      0.895460
      0.619010
      0.917112
      0.852632
      0.884137
      0.908512
      0.797630
      0.918843
    
    
      Adam Dorsette
      0.840999
      0.905482
      0.842657
      1.000000
      0.865479
      0.837112
      0.907689
      0.903638
      0.797630
      0.797630
      ...
      0.876257
      0.852530
      0.840999
      0.605231
      0.855862
      0.797630
      0.867661
      0.895563
      0.779874
      0.897319
    
    
      Adam Johnson
      0.949540
      0.965054
      0.911518
      0.865479
      1.000000
      0.956638
      0.965654
      0.967948
      0.911518
      0.911518
      ...
      0.964053
      0.923244
      0.949540
      0.714485
      0.907690
      0.881421
      0.951812
      0.957549
      0.847086
      0.974999
    
  

5 rows × 907 columns

Unfortunatelly the data is too sparse and the correlation metric is not really helpfull to understand which visitors could be complices

That is why we need to calculate the correlation manually. We calculate how many times two visitors were at the club when there was a theft, and we do this for every pair of visitors



In [37]:

    
#Lets count how many times two visitors were at the club on the same date, when a phone was stolen
from itertools import combinations
criminal_friends = []
for r1,r2 in combinations([x for x in cross_tab_mat],2):
    matches = [i for i,j in zip(r1,r2) if i==j and i!=0]
    criminal_friends.append((r1[0],r2[0], len(matches)))



In [38]:

    
#We clearly can see that Karen Keeney is the boss, and with her there are some other thiefs!
criminal_friends_ranking = sorted(criminal_friends, key=lambda x:x[2],reverse=True)
criminal_friends_ranking[:50]









    Out[38]:





[('Judith Sanders', 'Karen Keeney', 25),
 ('Dale Kirkwood', 'Karen Keeney', 24),
 ('Henry Brunson', 'Judith Sanders', 24),
 ('Judith Sanders', 'Lynn Bernhart', 24),
 ('Karen Keeney', 'Lynn Bernhart', 24),
 ('Arthur Freund', 'Karen Keeney', 23),
 ('Arthur Freund', 'Roger Ballance', 23),
 ('Christine Victoria', 'Karen Keeney', 23),
 ('Dale Kirkwood', 'Roger Ballance', 23),
 ('Daniel Laster', 'Lynn Bernhart', 23),
 ('Daniel Laster', 'Roger Ballance', 23),
 ('Henry Brunson', 'Karen Keeney', 23),
 ('Henry Brunson', 'Lynn Bernhart', 23),
 ('Henry Brunson', 'Patricia Whitley', 23),
 ('Judith Sanders', 'Patricia Whitley', 23),
 ('Karen Keeney', 'Patricia Whitley', 23),
 ('Karen Keeney', 'Roger Ballance', 23),
 ('Lynn Bernhart', 'Patricia Whitley', 23),
 ('Arthur Freund', 'Christine Victoria', 22),
 ('Arthur Freund', 'Daniel Laster', 22),
 ('Barbara Mchale', 'Karen Keeney', 22),
 ('Christine Victoria', 'Henry Brunson', 22),
 ('Christine Victoria', 'Patricia Whitley', 22),
 ('Dale Kirkwood', 'Daniel Laster', 22),
 ('Dale Kirkwood', 'Judith Sanders', 22),
 ('Dale Kirkwood', 'Lynn Bernhart', 22),
 ('Daniel Laster', 'Karen Keeney', 22),
 ('Daniel Laster', 'Patricia Whitley', 22),
 ('Henry Brunson', 'Kathleen Benzi', 22),
 ('Henry Brunson', 'Roger Ballance', 22),
 ('Karen Keeney', 'Kathleen Benzi', 22),
 ('Karen Keeney', 'Luz Connelly', 22),
 ('Karen Keeney', 'Raymond Shannon', 22),
 ('Karen Keeney', 'Suzanne Larson', 22),
 ('Lynn Bernhart', 'Roger Ballance', 22),
 ('Adrienne Ruck', 'Karen Keeney', 21),
 ('Adrienne Ruck', 'Patricia Whitley', 21),
 ('Arthur Freund', 'Dale Kirkwood', 21),
 ('Arthur Freund', 'Henry Brunson', 21),
 ('Arthur Freund', 'Patricia Whitley', 21),
 ('Barbara Mchale', 'Henry Brunson', 21),
 ('Barbara Mchale', 'Judith Sanders', 21),
 ('Barbara Mchale', 'Suzanne Larson', 21),
 ('Cheryl Robinson', 'Karen Keeney', 21),
 ('Christine Victoria', 'Dale Kirkwood', 21),
 ('Christine Victoria', 'Daniel Laster', 21),
 ('Christine Victoria', 'Judith Sanders', 21),
 ('Christine Victoria', 'Luz Connelly', 21),
 ('Christine Victoria', 'Lynn Bernhart', 21),
 ('Christine Victoria', 'Mary Laura', 21)]



In [39]:

    
# Lets select the top names as possible thiefs
# I use 21 as a threshold because the number of names retrieved is less than 20
criminal_names = set()
for t in criminal_friends_ranking:
    if t[2]>21:
        criminal_names.add(t[0])
        criminal_names.add(t[1])
print len(criminal_names)



In [40]:

    
#Finally here the list of possible thiefs of the nightclub!!
# If I have to choose, I would say that Karen Keeney is the boss! And the following names are most
# likely her accomplices
criminals = results[results.name.isin(criminal_names)].sort_values(['robbery_count','robbery_freq'],ascending=False)



In [43]:

    
criminals.head()









    Out[43]:






  
    
      
      visit_date
      name
      dob
      theft_date
      theft_bool
      not_robbery_count
      robbery_count
      total_visits
      robbery_freq
      theft_bool_zscore
      not_robbery_count_zscore
      robbery_count_zscore
      total_visits_zscore
      robbery_freq_zscore
    
  
  
    
      243
      2016-06-17
      Karen Keeney
      1993-12-25
      2016-06-17
      1
      112.0
      30.0
      142.0
      0.211268
      2.192224
      1.637612
      2.642183
      1.838025
      0.764001
    
    
      240
      2016-06-17
      Judith Sanders
      1993-08-26
      2016-06-17
      1
      112.0
      27.0
      139.0
      0.194245
      2.192224
      1.637612
      2.195532
      1.759091
      0.430758
    
    
      296
      2016-06-17
      Lynn Bernhart
      1995-11-13
      2016-06-17
      1
      116.0
      27.0
      143.0
      0.188811
      2.192224
      1.763315
      2.195532
      1.864337
      0.324394
    
    
      237
      2016-06-17
      Patricia Whitley
      1993-06-08
      2016-06-17
      1
      126.0
      27.0
      153.0
      0.176471
      2.192224
      2.077574
      2.195532
      2.127450
      0.082814
    
    
      370
      2016-06-17
      Christine Victoria
      1996-11-26
      2016-06-17
      1
      139.0
      27.0
      166.0
      0.162651
      2.192224
      2.486111
      2.195532
      2.469496
      -0.187726



In [46]:

    
criminals[['name','dob']]









    Out[46]:






  
    
      
      name
      dob
    
  
  
    
      243
      Karen Keeney
      1993-12-25
    
    
      240
      Judith Sanders
      1993-08-26
    
    
      296
      Lynn Bernhart
      1995-11-13
    
    
      237
      Patricia Whitley
      1993-06-08
    
    
      370
      Christine Victoria
      1996-11-26
    
    
      340
      Henry Brunson
      1996-06-21
    
    
      2211
      Daniel Laster
      1995-12-31
    
    
      333
      Roger Ballance
      1996-05-16
    
    
      356
      Arthur Freund
      1996-09-07
    
    
      235
      Dale Kirkwood
      1993-05-22
    
    
      291
      Kathleen Benzi
      1995-09-15
    
    
      275
      Luz Connelly
      1995-03-26
    
    
      269
      Barbara Mchale
      1994-12-19
    
    
      371
      Suzanne Larson
      1996-12-02
    
    
      417
      Raymond Shannon
      1997-08-06



In [ ]:

Conclusion

The proposed list shows the possible thieves, but there is another road that could be explored.

It is possible that the thieves went to the club just once and stole the phones.

As a future exercise it would be interesting to cover this other scenario as well.



In [ ]:

	visit_date	name	dob
0	2016-06-16	Samuel Casanova	1990-03-12
1	2016-06-16	Gerald Orozco	1990-10-05
2	2016-06-16	Sharon Davis	1990-10-18
3	2016-06-16	Spencer Spencer	1991-10-13
4	2016-06-16	Steven Estevez	1992-01-13

	name	robbery_count
0	Karen Keeney	30
1	Patricia Whitley	27
2	Christine Victoria	27
3	Judith Sanders	27
4	Henry Brunson	27
5	Lynn Bernhart	27
6	Daniel Laster	26
7	Arthur Freund	26
8	Dale Kirkwood	26
9	Roger Ballance	26
10	Suzanne Larson	25
11	Michael Mcbride	25
12	Kathleen Benzi	25
13	Barbara Mchale	25
14	Luz Connelly	25
15	Adrienne Ruck	24
16	Florine Kim	24
17	Johnie Johnson	24
18	Raymond Shannon	24
19	John Davis	24

	visit_date	name	dob	theft_date	theft_bool	robbery_count	total_visits	robbery_freq
1334	2016-06-22	Oneida Randall	1995-09-11	2016-06-22	1	11.0	11.0	1.0
7496	2016-07-21	Oneida Randall	1995-09-11	2016-07-21	1	11.0	11.0	1.0
7941	2016-07-23	Oneida Randall	1995-09-11	2016-07-23	1	11.0	11.0	1.0
8366	2016-07-25	Oneida Randall	1995-09-11	2016-07-25	1	11.0	11.0	1.0
9253	2016-07-29	Oneida Randall	1995-09-11	2016-07-29	1	11.0	11.0	1.0

	theft_bool	not_robbery_count	robbery_count	total_visits	robbery_freq
count	42313.000000	42313.000000	42313.000000	42313.000000	42313.000000
mean	0.172240	59.889750	12.253350	72.143100	0.172240
std	0.377594	31.821258	6.716742	38.006944	0.051084
min	0.000000	0.000000	1.000000	2.000000	0.025000
25%	0.000000	34.000000	7.000000	41.000000	0.147059
50%	0.000000	58.000000	12.000000	70.000000	0.167785
75%	0.000000	80.000000	17.000000	95.000000	0.195652
max	1.000000	154.000000	30.000000	177.000000	1.000000

	theft_bool	not_robbery_count	robbery_count	total_visits	robbery_freq
243	2.192224	1.637612	2.642183	1.838025	0.764001
664	-0.456158	1.637612	2.642183	1.838025	0.764001
875	-0.456158	1.637612	2.642183	1.838025	0.764001
1292	2.192224	1.637612	2.642183	1.838025	0.764001
1489	-0.456158	1.637612	2.642183	1.838025	0.764001

	visit_date	name	dob	theft_date	theft_bool	not_robbery_count	robbery_count	total_visits	robbery_freq	theft_bool_zscore	not_robbery_count_zscore	robbery_count_zscore	total_visits_zscore	robbery_freq_zscore
243	2016-06-17	Karen Keeney	1993-12-25	2016-06-17	1	112.0	30.0	142.0	0.211268	2.192224	1.637612	2.642183	1.838025	0.764001
240	2016-06-17	Judith Sanders	1993-08-26	2016-06-17	1	112.0	27.0	139.0	0.194245	2.192224	1.637612	2.195532	1.759091	0.430758
296	2016-06-17	Lynn Bernhart	1995-11-13	2016-06-17	1	116.0	27.0	143.0	0.188811	2.192224	1.763315	2.195532	1.864337	0.324394
237	2016-06-17	Patricia Whitley	1993-06-08	2016-06-17	1	126.0	27.0	153.0	0.176471	2.192224	2.077574	2.195532	2.127450	0.082814
370	2016-06-17	Christine Victoria	1996-11-26	2016-06-17	1	139.0	27.0	166.0	0.162651	2.192224	2.486111	2.195532	2.469496	-0.187726

visit_date	2016-06-17 00:00:00	2016-06-22 00:00:00	2016-06-26 00:00:00	2016-06-29 00:00:00	2016-07-03 00:00:00	2016-07-07 00:00:00	2016-07-09 00:00:00	2016-07-21 00:00:00	2016-07-23 00:00:00	2016-07-25 00:00:00	...	2016-12-05 00:00:00	2016-12-12 00:00:00	2016-12-15 00:00:00	2016-12-18 00:00:00	2016-12-19 00:00:00	2016-12-22 00:00:00	2016-12-25 00:00:00	2016-12-30 00:00:00	2017-01-01 00:00:00	All
name
Aaron Nelson	0	0	0	0	1	1	0	1	0	0	...	0	1	0	0	0	1	1	0	0	11
Ada Tejeda	1	0	1	1	1	0	1	1	1	0	...	1	1	0	1	0	0	0	1	0	22
Adam Berry	0	0	0	0	0	0	0	0	0	0	...	1	1	0	0	0	0	0	0	1	5
Adam Dorsette	0	0	1	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	4
Adam Johnson	0	0	1	1	0	1	1	0	1	0	...	0	0	0	0	0	1	1	0	1	13