This is a test on basic data processing. The test is tool-agnostic; you may use any data processing tool you like. Typical examples would include, Excel, python / pandas, matlab, SQL etc. The test is timed to last approximately 2 hours. Points available for each question are marked in [brackets].
You have been given 3 csv files:
leads.csv. This is a list of fictitious company directors, with some basic data about them and their company.
calls.csv. This is a list of fictitious calls made by an outbound call centre. The call centre consists of several agents, who make calls one after the other. They don’t get to choose who to call, the system does. The objective of the call is to get the lead to signup on the website. When they finish a call, they mark down the outcome, from a fixed list of possible outcomes. Note that a single lead may be called multiple times.
signups.csv. This is a list of leads who signed up after being called by someone from the call centre. Each signup was risk assessed and either approved or rejected for a loan.
Orange
avg: 1.8
avg: 2.1
The agent RED had the most signups. I assumed that a good indication of signups is the number of INTERESTED Leads that each agent had and the difference between CALL BACK LATER and INTERESTED outcome. Agent RED had the biggest difference between these two outcomes.
Agent BLUE had the most signups per call with 0.26 ratio. But I would say that this ratio is not accurate (and is just a lucky strike) because the total number of calls made by agent BLUE is so much lower compared to other agents. I would say that agent RED is the one with the highest signups per call average among the agents.
Was not for all the agents.
The difference was not always statistically significant because the sample size of certain agents was too small.
north-west.
consultancy.
Given a lead has already expressed interest and signed up,
north-west.
Like before, the sample for some regions are too small to be statistically significant.
Suppose you wanted to pick the 1000 leads most likely to sign up (who have not been called so far), based only on age, sector and region.
Age 18-20 -- Region nort-west, south-west -- Sector food, retail, consultancy, wholesale.*
These are the top criteria for leads who signed up for the product.
Around 10% of the sample, so 100 leads.
I would chose agent RED because he has the highest, statistically significant, signup per call average.
In [30]:
import pandas as pd
from itertools import combinations
from scipy.stats import ttest_ind
%pylab inline
pylab.rcParams['figure.figsize'] = (20, 15)
In [2]:
calls = pd.read_csv('calls.csv')
leads = pd.read_csv('leads.csv')
signups = pd.read_csv('signups.csv')
In [6]:
print calls.shape
calls.head()
Out[6]:
In [7]:
print leads.shape
leads.head()
Out[7]:
In [8]:
print signups.shape
signups.head()
Out[8]:
In [9]:
# Which agent made the most calls?
calls.Agent.value_counts()
Out[9]:
In [33]:
# For the leads that received one or more calls, how many calls were received on average?
print 'Number of calls on average: ',calls.groupby('Phone Number')['Call Outcome'].count().mean()
calls.groupby('Phone Number')['Call Outcome'].count().hist()
Out[33]:
In [37]:
# For the leads that signed up, how many calls were received, on average?
leads_signedup_phones = leads[leads['Name'].isin(signups['Lead'].tolist())]['Phone Number'].tolist()
leads_signedup = calls[calls['Phone Number'].isin(leads_signedup_phones)]
print 'Number of calls on average: ',leads_signedup.groupby('Phone Number')['Call Outcome'].count().mean()
leads_signedup.groupby('Phone Number')['Call Outcome'].count().hist()
Out[37]:
In [39]:
# Which agent had the most signups?
# Which assumptions did you make? (note that there is a many-to-one relationship between calls and leads)
leads_signedup_phones = leads[leads['Name'].isin(signups['Lead'].tolist())]['Phone Number'].tolist()
leads_signedup = calls[calls['Phone Number'].isin(leads_signedup_phones)]
leads_signedup.groupby('Agent')['Call Outcome'].value_counts()
Out[39]:
In [89]:
# Which agent had the most signups per call?
leads_signedup_phones = leads[leads['Name'].isin(signups['Lead'].tolist())]['Phone Number'].tolist()
leads_signedup = calls[calls['Phone Number'].isin(leads_signedup_phones)]
print 'INTERESTED calls per agent for signed up Leads:\n',leads_signedup[leads_signedup['Call Outcome']=='INTERESTED'].groupby('Agent')['Call Outcome'].count()
print '\nTotal number of calls per agent:\n', calls.groupby('Agent')['Call Outcome'].count()
print '\nSignups per call per agent:\n',leads_signedup[leads_signedup['Call Outcome']=='INTERESTED'].groupby('Agent')['Call Outcome'].count()/calls.groupby('Agent')['Call Outcome'].count()
In [112]:
# Was the variation between the agents’ signups-per-call statistically significant?
#this function creates a binary distribution where 1 means that the call was successful (lead signed up) and 0 otherwise
def agentCallDist(row,signedLeads):
if row['Phone Number'] in signedLeads and row['Call Outcome']=='INTERESTED':
return 1
return 0
leads_signedup_phones = leads[leads['Name'].isin(signups['Lead'].tolist())]['Phone Number'].tolist()
signedup_calls = calls.copy(deep=True)
signedup_calls['Signup'] = signedup_calls.apply(lambda x: agentCallDist(x,leads_signedup_phones),axis=1)
grouped = signedup_calls.groupby('Agent')
agent_list = []
for agent,group in grouped:
agent_list.append((agent,group['Signup'].tolist()))
for a,b in combinations(agent_list,2):
print a[0]+' vs '+b[0]
print ttest_ind(a[1],b[1],equal_var=False)
In [ ]:
# Why?
# The difference was not always statistically significant because the sample size of certain agents was too small.
A statistically significant result isn’t attributed to chance and depends on two key variables: sample size and effect size.
Sample size refers to how large the sample for your experiment is. The larger your sample size, the more confident you can be in the result of the experiment.
Effect size refers to the size of the difference in results between the two sample sets and indicates practical significance.
P-value refers to the probability value of observing an effect from a sample. A p-value of < 0.05 is the conventional threshold for declaring statistical significance.
In [82]:
# A lead from which region is most likely to be “interested” in the product?
interested_phone_numbers = calls[calls['Call Outcome']=='INTERESTED']['Phone Number'].drop_duplicates().tolist()
interested_leads = leads[leads['Phone Number'].isin(interested_phone_numbers)]
interested_leads.groupby('Region')['Name'].count().sort_values(ascending=False)
Out[82]:
In [83]:
# A lead from which sector is most likely to be “interested” in the product?
interested_phone_numbers = calls[calls['Call Outcome']=='INTERESTED']['Phone Number'].drop_duplicates().tolist()
interested_leads = leads[leads['Phone Number'].isin(interested_phone_numbers)]
interested_leads.groupby('Sector')['Name'].count().sort_values(ascending=False)
Out[83]:
In [88]:
# Given a lead has already expressed interest and signed up,
# - signups from which region are most likely to be approved?
signups_approved = signups[signups['Approval Decision']=='APPROVED']['Lead'].tolist()
approved_leads = leads[leads['Name'].isin(signups_approved)]
(approved_leads.groupby('Region')['Name'].count().sort_values(ascending=False)/leads.groupby('Region')['Name'].count().sort_values(ascending=False)).sort_values(ascending=False)
Out[88]:
In [118]:
leads['Region'].value_counts()# The average signup per call among all the agents is 0.173
Out[118]:
In [117]:
# - Is this statistically significant? Why?
leads_signedup = leads.merge(signups,how='right',left_on='Name',right_on='Lead')
leads_signedup['Outcome'] = leads_signedup['Approval Decision'].apply(lambda x: 1 if x=='APPROVED' else 0)
grouped = leads_signedup.groupby('Region')
region_list=[]
for region,group in grouped:
region_list.append((region,group['Outcome'].tolist()))
for a,b in combinations(region_list,2):
print a[0]+' vs '+b[0]
print ttest_ind(a[1],b[1],equal_var=False)
In [ ]:
# Suppose you wanted to pick the 1000 leads most likely to sign up (who have not been called so far),
# based only on age, sector and region.
# What criteria would you use to pick those leads? [10]
# In what sense are those an optimal criteria set? [3]
# How many signups would you expect to get based on those called leads, assuming they were being called by random agents? [3]
# If you could choose the agents to make those calls, who would you choose? Why? [3]
In [127]:
#In order to answer the last set of question we need to investigate the data first.
# Lets see what are the age, sector and region for the signed up leads
signed_leads = leads.merge(signups, how='right',left_on='Name',right_on='Lead')
print 'Top Age for signups'
print signed_leads['Age'].value_counts().sort_values(ascending=False)[:10]
print '\nTop Region for signups'
print signed_leads['Region'].value_counts().sort_values(ascending=False)[:10]
print '\nTop Sector for signups'
print signed_leads['Sector'].value_counts().sort_values(ascending=False)[:10]
In [ ]:
# What criteria would you use to pick those leads?
# Age between 18-20
# Region north-west and south-west
# Sector food, retail, consultancy, wholesale
In [ ]:
# In what sense are those an optimal criteria set?
# Those are optimal criteria because most of the leads who signed up for the product had them.
In [168]:
# How many signups would you expect to get based on those called leads, assuming they were being called
# by random agents?
# Now lets see how many leads with those criteria signed up
optimal_leads_age = leads[leads['Age']<21 ]
optimal_leads_region = leads[leads['Region'].isin(['north-west','south-west']) ]
optimal_leads_sector = leads[leads['Sector'].isin(['food','retail','consultancy','wholesale'])]
optimal_leads = pd.concat([optimal_leads_age,optimal_leads_region,optimal_leads_sector])
optimal_leads = optimal_leads.drop_duplicates()
optimal_leads_selected = optimal_leads_region[optimal_leads_region['Sector'].isin(['food','retail','consultancy','wholesale'])]
optimal_leads_selected.sort_values(by=['Age'],inplace=True)
optimal_leads_selected[optimal_leads_selected['Name'].isin(signups['Lead'].tolist())].shape
#It turns out that 10% of leads with the selected criteria signed up for the product
Out[168]: