Money 20/20 Delegate Targeting Project

Jack Shipway, 2018

Objectives

Early February 2018

To produce a predictive Likelihood of Attendance model that, based on a range of pre-defined inputs, etimates the probability of an existing delegate attending China
- This part of the project takes all of the data we have of people that we know about across events, producing a ranking of people in the order of their likelihood to attend China. This can be used by the marketing/sales team as an indicator of where to focus their efforts: who to target first/last.
To produce a Lookalike Profile, that best defines an ideal delegate. Demographic, financial and behavioural data will all be included in the model.
- This part is heavily dependent on Part 1, but drills into the top Nth percentile of delegates. I.e, of those most likely to attend, what did they all have in common? What do they have that differs from those unlikely to attend?
To automate the process, standardise the inputs and outputs, and demonstrate project repeatability across different Brands
- To prove the value within Money 20/20, and roll out similar processes across brands under the same guise/technology/processes.

Where is the Value?

February 10th 2018

I would like to understand exactly why this project is value creative. After discussion with Chris and various stakeholders, I see value lying in three sections.

Targeted campaigning can lead to better sales conversion rates.
Budget/resources can be allocated more efficiently, particularly in the buying of data.
Providing a helpful/easy to interpret dashboard (with SF, Marketo, Seminar and GA data) can save the team time doing all this themselves. Hopefully ad-hoc data collection will be a thing of the past. Everything (99%...) needed for decision-making will exist in the dashboard.

Whiteboarding session with Chris, and follow-up with Neil

February 12th 2018

We discussed and defined a range of metrics thought to be potential predictive factors. These include the following, and can be found on this Tableau Dashboard, in particular: 1) Overview, 2) Categories, and 3) Companies.

Demographics
- City/Country
- Job Title/Role/Group/Function/Seniority
- Company Type/Sector/Size, Domestic/International presence
- Gender? Interests? (Only if necessary)
Ticketing/Financial
- Cost of Ticket, Number purchased, Discounted?
- Registration Date (days pre-event)
- Ticket type - delegate, speaker, sponsor
Event Particulars
- Events Attended, Multi-show? Multi-Continent?
- Seminars attended, China/globalisation related (2017 only)
Others
- Colleague Attendance
- NPS
- Account was a Sponsor
- Switched Company, but still attended
- Attended Previous Year
- Marketo Open/Click Rate

First Meeting with Therese

February 14th 2018

I came to this session equipped with technical questions, and general questions about logistics and how to operate. Some are below:

Q. What distinguishes China from the USA/Europe? Why would someone who typically attends Money 20/20 USA/Europe, think about attending China? A. It was clarified that the reasons for attending China concern anything regarding doing Business in China. Such as networking with the local FinTech community, attending seminars and learning.

Q. Why would this approach help? A. Firstly, it would help us be a more data-driven organisation. It has been suggested that spamming delegates can have an adverse effect - people actually disengage, unsubscribe etc. Is this true?

Q. What are some current approaches? A. Generally just try and target as many people as possible (mass-emailing). Unless of course there are specific comms going out, i.e. European delegates might be sent something specific about Europe but in general it's about hitting as many people as possible.

Q. Do you agree with the project proposal - i.e. the 3 points above? A. Action: we will go away and write a structured proposal, laying out various time-frames of data cleansing, modelling and results visualisation.

Q. What is the most effective way to work? A. Weekly updates and sharing of data/dashboards/ideas on a 'when it's relevant' basis.

Workflow

February 14th-16th

In the points below, I outline key milestones that I wish to hit by the target date. I have excluded Visualisation and Testing as they are ongoing processes present at each stage.

Data Sourcing & Extraction (1w-2w, Fri 23rd Feb)
- Ideally I would like to see a Universe of People that we know of, a mapping to a list of Delegates, and all of the events that they have either attended or registered to attend. I would like all of the features discussed above, as well as Marketo related metrics such as open rate. Google Analytics may be more difficult to include due to the sheer granularity of the data - and I also need to confirm whether it has been correctly populated with Salesforce/Marketo IDs.
Data Processing (3d-1w, Fri 2nd Mar)
- I would rather use a subset of complete data than the whole set with missing fields. It does not matter too much about missing certain delegates, general data issues, it is more a case of having the best possible, clean data for the people I am able to match across all sources.
Preliminary Analysis (1w-2w, Fri 16th Mar) *
Derive Features (1-2w, Fri 30th Mar)
Modelling Layer (1-2w, Early/Mid Apr)
Testing (Ongoing)
Visualisation (Ongoing)

1. Data Sourcing & Extraction

February 15th - 21st

I have started to source relevant data sets, and have identified with Collins/Muyideen where such data fields might come from. Certain fields are definites, some are definites but need cleaning or mapping, some are available for US but not Europe, and some were discarded entirely.

I am initially impressed at the density and popularity of data in the data warehouse. All required data sets are keyed and dimensions can be joined easily and efficiently. As I understand it, there are no issues with the functionality of the DW; rather, the issues lie with the data therein. There are for example, opportunities that are Closed Won that are later Closed Lost (and not simply cancellations). There are Test Accounts that aren't easily identifiable,

I connected through Tableau and started exploring.

Initial concerns:

US/Europe/Asia SF data are hosted on different systems
Registrations are recorded differently (Events1 vs. Lenos)
If I am to 'join' delegates to their respective Opportunities. This is a tricky point because it is not always .. An Account can purchase 10 tickets, and it may not be known until the day who those 10 delegates are.

These concerns will likely lead to difficulties in joining data together - particularly as USA/Europe/China Salesforce data are stored in different DBs. Different companies were also responsible for collecting registration data (Lenos vs. Events1), which leads to different data being collected, different fields, different formats etc. Lorna created an excellent table for USA delegates (2014 onwards), which links nicely to a delegate pass and an Opportunity ID. Collins has also worked on a number of tables powering various delegates/SPEX dashboards, but they are still being tested, do not contain all of the fields I need, only contain data for either Europe or USA, and cannot be linked to an Opportunity Id. I have tried to replicate Lorna's table for USA 2012-13 data, Europe 2016-18, and Asia, but this is proving a challenge.

At a later date, I became aware that all data will be migrated into a single Salesforce instance - which would make things easier. I am not currently aware of any existing DBs containing China data - need to speak to China team (Rannie, Alice) about this.

2. Data Processing

February 21st - 26th February CHECK

I have pieced together USA 2014-18, across USA 2012-13, Europe 16-18 and Asia 2018. Despite its drawbacks, I opted to join a given Delegate (from FactEventMember) to an Opportunity based on the Event and Account ID.

While Collins has produced some excellent Views in the DW, it is not possible to link up a given Delegate with a given Opportunity, and so I worked on this myself (cherry-picking from his and Lorna's prior work). They also only cover Europe, and so this process would have to be replicated on the US Salesforce instance regardless.

Ideally, the following processes would be projects in their own right. I am able to do them to some extent, but embedding them into automated processes requires more resources.

Deduplication
Filtering
Company Name/Job Title/Sector Mapping
Check General Revenue Figures

I have spoken with Collins and Diana, and we have run some checks. There are some silly scenarios and it's very difficult to be confident in all results. Eg, if someone switches company, are they the same person or a different person? Intuition would suggest they are different people, but in some instances they are treated the same. Then on the flipside, there are examples of the same person, attending the same show under multiple email addresses, or two different people attending the same show as the same person. These are all things that need to be considered, and take time to work through. Because delegates can buy multiple passes (eg. 10, for other people), it is not known in advance exactly who those 10 tickets would be given to. As such, it isn't possible to simply 'tie' back a given delegate to a given ticket at a given price. Worse still, is the fact that if the tickets are Sponsored, there is no assigned price for the tickets; they simply come within the overall package. There are also cases where an opportunity has been won, money is in the bank, but it is later closed lost (without being a cancellation). It would be wise to discard these individuals, but they cover a large number of important clients, so a price must be assigned to them instead. Either we could set the 'flat' average price, or we could assign the price of a ticket at the time of purchase. Either way, it will not be an accurate representation.

Some of these issues can be found on this Tableau Dashboard, on the Data Issues tabs.

Thoughts about RapidMiner If deployed correctly it could be incredibly useful, and complement all data analysis/science projects. It is difficult to use since the trial version limits analysis to 10k rows.

Create Tables/Views for consumption in the DW
Connect to Views/Tables in Rapidminer
Run cleansing, aggregation, modelling and predictive analytics processes in Rapidminer. More powerful/insightful analyses can be run via Python/R scripts,
Configure Rapidminer to output to TBWX
Either:
- Save locally, embed in workbook, re-run on new data
- Using REST API, configure to push data to server
Connect Tableau to saved data source, or server-based data source

Test Interest via Email Campaign

February 22nd - 24th

Chris first suggested sending out an email campaign at this point. It could potentially be a powerful proxy for understanding delegates who would be interested in attending China. Obviously the only way of understanding whether our results are correct, is if we can successfully predict whether a given delegate actually buys a ticket - unlikely any time soon since the show isn't until Autumn.

Constructing this as a classic ML problem, we could send a campaign to our target audience. A survey with 'Are you Interested in Attending China?' could be attached, and respondees who answer 'Yes' would have their y-variable labelled as a 1, those with 'No', a 0. Starting with some simple regression techniques, we might be able to get a feel for the most influential variables, and perhaps learn some more. In Event Analysis scenarios, it is well-known that attending the most recent show is a good predictor, for example.

However, we must note the inherent differences between the US/Europe/Asia shows, and China. They are different shows.

2i. Preliminary Analysis (Delegates)

February 26th

I have produced a dashboard that contains an overview of all Money 20/20 events that I was able to successfully patch together.

Dash: Link to Tableau Dashboard.

February 28th (Key Project Direction Change)

In two weeks, I have sourced, extracted, joined, processed (filtered, cleaned, deduplicated) all delegate data for all shows. It has also been pushed into Tableau where visualisation has begun. I have created charts showing the respective shows attended by each delegate, the number of delegates attending multiple shows, multiple continents, how much money they spent etc. I have not yet linked up Marketo and Seminar data, but those data are in one location, and so should be easy to incorporate.

I was then slightly concerned to receive the following emails:

Hi both, Thank you for this. Can I just get clarification on when I’ll actually have a database to market to? We have our first milestone on March 23rd so I need data asap. The plan makes sense in terms of a longer-term, robust set-up (although I’d like to talk through to make sure I fully understand), but it seems far too slow to get me to a place asap where I can actually market to the West. What can we do in the short-term to get a data that we can at least test our way through?

Hi Jack, I’m really concerned about the timings of the project – I need data today. I can’t wait until mid-May as I have a delegate milestone mid-March and we have to start marketing now. I need a workaround – a quick way to build a smaller database I can start testing with in the next week.

For me, this is where the project took a considerable a change in direction - 2 weeks after submitting the proposal. I felt ahead of schedule, with a pretty solid data set of 30-40k historical delegates ready to be analysed. The next logical step would be to test the data, after which I could begin compiling a list.

I was comforted by the fact that Therese agreed 'it makes sense long-term', but this seems to be the case with every project - something is needed now, but providing a short-term workaround comes at the cost of a robust, long-term strategy.

The main objective, stated clearly above, was to create a model that predicts the likelihood some key features from historical data. From this point onwards, the intelligent layer was overlooked in favour of 'rule-of-thumb' metrics. I had spent a considerable amount of time bringing everything in our initial list (to the mixFor example, the final list of delegates is based on 3 data points: did Delegate A attend the same show more than once, did they attend multiple shows, and did they attend Seminars about China. This does not require data scientific skills, and could have been completed by any analyst.

I suppose this is where I began to find it difficult to deliver. On one hand I had pieced the data together and had only just started to analyse it (i.e. the interesting bit), but was essentially asked to do something else. For me, the value in the project lay with learning how heavily all of the metrics we have talked about influence the purchasing of tickets. If we can understand what would lead someone to do so,

2ii. Preliminary Analysis (Seminar Attendance & NPS)

March 1st

I have produced a dashboard with delegate attendance of Seminars at Money 20/20 Vegas 2017. This could help direct personalised marketing to delegates who attended specific seminars. It would be an interesting hypothesis to test whether the mean time-to-open, open rate and click rates were affected by emails that were tailored to user interests, inferred by the types of seminars they attended.

Link to Tableau Dashboard

The idea behind adding Seminar data as a feature is straightforward, but requires some scripting. This is where RapidMiner could have been really useful - rather than doing this ad-hoc, I could have scheduled a job that connects to the view created in the DW, and the Seminar Data saved locally, processes and joins them, and outputs the results as a tableau extract. Instead, I had to write my own python script, output to csv, and connect to that csv in Tableau. This could potentially all be automated. There exists a field in the data called 'Theme' which classifies each Seminar into a Theme (like Blockchain, AI, Globalisation etc). I appended a flag for each delegate, if they attended a Globalisation seminar. I wrote another function that appends a flag if a delegate attended a Seminar with the word 'China' in it.

2iii. Preliminary Analysis (Marketo)

March 2nd - 5th

I sought to compute four metrics for each delegate across all Money 20/20 emails: Mean Time to Open, Mean Time to Click, Open Rate and Click Rate. I also appended four flags: one if the respondee opened the specific 'Are you Interested in China' Campaign. Another, for any China related e-mails. Likewise for Clicks.

Using these metrics in combination with Seminar Data could offer a rich picture of delegates - an 'Interest in China' score, perhaps.

An analysis of email correspondence can be found on this Tableau Dashboard.

It recently occurred to me that two definitions of Open Rate and Click Rate are being used across the business. For campaigns, I use the former; for delegates, the latter.

Of a Campaign
Of an Individual

The former is calculated as follows: divide the total number of unique individuals who opened an e-mail, by the total number of unique individuals who received such an e-mail. Usually, everyone receives the campaign at the same time on the same day, and it is generally only sent once. The Open Rate depends on time - i.e, on the day of the campaign, 100 people might open the e-mail, 30 the day after, and 5 the day after that. This sort of decay in the number of unique opens is typical. Click Rate follows the same pattern, and can only be equal to, or below the Open Rate at any time, since it is not possible to click an e-mail without first opening it.

The latter is calculated at the delegate level: taking the number of unique e-mails opened over a given period of time, divided by the number of unique e-mails received. It can be used as an indicator of customer engagement, but care clearly needs to be taken when analysing results. Eg. someone who has only ever opened and clicked a single email (having only received 1) would have a 100% Open and Click Rate. Comparing this to someone with a 75% Open Rate and 30% Click Rate, but has been delivered 100 e-mails, would suggest the former is more engaged using these metrics alone. The latter is in fact, a far more consistent and reliable responder.

Delegate Targeting

March 6th

I confirmed to Chris that I would have a Final List by the end of the week (13th March). We concluded that sending out the campaign by the 19th/20th March would be a good target. Update It is now the 29th March (16 days) Update 5th April (23 days) Update 9th April (27 days)... still waiting.

March 8th

Therese and Tim were asked for a creative pertaining to China - to which our survey would be appended. We discussed how many delegates should be in the target list. I didn't understand why we couldn't send the email to everyone - after all, we want to know anyone/everyone we could target. I was made aware that the team didn't wish to cannibalise sales from Europe/Asia - but I did think it was implicit in the project that we wanted to understand the likelihood of all delegates. I am not a marketer, and did not foresee the significance of sending campaigns at particular times of year etc. Had we known this in advance, we could have been more strategic in our targeting, scheduling emails at different times etc.

A list of 5000 was proposed, with 2500 sampled uniformly at random, and 2500 strategically targeted. This was later reduced to 3000, without explanation.

March 13th

Hi both, Right, there is a view from Tracey and co that we need to move a lot faster on data for both Europe and China. Conscious that the China project requires a fair bit more work. Is the target group ready? And the survey? Bearing in mind that we’ve already sent a really simplified version of a survey to our US audience to gauge interest, could we not move things along faster with a quick and basic approach to start with – could we not build a profile of the 160 or so people that said they were interested in China and extrapolate that build data? And also look at customers who attended both the US and Asia shows?

As far as I am aware - the last paragraph is a different question. I also had to clarify what 'building data' meant - we can either analyse existing delegates, or we can use that knowledge to build a profile of an ideal delegate, and buy data as such. By now, my target list (1500 target, 1500 random) was created, based upon the 3 features discussed. I.e, if someone attended an event more than once, attended different events, and attended seminars related to China and/or globalisation.

March 14th and 15th

Hi Both, My concern is that it will confuse the US contacts - who are also being asked to buy tickets to the US show. Can I see the survey so I know what copy we need to introduce it? And back to the 160 delegates who already told us they are interested - can we build some lookalike audiences on that?

Hi Chris, Jack, I had Tim run a quick pull on this test data and 68% are Europe – that’s too high right now. I would like to send next week Thurs but can we please reduce Europe count? Creative coming tomorrow. Thanks, Therese

These two e-mails imply that "we cannot market to USA delegates because it will confuse them", and then that "we cannot market to Europe because it's too high". I did not understand why, and was not given an explanation. If anything, it seems quite low considering we are talking about 1500 of our most likely delegates, and that a considerable number of them attended both a US and European show.

As for the '160 delegates who told us they are interested - can we build some lookalike audiences', I believe it to be an ad-hoc request. It is meant to be incorporated into the wider model, but analysing those delegates specifically is a separate task. I explain in Section 2iii, exactly how I wanted to use this data. The ad-hoc analysis can be found on this Tableau Dashboard.

March 18th

At our next meeting, I was asked to exclude everyone who had ever attended a European event. The reason being, that their attention would be diverted, and we

At this point, I attempted to divert attention back to the project brief where it was clearly stated that the purpose of an intelligent model is to learn from all of our data; not just

In a follow up e-mail, I was given a 10 minute time-frame to add multiple 'flags' to the csv, indicating whether or not a delegate was in my target list - this was supplied the night before.

What a Predictive Model should look like

March 20th - Present

I was a little frustrated at this point upon realising that this was no longer a predictive modelling project. Alongside refining and updating some of the existing dashboards, I have started a basic machine learning project based on the following Python. Since my GitHub Repo isn't private, I can't use the data within this report. A sample subset is used instead.

I decided to predict the probability (a regression) of an existing delegate attending Money 20/20 USA 2017. I set out to use metrics from all of their historical data, across Salesforce and Marketo. To begin with, Linear and Logistic regression were used, but I hope to expand upon these. Unfortunately, Seminar data is only available for 2017 and so we can't use it in this model. It might be really interesting to re-run this model on 2018 data when the time comes, and include such data!



In [ ]:

    
import pandas as pd
import numpy as np

data = pd.DataFrame(pd.read_csv('/data.csv'))

# Remove rows for which we don't have enough data

# Fill in any missing fields with the right values: either set them as 0, the average, or something else.

# Compute derived features

# Transform any features (Eg. Logs? Square Roots? Reciprocals?)

# Any one-hot encoding required?

# Normalise the features


num_epochs = 1000
mse_all = []

for i in range(num_epochs):
    x_train, y_train, x_test, y_test = randomise(data)
    
    learn = lin_reg(x_train, y_train)
    predictions = lin_reg(learn, x_test)
    mse = mse(predictions, y_test)
    mse_all.append(mse)

avg_mse = np.mean(mse_all)
print aoc(mse_all)

Looking at the results, we see that X% of variation in the results can be explained by the metrics chosen. It seems that attending the previous year is a good predictor of attendance. So to is having attended multiple shows in the past. Being a C-Level person or being a Founder of a company are also strong predictors.

I compute a linear regression over 1000 training/test set combinations in an 80% to 20% proportion. The mean squared error over these runs is X, and an AOC chart is shown below demonstrating that the model is considerably better at predicting whether a given delegate will attend, than selecting at random.

The only issue I have, is that China is a completely different show. The underlying characteristics of Money 20/20 China might be fundamentally different to those of Money 20/20 USA.

Structuring a Campaign

April 4th

I thought it was obvious in prior discussion that the only thing we want to gauge is Interest in attending China. I did not anticipate having to actually creating the survey in Qualtrics - I thought this came under the remit of the marketing team as we supplied them with our questions. Either way, after some refinement/suggestions from Chris, it was ready to go that afternoon.

It was mentioned towards the end of the meeting that an e-mail was recently sent to 19,000 delegates in the Asia database. In response, 46 individuals proceeded to purchase tickets for China. I was unaware of such a campaign, but this could/should have been the focal point of the project:

"Of the 19,000 e-mailed, what was it about them that led to the purchasing of a ticket?".

If I knew an e-mail of this importance was going to be sent out to such a large audience, directly encouraging individuals to buy tickets for China, I could have run some predictive models of those specific delegates. If I knew that it would only be sent to the Asia Database, there would be no need for any complex data querying.

Post-Campaign Analysis

April the ... As soon as the campaign is sent 'th

Structure I would like to know the Message ID of the campaign, and I can then grab all related correspondence from the DW using the following query.

SELECT PersonId, AccountId, DateTime, Activity, Dim1, etc
FROM AscentialEventsDW.marketing.FactLeadActivity FLA
WHERE 
    PrimaryAttributeValue = MessageId
    AND ActivityType IN ('Deliver', 'Open', 'Click')
INNER JOIN Dimension1 DIM1
    ON FLA.Key = DIM1.Key
INNER JOIN etc

I will connect directly to the results in Tableau. First steps would be to identify who opened and who clicked the email and run a quick open/click rate analysis. I will then look at the Qualtrics responses (if any) and try to extract data in a numerical format. As a first step, I will analyse the responses of any existing delegates. Then, I will look at any delegates who attended multi-shows, multi-continents, and attended seminars.

Overall though, we want to prove that the engagement/response of delegates in the target list is higher than that of the randomly selected list. If we can prove this (and that there is a statistically significant difference), we can categorically say that by targeting delegates in this way, we have led to a greater response/conversion rate. Obviously, we can only use this as a proxy (and the actual conversion rate will likely be lower, i.e. when actual money is involved).

I can then analyse associated accounts and this information can be used to market to them as such.

Retrospective Thoughts/Actions

Ongoing

Understand the Marketing Team's strategy in more depth. It would have helped to know the campaign scheduling in advance, and could have tailored our strategy as such.
Why can't we piggy-back on another campaign? I.e. 'Have you heard about our China Launch? Tell us what you think!'
Understand why Cannibalisation is such a worry? Has it been proven? Where/Why? If we are talking about such a small portion of our audience, why does it matter?
I understand that we don't want to confuse our audience, but isn't that the purpose of a Launch Campaign? To clarify why China exists, its purpose, and why you should/shouldn't attend. And if you don't want to, why shouldn't the asian part of your company do so? There could be some great selling points here.
I am most effective when working in a team - being able to bounce : There are two cases where this would help:
- I've written a query/script/algorithm where I am unsure on its validity/correctness.
- I've been asked to compute something, I go away and compute it, turn up to the next meeting and am asked to compute something else. I can do the computation very quickly, but there are literally billions of different metrics. Without working through this with someone, I feel I get stuc
Data Science comes after data processing. I have spent the majority of my time sourcing and processing data, which is still not to a satisfactory standard. Good data is absolutely vital, and I don't think it has been appreciated just how important (and time-consuming) this step can be.



In [ ]:

Money 20/20 Delegate Targeting Project

Jack Shipway, 2018

</br>

Objectives

Early February 2018

</br>

Where is the Value?

February 10th 2018

</br>

Whiteboarding session with Chris, and follow-up with Neil

February 12th 2018

</br>

First Meeting with Therese

February 14th 2018

</br>

Workflow

February 14th-16th

</br>

1. Data Sourcing & Extraction

February 15th - 21st

</br>

2. Data Processing

February 21st - 26th February CHECK

</br>

Test Interest via Email Campaign

February 22nd - 24th

</br>

2i. Preliminary Analysis (Delegates)

February 26th

February 28th (Key Project Direction Change)

</br>

2ii. Preliminary Analysis (Seminar Attendance & NPS)

March 1st

</br>

2iii. Preliminary Analysis (Marketo)

March 2nd - 5th

</br>

Delegate Targeting

March 6th

March 8th

March 13th

March 14th and 15th

March 18th

What a Predictive Model should look like

March 20th - Present

Structuring a Campaign

April 4th

</br>

Post-Campaign Analysis

April the ... As soon as the campaign is sent 'th

</br>

Retrospective Thoughts/Actions

Ongoing