I would like to understand exactly why this project is value creative. After discussion with Chris and various stakeholders, I see value lying in three sections.
We discussed and defined a range of metrics thought to be potential predictive factors. These include the following, and can be found on this Tableau Dashboard, in particular: 1) Overview, 2) Categories, and 3) Companies.
I came to this session equipped with technical questions, and general questions about logistics and how to operate. Some are below:
Q. What distinguishes China from the USA/Europe? Why would someone who typically attends Money 20/20 USA/Europe, think about attending China? </br> A. It was clarified that the reasons for attending China concern anything regarding doing Business in China. Such as networking with the local FinTech community, attending seminars and learning.
Q. Why would this approach help? A. Firstly, it would help us be a more data-driven organisation. It has been suggested that spamming delegates can have an adverse effect - people actually disengage, unsubscribe etc. Is this true?
Q. What are some current approaches? A. Generally just try and target as many people as possible (mass-emailing). Unless of course there are specific comms going out, i.e. European delegates might be sent something specific about Europe but in general it's about hitting as many people as possible.
Q. Do you agree with the project proposal - i.e. the 3 points above? A. Action: we will go away and write a structured proposal, laying out various time-frames of data cleansing, modelling and results visualisation.
Q. What is the most effective way to work? </br> A. Weekly updates and sharing of data/dashboards/ideas on a 'when it's relevant' basis.
In the points below, I outline key milestones that I wish to hit by the target date. I have excluded Visualisation and Testing as they are ongoing processes present at each stage.
I have started to source relevant data sets, and have identified with Collins/Muyideen where such data fields might come from. Certain fields are definites, some are definites but need cleaning or mapping, some are available for US but not Europe, and some were discarded entirely.
I am initially impressed at the density and popularity of data in the data warehouse. All required data sets are keyed and dimensions can be joined easily and efficiently. As I understand it, there are no issues with the functionality of the DW; rather, the issues lie with the data therein. There are for example, opportunities that are Closed Won that are later Closed Lost (and not simply cancellations). There are Test Accounts that aren't easily identifiable,
I connected through Tableau and started exploring.
Initial concerns:
These concerns will likely lead to difficulties in joining data together - particularly as USA/Europe/China Salesforce data are stored in different DBs. Different companies were also responsible for collecting registration data (Lenos vs. Events1), which leads to different data being collected, different fields, different formats etc. </p> Lorna created an excellent table for USA delegates (2014 onwards), which links nicely to a delegate pass and an Opportunity ID. Collins has also worked on a number of tables powering various delegates/SPEX dashboards, but they are still being tested, do not contain all of the fields I need, only contain data for either Europe or USA, and cannot be linked to an Opportunity Id. I have tried to replicate Lorna's table for USA 2012-13 data, Europe 2016-18, and Asia, but this is proving a challenge.
At a later date, I became aware that all data will be migrated into a single Salesforce instance - which would make things easier. I am not currently aware of any existing DBs containing China data - need to speak to China team (Rannie, Alice) about this.
I have pieced together USA 2014-18, across USA 2012-13, Europe 16-18 and Asia 2018. Despite its drawbacks, I opted to join a given Delegate (from FactEventMember) to an Opportunity based on the Event and Account ID.
While Collins has produced some excellent Views in the DW, it is not possible to link up a given Delegate with a given Opportunity, and so I worked on this myself (cherry-picking from his and Lorna's prior work). They also only cover Europe, and so this process would have to be replicated on the US Salesforce instance regardless.
Ideally, the following processes would be projects in their own right. I am able to do them to some extent, but embedding them into automated processes requires more resources.
I have spoken with Collins and Diana, and we have run some checks. There are some silly scenarios and it's very difficult to be confident in all results. Eg, if someone switches company, are they the same person or a different person? Intuition would suggest they are different people, but in some instances they are treated the same. Then on the flipside, there are examples of the same person, attending the same show under multiple email addresses, or two different people attending the same show as the same person. These are all things that need to be considered, and take time to work through. Because delegates can buy multiple passes (eg. 10, for other people), it is not known in advance exactly who those 10 tickets would be given to. As such, it isn't possible to simply 'tie' back a given delegate to a given ticket at a given price. Worse still, is the fact that if the tickets are Sponsored, there is no assigned price for the tickets; they simply come within the overall package. There are also cases where an opportunity has been won, money is in the bank, but it is later closed lost (without being a cancellation). It would be wise to discard these individuals, but they cover a large number of important clients, so a price must be assigned to them instead. Either we could set the 'flat' average price, or we could assign the price of a ticket at the time of purchase. Either way, it will not be an accurate representation.
Some of these issues can be found on this Tableau Dashboard, on the Data Issues tabs.
Thoughts about RapidMiner If deployed correctly it could be incredibly useful, and complement all data analysis/science projects. It is difficult to use since the trial version limits analysis to 10k rows.
Chris first suggested sending out an email campaign at this point. It could potentially be a powerful proxy for understanding delegates who would be interested in attending China. Obviously the only way of understanding whether our results are correct, is if we can successfully predict whether a given delegate actually buys a ticket - unlikely any time soon since the show isn't until Autumn.
Constructing this as a classic ML problem, we could send a campaign to our target audience. A survey with 'Are you Interested in Attending China?' could be attached, and respondees who answer 'Yes' would have their y-variable labelled as a 1, those with 'No', a 0. Starting with some simple regression techniques, we might be able to get a feel for the most influential variables, and perhaps learn some more. In Event Analysis scenarios, it is well-known that attending the most recent show is a good predictor, for example.
However, we must note the inherent differences between the US/Europe/Asia shows, and China. They are different shows.
I have produced a dashboard that contains an overview of all Money 20/20 events that I was able to successfully patch together.
Dash: Link to Tableau Dashboard.
In two weeks, I have sourced, extracted, joined, processed (filtered, cleaned, deduplicated) all delegate data for all shows. It has also been pushed into Tableau where visualisation has begun. I have created charts showing the respective shows attended by each delegate, the number of delegates attending multiple shows, multiple continents, how much money they spent etc. I have not yet linked up Marketo and Seminar data, but those data are in one location, and so should be easy to incorporate.
I was then slightly concerned to receive the following emails:
Hi both, </br> Thank you for this. Can I just get clarification on when I’ll actually have a database to market to? We have our first milestone on March 23rd so I need data asap. </br> The plan makes sense in terms of a longer-term, robust set-up (although I’d like to talk through to make sure I fully understand), but it seems far too slow to get me to a place asap where I can actually market to the West. What can we do in the short-term to get a data that we can at least test our way through?
Hi Jack, </br> I’m really concerned about the timings of the project – I need data today. I can’t wait until mid-May as I have a delegate milestone mid-March and we have to start marketing now. I need a workaround – a quick way to build a smaller database I can start testing with in the next week.
For me, this is where the project took a considerable a change in direction - 2 weeks after submitting the proposal. I felt ahead of schedule, with a pretty solid data set of 30-40k historical delegates ready to be analysed. The next logical step would be to test the data, after which I could begin compiling a list.
I was comforted by the fact that Therese agreed 'it makes sense long-term', but this seems to be the case with every project - something is needed now, but providing a short-term workaround comes at the cost of a robust, long-term strategy.
The main objective, stated clearly above, was to create a model that predicts the likelihood some key features from historical data. From this point onwards, the intelligent layer was overlooked in favour of 'rule-of-thumb' metrics. I had spent a considerable amount of time bringing everything in our initial list (to the mixFor example, the final list of delegates is based on 3 data points: did Delegate A attend the same show more than once, did they attend multiple shows, and did they attend Seminars about China. This does not require data scientific skills, and could have been completed by any analyst.
I suppose this is where I began to find it difficult to deliver. On one hand I had pieced the data together and had only just started to analyse it (i.e. the interesting bit), but was essentially asked to do something else. For me, the value in the project lay with learning how heavily all of the metrics we have talked about influence the purchasing of tickets. If we can understand what would lead someone to do so,
I have produced a dashboard with delegate attendance of Seminars at Money 20/20 Vegas 2017. This could help direct personalised marketing to delegates who attended specific seminars. It would be an interesting hypothesis to test whether the mean time-to-open, open rate and click rates were affected by emails that were tailored to user interests, inferred by the types of seminars they attended.
Link to Tableau Dashboard
The idea behind adding Seminar data as a feature is straightforward, but requires some scripting. This is where RapidMiner could have been really useful - rather than doing this ad-hoc, I could have scheduled a job that connects to the view created in the DW, and the Seminar Data saved locally, processes and joins them, and outputs the results as a tableau extract. Instead, I had to write my own python script, output to csv, and connect to that csv in Tableau. This could potentially all be automated. There exists a field in the data called 'Theme' which classifies each Seminar into a Theme (like Blockchain, AI, Globalisation etc). I appended a flag for each delegate, if they attended a Globalisation seminar. I wrote another function that appends a flag if a delegate attended a Seminar with the word 'China' in it.
I sought to compute four metrics for each delegate across all Money 20/20 emails: Mean Time to Open, Mean Time to Click, Open Rate and Click Rate. I also appended four flags: one if the respondee opened the specific 'Are you Interested in China' Campaign. Another, for any China related e-mails. Likewise for Clicks.
Using these metrics in combination with Seminar Data could offer a rich picture of delegates - an 'Interest in China' score, perhaps.
An analysis of email correspondence can be found on this Tableau Dashboard.
It recently occurred to me that two definitions of Open Rate and Click Rate are being used across the business. For campaigns, I use the former; for delegates, the latter.
The former is calculated as follows: divide the total number of unique individuals who opened an e-mail, by the total number of unique individuals who received such an e-mail. Usually, everyone receives the campaign at the same time on the same day, and it is generally only sent once. The Open Rate depends on time - i.e, on the day of the campaign, 100 people might open the e-mail, 30 the day after, and 5 the day after that. This sort of decay in the number of unique opens is typical. Click Rate follows the same pattern, and can only be equal to, or below the Open Rate at any time, since it is not possible to click an e-mail without first opening it.
The latter is calculated at the delegate level: taking the number of unique e-mails opened over a given period of time, divided by the number of unique e-mails received. It can be used as an indicator of customer engagement, but care clearly needs to be taken when analysing results. Eg. someone who has only ever opened and clicked a single email (having only received 1) would have a 100% Open and Click Rate. Comparing this to someone with a 75% Open Rate and 30% Click Rate, but has been delivered 100 e-mails, would suggest the former is more engaged using these metrics alone. The latter is in fact, a far more consistent and reliable responder.
I confirmed to Chris that I would have a Final List by the end of the week (13th March). We concluded that sending out the campaign by the 19th/20th March would be a good target. </br> Update It is now the 29th March (16 days)</br> Update 5th April (23 days)</br> Update 9th April (27 days)... still waiting.
</br>
Therese and Tim were asked for a creative pertaining to China - to which our survey would be appended. We discussed how many delegates should be in the target list. I didn't understand why we couldn't send the email to everyone - after all, we want to know anyone/everyone we could target. I was made aware that the team didn't wish to cannibalise sales from Europe/Asia - but I did think it was implicit in the project that we wanted to understand the likelihood of all delegates. I am not a marketer, and did not foresee the significance of sending campaigns at particular times of year etc. Had we known this in advance, we could have been more strategic in our targeting, scheduling emails at different times etc.
A list of 5000 was proposed, with 2500 sampled uniformly at random, and 2500 strategically targeted. This was later reduced to 3000, without explanation.
</br>
Hi both, </br> Right, there is a view from Tracey and co that we need to move a lot faster on data for both Europe and China. Conscious that the China project requires a fair bit more work. </br> Is the target group ready? And the survey? </br> Bearing in mind that we’ve already sent a really simplified version of a survey to our US audience to gauge interest, could we not move things along faster with a quick and basic approach to start with – could we not build a profile of the 160 or so people that said they were interested in China and extrapolate that build data? And also look at customers who attended both the US and Asia shows? </br>
As far as I am aware - the last paragraph is a different question. I also had to clarify what 'building data' meant - we can either analyse existing delegates, or we can use that knowledge to build a profile of an ideal delegate, and buy data as such. By now, my target list (1500 target, 1500 random) was created, based upon the 3 features discussed. I.e, if someone attended an event more than once, attended different events, and attended seminars related to China and/or globalisation.
</br>
Hi Both, </br> My concern is that it will confuse the US contacts - who are also being asked to buy tickets to the US show. </br> Can I see the survey so I know what copy we need to introduce it? </br> And back to the 160 delegates who already told us they are interested - can we build some lookalike audiences on that?
Hi Chris, Jack, </br> I had Tim run a quick pull on this test data and 68% are Europe – that’s too high right now.</br> I would like to send next week Thurs but can we please reduce Europe count?</br> Creative coming tomorrow.</br> Thanks,</br> Therese
These two e-mails imply that "we cannot market to USA delegates because it will confuse them", and then that "we cannot market to Europe because it's too high". I did not understand why, and was not given an explanation. If anything, it seems quite low considering we are talking about 1500 of our most likely delegates, and that a considerable number of them attended both a US and European show.
As for the '160 delegates who told us they are interested - can we build some lookalike audiences', I believe it to be an ad-hoc request. It is meant to be incorporated into the wider model, but analysing those delegates specifically is a separate task. I explain in Section 2iii, exactly how I wanted to use this data. The ad-hoc analysis can be found on this Tableau Dashboard.
</br>
At our next meeting, I was asked to exclude everyone who had ever attended a European event. The reason being, that their attention would be diverted, and we
At this point, I attempted to divert attention back to the project brief where it was clearly stated that the purpose of an intelligent model is to learn from all of our data; not just
In a follow up e-mail, I was given a 10 minute time-frame to add multiple 'flags' to the csv, indicating whether or not a delegate was in my target list - this was supplied the night before.
</br>
I was a little frustrated at this point upon realising that this was no longer a predictive modelling project. Alongside refining and updating some of the existing dashboards, I have started a basic machine learning project based on the following Python. Since my GitHub Repo isn't private, I can't use the data within this report. A sample subset is used instead.
I decided to predict the probability (a regression) of an existing delegate attending Money 20/20 USA 2017. I set out to use metrics from all of their historical data, across Salesforce and Marketo. To begin with, Linear and Logistic regression were used, but I hope to expand upon these. Unfortunately, Seminar data is only available for 2017 and so we can't use it in this model. It might be really interesting to re-run this model on 2018 data when the time comes, and include such data!
In [ ]:
import pandas as pd
import numpy as np
data = pd.DataFrame(pd.read_csv('/data.csv'))
# Remove rows for which we don't have enough data
# Fill in any missing fields with the right values: either set them as 0, the average, or something else.
# Compute derived features
# Transform any features (Eg. Logs? Square Roots? Reciprocals?)
# Any one-hot encoding required?
# Normalise the features
num_epochs = 1000
mse_all = []
for i in range(num_epochs):
x_train, y_train, x_test, y_test = randomise(data)
learn = lin_reg(x_train, y_train)
predictions = lin_reg(learn, x_test)
mse = mse(predictions, y_test)
mse_all.append(mse)
avg_mse = np.mean(mse_all)
print aoc(mse_all)
Looking at the results, we see that X% of variation in the results can be explained by the metrics chosen. It seems that attending the previous year is a good predictor of attendance. So to is having attended multiple shows in the past. Being a C-Level person or being a Founder of a company are also strong predictors.
I compute a linear regression over 1000 training/test set combinations in an 80% to 20% proportion. The mean squared error over these runs is X, and an AOC chart is shown below demonstrating that the model is considerably better at predicting whether a given delegate will attend, than selecting at random.
The only issue I have, is that China is a completely different show. The underlying characteristics of Money 20/20 China might be fundamentally different to those of Money 20/20 USA.
</br>
I thought it was obvious in prior discussion that the only thing we want to gauge is Interest in attending China. I did not anticipate having to actually creating the survey in Qualtrics - I thought this came under the remit of the marketing team as we supplied them with our questions. Either way, after some refinement/suggestions from Chris, it was ready to go that afternoon.
It was mentioned towards the end of the meeting that an e-mail was recently sent to 19,000 delegates in the Asia database. In response, 46 individuals proceeded to purchase tickets for China. I was unaware of such a campaign, but this could/should have been the focal point of the project:
"Of the 19,000 e-mailed, what was it about them that led to the purchasing of a ticket?".
If I knew an e-mail of this importance was going to be sent out to such a large audience, directly encouraging individuals to buy tickets for China, I could have run some predictive models of those specific delegates. If I knew that it would only be sent to the Asia Database, there would be no need for any complex data querying.
SELECT PersonId, AccountId, DateTime, Activity, Dim1, etc
FROM AscentialEventsDW.marketing.FactLeadActivity FLA
WHERE
PrimaryAttributeValue = MessageId
AND ActivityType IN ('Deliver', 'Open', 'Click')
INNER JOIN Dimension1 DIM1
ON FLA.Key = DIM1.Key
INNER JOIN etc
I will connect directly to the results in Tableau. First steps would be to identify who opened and who clicked the email and run a quick open/click rate analysis. I will then look at the Qualtrics responses (if any) and try to extract data in a numerical format. As a first step, I will analyse the responses of any existing delegates. Then, I will look at any delegates who attended multi-shows, multi-continents, and attended seminars.
Overall though, we want to prove that the engagement/response of delegates in the target list is higher than that of the randomly selected list. If we can prove this (and that there is a statistically significant difference), we can categorically say that by targeting delegates in this way, we have led to a greater response/conversion rate. Obviously, we can only use this as a proxy (and the actual conversion rate will likely be lower, i.e. when actual money is involved).
I can then analyse associated accounts and this information can be used to market to them as such.
In [ ]: