Author: pascal@bayes.org
Date: 2017-08-13
In July 2017, the WorkUp team gave us a programatical access to their events. This notebook explores how we could use it to give examples of pro events to our users to motivate them to go to some of them or others.
The whole dataset is directly downloadable at https://www.workuper.com/events/index_json.json and it is also available with the command docker-compose run --rm data-analysis-prepare data/workuper.json.
First let's load the json file:
In [1]:
import json
import os
from os import path
import pandas as pd
DATA_FOLDER = os.getenv('DATA_FOLDER')
events = pd.read_json(path.join(DATA_FOLDER, 'workup.json'))
events.head()
Out[1]:
Cool! Before exploring each individual fields, let's see how many events there are, and whether those fields are always set:
In [2]:
events.describe(include='all').head(3)
Out[2]:
Hum, there are not that many rows: 13 only. However all fields seem to be set, except for favorite which is never set.
By a quick glance to the data above, we can classify fields between useful, irrelevant, and others to explore.
The ones that seem directly useful:
titleaddress combined with latitude and longitudedate and dateendorganiserAnd then in a lesser extent (too much details for what we want to use them):
descriptionsubscription_linkwebsitetimeHowever the following ones are irrelevant to us as they seem only useful for the WorkUp database:
favoriteidstatuscreated_atupdated_atuser_idSo it leaves 3 fields: category, price and slug that we should explore a bit more.
Let's check quickly that the obvious fields have useful values. The titles:
In [3]:
pd.options.display.max_colwidth = 100
events.title.to_frame()
Out[3]:
Perfect, we can use it directly as a title to show to our users. Note that the title frequently involves the city, and sometimes the timing. Also we can see that the use of upper case letters is less than ideal.
The addresses:
In [4]:
events[['address', 'latitude', 'longitude']]
Out[4]:
Quickly comparing two addresses in Bordeaux, we can see that the latitude, longitude is probably the exact one (pretty cool). For our application we could filter events that are not so far from the user's target city. The address is not always formatted the same way so we will use mainly the lat/lng.
The dates:
In [5]:
events[['date', 'dateend']]
Out[5]:
The dates are all in the future which probably indicates that WorkUp already filters the past ones. Note that the dateend is almost always the same as date which probably indicates that those events only last one day. For our purpose we will ignore dateend for now.
The organisers:
In [6]:
events.organiser.to_frame()
Out[6]:
As for the title, this is pretty clean and useful.
The description, exact time, the website or a direct subscription_link are not required for what we want. Our goal is not to speed up the process of our users subscribing to a specific event but to have them realize that there are many of them and that they should dig a bit more this way of enlarging their network or improving their job search.
However it would be useful to get all those details from a secondary page if they wanted to. From the WorkUp website, there are pages with full details that are accessible with a URL like this one: https://www.workuper.com/events/salon-des-10-000-emplois-marseille. The good thing is that the dataset contains the last part of the URL in the slug field. So we will keep this field to rebuild the full page URLs.
Let's check the price field: we could decide to hide events that are not free, or at least warn our users early.
In [7]:
events[['title', 'price']]
Out[7]:
Cool! Most of them are free. However some of them are not, and one of them is really expensive. Here we would need a product decision on whether we show them and how.
Finally, let's check the category field:
In [8]:
events.category.iloc[0]
Out[8]:
Woops, this seems like a JSON encoded string (which means there was a double JSON encoding as we already decoded once to create the dataset). Let's decode it:
In [9]:
events['categories'] = events.category.apply(lambda l: json.loads(l))
events.categories.iloc[0]
Out[9]:
Now let's list all the available categories:
In [10]:
all_categories = set(c for categories in events.categories.tolist() for c in categories)
all_categories
Out[10]:
Great! Some of them match exactly some questions that our users have answered: so we could directly filter on the events that might interest them.
Despite having a very small amount of data today, the API provided by WorkUp is perfect for us and would easily get integrated in Bob Emploi.
Few things that WorkUp could fix (apart from getting more events across the country):
title field is clean.category field.After that, there are many fields that we can use out of the box to display the events, and few others that can be used to filter them as appropriate for a given user:
latitude, longitude, to narrow the list of events to the ones close to the user.price, to select only the free or cheap ones.category, to select the ones that match what the user is trying to do.Ultimately we might also want to filter some events that are linked to certain industries or certain kinds of jobs (e.g. "Freelance Day") but as most events are actually kind of generic for now, this is not a priority.