At what time of the day are events triggered? How does this relate to the user demographics. This notebook is a fork from the previous investigations of these questions in Russ Williams's kernel.
That kernel compared the activation times of events for different age and gender groups. In this notebook, we take an alternative approach and determine the age distribution for different times of the day.
In [88]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pickle
import seaborn as sns
%matplotlib inline
from scipy import sparse
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,FunctionTransformer
#path to data and features
DATA_PATH = "../../../input/"
As a first step, we load the events data that contains the information when an event has been fired. We are only interested in the timestamp and the device-id. The timestamp column is converted to datetime.
In [7]:
events = pd.read_csv('{0}events.csv'.format(DATA_PATH)).loc[:, ['timestamp', 'device_id']]
events['timestamp'] = pd.to_datetime(events['timestamp'])
We represent the time of day in fractional hours.
In [16]:
def fract_hour(time):
return time.hour + time.minute / 60.0 + time.second / 3600.0
events['hour'] = events['timestamp'].apply(lambda time: fract_hour(time))
Before delving into any more refined analysis, we visualize the distribution of the time stamps.
In [194]:
ax = sns.distplot(events['hour'])
ax.set_title('Events by hour')
ax.set_xlim(xmin = 0, xmax = 24)
ax.set_xlabel('Hour of day')
Out[194]:
We see event peeks in the morning at around 10 AM and in the evening at around 9 PM. The kernel density plot is confusing at the boundaries of the domain. We provide a new perspective by interpreting 10 PM as -2 AM.
In [222]:
events['hour_recentered'] = [((time + 2) % 24)-2 for time in events['hour']]
In [237]:
ax1 = sns.distplot(events['hour_recentered'])
ax1.set_xlim(xmin = -2, xmax = 22)
ax1.set_xlabel('Hour of day')
ax1.set_title('Events by hour -- recentered')
Out[237]:
After having developed a rough idea on the general distribution of the event timing, we now investigate connections to the user demographics. For this, we subdivide the day into 6 periods and investigate the age distribution within each of these periods.
In order to use the demographics data, we need to join the age_train dataset with the events dataset on the 'device-id' field.
In [239]:
age_sex = pd.read_csv('{0}gender_age_train.csv'.format(DATA_PATH)).drop('group', axis = 1)
age_sex_event = age_sex.merge(events, 'inner', on = 'device_id').drop_duplicates().drop('device_id', axis = 1)
Now, we subdivide the day into 4 bins and generate violin plots for each time interval.
In [251]:
age_sex_event['bin'] = pd.cut(age_sex_event['hour_recentered'], [-2, 2, 7, 22])
ax = sns.violinplot(x="bin", y="age", data = age_sex_event)
ax.set_ylim(ymin = 18, ymax = 55)
ax.set_xlabel('Time of day')
ax.set_title('Age distribution by time of day')
Out[251]:
The violin plots are fairly similar, but we do observe that the late night sessions (22 PM to 2 AM) and early morning events (2 AM to 7 AM) correspond to younger and older age groups, respectively.
Finally, we take into account not only the age but also the gender. We see that for late-night activities, the female age median is smaller than the male age median, whereas the situation is reversed in the early morning.
In [250]:
ax_violin = sns.violinplot(x='bin', y='age', hue = 'gender', split = False, data = age_sex_event)
ax_violin.set_ylim(ymin = 18, ymax = 55)
ax_violin.set_xlabel('Time of day')
ax_violin.set_title('Age distribution by time of day and gender')
ax_violin.legend(bbox_to_anchor=(1.05, 1), loc=2)
Out[250]: