This competition focuses on the problem of forecasting the future values of multiple time series, as it has always been one of the most challenging problems in the field. More specifically, we aim the competition at testing state-of-the-art methods designed by the participants, on the problem of forecasting future web traffic for approximately 145,000 Wikipedia articles.
For each time series, you are provided the name of the article as well as the type of traffic that this time series represent (all, mobile, desktop, spider). You may use this metadata and any other publicly available data to make predictions.
Unfortunately, the data source for this dataset does not distinguish between traffic values of zero and missing values. A missing value may mean the traffic was zero or that the data is not available for that day.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
raw = pd.read_csv('../data/raw/train_1.csv.zip', compression='zip', encoding='iso-8859-1')
keys = pd.read_csv('../data/raw/key_1.csv.zip', compression='zip', encoding='iso-8859-1')
In [3]:
raw.head(15)
Out[3]:
In [4]:
keys.head(15)
Out[4]:
In [5]:
raw.describe()
Out[5]:
In [6]:
keys.describe()
Out[6]:
In [7]:
# Page views are integers, let's downcast floats to ints
for col in raw.columns[1:]:
raw[col] = pd.to_numeric(raw[col], downcast='integer')
Each row consists of a page and a set of observations per date. The page contains a set of useful looking info we should try to extract.
Each article name has the following format: 'name_project_access_agent' (e.g. 'AKB48_zh.wikipedia.org_all-access_spider'). Let's parse the Page column, but by checking the number of underscores we see that we don't consistently have four. We can bet that only article names have underscores.
In [8]:
sns.distplot(raw['Page'].apply(lambda x: len(str(x).split('_'))), kde=False, bins=20).set_title("Number of underscores per Page")
Out[8]:
In [9]:
# Show we have only two kinds of agents
print(raw['Page'].apply(lambda x: str(x).split('_')[-1]).unique())
# Show we have only three kinds of accesses
print(raw['Page'].apply(lambda x: str(x).split('_')[-2]).unique())
# Show we have only a small set of Wikipedia projects
print(raw['Page'].apply(lambda x: str(x).split('_')[-3]).unique())
# general conclusion is we can safely parse in reverse
In [10]:
def parsePage(page):
input = str(page).split('_')
output = []
output.append(input.pop())
output.append(input.pop())
output.append(input.pop())
output.append('_'.join(input))
return pd.Series(output)
page_details = pd.DataFrame(raw['Page'].apply(parsePage))
page_details.columns = ["agent", "access", "project", "pagename"]
In [11]:
page_details.describe()
Out[11]:
In [12]:
df = pd.concat([raw, page_details], axis=1)
In [13]:
df.to_csv('../data/processed/df.csv', encoding='utf-8', index=False)
In [14]:
test = pd.read_csv('../data/processed/df.csv')
In [15]:
test.head()
Out[15]:
In [16]:
# found a lovely function for graphing in https://www.kaggle.com/dextrousjinx/brief-insight-on-web-traffic-time-series
def graph_by(df, plot_hue, graph_columns):
train_project_df = df.groupby(plot_hue).sum().T
train_project_df.index = pd.to_datetime(train_project_df.index)
train_project_df = train_project_df.groupby(pd.TimeGrouper('M')).mean().dropna()
train_project_df['month'] = 100*train_project_df.index.year + train_project_df.index.month
train_project_df = train_project_df.reset_index(drop=True)
train_project_df = pd.melt(train_project_df, id_vars=['month'], value_vars=graph_columns)
fig = plt.figure(1,figsize=[12,10])
ax = sns.pointplot(x="month", y="value", hue=plot_hue, data=train_project_df)
ax.set(xlabel='Year-Month', ylabel='Mean Hits')
In [ ]:
project_columns = page_details['project'].unique()
access_columns = page_details['access'].unique()
agents_columns = page_details['agent'].unique()
In [ ]:
graph_by(test, "project", project_columns)
In [ ]:
graph_by(test, "access", access_columns)
In [ ]:
graph_by(test, "agent", agents_columns)
In [ ]: