Evaluation

We give a first try to cross-validation for evaluating our model.


In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('../data/processed/df.csv', encoding='iso-8859-1')
keys = pd.read_csv('../data/raw/key_1.csv.zip', encoding='iso-8859-1', compression='zip')
sample_submission = pd.read_csv('../data/raw/sample_submission_1.csv.zip', encoding='iso-8859-1', compression='zip')

In [3]:
df.head()


Out[3]:
Page 2015-07-01 2015-07-02 2015-07-03 2015-07-04 2015-07-05 2015-07-06 2015-07-07 2015-07-08 2015-07-09 ... 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31 agent access project pagename
0 2NE1_zh.wikipedia.org_all-access_spider 18.0 11.0 5.0 13.0 14.0 9.0 9.0 22.0 26.0 ... 14.0 20.0 22.0 19.0 18.0 20.0 spider all-access zh.wikipedia.org 2NE1
1 2PM_zh.wikipedia.org_all-access_spider 11.0 14.0 15.0 18.0 11.0 13.0 22.0 11.0 10.0 ... 9.0 30.0 52.0 45.0 26.0 20.0 spider all-access zh.wikipedia.org 2PM
2 3C_zh.wikipedia.org_all-access_spider 1.0 0.0 1.0 1.0 0.0 4.0 0.0 3.0 4.0 ... 4.0 4.0 6.0 3.0 4.0 17.0 spider all-access zh.wikipedia.org 3C
3 4minute_zh.wikipedia.org_all-access_spider 35.0 13.0 10.0 94.0 4.0 26.0 14.0 9.0 11.0 ... 16.0 11.0 17.0 19.0 10.0 11.0 spider all-access zh.wikipedia.org 4minute
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 3.0 11.0 27.0 13.0 36.0 10.0 spider all-access zh.wikipedia.org 52_Hz_I_Love_You

5 rows × 555 columns


In [4]:
keys.head()


Out[4]:
Page Id
0 !vote_en.wikipedia.org_all-access_all-agents_2... bf4edcf969af
1 !vote_en.wikipedia.org_all-access_all-agents_2... 929ed2bf52b9
2 !vote_en.wikipedia.org_all-access_all-agents_2... ff29d0f51d5c
3 !vote_en.wikipedia.org_all-access_all-agents_2... e98873359be6
4 !vote_en.wikipedia.org_all-access_all-agents_2... fa012434263a

In [5]:
keys['Page'].apply(lambda x: x.split('_')[-1]).unique()


Out[5]:
array(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
       '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
       '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12',
       '2017-01-13', '2017-01-14', '2017-01-15', '2017-01-16',
       '2017-01-17', '2017-01-18', '2017-01-19', '2017-01-20',
       '2017-01-21', '2017-01-22', '2017-01-23', '2017-01-24',
       '2017-01-25', '2017-01-26', '2017-01-27', '2017-01-28',
       '2017-01-29', '2017-01-30', '2017-01-31', '2017-02-01',
       '2017-02-02', '2017-02-03', '2017-02-04', '2017-02-05',
       '2017-02-06', '2017-02-07', '2017-02-08', '2017-02-09',
       '2017-02-10', '2017-02-11', '2017-02-12', '2017-02-13',
       '2017-02-14', '2017-02-15', '2017-02-16', '2017-02-17',
       '2017-02-18', '2017-02-19', '2017-02-20', '2017-02-21',
       '2017-02-22', '2017-02-23', '2017-02-24', '2017-02-25',
       '2017-02-26', '2017-02-27', '2017-02-28', '2017-03-01'], dtype=object)

In [6]:
sample_submission.head()


Out[6]:
Id Visits
0 bf4edcf969af 0
1 929ed2bf52b9 0
2 ff29d0f51d5c 0
3 e98873359be6 0
4 fa012434263a 0

Define SMAPE evaluation. It handles the case where there are nan in the y_true array, but it assumes there are no nan in the y_pred array.


In [7]:
def smape(y_true, y_pred):
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.nanmean(diff)

After we do the predictions, we must correctly build the submission file from the predicted dataframe.

First let's verify we can join page names in keys and raw file.


In [8]:
all_keys = keys['Page'].apply(lambda x: '_'.join(x.split('_')[:-1])).unique()
all_pages = df['Page'].values

In [9]:
print(len(all_keys))
print(len(all_pages))


145063
145063

In [10]:
print(len(np.intersect1d(all_keys, all_pages)) == len(all_keys))
print(len(np.intersect1d(all_keys, all_pages)) == len(all_pages))


True
True

So we can join on page names, the keys Page is like the df Page+_date at the end. As such we will need a function to pass the test dataframe with dates in columns to a submission dataframe with dates in rows.

SMAPE on latest 60 times of dataset

Let's measure SMAPE on test dataframe of latest 60 days with a predictions of 0 at those last 60 days for all pagenames


In [11]:
# divide into train and test by slicing at last 60 days
train, test = df.iloc[:,1:-64], df.iloc[:,-64:-4]
pagenames = df['Page'].values

In [12]:
predictions = test.copy()
predictions[:] = 0

We should build a function that from a dataframe with row Pagename and column timestamp generates a Dataframe row Pagename_timestamp and column visits. This function maps our predictions Dataframe to the space of submissions.


In [13]:
test.head()


Out[13]:
2016-11-02 2016-11-03 2016-11-04 2016-11-05 2016-11-06 2016-11-07 2016-11-08 2016-11-09 2016-11-10 2016-11-11 ... 2016-12-22 2016-12-23 2016-12-24 2016-12-25 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
0 18.0 25.0 14.0 20.0 8.0 67.0 13.0 41.0 10.0 21.0 ... 32.0 63.0 15.0 26.0 14.0 20.0 22.0 19.0 18.0 20.0
1 11.0 14.0 26.0 11.0 21.0 14.0 14.0 54.0 5.0 10.0 ... 17.0 42.0 28.0 15.0 9.0 30.0 52.0 45.0 26.0 20.0
2 3.0 3.0 3.0 2.0 10.0 2.0 2.0 2.0 7.0 3.0 ... 3.0 1.0 1.0 7.0 4.0 4.0 6.0 3.0 4.0 17.0
3 12.0 11.0 15.0 7.0 12.0 13.0 9.0 8.0 21.0 16.0 ... 32.0 10.0 26.0 27.0 16.0 11.0 17.0 19.0 10.0 11.0
4 5.0 6.0 33.0 13.0 10.0 22.0 11.0 8.0 4.0 10.0 ... 48.0 9.0 25.0 13.0 3.0 11.0 27.0 13.0 36.0 10.0

5 rows × 60 columns


In [14]:
predictions.head()


Out[14]:
2016-11-02 2016-11-03 2016-11-04 2016-11-05 2016-11-06 2016-11-07 2016-11-08 2016-11-09 2016-11-10 2016-11-11 ... 2016-12-22 2016-12-23 2016-12-24 2016-12-25 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 60 columns


In [15]:
def meltDataframe(preds, pagenames):
    t = preds.copy()
    t['Page'] = pagenames
    t = pd.melt(t, id_vars=['Page'], var_name='date', value_name='Visits')
    t['Page'] = t['Page'] + "_" + t['date']
    return t[['Page', 'Visits']]

In [16]:
meltDataframe(test, pagenames).head()


Out[16]:
Page Visits
0 2NE1_zh.wikipedia.org_all-access_spider_2016-1... 18.0
1 2PM_zh.wikipedia.org_all-access_spider_2016-11-02 11.0
2 3C_zh.wikipedia.org_all-access_spider_2016-11-02 3.0
3 4minute_zh.wikipedia.org_all-access_spider_201... 12.0
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... 5.0

In [17]:
meltDataframe(predictions, pagenames).head()


Out[17]:
Page Visits
0 2NE1_zh.wikipedia.org_all-access_spider_2016-1... 0.0
1 2PM_zh.wikipedia.org_all-access_spider_2016-11-02 0.0
2 3C_zh.wikipedia.org_all-access_spider_2016-11-02 0.0
3 4minute_zh.wikipedia.org_all-access_spider_201... 0.0
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... 0.0

In [18]:
smape(meltDataframe(test, pagenames)['Visits'], meltDataframe(predictions, pagenames)['Visits'])


Out[18]:
197.31749392225692

Predictions with mean value everywhere on test dataset

How about for each pagename, we build the mean on the training set and then apply that as a prediction in the test set ?


In [19]:
mean_per_row = test.mean(axis=1)
mean_per_row


Out[19]:
0          25.716667
1          31.433333
2           7.650000
3          15.433333
4          15.400000
5          24.866667
6          11.350000
7          57.716667
8          45.983333
9          28.016667
10         31.333333
11         15.700000
12          4.066667
13         14.350000
14         40.200000
15         26.883333
16         44.683333
17          9.133333
18         12.733333
19         11.050000
20         53.833333
21         19.566667
22         20.583333
23         35.816667
24        108.450000
25         16.366667
26         41.750000
27         11.200000
28         28.200000
29         10.283333
             ...    
145033     39.016667
145034     35.150000
145035     13.550000
145036      2.516667
145037      5.200000
145038     15.650000
145039     22.666667
145040      2.966667
145041      9.966667
145042      3.033333
145043     10.133333
145044     24.083333
145045      1.300000
145046      9.533333
145047      5.266667
145048      6.583333
145049     19.444444
145050      3.816667
145051     34.836364
145052     18.946429
145053     12.203704
145054     14.142857
145055      6.800000
145056     40.437500
145057      7.000000
145058      9.333333
145059           NaN
145060           NaN
145061           NaN
145062           NaN
Length: 145063, dtype: float64

There are NAs in there, let's fill it with 0s and convert all to ints


In [20]:
mean_per_row = mean_per_row.fillna(0)
mean_per_row = mean_per_row.astype(np.int)

In [21]:
predictions_mean = test.copy()
predictions_mean[:] = 1
predictions_mean = predictions_mean.mul(mean_per_row, axis=0)

In [22]:
smape(meltDataframe(test, pagenames)['Visits'], meltDataframe(predictions_mean, pagenames)['Visits'])


Out[22]:
43.174384194575651

Well that is a better result, let's send that to Kaggle to test the submission process.

Build submission file which contains the mean for each pagename


In [23]:
submission_cols = pd.Series(pd.date_range('1/1/2017', '3/1/2017').format())
submission = meltDataframe(pd.DataFrame({col:mean_per_row for col in submission_cols}), pagenames)

In [24]:
submission.head()


Out[24]:
Page Visits
0 2NE1_zh.wikipedia.org_all-access_spider_2017-0... 25
1 2PM_zh.wikipedia.org_all-access_spider_2017-01-01 31
2 3C_zh.wikipedia.org_all-access_spider_2017-01-01 7
3 4minute_zh.wikipedia.org_all-access_spider_201... 15
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... 15

In [25]:
submission['Page'].head().values


Out[25]:
array(['2NE1_zh.wikipedia.org_all-access_spider_2017-01-01',
       '2PM_zh.wikipedia.org_all-access_spider_2017-01-01',
       '3C_zh.wikipedia.org_all-access_spider_2017-01-01',
       '4minute_zh.wikipedia.org_all-access_spider_2017-01-01',
       '52_Hz_I_Love_You_zh.wikipedia.org_all-access_spider_2017-01-01'], dtype=object)

In [26]:
keys['Page'].head().values


Out[26]:
array(['!vote_en.wikipedia.org_all-access_all-agents_2017-01-01',
       '!vote_en.wikipedia.org_all-access_all-agents_2017-01-02',
       '!vote_en.wikipedia.org_all-access_all-agents_2017-01-03',
       '!vote_en.wikipedia.org_all-access_all-agents_2017-01-04',
       '!vote_en.wikipedia.org_all-access_all-agents_2017-01-05'], dtype=object)

In [27]:
submission[submission['Page'] == keys['Page'].head().values[0]]


Out[27]:
Page Visits
37206 !vote_en.wikipedia.org_all-access_all-agents_2... 3

In [28]:
keys[keys['Page'] == submission['Page'].head().values[0]]


Out[28]:
Page Id
111180 2NE1_zh.wikipedia.org_all-access_spider_2017-0... ff8c1aade3de

In [29]:
submission.set_index('Page')
keys.set_index('Page')


Out[29]:
Id
Page
!vote_en.wikipedia.org_all-access_all-agents_2017-01-01 bf4edcf969af
!vote_en.wikipedia.org_all-access_all-agents_2017-01-02 929ed2bf52b9
!vote_en.wikipedia.org_all-access_all-agents_2017-01-03 ff29d0f51d5c
!vote_en.wikipedia.org_all-access_all-agents_2017-01-04 e98873359be6
!vote_en.wikipedia.org_all-access_all-agents_2017-01-05 fa012434263a
!vote_en.wikipedia.org_all-access_all-agents_2017-01-06 48f1e93517a2
!vote_en.wikipedia.org_all-access_all-agents_2017-01-07 5def418fcb36
!vote_en.wikipedia.org_all-access_all-agents_2017-01-08 77bd08134351
!vote_en.wikipedia.org_all-access_all-agents_2017-01-09 5889e6dbb16f
!vote_en.wikipedia.org_all-access_all-agents_2017-01-10 5f21fef1d764
!vote_en.wikipedia.org_all-access_all-agents_2017-01-11 6f07e1b8815a
!vote_en.wikipedia.org_all-access_all-agents_2017-01-12 228e54b5dea0
!vote_en.wikipedia.org_all-access_all-agents_2017-01-13 da1b34963ed7
!vote_en.wikipedia.org_all-access_all-agents_2017-01-14 ab5ccefaa2db
!vote_en.wikipedia.org_all-access_all-agents_2017-01-15 cbf42873ebf1
!vote_en.wikipedia.org_all-access_all-agents_2017-01-16 ac67e35ed44e
!vote_en.wikipedia.org_all-access_all-agents_2017-01-17 88c098aa640d
!vote_en.wikipedia.org_all-access_all-agents_2017-01-18 7c72842a89d1
!vote_en.wikipedia.org_all-access_all-agents_2017-01-19 8ce002f2c329
!vote_en.wikipedia.org_all-access_all-agents_2017-01-20 5f72d9920560
!vote_en.wikipedia.org_all-access_all-agents_2017-01-21 f93afd7f5d9b
!vote_en.wikipedia.org_all-access_all-agents_2017-01-22 14011cb66f2d
!vote_en.wikipedia.org_all-access_all-agents_2017-01-23 0065551ac465
!vote_en.wikipedia.org_all-access_all-agents_2017-01-24 175f1872729e
!vote_en.wikipedia.org_all-access_all-agents_2017-01-25 31d756e83124
!vote_en.wikipedia.org_all-access_all-agents_2017-01-26 e186c2363c5e
!vote_en.wikipedia.org_all-access_all-agents_2017-01-27 3bce56c2b977
!vote_en.wikipedia.org_all-access_all-agents_2017-01-28 d497981dce77
!vote_en.wikipedia.org_all-access_all-agents_2017-01-29 c813cec10548
!vote_en.wikipedia.org_all-access_all-agents_2017-01-30 5123e0ed62c9
... ...
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-01-31 1fb8f902ad0f
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-01 0107f6d7cd82
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-02 30c402ed9e49
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-03 935fa0168d01
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-04 1140b428380e
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-05 cc5eadae0d7a
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-06 f923701cdb05
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-07 905679a20d39
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-08 642354a50690
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-09 7376c63bd4c1
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-10 1f0566b71f7e
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-11 938774bbb675
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-12 53c046bac8cb
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-13 ead2377353d3
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-14 efa87c7d5160
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-15 f239d6ceb17b
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-16 0fef0826b1bc
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-17 478d3c34b0c1
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-18 6a1b6e3028fc
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-19 3b5fb022accd
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-20 a4456a9d271d
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-21 d43a25cf4ef2
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-22 8f47d2e020cd
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-23 a78af728d84b
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-24 d1ba45c7ec08
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-25 f69747f5ee68
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-26 2489963dc503
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-27 b0624c909f4c
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-02-28 24a1dfb06c10
龙生九子_zh.wikipedia.org_mobile-web_all-agents_2017-03-01 add681d54216

8703780 rows × 1 columns


In [30]:
submission_file = submission.join(keys, how='left', lsuffix='_sub', rsuffix='_key')[["Id", "Visits"]]

In [31]:
submission_file.describe()


Out[31]:
Visits
count 8.703780e+06
mean 1.457983e+03
std 8.043674e+04
min 0.000000e+00
25% 2.300000e+01
50% 1.900000e+02
75% 8.170000e+02
max 2.322630e+07

In [32]:
# check there are no nans after left join, eventually
submission_file.isnull().sum()


Out[32]:
Id        0
Visits    0
dtype: int64

In [33]:
submission_file.to_csv('../data/submissions/0.3_mean_row_submission.csv.gz', compression='gzip', index=False)

Next time we can try to decompose to trend + seasonality and just predict using trend for the same date the year before


In [ ]: