Test forecast package

Let's take the opportunity to study the forecast package in R, and other classic R packages in my case.

Follow https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials


In [1]:
library('readr') # data input
library('tibble') # data wrangling
library('reshape2') # data manipulation
library('dplyr') # data manipulation

library('ggplot2') # visualization

library('lubridate') # date and time
library('tseries') # time series analysis
library('forecast') # time series analysis

options(repr.plot.width=8, repr.plot.height=4)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Attaching package: 'lubridate'

The following object is masked from 'package:base':

    date

Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: timeDate
This is forecast 7.3 


In [2]:
df <- read_csv('../data/processed/df.csv')


Parsed with column specification:
cols(
  .default = col_double(),
  Page = col_character(),
  agent = col_character(),
  access = col_character(),
  project = col_character(),
  pagename = col_character()
)
See spec(...) for full column specifications.

In [3]:
head(df)


Page2015-07-012015-07-022015-07-032015-07-042015-07-052015-07-062015-07-072015-07-082015-07-09...2016-12-262016-12-272016-12-282016-12-292016-12-302016-12-31agentaccessprojectpagename
2NE1_zh.wikipedia.org_all-access_spider 18 11 5 13 14 9 9 22 26 ... 14 20 22 19 18 20 spider all-access zh.wikipedia.org 2NE1
2PM_zh.wikipedia.org_all-access_spider 11 14 15 18 11 13 22 11 10 ... 9 30 52 45 26 20 spider all-access zh.wikipedia.org 2PM
3C_zh.wikipedia.org_all-access_spider 1 0 1 1 0 4 0 3 4 ... 4 4 6 3 4 17 spider all-access zh.wikipedia.org 3C
4minute_zh.wikipedia.org_all-access_spider 35 13 10 94 4 26 14 9 11 ... 16 11 17 19 10 11 spider all-access zh.wikipedia.org 4minute
52_Hz_I_Love_You_zh.wikipedia.org_all-access_spiderNA NA NA NA NA NA NA NA NA ... 3 11 27 13 36 10 spider all-access zh.wikipedia.org 52_Hz_I_Love_You
5566_zh.wikipedia.org_all-access_spider 12 7 4 5 20 8 5 17 24 ... 32 19 23 17 17 50 spider all-access zh.wikipedia.org 5566

Let's train only on a row time series, like Pokemon ^^


In [4]:
input <- melt(df %>% select(-c(Page, agent, access, project, pagename)) %>% slice(1))


No id variables; using all as measure variables

In [5]:
head(input)


variablevalue
2015-07-0118
2015-07-0211
2015-07-03 5
2015-07-0413
2015-07-0514
2015-07-06 9

Plot some data


In [6]:
input$Date = as.Date(input$variable)

In [7]:
ggplot(input, aes(Date, value)) + geom_line() + scale_x_date('month')  + ylab("Visits") + xlab("Date")


Some big values here. R provides a convenient method for removing time series outliers: tsclean() as part of its forecast package. tsclean() identifies and replaces outliers using series smoothing and decomposition. This method is also capable of inputing missing values in the series if there are any.

Note that we are using the ts() command to create a time series object to pass to tsclean():


In [8]:
count_ts = ts(input[, c('value')])
input$visits_cleaned = tsclean(count_ts)

In [9]:
ggplot(input, aes(Date, visits_cleaned)) + geom_line() + scale_x_date('month')  + ylab("Visits") + xlab("Date")


Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

In [10]:
input$visits_ma = ma(input$visits_cleaned, order=7)
input$visits_ma30 = ma(input$visits_cleaned, order=30)

In [11]:
ggplot() +
  geom_line(data = input, aes(x = Date, y = visits_cleaned, colour = "Counts")) +
  geom_line(data = input, aes(x = Date, y = visits_ma,   colour = "Weekly Moving Average"))  +
  geom_line(data = input, aes(x = Date, y = visits_ma30, colour = "Monthly Moving Average"))  + 
  ylab('Visits Count')


Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
Warning message:
"Removed 6 rows containing missing values (geom_path)."Warning message:
"Removed 30 rows containing missing values (geom_path)."

Timeseries decomposition


In [ ]: