In [52]:
# Import the library we need, which is dplyr and ggplot2
library(dplyr)
library(ggplot2)
In [53]:
options(repr.plot.width=10, repr.plot.height=6)
In [54]:
# Read the csv file of Monthwise Quantity and Price csv file we have.
df <- read.csv('MonthWiseMarketArrivals_Clean.csv')
In [55]:
str(df)
In [56]:
# Fix the date
df$date = as.Date(as.character(df$date), "%Y-%m-%d")
In [57]:
head(df)
Out[57]:
In [58]:
# Get the values for Bangalore
dfBang <- df %>%
filter( city == 'BANGALORE') %>%
arrange(date) %>%
select(quantity, priceMod, date)
In [59]:
head(dfBang)
Out[59]:
In [60]:
summary(dfBang$priceMod)
Out[60]:
In [61]:
summary(dfBang$quantity)
Out[61]:
In [62]:
ggplot(dfBang) + aes(quantity, priceMod) + geom_point()
In [63]:
cor(df$quantity, df$priceMod)
Out[63]:
In [64]:
cor(log(df$quantity), log(df$priceMod))
Out[64]:
In [65]:
# We can try and fit a linear line to the data to see if there is a relationship
ggplot(dfBang) + aes(log(quantity), log(priceMod)) + geom_point() + stat_smooth(method = 'lm')
In [ ]:
In [ ]:
In [ ]:
We will build a time-series forecasting model to get a forecast for Onion prices.
Most of the time series models work on the assumption that the time series is stationary. Intuitively, we can see that if a time series has a particular behaviour over time, there is a very high probability that it will follow the same in the future. Also, the theories related to stationary series are more mature and easier to implement as compared to non-stationary series
Statistical stationarity: A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time. Most statistical forecasting methods are based on the assumption that the time series can be rendered approximately stationary (i.e., "stationarized") through the use of mathematical transformations. A stationarized series is relatively easy to predict: you simply predict that its statistical properties will be the same in the future as they have been in the past!
There are three basic criterion for a series to be classified as stationary series :
In [66]:
head(dfBang)
Out[66]:
In [67]:
# Let us create a time series variable for priceMod
ggplot(dfBang) + aes(date, priceMod) + geom_line()
Approaches to make the time series stationary
In data analysis transformation is the replacement of a variable by a function of that variable: for example, replacing a variable x by the square root of x or the logarithm of x. In a stronger sense, a transformation is a replacement that changes the shape of a distribution or relationship. Transformation are done for...
In [68]:
dfBang <- dfBang %>%
mutate(priceModLog = log(priceMod))
In [69]:
head(dfBang)
Out[69]:
In [70]:
ggplot(dfBang) + aes(priceMod) + geom_histogram(bins = 30)
In [71]:
ggplot(dfBang) + aes(priceModLog) + geom_histogram(bins = 30)
In [72]:
# We take the log transform to reduce the impact of high values
ggplot(dfBang) + aes(date, priceModLog) + geom_line()
Computing the differences between consecutive observations is known as differencing.
Transformations such as logarithms can help to stabilize the variance of a time series. Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and so eliminating trend and seasonality.
A window function is a variation on an aggregation function. Where an aggregation function, like sum()
and mean()
, takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values. Window functions include variations on aggregate functions, like cumsum()
and cummean()
, functions for ranking and ordering, like rank()
, and functions for taking offsets, like lead()
and lag()
.
In [73]:
# Let us take the first order difference
dfBang <- dfBang %>%
mutate(priceModLogLag = lag(priceModLog)) %>%
mutate(priceModLogDiff = priceModLog - lag(priceModLog) )
In [74]:
# The first value is NA, the rest are first order differences
head(dfBang)
Out[74]:
In [75]:
# We can see that they are highly correlated
cor(dfBang$priceModLog, dfBang$priceModLogLag, use = 'complete')
Out[75]:
In [76]:
ggplot(dfBang) + aes(priceModLog, priceModLogLag) + geom_point()
In [77]:
# Let us plot the priceModLogDiff
ggplot(dfBang) + aes(date, priceModLogDiff) + geom_line()
In [78]:
# We can check the mean of this difference has been reduced close to zero
mean(dfBang$priceModLogDiff, na.rm = TRUE)
Out[78]:
In [79]:
sd(dfBang$priceModLogDiff, na.rm = TRUE)
Out[79]:
Now it is cumbersome to calculate the correlation for t-1, t-2, t-3 and so on to check... So we can use the acf
function to do the same.
In [80]:
# Calculate the auto-correlation for the priceModLog
acf(dfBang$priceModLog, na.action = na.omit)
In [81]:
# Calculate the auto correlation factor for priceModLogDiff
acf(dfBang$priceModLogDiff, na.action = na.omit)
In [92]:
dim(dfBang)
Out[92]:
In [94]:
# Confidence Interval
2 / sqrt(147)
Out[94]:
When faced with a time series that shows irregular growth, the best strategy may not be to try to directly predict the level of the series at each period (i.e., the quantity Yt). Instead, it may be better to try to predict the change that occurs from one period to the next (i.e., the quantity Yt - Yt-1). That is, it may be better to look at the first difference of the series, to see if a predictable pattern can be found there. For purposes of one-period-ahead forecasting, it is just as good to predict the next change as to predict the next level of the series, since the predicted change can be added to the current level to yield a predicted level. The simplest case of such a model is one that always predicts that the next change will be zero, as if the series is equally likely to go up or down in the next period regardless of what it has done in the past.
Random Walk Model $$ \hat{Y_t} = Y_{t-1} + \epsilon \\$$
In [82]:
head(dfBang)
Out[82]:
In [85]:
dfBang <- dfBang %>%
mutate(priceRandomWalk = lag(priceMod))
In [86]:
head(dfBang)
Out[86]:
In [99]:
tail(dfBang)
Out[99]:
In [97]:
predicted <- tail(dfBang$priceMod, 1)
In [98]:
predicted
Out[98]:
In [91]:
ggplot(dfBang) + aes(date, priceRandomWalk) + geom_line() + geom_point(aes(date, priceMod))
In [ ]: