Required Packages

  1. poweRlaw
  2. ipoptr

Let's first start by making all functions availble for simmulating, modeling and predicting cascades


In [1]:
## Just source the rscript provided
source('rscripts/marked_hawkes.R')


Loading required package: poweRlaw
Loading required package: ipoptr

Simulation

We will start with visualizing how would kernel function for an event with mark 1000 evolve, where it's parameters are $\kappa = 0.8$, $\beta = 0.6$, $c = 10$ and $\theta = 0.8$. For any code where we want to use power law kernel we don't need to pass the kernel value as it's te defualt type implemented.


In [2]:
# initial event we want to use for simmulation
event <- c(1000, 0)
# series of time points we want to simmulate event for
t <- seq(from = 0, to = 100, length=1000)
# parameters for kernel
K <- 0.8
beta <- 0.6
c <- 10
theta <- 0.8

# Power Law Kernel
# calling the kernelfunction to get the values
values.PL <- kernelFct(event = event, t = t, K = K, beta = beta, c = c, theta = theta, kernel.type='PL')
# ploting the kernel function
plot(x = t, y = values.PL, type = 'l', col = 'blue', pch=16, bty = "n",
      lty = 1, las = 1, yaxt = "n", xaxt = "n", ylim = c(0, max(values.PL)+0.1),
        cex.main = 2, cex.lab = 1.8, cex.axis = 1.4,
        main = expression(paste("Power Law memory kernel ", phi[m] (tau), " over time")), ylab = "", xlab = "" )
axis(side = 2, las = 1, cex.axis = 1.4)
axis(side = 1, cex.axis = 1.4 )


We can also simulate a series of events generated by a hawkes process described by the parameters above. We have option to save it to a file and specify the time till we want events to b simulated. We can even simulate a series of events after providing an initial history.


In [3]:
# generating events till a time
events1 <- generate_Hawkes_event_series(K = K, beta = beta, c = c, theta = theta, Tmax = 50)
events2 <- generate_Hawkes_event_series(K = K, beta = beta, c = c, theta = theta, history_init = events1, Tmax = 60)


Current simulation time: 50.066 / 50.000 (43 events).
--> Simulation done!
Current simulation time: 61.200 / 60.000 (47 events).
--> Simulation done!

Now if we compare events2 and events1 we find events2 has all of events1 as prefix, with some extra generated events.

Real Example

In this section, we first read a real twitter cascade and then fit a hawkes model to it after observing first 1 hours of cacscade and then we will do the final prediction. A retweet cascade is provided in the file 'example.csv' with each row representing an event with values of the form

[magnitude(#user followers), time(secs)]


In [4]:
## read the real cascade provided in the file example.csv
real_cascade <- read.csv(file = 'example_book.csv', header = T)

In [5]:
## just looking at data
real_cascade[1:10, ]


Xmagnitudetime
1 40989 0
2 144521
3 56331
4 32933
5 55549
6 51354
7 2814558
8 10062
9 16482
10 49187

In [6]:
## reading the events uptill prediction time, 600 secs, to be used for fitting
predTime <- 600
history <- real_cascade[real_cascade$time <= predTime, ]
## removing the first column, index
history <- history[ , 2:3]

Using Power Law Kernel as Social Kernel


In [7]:
## just to supress warnings as optimization 
oldw <- getOption("warn")
options(warn = -1)
## now for fitting we just call the fitting function, which uses IPOPT which needs some starting values for parameters
startParams <- c(K= 1,beta=1, c=250, theta= 1)
result <- fitParameters(startParams, history)


******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

In IPOPT the optimal values returned are in order they are passed to the objective function hence:

$\kappa = result\$solution[1]$

$\beta = result\$solution[2] $

$ c = result\$solution[3] $

$\theta = result\$solution[4] $


In [8]:
sprintf("Value of K =  %f", result$solution[1])
sprintf("Value of beta = %f", result$solution[2])
sprintf("Value of c =  %f", result$solution[3])
sprintf("Value of theta =  %f", result$solution[4])


'Value of K = 1.000000'
'Value of beta = 1.015493'
'Value of c = 250.657531'
'Value of theta = 1.338108'

Now we plot part of the observed event series and the corresponding intensity function with fitted parameters. We do not plot the whole observed parts as it's too big to be seen with useful information on a single small graph


In [9]:
## length of time to use for plottig
plotTime <- 600

pltData <- history[history$time <= plotTime,]
intensity <- lambda(t=seq(0,plotTime,1), history = history, params = result$solution, inclusive = T)
par(mfrow=c(2,1))
data <- data.frame(time= seq(0,plotTime,1), intensity=intensity)

plot(x = pltData$time, y = log(pltData$magnitude, base = 10), type = 'p', col = 'black', pch=16, bty = "n",
         xaxt = "n", yaxt = "n", xlab = "", ylab = "User Influence", main = "A retweet cascade as a sequence of events",
         cex.main = 1.8, cex.lab = 1.8)
axis(side = 1, cex.axis=1.4)
segments(x0 =pltData$time, y0 = 0, x1 = pltData$time, y1 = log(pltData$magnitude, base = 10),lty = 2)
points(x = pltData$time, y = log(pltData$magnitude, base = 10), type = 'p', col = 'black', pch=16)
axis(side = 2)


plot(x = data$time, y = data$intensity, type = 'l', col = 'black', pch=16, bty = "n", 
     xaxt = "n", yaxt = "n", xlab = "", ylab = "Intensity",
     cex.main = 1.8, cex.lab = 1.8)
axis(side = 1, cex.axis=1.4)
axis(side = 2)



In [10]:
## To get predictions from fitted model we call the function provided in rscipts
prediction <- getTotalEvents(history = history, bigT = predTime, 
                             K = result$solution[1], alpha = 2.016, 
                             beta = result$solution[2], mmin = 1,                
                             c = result$solution[3], theta = result$solution[4])
# settings warnings back to deafult
options(warn = oldw)

In [11]:
prediction


total
216
nstar
0.922294642616219
a1
13.4134242589397

In [12]:
## Total length of real casacde
nReal = nrow(real_cascade)
nPredicted = prediction['total']
sprintf("The real size of cascade is %d and predicted values is %d", nReal, nPredicted)
sprintf("The relative error in percentage is %0.2f", 100*abs(nReal-nPredicted)/nReal)


'The real size of cascade is 219 and predicted values is 216'
'The relative error in percentage is 1.37'

Using Exponential Kernel as Social Kernel


In [13]:
## just to supress warnings as optimization 
oldw <- getOption("warn")
options(warn = -1)
## now for fitting we just call the fitting function, which uses IPOPT which needs some starting values for parameters
## for exp there is no c value
startParams <- c(K= 0.01,beta=0.01, c=0, theta= 0.01)
result <- fitParameters(startParams, history, kernel.type = 'EXP')

In IPOPT the optimal values returned are in order they are passed to the objective function hence:

$\kappa = result\$solution[1]$

$\beta = result\$solution[2] $

$ c = result\$solution[3] $

$\theta = result\$solution[4] $


In [14]:
sprintf("Value of K =  %f", result$solution[1])
sprintf("Value of beta = %f", result$solution[2])
## no significance of c, we have to use it just to have same call for all kernel types
sprintf("Value of c =  %f", result$solution[3])
sprintf("Value of theta =  %f", result$solution[4])


'Value of K = 0.000382'
'Value of beta = 1.015611'
'Value of c = 0.000000'
'Value of theta = 0.005451'

Now we plot part of the observed event series and the corresponding intensity function with fitted parameters. We do not plot the whole observed parts as it's too big to be seen with useful information on a single small graph


In [15]:
## length of time to use for plottig
plotTime <- 600

pltData <- history[history$time <= plotTime,]
intensity <- lambda(t=seq(0,plotTime,1), history = history, params = result$solution, inclusive = T,  kernel.type = 'EXP')
par(mfrow=c(2,1))
data <- data.frame(time= seq(0,plotTime,1), intensity=intensity)

plot(x = pltData$time, y = log(pltData$magnitude, base = 10), type = 'p', col = 'black', pch=16, bty = "n",
         xaxt = "n", yaxt = "n", xlab = "", ylab = "User Influence", main = "A retweet cascade as a sequence of events",
         cex.main = 1.8, cex.lab = 1.8)
axis(side = 1, cex.axis=1.4)
segments(x0 =pltData$time, y0 = 0, x1 = pltData$time, y1 = log(pltData$magnitude, base = 10),lty = 2)
points(x = pltData$time, y = log(pltData$magnitude, base = 10), type = 'p', col = 'black', pch=16)
axis(side = 2)


plot(x = data$time, y = data$intensity, type = 'l', col = 'black', pch=16, bty = "n", 
     xaxt = "n", yaxt = "n", xlab = "", ylab = "Intensity",
     cex.main = 1.8, cex.lab = 1.8)
axis(side = 1, cex.axis=1.4)
axis(side = 2)



In [16]:
## To get predictions from fitted model we call the function provided in rscipts
prediction <- getTotalEvents(history = history, bigT = predTime, 
                             K = result$solution[1], alpha = 2.016, 
                             beta = result$solution[2], mmin = 1,                
                             c = result$solution[3], theta = result$solution[4],
                             kernel.type = 'EXP')
# settings warnings back to deafult
options(warn = oldw)

In [17]:
prediction


total
1603
nstar
0.997453161223473
a1
3.97306402730592

In [18]:
## Total length of real casacde
nReal = nrow(real_cascade)
nPredicted = prediction['total']
sprintf("The real size of cascade is %d and predicted values is %d", nReal, nPredicted)
sprintf("The relative error in percentage is %0.2f", 100*abs(nReal-nPredicted)/nReal)


'The real size of cascade is 219 and predicted values is 1603'
'The relative error in percentage is 631.96'

We have used only a single starting point for optimization solver. To get best fittings it's generally recommended to use random different starting points and then select the parameters that best fit. We have not covered this trick here as it's more of a coding exercise.

To understand how each function is implemented please look at the files in rscripts folder