This is an example of a data analysis in Ruby using daru
to organize and manipulate data, mixed_models
to fit a statistical model, and gnuplotrb
for visualization.
We will analyze these data1 from the UCI machine learning repository2, which originate from blog posts from various sources in 2010-2012.
Here, we use the data to investigate the following question:
If a blog post has received a comment in the first 24 hours after publication, how many more comments will it receive before 24 hours after its publication have passed?
After doing the entire data cleaning and preprocessing with daru
, we use mixed_models
to fit a linear mixed model. The fitted model can be used to make predictions for future observations, and make inferences about the relationships between different variables in the data.
[1] For more information on the data set we refer to: Buza, K. (2014). Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152). Springer International Publishing.
[2] Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
Since daru
requires CSV files to have a header line, we add a header to the data file and save the new data frame.
In [1]:
without_header = '../examples/data/blogData_train.csv'
with_header = '../examples/data/blogData_train_with_header.csv'
colnames = (1..281).to_a.map { |x| "v#{x}" }
header = colnames.join(',')
File.open(with_header, 'w') do |fo|
fo.puts header
File.foreach(without_header) do |li|
fo.puts li
end
end
Now we can load the data with daru
.
In [2]:
require 'daru'
df = Daru::DataFrame.from_csv '../examples/data/blogData_train_with_header.csv'
puts "Number of rows: #{df.nrows}"
puts "Number of columns: #{df.ncols}"
We don't want to keep most of the variables for further analysis, as the great majority of the columns represent bag-of-words features and summary statistics for other attributes. So, we select the data columns which we want to keep, and assign them meaningful names.
In [3]:
# select a subset of columns of the data frame
keep = ['v16', 'v41', 'v54', 'v62', 'v270', 'v271', 'v272',
'v273', 'v274', 'v275', 'v276', 'v277', 'v280']
blog_data = df[*keep]
df = nil
# assign meaningful names for the selected columns
meaningful_names = [:host_comments_avg, :host_trackbacks_avg,
:comments, :length, :mo, :tu, :we, :th,
:fr, :sa, :su, :parents, :parents_comments]
blog_data.vectors = Daru::Index.new(meaningful_names)
# the resulting data set
blog_data.head
Out[3]:
As can be seen in the above output, the length of the text in a blog post is often given as zero. Those are probably missing values, and we get rid of the corresponding observations.
We also delete observation which have zero comments in the first 24 hours after publication, to comply with our research objective stated in the beginning.
In [4]:
nonzero_ind = blog_data[:length].each_index.select do |i|
blog_data[:length][i] > 0 && blog_data[:comments][i] > 0
end
blog_data = blog_data.row[*nonzero_ind]
puts "Remaining number of rows: #{blog_data.nrows}"
For a more clear representation of the data, and in order to use the day of the week as a grouping variable for the observations, we replace the respective seven 0-1-valued columns with one column of categorical data (with values 'mo', 'tu', 'we', 'th', 'fr', 'sa', 'su').
In [5]:
days = Array.new(blog_data.nrows) { :unknown }
[:mo, :tu, :we, :th, :fr, :sa, :su].each do |d|
ind = blog_data[d].to_a.each_index.select { |i| blog_data[d][i]==1 }
ind.each { |i| days[i] = d.to_s }
blog_data.delete_vector(d)
end
blog_data[:day] = days
blog_data.head 3
Out[5]:
The variable parents
denotes the number of parent blog posts, where we consider a blog post P as a
parent of blog post B if B is a reply (trackback) to blog post P.
Related to it, the variable parents_comments
denotes the number of comments that the parents of a blog post received on average.
Clearly, the two variables are highly correlated (as zero parents
implies zero parents_comments
). Therefore, we shouldn't include both these variables in the linear mixed model.
We combine the variables parents
and parents_comments
into one variable called has_parent_with_comments
, which is a categorical variable designating whether a blog post has at least one parent post with at least one comment.
In [6]:
# create a binary indicator vector specifying if a blog post has at least
# one parent post which has comments
hpwc = (blog_data[:parents] * blog_data[:parents_comments]).to_a
blog_data[:has_parent_with_comments] = hpwc.map { |t| t == 0 ? 'no' : 'yes'}
blog_data.delete_vector(:parents)
blog_data.delete_vector(:parents_comments)
blog_data.head 3
Out[6]:
In [7]:
log_comments = blog_data[:comments].to_a.map { |c| Math::log(c) }
log_host_comments_avg = blog_data[:host_comments_avg].to_a.map { |c| Math::log(c) }
blog_data[:log_comments] = log_comments
blog_data[:log_host_comments_avg] = log_host_comments_avg
blog_data.head 3
Out[7]:
The logarithm of the number of comments of a given blog post is modeled as a function of the blog post text length, the average number of comments and trackbacks per blog article at the hosting website of the blog, and the existence of commented parent blog posts. Additionally, we model random fluctuations of the number of comments due to the day of the week when the blog post was released.
That is, the number of comments of the $i$th blog post, which is published on weekday $d$, is estimated as
$$log(comments_i) \approx \beta_0 + length_i \cdot \beta_1 + log(host\_comments_i) \cdot \beta_2 + host\_trackbacks_i \cdot \beta_3 + parent\_with\_comments\_yes \cdot \beta_4 + b_d.$$Note: It is questionable whether a linear mixed model is a good choice here, because the response variable represents count data, which for obvious reasons cannot follow a normal distribution. However, we can still get a reasonble model fit, and certainly gain some insight into the data from the linear mixed model, even if it wont match the data very well. In fact, by taking the $log$-transform, we partially account for the fact that the response variable represents count data.
We fit the model with mixed_models
and display the estimated fixed effects coefficients (the $\beta$'s from the above equation).
In [8]:
require 'mixed_models'
model_fit = LMM.from_formula(formula: "log_comments ~ log_host_comments_avg + host_trackbacks_avg + length + has_parent_with_comments + (1 | day)", data: blog_data)
model_fit.fix_ef
Out[8]:
In [9]:
model_fit.fix_ef_summary
Out[9]:
We can assess the goodness of the model fit (to some extent) by plotting the residuals agains the fitted values. We definately see a pattern here -- the model seems to make somewhat better predictions for observations with a small number of comments. Additionally, a subset of the observations displays what appears to be a linear relationship between the fitted values and the residuals, which is a reason for concern. However, this is due to the fact the the response variable has a discrete range ($log$ transfrom of count data), with possible values at the low end being especially far apart.
In [10]:
require 'gnuplotrb'
include GnuplotRB
x, y = model_fit.fitted, model_fit.residuals
fitted_vs_residuals = Plot.new([[x,y], with: 'points', pointtype: 6, notitle: true],
xlabel: 'Fitted', ylabel: 'Residuals')
fitted_vs_residuals.term('png')
Out[10]:
In [11]:
bin_width = (y.max - y.min)/10.0
bins = (y.min..y.max).step(bin_width).to_a
rel_freq = Array.new(bins.length-1){0.0}
y.each do |r|
0.upto(bins.length-2) do |i|
if r >= bins[i] && r < bins[i+1] then
rel_freq[i] += 1.0/y.length
end
end
end
bins_center = bins[0...-1].map { |b| b + bin_width/2.0 }
residuals_hist = Plot.new([[bins_center, rel_freq], with: 'boxes', notitle: true],
style: 'fill solid 0.5')
residuals_hist.term('png')
Out[11]:
In [12]:
require 'distribution'
observed = model_fit.residuals.sort
n = observed.length
theoretical = (1..n).to_a.map { |t| Distribution::Normal.p_value(t.to_f/n.to_f) * model_fit.sigma}
qq_plot = Plot.new([[theoretical, observed], with: 'points', pointtype: 6, notitle: true],
['x', with: 'lines', notitle: true],
xlabel: 'Normal theoretical quantiles', ylabel: 'Observed quantiles',
title: 'Q-Q plot of the residuals')
qq_plot.term('png')
Out[12]:
In general, we see that the model residuals satisfy the normality assumption to a reasonable extent.
In [13]:
puts "Obtained fixed effects coefficient estimates:"
puts model_fit.fix_ef
The directionality of the obtained estimates implies that blog posts hosted on websites with a high average of comments per blog post, also tend to have more comments. Moreover, blog posts which have parent blog posts, tend to have fewer comments. The effects of the average number of trackbacks per blog post on the hosting website and the blog post length seem rather small in magnitude.
Based on the Wald Z test statistics, we can carry out hypotheses tests for each fixed effects terms $\beta_{i}$, testing the null $H_{0} : \beta_{i} = 0$ against the alternative $H_{a} : \beta_{i} \neq 0$.
Note: The Wald methods for $p$-values and confidence intervals are not absolutely trustworthy and should be treated with caution, as pointed out in this blog post.
The corresponding (approximate) p-values are obtained with:
In [14]:
model_fit.fix_ef_p(method: :wald)
Out[14]:
Interestingly, all $p$-values are tiny, implying that each predictor has a very strong linear relationship with the response variable.
However, we should be careful with our conclusions. The very small $p$-values can also be explained by the large sample size (>20000 observations).
We can also look at Wald confidence intervals for the fixed effects coefficient estimates, which are in general more informative than $p$-values.
In [15]:
conf_int = model_fit.fix_ef_conf_int(level: 0.95, method: :wald)
ci = Daru::DataFrame.rows(conf_int.values, order: [:lower95, :upper95], index: model_fit.fix_ef_names)
ci[:coef] = model_fit.fix_ef.values
ci
Out[15]:
We observe that none of the 95% confidence intervals contains 0, which suggest high statistical significance of the linear predictors (equivalent to $p$-values). Also, all of the intervals seem rather narrow.
In [16]:
puts "Obtained random effects coefficient estimates:"
puts model_fit.ran_ef
We can also look at the estimated correlation structure of the random effects:
In [17]:
model_fit.ran_ef_summary
Out[17]:
Interestingly, the estimates of the random effects coefficients and standard deviation are all zero!
That is, we have a singular fit. Thus, our results imply that the variability between different days of the week is not large enough to justify non-zero random effects in this model. Practically, we can coclude that the day of the week on which a blog post is published has no effect on the number of comments that the blog post will receive in the first 24 hours.
This turned out to be an example of a degenerate model fit (random effects variance estimated to be zero), and we saw that mixed_models
can handle degenerate fits very well.
Of course we learned alot about the blog post data by doing this analysis. Here is a list of some of the findings.
Blog posts hosted on websites with a high average of comments per blog post, also tend to have more comments (obviously).
Blog posts which have parent blog posts, seem to have fewer comments (wonder why that is...).
The effect of the average number of trackbacks per blog post on the hosting website seems to have a very small effect on the number of comments of a given blog post.
The blog post text length has an extremely small positive effect on the number of comments.
All considered fixed effects predictor variables seem to have a significant influence on the number of comments that a blog post receives in the first 24 hours after publication (according to Wald Z tests).
The day of the week on which a blog post is published has practically no influence on the number of comments that the blog post will receive in the first 24 hours after publication.