Data Analysis in Ruby

Introduction

This is an example of a data analysis in Ruby using daru to organize and manipulate data, mixed_models to fit a statistical model, and gnuplotrb for visualization.

We will analyze these data1 from the UCI machine learning repository2, which originate from blog posts from various sources in 2010-2012.

Here, we use the data to investigate the following question:

If a blog post has received a comment in the first 24 hours after publication, how many more comments will it receive before 24 hours after its publication have passed?

After doing the entire data cleaning and preprocessing with daru, we use mixed_models to fit a linear mixed model. The fitted model can be used to make predictions for future observations, and make inferences about the relationships between different variables in the data.


[1] For more information on the data set we refer to: Buza, K. (2014). Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152). Springer International Publishing.

[2] Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Data preprocessing with daru

Since daru requires CSV files to have a header line, we add a header to the data file and save the new data frame.


In [1]:
without_header = '../examples/data/blogData_train.csv'
with_header = '../examples/data/blogData_train_with_header.csv'
colnames = (1..281).to_a.map { |x| "v#{x}" }
header = colnames.join(',')
File.open(with_header, 'w') do |fo|
  fo.puts header
  File.foreach(without_header) do |li|
    fo.puts li
  end
end

Now we can load the data with daru.


In [2]:
require 'daru'
df = Daru::DataFrame.from_csv '../examples/data/blogData_train_with_header.csv'
puts "Number of rows: #{df.nrows}"
puts "Number of columns: #{df.ncols}"


Number of rows: 52397
Number of columns: 281

Selecting and renaming data columns

We don't want to keep most of the variables for further analysis, as the great majority of the columns represent bag-of-words features and summary statistics for other attributes. So, we select the data columns which we want to keep, and assign them meaningful names.


In [3]:
# select a subset of columns of the data frame
keep = ['v16', 'v41', 'v54', 'v62', 'v270', 'v271', 'v272', 
        'v273', 'v274', 'v275', 'v276', 'v277', 'v280']
blog_data = df[*keep]
df = nil

# assign meaningful names for the selected columns
meaningful_names = [:host_comments_avg, :host_trackbacks_avg, 
                    :comments, :length, :mo, :tu, :we, :th, 
                    :fr, :sa, :su, :parents, :parents_comments]
blog_data.vectors = Daru::Index.new(meaningful_names)

# the resulting data set
blog_data.head


Out[3]:
Daru::DataFrame:47139036762220 rows: 10 cols: 13
host_comments_avghost_trackbacks_avgcommentslengthmotuwethfrsasuparentsparents_comments
034.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
134.5675660.9729735.00.00.00.01.00.00.00.00.00.00.0
234.5675660.9729735.00.00.00.01.00.00.00.00.00.00.0
334.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
434.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
534.5675660.9729735.00.00.00.01.00.00.00.00.00.00.0
634.5675660.9729735.00.00.00.01.00.00.00.00.00.00.0
734.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
834.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
934.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0

Selecting a subset of the data rows

As can be seen in the above output, the length of the text in a blog post is often given as zero. Those are probably missing values, and we get rid of the corresponding observations.

We also delete observation which have zero comments in the first 24 hours after publication, to comply with our research objective stated in the beginning.


In [4]:
nonzero_ind = blog_data[:length].each_index.select do |i| 
  blog_data[:length][i] > 0 && blog_data[:comments][i] > 0
end
blog_data = blog_data.row[*nonzero_ind]
puts "Remaining number of rows: #{blog_data.nrows}"


Remaining number of rows: 22435

Replacing and transforming variables

Creating a categorical "day" variable

For a more clear representation of the data, and in order to use the day of the week as a grouping variable for the observations, we replace the respective seven 0-1-valued columns with one column of categorical data (with values 'mo', 'tu', 'we', 'th', 'fr', 'sa', 'su').


In [5]:
days = Array.new(blog_data.nrows) { :unknown }
[:mo, :tu, :we, :th, :fr, :sa, :su].each do |d|
  ind = blog_data[d].to_a.each_index.select { |i| blog_data[d][i]==1 }
  ind.each { |i| days[i] = d.to_s }
  blog_data.delete_vector(d)
end
blog_data[:day] = days
blog_data.head 3


Out[5]:
Daru::DataFrame:47139045191780 rows: 3 cols: 7
host_comments_avghost_trackbacks_avgcommentslengthparentsparents_commentsday
1221110.300870.074.03501.00.00.0we
1222110.300870.074.03501.00.00.0we
1223110.300870.0218.04324.00.00.0th
Replacing two highly correlated variables

The variable parents denotes the number of parent blog posts, where we consider a blog post P as a parent of blog post B if B is a reply (trackback) to blog post P. Related to it, the variable parents_comments denotes the number of comments that the parents of a blog post received on average. Clearly, the two variables are highly correlated (as zero parents implies zero parents_comments). Therefore, we shouldn't include both these variables in the linear mixed model.

We combine the variables parents and parents_comments into one variable called has_parent_with_comments, which is a categorical variable designating whether a blog post has at least one parent post with at least one comment.


In [6]:
# create a binary indicator vector specifying if a blog post has at least 
# one parent post which has comments
hpwc = (blog_data[:parents] * blog_data[:parents_comments]).to_a
blog_data[:has_parent_with_comments] = hpwc.map { |t| t == 0 ? 'no' : 'yes'} 
blog_data.delete_vector(:parents)
blog_data.delete_vector(:parents_comments)
blog_data.head 3


Out[6]:
Daru::DataFrame:47139031507220 rows: 3 cols: 6
host_comments_avghost_trackbacks_avgcommentslengthdayhas_parent_with_comments
1221110.300870.074.03501.0weno
1222110.300870.074.03501.0weno
1223110.300870.0218.04324.0thno
Log transforms

Some prior experimentation with the data suggests that it is necessary to take $log$-transforms of the response variable comments and the predictor variable host_comments_avg. This results in a much better agreement with the normality assumption on the model residuals.


In [7]:
log_comments = blog_data[:comments].to_a.map { |c| Math::log(c) }
log_host_comments_avg = blog_data[:host_comments_avg].to_a.map { |c| Math::log(c) }
blog_data[:log_comments] = log_comments
blog_data[:log_host_comments_avg] = log_host_comments_avg
blog_data.head 3


Out[7]:
Daru::DataFrame:47139031510440 rows: 3 cols: 8
host_comments_avghost_trackbacks_avgcommentslengthdayhas_parent_with_commentslog_commentslog_host_comments_avg
1221110.300870.074.03501.0weno4.304065093204174.7032118138076795
1222110.300870.074.03501.0weno4.304065093204174.7032118138076795
1223110.300870.0218.04324.0thno5.3844950627890894.7032118138076795

Linear mixed model

The logarithm of the number of comments of a given blog post is modeled as a function of the blog post text length, the average number of comments and trackbacks per blog article at the hosting website of the blog, and the existence of commented parent blog posts. Additionally, we model random fluctuations of the number of comments due to the day of the week when the blog post was released.

That is, the number of comments of the $i$th blog post, which is published on weekday $d$, is estimated as

$$log(comments_i) \approx \beta_0 + length_i \cdot \beta_1 + log(host\_comments_i) \cdot \beta_2 + host\_trackbacks_i \cdot \beta_3 + parent\_with\_comments\_yes \cdot \beta_4 + b_d.$$

Note: It is questionable whether a linear mixed model is a good choice here, because the response variable represents count data, which for obvious reasons cannot follow a normal distribution. However, we can still get a reasonble model fit, and certainly gain some insight into the data from the linear mixed model, even if it wont match the data very well. In fact, by taking the $log$-transform, we partially account for the fact that the response variable represents count data.

We fit the model with mixed_models and display the estimated fixed effects coefficients (the $\beta$'s from the above equation).


In [8]:
require 'mixed_models'
model_fit = LMM.from_formula(formula: "log_comments ~ log_host_comments_avg + host_trackbacks_avg + length + has_parent_with_comments + (1 | day)", data: blog_data)
model_fit.fix_ef


Out[8]:
{:intercept=>0.18944500002662124, :log_host_comments_avg=>0.7861228519075351, :host_trackbacks_avg=>0.05579925647682859, :length=>2.580329980620255e-05, :has_parent_with_comments_lvl_yes=>-0.48841082096511457}

In [9]:
model_fit.fix_ef_summary


Out[9]:
Daru::DataFrame:47139041993940 rows: 5 cols: 4
coefsdz_scoreWaldZ_p_value
intercept0.189445000026621240.019300366951039869.8156164858002540.0
log_host_comments_avg0.78612285190753510.0053227324152766145147.691597205094780.0
host_trackbacks_avg0.055799256476828590.0081028499656306546.8863741416302615.723421736547607e-12
length2.580329980620255e-052.0833561682273268e-0612.3854481531873170.0
has_parent_with_comments_lvl_yes-0.488410820965114570.08884797074497962-5.4971522351028193.859734754030342e-08

Assess the quality of the fit

Fitted vs. residuals plot

We can assess the goodness of the model fit (to some extent) by plotting the residuals agains the fitted values. We definately see a pattern here -- the model seems to make somewhat better predictions for observations with a small number of comments. Additionally, a subset of the observations displays what appears to be a linear relationship between the fitted values and the residuals, which is a reason for concern. However, this is due to the fact the the response variable has a discrete range ($log$ transfrom of count data), with possible values at the low end being especially far apart.


In [10]:
require 'gnuplotrb'
include GnuplotRB

x, y = model_fit.fitted, model_fit.residuals
fitted_vs_residuals = Plot.new([[x,y], with: 'points', pointtype: 6, notitle: true],
                               xlabel: 'Fitted', ylabel: 'Residuals')
fitted_vs_residuals.term('png')


Out[10]:
Histogram of the residuals

We can further analyze the validity of the linear mixed model somewhat, by looking at a histogram and checking if the residuals appear to be approximately normally distributed. The distribution looks in general not too different from a bell-shaped normal curve.


In [11]:
bin_width = (y.max - y.min)/10.0
bins = (y.min..y.max).step(bin_width).to_a
rel_freq = Array.new(bins.length-1){0.0}
y.each do |r|
  0.upto(bins.length-2) do |i|
    if r >= bins[i] && r < bins[i+1] then
      rel_freq[i] += 1.0/y.length
    end
  end
end
bins_center = bins[0...-1].map { |b| b + bin_width/2.0 }
  
residuals_hist = Plot.new([[bins_center, rel_freq], with: 'boxes', notitle: true],
                           style: 'fill solid 0.5')
residuals_hist.term('png')


Out[11]:
Q-Q plot of the residuals

The Q-Q scatter plot deviates a little at both ends from the diagonal. However, it isn't horrible considering that we are modeling count data with a normal distribution.


In [12]:
require 'distribution'

observed = model_fit.residuals.sort
n = observed.length
theoretical = (1..n).to_a.map { |t| Distribution::Normal.p_value(t.to_f/n.to_f) * model_fit.sigma}
qq_plot = Plot.new([[theoretical, observed], with: 'points', pointtype: 6, notitle: true],
                   ['x', with: 'lines', notitle: true],
                   xlabel: 'Normal theoretical quantiles', ylabel: 'Observed quantiles',
                   title: 'Q-Q plot of the residuals')
qq_plot.term('png')


Out[12]:

In general, we see that the model residuals satisfy the normality assumption to a reasonable extent.

Estimation results

Let's look at the estimated model parameters, and see what they reveal about the data.

Fixed effects

Looking at the fixed effects coefficients it is striking that the estimate corresponding to the effect of the blog post text length is almost zero. Possibly, the length of a blog post has practically no effect on how many comments the blog post will receive.


In [13]:
puts "Obtained fixed effects coefficient estimates:"
puts model_fit.fix_ef


Obtained fixed effects coefficient estimates:
{:intercept=>0.18944500002662124, :log_host_comments_avg=>0.7861228519075351, :host_trackbacks_avg=>0.05579925647682859, :length=>2.580329980620255e-05, :has_parent_with_comments_lvl_yes=>-0.48841082096511457}

The directionality of the obtained estimates implies that blog posts hosted on websites with a high average of comments per blog post, also tend to have more comments. Moreover, blog posts which have parent blog posts, tend to have fewer comments. The effects of the average number of trackbacks per blog post on the hosting website and the blog post length seem rather small in magnitude.

Based on the Wald Z test statistics, we can carry out hypotheses tests for each fixed effects terms $\beta_{i}$, testing the null $H_{0} : \beta_{i} = 0$ against the alternative $H_{a} : \beta_{i} \neq 0$.

Note: The Wald methods for $p$-values and confidence intervals are not absolutely trustworthy and should be treated with caution, as pointed out in this blog post.

The corresponding (approximate) p-values are obtained with:


In [14]:
model_fit.fix_ef_p(method: :wald)


Out[14]:
{:intercept=>0.0, :log_host_comments_avg=>0.0, :host_trackbacks_avg=>5.723421736547607e-12, :length=>0.0, :has_parent_with_comments_lvl_yes=>3.859734754030342e-08}

Interestingly, all $p$-values are tiny, implying that each predictor has a very strong linear relationship with the response variable.

However, we should be careful with our conclusions. The very small $p$-values can also be explained by the large sample size (>20000 observations).

We can also look at Wald confidence intervals for the fixed effects coefficient estimates, which are in general more informative than $p$-values.


In [15]:
conf_int = model_fit.fix_ef_conf_int(level: 0.95, method: :wald)
ci = Daru::DataFrame.rows(conf_int.values, order: [:lower95, :upper95], index: model_fit.fix_ef_names)
ci[:coef] = model_fit.fix_ef.values
ci


Out[15]:
Daru::DataFrame:47139052686000 rows: 5 cols: 3
lower95upper95coef
intercept0.151616976164225870.22727302388901660.18944500002662124
log_host_comments_avg0.77569048814320870.79655521567186140.7861228519075351
host_trackbacks_avg0.0399179624770390350.071680550476618140.05579925647682859
length2.1719996776498967e-052.988660283590613e-052.580329980620255e-05
has_parent_with_comments_lvl_yes-0.6625496425736548-0.31427199935657435-0.48841082096511457

We observe that none of the 95% confidence intervals contains 0, which suggest high statistical significance of the linear predictors (equivalent to $p$-values). Also, all of the intervals seem rather narrow.

Random effects

We can look at the obtained random effects estimates (the values $b_d$ from the above equation, where $d\in\{mo, tu, we, th, fr, sa, su\}$).


In [16]:
puts "Obtained random effects coefficient estimates:"
puts model_fit.ran_ef


Obtained random effects coefficient estimates:
{:intercept_fr=>0.0, :intercept_mo=>0.0, :intercept_sa=>0.0, :intercept_su=>0.0, :intercept_th=>0.0, :intercept_tu=>0.0, :intercept_we=>0.0}

We can also look at the estimated correlation structure of the random effects:


In [17]:
model_fit.ran_ef_summary


Out[17]:
Daru::DataFrame:47139054735200 rows: 1 cols: 1
day
day0.0

Interestingly, the estimates of the random effects coefficients and standard deviation are all zero!

That is, we have a singular fit. Thus, our results imply that the variability between different days of the week is not large enough to justify non-zero random effects in this model. Practically, we can coclude that the day of the week on which a blog post is published has no effect on the number of comments that the blog post will receive in the first 24 hours.

Conclusions

This turned out to be an example of a degenerate model fit (random effects variance estimated to be zero), and we saw that mixed_models can handle degenerate fits very well.

Of course we learned alot about the blog post data by doing this analysis. Here is a list of some of the findings.

  • Blog posts hosted on websites with a high average of comments per blog post, also tend to have more comments (obviously).

  • Blog posts which have parent blog posts, seem to have fewer comments (wonder why that is...).

  • The effect of the average number of trackbacks per blog post on the hosting website seems to have a very small effect on the number of comments of a given blog post.

  • The blog post text length has an extremely small positive effect on the number of comments.

  • All considered fixed effects predictor variables seem to have a significant influence on the number of comments that a blog post receives in the first 24 hours after publication (according to Wald Z tests).

  • The day of the week on which a blog post is published has practically no influence on the number of comments that the blog post will receive in the first 24 hours after publication.