A look at trends in Hedonometer.org's happiness data using Julia, JuMP, and GLPK.
Hedonometer.org popped onto my radar a couple weeks ago. It's a nifty project, attempting to convert samples of words found in the Twitter Gardenhose feed into a time series of happiness.
While I'm not a computational social scientist, I must say the data does have a nice intuitive quality to it. There are obvious trends in happiness associated with major holidays, days of the week, and seasons. It seems like the sort of data that could be decomposed into trends based on those various components. The Hedonometer group has, of course, done extensive analyses of their own data which you can find on their papers page.
This post examines another approach. It follows the structure of Robert Vanderbei's excellent "Local Warming" project to separate out the Hedonometer averages into daily, seasonal, solar, and day-of-the-week trends. We'll be using Julia v0.2 with JuMP and GLPK for linear optimization, Gadfly for graphing, and a few other libraries. If you haven't installed Julia and IPython, first do that. Then add all the extra requirements in the Julia shell as shown below.
In [2]:
Pkg.add("IJulia")
Pkg.add("DataFrames")
Pkg.add("Datetime")
Pkg.add("Gadfly")
Pkg.add("JuMP")
Pkg.add("GLPKMathProgInterface")
Pkg.add("HypothesisTests")
Hedonometer.org introduced an API recently, but as this post was started before that happened, we'll work with the CSV data. For the time being, you can still download it here. The ambitious reader is referred to the exercise at the end of the page regarding their API.
We use the readtable function from the DataFrames package to import the data. One oddity that this function doesn't seem to deal with yet is that the header has three columns while the rest of the CSV has two. So we skip the header and add column names in later. x1 is the date column here and x2 is the happiness value, in Hedonometer units.
A brief inspection shows that we are not missing any data, so we don't have to merge in a list of dates with NAs.
In [1]:
using DataFrames
tmpData = readtable("dailyFull.csv", header=false, skipstart=1)
head(tmpData)
Out[1]:
Note that the dates are strings. We vectorize the date constructor and use it to build a proper data frame.
In [2]:
import Datetime: date
@vectorize_1arg String date
data = DataFrame(date=date(tmpData[:, :x1]), value=tmpData[:, :x2])
head(data)
Out[2]:
Let's start by making a plot of the raw data. This should look similar to the graph on their website, only not as pretty.
In [4]:
using Gadfly
set_default_plot_size(24cm, 12cm)
colorGradient = Scale.lab_gradient(color("crimson"), color("blue"), color("greenyellow"))
plot(
data,
x = "date",
y = "value",
color = "value",
Guide.xlabel("Date"),
Guide.ylabel("Hedonomoeter.org Units"),
Guide.title("Average Happiness on Twitter"),
Scale.ContinuousColorScale(colorGradient)
)
Out[4]:
The data looks right, so we're off to a good start. Now we have to think about what sort of components we believe are important factors to this index. We'll start with the same ones as in the Vanderbei model:
We'll add to this weekly trends, as zooming into the data shows we tend to happier on the weekends than on work days. In the next section we'll build a model to separate out the effects of these trends on the Hedonometer.org index.
Vanderbei's model analyzes daily temperature data for a particular location using least absolute deviations (LAD). This is similar to the well-known least squares approach, but while the latter penalizes the model quadratically more for bigger errors, the former does not. In mathematical notation, the least squares model takes in a known $m \times n$ matrix $A$ and $m \times 1$ vector $y$ of observed data, then searches for a vector $x$ such that $Ax = \hat{y}$ and $\left\lVert y - \hat{y} \right\rVert_2^2$ is minimized.
The LAD model is similar in that it takes in the same data, but instead of minimizing the sum of the squared $L^2$ norms, it minimizes the sum of the $L^1$ norms. Thus we penalize our model using simply the absolute values of its errors instead of their squares. This makes the LAD model more robust, that is, less sensitive to outliers in our input data.
Using a robust model with this data set makes sense becuase it clearly contains a lot of outliers. While some of them, such as December 25th, may be recurrent, we're going to ignore that detail for now. After all, not every day is Christmas.
We formulate our model below using JuMP with GLPK as the solver. The code works by defining a set of variables called coefficientVars that will converge to optimal values for $x$. For each observation we compute a row of the $A$ matrix that has the following components:
We then add a linear variable representing the residual, or error, of the fitted model for each observation. Constraints enforce that these variables always take the absolute values of those errors.
Minimizing the sum of those residuals gives us a set of eight coefficients for the model. We return these and a function that predicts the happiness level for an offset from the first data record. (Note that the first record appears to be from Wednesday, September 10, 2008.)
In [7]:
using JuMP
using GLPKMathProgInterface
function buildPredictor(d)
m = Model(solver=GLPKSolverLP())
# Define a linear variable for each of our regression coefficients.
# Note that by default, JuMP variables are unrestricted in sign.
@defVar(m, coefficientVars[1:8])
# Residuals are the absolute values of the error comparing our
# observed and fitted values.
residuals = Array(Variable, nrow(d))
# This builds rows for determining fitted values. The first value is
# 1 since it is multiplied by our our trend line's offset. The other
# values correpond to the trends described above. Sinusoidal elements
# have two variables with sine and cosine terms.
function buildRowConstants(a1)
[
1, # Offset
a1, # Daily trend
cos(2pi*a1/365.25), # Seasonal variation
sin(2pi*a1/365.25), #
cos(2pi*a1/(10.66*365.25)), # Solar cycle variation
sin(2pi*a1/(10.66*365.25)), #
cos(2pi*a1/7), # Weekly variation
sin(2pi*a1/7) #
]
end
# This builds a linear expression as the dot product of a row's
# constants and the coefficient variables.
buildRowExpression(a1) = dot(buildRowConstants(a1), coefficientVars)
# Add a variable representing the absolute value of each error.
@defVar(m, residuals[1:nrow(d)] >= 0)
for a1 = 1:nrow(d)
fitted = buildRowExpression(a1)
# Linear way of representing residual >= |observed - fitted|
@addConstraint(m, residuals[a1] >= fitted - d[a1,:value])
@addConstraint(m, residuals[a1] >= d[a1,:value] - fitted)
end
# Minimize the total sum of these residuals.
@setObjective(m, Min, sum(residuals))
solve(m)
# Return the model coefficients and a function that predicts happiness
# for a given day, by index from the start of the data set.
coefficients = getValue(coefficientVars)[:]
# And we would like our model to work over vectors.
predictHappiness(a1) = dot(buildRowConstants(a1), coefficients)
return coefficients, predictHappiness
end
coefficients, predictor = buildPredictor(data)
coefficients
Out[7]:
The optimal values for $x$ are output above. The second value is the change in happiness per day. We can see from this that there does seem to be a small positive trend.
We can call our predictor function to obtain the fitted happiness level for any day number starting from September 10, 2008.
In [8]:
predictor(1000)
Out[8]:
We now have a set of coefficients and a predictive model. That's nice, but we'd like to have some sense of a reasonable range on our model's coefficients. For instance, how certain are we that our daily trend is really even positive? To deal with these uncertanties, we use a method called bootstrapping.
Bootstrapping involves building fake observed data based on our fitted model and its associated errors. We then fit the model to our fake data to determine new coefficients. If we repeat this enough times, we may be able to generate decent confidence intervals around our model coefficients.
First step: compute the errors between the observed and fitted data. We'll construct a new data frame that contains everything we need to construct fake data.
In [15]:
# Compute fitted data corresponding to our observations and their associated errors.
fitted = DataFrame(
date = data[:, :date],
observed = data[:, :value],
fitted = map(x -> predictor(x), 1:nrow(data)),
error = map(x -> predictor(x) - data[x, :value], 1:nrow(data))
)
describe(fitted)
Note that the median for our errors is exactly zero. This is a good sign.
Now we build a function that creates a fake input data set using the fitted values with randomonly selected errors. That is, for each observation, we add a randomly selected error with replacement to its corresponding fitted value. Once we've done that for every observation, we have a complete fake data set.
In [17]:
function buildFakeData(fitted)
errorIndexes = rand(1:nrow(fitted), nrow(fitted))
DataFrame(
date = fitted[:, :date],
value = map(x -> fitted[x, :fitted] - fitted[errorIndexes[x], :error], 1:nrow(fitted))
)
end
# Plot some fake data to see if it looks similar.
plot(
buildFakeData(fitted),
x = "date",
y = "value",
color = "value",
Guide.xlabel("Date"),
Guide.ylabel("Hedonomoeter.org Units"),
Guide.title("Average Happiness on Twitter (Fake)"),
Scale.ContinuousColorScale(colorGradient)
)
Out[17]:
Visually, the plot of an arbitrary fake data set looks a lot like our original data, but not exactly.
Now we generate 1,000 fake data sets and run them through our optimization function above. This generates 1,000 sets of model coefficients and then computes $2\sigma$ confidence intervals around them.
The following code took just under an hour on my machine. If you're intent on running it yourself, you may want to get some coffee or a beer in the meantime.
In [18]:
using HypothesisTests
coeffData = [buildPredictor(buildFakeData(fitted))[1] for i = 1:1000]
confidence2Sigma = Array((Float64, Float64), length(coefficients))
for i = 1:length(coefficients)
sample::Array{Float64}
sample = [x[i] for x in coeffData]
confidence2Sigma[i] = ci(OneSampleTTest(sample))
end
confidence2Sigma
Out[18]:
From the above output we can tell that we do seem to be trending happier, with a daily trend of 9.88082e-5 in Hedonometer units and a 95% confidence interval on that trend of 9.833341267642056e-5, and 9.891872644950681e-5. Cool!
Now we take a quick look at our model output. First, we plot the fitted happiness values for the same time period as the observed data. We can see that this resembles the same general trend minus the outliers. The width of the curve is due to weekly variation.
In [88]:
plot(
fitted,
x = "date",
y = "fitted",
Guide.xlabel("Date"),
Guide.ylabel("Hedonomoeter.org Units"),
Guide.title("Expected Happiness on Twitter"),
Geom.line
)
Out[88]:
Now we take a look at what a typical week looks like in terms of its effect on our happiness. As September 10, 2008 was a Wednesday, we index Sunday starting at 5. The resulting graph highlights what I think we all already know about the work week.
In [87]:
daily(a1) = coefficients[6]*cos(2pi*a1/7) + coefficients[7]*sin(2pi*a1/7)
plot(
x = ["Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"],
y = map(daily, [5, 6, 7, 1, 2, 3, 4]),
Guide.xlabel("Day of the Week"),
Guide.ylabel("Hedonomoeter.org Units"),
Guide.title("Variation in Happiness on Twitter Due to Day of the Week"),
Geom.line,
Geom.point
)
Out[87]:
That's it for this analysis. We've learned that, for the being at least, we seem to be trending happier. Which is pretty cool. Of course, 6 years is not a huge amount of time for this sort of data. Let's try doing the analysis again in 10 or 20.
Will anyone still be using Twitter then?...
The particularly ambitious reader may find the following exercises interesting.
Hedonometer.org introduced an API while this post was already in progress. Change the code at the beginning to read from this instead of a CSV.
The code that reruns the model using randomly constructed fake data is eligible for parallelization. Rewrite the list comprehension that calls buildPredictor so it runs concurrently.
Some of the happier days in the Hedonometer.org data, such as Christmas, are recurring, and therefore not really outliers. How might one go about accounting for the effects of those days?
Try the same analysis using a least-squares model. Which model is better for this data?