Just like p values, standardized effect sizes provide another of psychologist's rituals that helps to compare academic performance but provides little toward the goal of reverse-engineering human mind. Large effect size is good, note-worthy. The findings with large effect sizes are likely replicable and as such robust. At the same time effect sizes make it impossible to compare results across different experiments. They make it impossible to compare the results of the replications of the same experiment. And they even make it impossible to compare different conditions of the same study. No wonder that discussion sections of published reports just skip over effect sizes. Even if effect sizes are reported they are not discussed and they are not used to support interpretation and conclusions. Standardized effect sizes are confusing and people are confused by them. Up to the point, that some researcher give up on their original noble purpose of allowing comparisons across studies (as exemplified by their use in a meta-analysis). At least Uri Simonsohn seems to be giving up on this noble goal in his most recent blog post. Simonsohn concludes that there is no single effect size because effect sizes are not comparable across studies.

Why are effect sizes not comparable across studies? How can we make them comparable?

Simonsohn's issue with effect sizes is primarily a problem with standardized effect sizes. Consider an example. Researcher A wants to know whether Claire runs faster than David. Researcher A compares how long it takes for Claire and David to run 30 meters. Claire takes 8 seconds. In contrast it takes 5 seconds for David. Then, assuming standard deviation $s=1$, the effect size is $d=(8-5)/1=3$. Researcher B also measures Claire's and David's run time. In B's experiment they run 80 meters. Claire takes 16 seconds while David takes 10 seconds giving $d=(16-10)/1=6$. How should we compare these effect sizes? What is the true effect size? Should we average the effect sizes (to get 4.5) or take one of other Simonsohn's suggestions:

Should we weight by number of studies? Imagined, planned, or executed? Or perhaps weight by how clean (free-of-confounds) each study is? Or by sample size?

The answer is perfectly simple. Avoid standardizing and thus remove the distance confound. Instead, by regressing the time with respect to duration we obtain the quantity that we are interested in. It's called speed. David runs at 8 m/s. Claire runs at 5 m/s. The difference is 3 m/s - this is the number we are interested in. Amazingly, this number is identical across both studies (A and B).

(The argument above assumes that $s=1$ when computing standardized effect size in both A and B. This is unlikely and $s$ probably increases with distance. However, what matters for the argument is only that $s$ does not linearly increase with distance.)

The above procedure is perfectly general. To obtain effect size, we define the function $Y=f(X)$ that describes the relation between the manipulation $X$ and the measure $Y$. To define the function we need both its functional form (is it $Y=log(X)$, $Y=b+aX$ or $Y=b+aX^2$) and the estimates of the parameters (e.g. $a$,$b$). We usually use generalized linear models $Y=f(b+aX)$. Then $a$ characterizes the relationship between $X$ and $Y$ and can be considered as *the* effect size.

For the above procedure it is necessary that we know what $Y$ and $X$ are and what quantity in what units they describe. Social psychologists are especially prone to use unvalidated measures and unvalidated, unstandardized manipulations. If one condition uses pictures and another condition uses sentences and you don't have a theory that would quantify how the diverse stimuli relate to your $X$, then you are lost. If you don't know what you are measuring, if you don't know what your stimulus does, then no amount of statistical sophistry will make the data interpretable and comparable with other studies. There are two ways out of this misery. Either you do the hard work and you validate your measures and stimuli, or you use established measures that describe established constructs. There is no third way.

Confounds may change the interpretation of results and they may change the interpretation of effect sizes. If the confound set differs from study to study, the effect sizes from the different studies may be no longer comparable - even when the two rules mentioned so far are adhered to.

Confounds can be easily added to the model formulation $Y=f(X,C_1,C_2,...,C_n)$. What variables should we include as confounds? This can be determined based on researcher's causal assumptions and with the help of Pearl's calculus for causal inference (Pearl, 2009). If the causal assumptions differ across studies then of course the studies are not comparable.

If you adher to the above three rules, you will be able to always get the same effect size (sans estimation error) across all studies. There is no need for averaging or weighting. If you wish to pool data from different studies to get more precise estimate, you would merge the models from all studies into a single model and use the big model to estimate effect size based on all data.

If you forfeit all the rules you end up in a proto-theoretical limbo such as the one that characterizes the data analysis in social psychology. Simonsohn specifically asked in his blog post, what is the effect size of anchoring. Anchoring is when people's numerical estimates are biased towards arbitrary starting points. Anchoring is a good example of a theoretical zombie created by a science that doesn't know what it is measuring or what it's stimulus does. Here are few examples from wikipedia:

The anchoring and adjustment heuristic was first theorized by Amos Tversky and Daniel Kahneman. In one of their first studies, participants were asked to compute, within 5 seconds, the product of the numbers one through eight, either as $1 \times 2 \times 3 \times 4 \times 5 \times 6 \times 7 \times 8$ or reversed as $8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1$. Because participants did not have enough time to calculate the full answer, they had to make an estimate after their first few multiplications. When these first multiplications gave a small answer – because the sequence started with small numbers – the median estimate was 512; when the sequence started with the larger numbers, the median estimate was 2,250. (The correct answer was 40,320.) In another study by Tversky and Kahneman, participants observed a roulette wheel that was predetermined to stop on either 10 or 65. Participants were then asked to guess the percentage of the United Nations that were African nations. Participants whose wheel stopped on 10 guessed lower values (25% on average) than participants whose wheel stopped at 65 (45% on average). The pattern has held in other experiments for a wide variety of different subjects of estimation.

Can we apply our rules to anchoring? At first blush, it is not clear what the manipulations across these two studies have in common. What is the $X$? In both experiments there is a strong uncertainty and some decoy that exploits this uncertainty and biases the final decision. So we may take some indicator of the amount of uncertainty as $X$. We may also wish to factor in the plausibility of the decoy as $C$. Then, as $Y$, we take some measure of deviation of the subject's answer from the truth.

This is all fine, but how should we relate the experimental manipulations to $X$? How does the time pressure $P$ (in seconds) from the first experiment relate to subject's uncertainty $X$? Note, we can easily define $X$ formally, e.g. as the entropy of prior probability distribution over the space of all possible answers. But the problem remains. How does $P$ relate to $X$? From the above experiment, we don't know. However it would be easy to add experiments without decoy that quantify subject's prior uncertainty by letting subject bet on his prior answer. Then we can relate $P$ to $X$. And after similarly clearing up what $Y$ measures with additional experiments we can relate $X$ to $Y$.

So, we can get a clear definition of $X$ and $Y$ and get an effect size that describes their relationship. Unfortunately our above definition of X and Y does not describe anchoring but context effects in general. It is applicable to non-numerical context effects such as attractiveness, similarity or compromise. It is equally applicable to linguistic context effects such as the garden path effect "horse raced past the barn fell". There is no justification for the "numerical" qualifier in the definition of anchoring. As such there is no justification for the definition of anchoring as a separate context effect. Anchoring is not a theoretical construct that relates a mass of findings to a cognitive mechanism - a cognitive mechanisms that would explain the findings. Anchoring is a basket for sorting findings based on their superficial similarity. Just like biologists before Darwin sorted plants and organisms into groups without deeper understanding of the mechanisms that produce them, so is social psychology still in a proto-theoretical state of sorting phenomena into groups based on their superficial similarity without deeper understanding of the cognitive processes that create them. Experiments with numbers go in this group. Experiments with sentences go in that group. There is no effect size for anchoring, because effect sizes describe theorized mechanisms, not random heaps of findings.