In Defence of Liberty

Driven by data; ridden with liberty.

Statistics and Lampposts XIII: Effect Sizes

Effect sizes are used to measure the magnitude of a phenomenon. (Edited: Kevin Dooley)

Effect sizes are used to measure the magnitude of a phenomenon. (Edited: Kevin Dooley)

One of the more misunderstood concepts in statistics is that of statistical significance. If the difference between two groups is statistically significant, it simply means that the described difference is discernable from zero, at a specified level of confidence. It does not confer importance or largeness. Whilst it is common to seek answers at a 95% confidence, this is arbitrary. Statistical significance is dependent upon the sample size. Even a minute dissimilarity between the two populations will be statistically significant, as long as the populations are large enough.

Effect sizes are used to quantify the strength of a phenomenon, describing the magnitude of difference between two groups. A nicotine replacement therapy may claim that it will cut the number of cigarettes smoked per day by 10. A weight loss programme may claim that it will reduce the participant’s body weight by half a stone, over the course. These unstandardised effect sizes say nothing about the population’s variability: it could be half of the dieters lose a stone; whilst the other half lose no weight at all.

Standardised effect sizes are used when there is no inherent meaning to the values, or as a common measure to compare effect sizes, as in the meta-analysis of various studies. An example of a standardised effect size is Cohen’s d: the difference between two means over a pooled standard deviation. Unlike correlation coefficients, which must be between 1 and -1, representing perfect positive and perfect negative correlation respectively, effect sizes can take any value. Jacob Cohen is a major influence in this area of study, and established a common convention for interpreting effect sizes. If the absolute effect size is around 0.2, it was ‘small’; if the size is about 0.5, the effect is ‘medium’ ; the effect is ‘large’ if the size is near 0.8; and ‘very large’ if the standardised size is even bigger than 1.3.

Other scholars argued vigorously against these “t-shirt sizes”. The statistician Gene Glass wrote:

There is no wisdom whatsoever in attempting to associate regions of the effect size metric with descriptive such as ‘small’, ‘moderate’, ‘large’ and the like. Dissociated from a context of decision and comparative value, there is little inherent value to an effect size of 3.5 or 0.2. Depending on what benefits can be achieved at what cost, an effect size of 2.0 might be ‘poor’ and one of 0.1 might be ‘good’.

Cohen responded to these concerns:

The terms ‘small’, ‘medium’ and ‘large’ are relative, not only to each other, but to the area of behavioural science or even more particularly to the specific content and research method being employed in any given investigation.

When comparing published studies, it is important to recognise that effect sizes are often subject to publication bias. Despite the small addition to human knowledge, researchers are usually reluctant to publish studies which show a small or minute effect sizes. Finding nothing might seem unimpressive. This means aggregate effect sizes may overstate the actual effect of the studied phenomenon.



This entry was posted on July 22, 2014 by in Statistics.
%d bloggers like this: