

Discover more from Philosophy bear
Depression drugs might work better than we think (or worse) because depression scales have a severe flaw
Scott Alexander has a post up arguing that we may be underestimating how big the effect size of anti-depression drugs is. But there’s another reason to think we’re misestimating the effect that cuts to the heart of how we measure depression and a lot of other things.
If you’ve ever filled out a psychometric questionnaire you’ve probably encountered questions like:
“I love to spend time with people”
With an answer scale of, say, 1 to 7, perhaps with 1 being “strongly disagree” and 7 being “strongly agree”.
There is a long-standing statistical and philosophical debate about whether we can assume that, say, the gap between 2 and 3 is the same size as the gap between 4 and 5. It matters because if we can’t, a lot of the statistics we’d normally do- like looking at the average response won’t produce meaningful results.
My personal view is that, on the balance of evidence, any difference in the size of the gaps between the possible responses on a Likert scale isn’t large enough to endanger the conclusions we draw from statistics - e.g. comparing the average happiness of countries. Thus averaging Likert scales is not too dangerous.
But for some reason, depression scales are done differently. They’re not Likert scales or a number that represents an abstract degree of “agreement”, instead they often look like this from the HAM-D:
DEPRESSED MOOD (sadness, hopeless, helpless, worthless)
0|__|Absent.
1|__|These feeling states indicated only on questioning.
2|__|These feeling states spontaneously reported verbally.
3|__|Communicates feeling states non-verbally, i.e. through facial expression, posture, voice and tendency to weep.
4|__|Patient reports virtually only these feeling states in his/her spontaneous verbal and non-verbal communication.
INSOMNIA: EARLY IN THE NIGHT
0|__|No difficulty falling asleep.
1|__|Complains of occasional difficulty falling asleep, i.e. more than 1⁄2 hour.
2|__|Complains of nightly difficulty falling asleep.
FEELINGS OF GUILT
0|__|Absent.
1|__|Self reproach, feels he/she has let people down.
2|__|Ideas of guilt or rumination over past errors or sinful deeds.
3|__|Present illness is a punishment. Delusions of guilt.
4|__|Hears accusatory or denunciatory voices and/or experiences threatening visual hallucinations.
There is no way in hell that the gaps between these possible answers are equal in size. Each answer option is more severe than the previous, but how much more severe? The size of the gaps varies unsystematically between the different questions. Does anyone really think that we can say that going from no depressed mood to “feelings indicated only on questioning” is one-quarter as much of a jump as going to “patient reports virtually only these feelings”? That the jump between “present illness is a punishment” and “threatening visual hallucinations” is the same as the jump between “absent” and “self-reproach”. Despite this, we do statistics as if we could compare the average score between groups! What do such averages even mean?
My guess, looking at these questions, is that the scale is convex. The jumps get bigger and bigger as you go up the scale. This would suggest that differences between groups (e.g. control and an experimental group) are likely bigger than they appear.
If that’s right, then the effect sizes of drugs and therapies on depression are being systematically underestimated. Also, differences in effect sizes- e.g., between placebo and drug- is being underestimated too. Statistical power will also be reduced, relative to a hypothetical test that is truly linear in symptoms.
But I could be wrong, maybe the gaps are instead getting smaller in terms of the real increase in suffering they represent. If that’s the case the drugs are less effective than they appear. It’s also possible it’s linear after all perhaps the gaps between individual responses aren’t linear, but it’s linear at the aggregate level? Personally, I would be surprised if it were linear but who knows?
Depression scales seem to be particularly bad for the reasons outlined- the jumps seem varied and arbitrary. However, despite its critical importance to almost everything in the social sciences, almost no one is investigating the relationship between scores and the quantities they measure. Is happiness, for example, a linear or non-linear function of a self-rated happiness score? The answer to this question is critical to anyone who cares about the general welfare, but with honorable exceptions, the topic has largely been left on the shelf.
A modest suggestion for a very partial solution: Key question distribution
Here’s one strategy that might help mitigate both my and Scott’s concerns about how we currently measure effects on depression. I don’t intend it as a criterion to accept or reject treatments for depression but it does seem like a sensible way to give an intuitive overview of the clinical efficacy of a treatment to patients, physicians, and the general public.
Pick 1-3 “core questions”- preregistered before the study- that especially well track the core symptomology of depression. For example, we could take from the MADRS the question about reported sadness with the following possible scale of answers:
Occasional sadness in keeping with circumstances
Sad or low but brightens up without difficulty
Pervasive feelings of sadness or gloominess; mood still influenced by external circumstances
Continuous or unvarying sadness, misery, or despondency
(I’ve left out the “worsening” sub-options).
Now determine in each study condition what percentage of people are best described by each possible answer pre and post-treatment, and present it in a diagram. This information will mean a lot more to people than a score.
This would allow us to say, for example, that if you currently have pervasive feelings of sadness or gloominess, then based on the data and if nothing else is known about you, you have a one-third chance of staying the same, a one-third chance of moving to “sad or low but brightens up without difficulty” and a one-third chance of “occasional sadness in keeping with the circumstances”- as would be a pretty typical result from a depression trial- gives an intuitive way to consider the possible range of effects a drug might have. It is far easier to understand this than a result like “the average score moved from 20 to 15”. It would help with both my concerns about ordinality versus cardinality and Scott’s concerns about how to decide what constitutes a meaningful effect size.
Appendix: Some other prominent scales
The Ham-D appears to be one of the most widely used measures, but there are others, and they also seem to have the same problem with statistical interpretation, e.g. the:
It’s worth commenting on the PHQ-9, because superficially it looks like it might be somewhat linear. The questions responses are:
Not at all
Several days
More than half the days
Nearly every day
And it might well be more linear than the others, but trust me, the overall difference in mood between someone who is:
“Feeling bad about yourself or that you are a failure or have let yourself or your family down”
For “Several days” or “More than half the days” might be far from a linear increase, especially as not just time but intensity vary
There are questionnaires that look more like a classical Likert scale, e.g. the:
DASS-21
But a) it’s not a perfect Likert scale and b) it doesn’t seem to be used as much as the others for depression trials.
Aside: The longer I look at it, the more the HAM-D scale in particular seems like a bad scale, far too weakly loaded on the core symptoms of depression- the emotional states and feeling absences that define depression. The MADRS seems to focus much more closely on these features. This may be why the HAM-D seems to be statistically multidimensional, suggesting it perhaps shouldn’t be collapsed into a single score anyway.
Depression drugs might work better than we think (or worse) because depression scales have a severe flaw
I like the level of detail you're going into here but I think the pattern of effect sizes we see for many conditions - not just depression - indicates some sort of general, conceptual problem with how we're measuring certain kinds of treatment outcomes. I came across this when I was researching my post on naltrexone (https://notpeerreviewed.wordpress.com/2021/05/10/can-we-take-the-devil-out-of-the-bottle-evidence-and-personal-experience-with-naltrexone-for-alcohol-abuse/); most of those studies used seemingly cardinal rather than ordinal outcomes and still showed effect sizes that don't seem to reflect the experiences people report, and I suspect we would see this for many treatments and conditions. I'll have to admit I don't have a good sense of what the answer might be.
What matters is not whether moving from 2 to 3 is as different as 3 to 4 for a sub item. What matters is whether it contributes to the same extent to the sum (and what it represents). This is a subtle but important distinction. Relatedly, you should think more about psychometrics, validity and reliability and how they apply to this situation.