Arnold Kling  

The Perils of Published Studies

PRINT
Medicare and the Socialist Cal... Raico on Howe and Trotsky...

Jonah Lehrer writes,


The therapeutic power of the drugs appeared to be steadily falling. A recent study showed an effect that was less than half of that documented in the first trials, in the early nineties.

The article is interesting, but I object to much of its tone. For most of the article, Lehrer wants you to believe that there is something mysterious going on in the world, that scientists are baffled, and we no longer know how to arrive at truth.

In my view, what is going on is somewhat more mundane. There are a number of factors that cause unreliable results to sometimes achieve prominence. They all stem from the fact that non-effects get less attention than effects. If you are looking for an effect and you fail to get it, (a) you are more likely to tinker with your experiment*, (b) you are less likely to report your results, and (c) if you report your results you are less likely to get them published.

The statistical measure of the significance of a result is known as the p-value. It is supposed to be a measure of the probability that, say, the drug has no effect, given the average effect you found in your sample. The lower the p-value, the more confident you are that the drug has an effect that is not due to chance. The standard rule of thumb is to treat a p-value below .05 as significant. We call .05 the significance level.

[UPDATE: Oh, gosh! Did I write this? I have a bad cold, so forgive me. The p-value measures the probability that you would observe your result, assuming that the drug has no effect. The lower the p-value, the more confident you are that you results were not due to chance.|

But if you look at the actual research process, the true p-value is much higher than what is reported. An imperfect but workable fix would be to standardize on a lower significance level. I think that for most ordinary research, the significance level ought to be set at .001. For blind data-mining exercises, such as looking among lots of genes to find correlations with diseases, the level ought to be lower, perhaps .0001 or less.

The impact of a lower significance level would be to make many small-sample studies unpublishable (I take it as given that publication will always be biased in favor of studies that find effects. We can't fix that problem.) My guess is that this simple fix would have prevented a number of the scientific embarrassments mentioned in Lehrer's article.

So, if there is a petition calling on researchers to set significance levels at .001 instead of .05, sign my name to it. I think that would be a better norm.

[UPDATE: Robin Hanson really jumps on this topic. He correctly points out that my approach would reduce excess gullibility but risks creating excess skepticism.]

(*Over 25 years ago, Ed Leamer blew the whistle on the phenomenon of specification searches in econometrics, which are an instance of this. It took a while, but his writing has had some influence. As recently as this Spring, the Journal of Economic Perspectives had a relevant symposium, to which Leamer contributed. Russ Roberts talked to Leamer about his work.)


Comments and Sharing


CATEGORIES: Economic Methods



COMMENTS (8 to date)
gabriel rossman writes:

i was shocked recently when i peer-reviewed for a poli sci journal and the paper called p

also of possible interest, here are some older posts of mine directly related to this issue:

JonB writes:

Wow. Just goes to show how the command and control bug can infect any of us if we are not careful.

Most human disease is polygenetic, with risk factors distributed dimensionally although disease expression may appear in the time dimension with a sudden nonlinearity. Small studies, even with marginal categorical p. values, nevertheless play an important role in cueing/directing the larger scientific community down a noisy gradient descent of hypothesis testing.

Small p correlates with large sample e.g. reinforces the prevailing dogma funded by NIMH and me-too drugs in industry. This standard is a great way to get trapped in local minima. We scientists need more noise to shake us out of our established dogmas, not less.

Radford Neal writes:

The statistical measure of the significance of a result is known as the p-value. It is supposed to be a measure of the probability that, say, the drug has no effect, given the average effect you found in your sample.

This is a common misunderstanding of what a p-value is. Actually, the p-value is the probability of getting evidence of the drug having an effect that is as strong as, or stronger than, what you observed in the experiment, if in fact the drug has no effect. If that sounds a bit convoluted, that's because it is, hence the common tendency to interpret p-values as instead meaning something that they don't mean.

To get what you think a p-value is, you need to use Bayesian statistical methods, not the currently-conventional methods behind p-values. A Bayesian method could actually give a probability of the drug having no effect, though only if you input your prior belief that the drug has no effect (and your prior belief about how big the effect is likely to be if it isn't null).

If p-values really did mean that you think they mean, however, requiring a p-value of 0.001, so that a drug that has a 99.8 percent chance of working is abandoned, would make no sense.

MernaMoose writes:

Lehrer wants you to believe that there is something mysterious going on in the world, that scientists are baffled, and we no longer know how to arrive at truth.

If this is coming from the same "scientific" Mother Earth religion that believes the world is about to end -- and we should carbon tax Western Civilization into oblivion -- they're just reaching the natural end of their road.

There's nothing mysterious going on. They don't know how to arrive at truth.

James A. Donald writes:

"They all stem from the fact that non-effects get less attention than effects. If you are looking for an effect and you fail to get it, (a) you are more likely to tinker with your experiment*, (b) you are less likely to report your results, and (c) if you report your results you are less likely to get them published."


Not the problem: The primary factor, as revealed in the climategate files, is bare faced forgery and pal review. The narrower the field, the more centralized the dispensing of money, the more extravagant the forgery, and the more incestuous the peer review.

Demanding higher p values just means more concentrated and centralized funding, which makes it easier for fraudsters to produce any p value they want without risk that anyone might challenge them.

Bill N writes:

I have come across articles written by faculty from Ivy league economics departments that would never have been published had I been a peer reviewer. Part of the problem was naive use of a "p" value, worse was misuse of probability.

In the hard sciences we seldom see the "p" value. Even the confidence interval is uncommon. More likely would be an effect size with a standard deviation and a standard error.

SB7 writes:
I think that for most ordinary research, the significance level ought to be set at .001. For blind data-mining exercises, such as looking among lots of genes to find correlations with diseases, the level ought to be lower, perhaps .0001 or less. [...]

So, if there is a petition calling on researchers to set significance levels at .001 instead of .05, sign my name to it. I think that would be a better norm.

I don't know how things work in medicine or econometrics, but in my field (Computer Science) people tend to report the particular p value. As such we don't need a centralized standard for what is significant since readers of articles can tell one result had a value of p=.008 and another had p=.049. Rather than advocating a new standard, I would say people should reinforce reporting actual p value, and discounting results with values too high, rather than lumping things into "statistically significant" and "not statistically significant" categories.
Mike Rulle writes:

Your solution of lowering the p-value to mitigate the various file drawer and multiple comparison problems is not a bad approximation for a new conventional standard. Implicit in your comments is we cannot trust scientists to be either competent or honest. I don't necessarily disagree. But there are multiple comparison procedures in statistics that provide better ways, albeit also imprecise, to assess the liklihood that null hypothesis rejections occurred merely by chance.

But I agree with your sentiment.

Comments for this entry have been closed
Return to top