W. Bond's "Reflections of an Amateur": Science and Statistics

Here: http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble is a link to a very well-done piece in a recent edition of the Economist, summarizing recent controversies in science, ranging from the well-publicized psychology “priming” literature debacle to the equally well-publicized biomedical/oncology bench research problems.

“Public-choice” theory in economics rigorously studies the incentives for people in government – not a new idea, admittedly, but one that was generally ignored by progressives who assumed experts could dispassionately administer a non-partisan, scientific, ever-expanding state, largely free from bias. Now, we are witnessing an increased realization that working scientists also respond to the very real incentives at play in the grant process – both directly financial and more generally in terms of career advancement. This is not a surprising finding, given human nature.

That said, I want to emphasize and think through the math/theory behind one specific part of the article, namely that pertaining to “positive predictive value” of studies. For the math here, bear in mind that I am assuming bias-free, perfectly conducted research. The section I am referring to is that describing the Ioannadis paper and accompanying figure.

Normally what we think of as statistical significance is the “p” value less than .05, meaning that the odds of the “positive” results being due to chance are less than 5%. This is similar (but not exactly the same, evidently [see below at bottom]) to the false positive error rate. The false positive error rate (alpha, or type 1 error) is the exact analog of specificity for a test: false positives/(true negatives + false positives). Actually, of course the false positive rate is 1- specificity, specificity being true negatives/(true negatives + false positives).

“Power” is the analog of sensitivity: true positives/(true positives + false negative). Power, is a statement about the sample size needed to achieve a p value of less than .05 when measuring differences between two groups, for a given expected magnitude of difference. Much like a p value of less than .05 is considered standard, a value of .8 (or greater) is considered standard. We accept four times as many false negatives as false positives, goes the thinking.

But much like with sensitivity and specificity in diagnostic testing, what we are often most interested in is neither the sensitivity nor the specificity but the positive and negative predictive values. These depend on both the sensitivity/specificity and the ever-important prevalence.

Take a diagnostic test that is both 90% sensitive and specific. If the prevalence of the disease in a population is only 10%, and all are tested, then the positive predictive value is only 50% (negative predictive value is 99%). The math here is easy. If there are 100 patients, 10 have disease and 90 are healthy. Of the ten, there are 9 positive tests and one false negative test (sensitivity of 90%). Of the ninety healthy, there are 81 negative tests and nine false positives (specificity of 90%). Of the 18 positive tests, half are false positives.

Now, if the prevalence is 50%, the positive and negative predictive values are 90%. Much better without changing either sensitivity nor specificity.

Now to the studies (and not just in medicine and biomedical research), as opposed to diagnostic tests. In the link above, the “prevalence” is 10%. That is, the assumption is that only about 10% of hypotheses studied are likely to be true. Scientists want to generate interesting, groundbreaking results, etc. is the argument here. In this case, with a p value of less than .05 and a power of .08, the positive predictive value of all “statistically significant” results is 64%. It gets worse if the power is lower, as may often be the case. This also is prior to considering the possibility of sloppy research, fraud, and all of the unconscious bias that may accompany the scientists’ work.

So, clearly, in assessing the validity of results, we should be interested not only in p values, but in the “prevalence” or “pre-test” probability that results are likely to be true. This is, in the end, an unquantifiable number. And, since most “negative” studies are not published (unless they are interesting for being negative), this problem is compounded. Which means, of course, that the positive predictive value of any body of research taken in total is unknown! The fact that it may be more like 50-60% (or less) rather than 95% should be kept in mind, however.

With inductive reasoning, there are varying degrees of certainty about what is knowledge. Engineering demands greater certainty than does paleontology. In my experience, engineers and physicists when outside of their area tend to look for a degree of certainty that is not possible. I don’t know any paleontologists, but I suspect, like economists and psychologists, they tend to draw firmer conclusions about what they know than the data justifies.

Which gives me the excuse to again post a favorite quote:

“for it is the mark of an educated man to look for precision in each class of things just so far as the nature of the subject admits”

Aristotle Nicomachean ethics, I.iii.1-4.

P.S.

http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124 Original Ionnadis piece.

One criticism is that he mis-identifies the false positive error rate (alpha) with the p-value. In practice, it seems, these are very similar values, however. Or so I say, as a non-statistician.

W. Bond's "Reflections of an Amateur"

Saturday, November 16, 2013

Science and Statistics

No comments:

Post a Comment