Statistical Thinking

This blog is devoted to statistical thinking and its impact on science and everyday life. Emphasis is given to maximizing the use of information, avoiding statistical pitfalls, describing problems caused by the frequentist approach to statistical inference, describing advantages of Bayesian and likelihood methods, and discussing intended and unintended differences between statistics and data science. I'll also cover regression modeling strategies, clinical trials, and drug evaluation.

Saturday, January 14, 2017

Null Hypothesis Significance Testing Never Worked

Much has been written about problems with our most-used statistical paradigm: frequentist null hypothesis significance testing (NHST), p-values, type I and type II errors, and confidence intervals. Rejection of straw-man null hypotheses leads researchers to believe that their theories are supported, and the unquestioning use of a threshold such as p<0.05 has resulted in hypothesis substitution, search for subgroups, and other gaming that has badly damaged science. But we seldom examine whether the original idea of NHST actually delivered on its goal of making good decisions about effects, given the data.

NHST is based on something akin to proof by contradiction. The best non-mathematical definition of the p-value I've ever seen is due to Nicholas Maxwell: "the degree to which the data are embarrassed by the null hypothesis." p-values provide evidence against something, never in favor of something, and are the basis for NHST. But proof by contradiction is only fully valid in the context of rules of logic where assertions are true or false without any uncertainty. The classic paper The Earth is Round (p<.05) by Jacob Cohen has a beautiful example pointing out the fallacy of combining probabilistic ideas with proof by contradiction in an attempt to make decisions about an effect.

The following is almost but not quite the reasononing of null hypothesis rejection:

If the null hypothesis is correct, then this datum (D) can not occur.
It has, however, occurred.
Therefore the null hypothesis is false.

If this were the reasoning of H₀ testing, then it would be formally correct. … But this is not the reasoning of NHST. Instead, it makes this reasoning probabilistic, as follows:

If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

By making it probabilistic, it becomes invalid. … the syllogism becomes formally incorrect and leads to a conclusion that is not sensible:

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)
This person is a member of Congress.
Therefore, he is probably not an American. (Pollard & Richardson, 1987)

… The illusion of attaining improbability or the Bayesian Id's wishful thinking error …

Induction has long been a problem in the philosophy of science. Meehl (1990) attributed to the distinguished philosopher Morris Raphael Cohen the saying "All logic texts are divided into two parts. In the first part, on deductive logic, the fallacies are explained; in the second part, on inductive logic, they are committed."

Sometimes when an approach leads to numerous problems, the approach itself is OK and the problems can be repaired. But besides all the other problems caused by NHST (including need for arbitrary multiplicity adjustments, need for consideration of investigator intentions and not just her actions, rejecting H₀ for trivial effects, incentivizing gaming, interpretation difficulties, etc.) it may be the case that the overall approach is defective and should not have been adopted.

With all of the amazing things Ronald Fisher gave us, and even though he recommended against the unthinking rejection of H₀, his frequentist approach and dislike of the Bayesian approach did us all a disservice. He called the Bayesian method invalid and was possibly intellectually dishonest when he labeled it as "inverse probability." In fact the p-value is an indirect inverse probability and Bayesian posterior probabilities are direct forwards probabilities that do not condition on a hypothesis, and the Bayesian approach has not only been shown to be valid, but it actually delivers on its promise.

21 comments:

UnknownJanuary 14, 2017 at 7:24 PM
Hi, Frank.

Your example from Pollard & Richardson, 1987, is a good one, but I think it does not illustrate the lesson the way they say. In this case, H_0 should be "This person is selected (uniformly) at random from among all Americans." If I tell you that this person is American, do you believe that H_0 is an accurate and complete description of how I chose this person? I think not; you would deem H_0 unlikely, wouldn't you? I think the lesson here is that one needs to think carefully about what H_0 means.

Best,
Russ
ReplyDelete
Replies
Frank HarrellJanuary 14, 2017 at 9:08 PM
Thanks for the comment Russ. I think that Pollard & Richardson were trying to put this in plainer language. And perhaps the fact that you need to think so hard about what H0 is has something to do with the difficulties of frequentist inference.
ReplyDelete
Replies
UnknownJanuary 14, 2017 at 10:12 PM
First, let me edit my comment after the fact: I should have written, "If I tell you that this person is a member of Congress." Second, I agree with your conclusion insofar as H_0 should involve an introduction of randomness, whereas in practice there is often no such randomness (as in observational studies). That does make it hard to apply frequentist theory.
ReplyDelete
Replies
Shravan VasishthJanuary 15, 2017 at 1:02 AM
Could you also post a comment on situations where NHST might be adequate? I have heard statisticians say the following: Bayesian methods are better when you have relatively little data but a lot of prior knowledge. When you have a lot of data, there isn't much gain in switching to a Bayesian approach.

PS I only use Bayesian approaches in my own work once I realized that p-values were not useful for anything I needed to do.
ReplyDelete
Replies
Frank HarrellJanuary 15, 2017 at 6:34 AM
This is a common misconception about Bayesian methods. When data are inadequate but prior information is available, Bayesian methods may be necessary to use. When there is no prior information, Bayesian approaches work just as well as frequentist methods in terms of precision and power but still have an interpretation advantage. The only situation I can think of where NHST might be OK is an "existence experiment", e.g., trying to gather evidence that extrasensory perception exists. This involves showing for example that one can guess which card I'm holding better than chance. But in the vast majority of situations I want to quantify evidence for effects, and Bayesian posterior probabilities are best at that. For example what is the probability that a drug lowers blood pressure more than control therapy? What is the probability that it beats control by more than 3 mmHg?
ReplyDelete
Replies
Gerd RosenkranzJanuary 15, 2017 at 4:47 PM
I basically share the viewpoint that hypothesis tests may not deliver what would be desirable. I wonder what can be done from a practical perspective to convince scientists to stop using them. You mention in your bio that you are also working with the office of biostatistics at FDA. I would be curious whether you had a chance to address the NHST issue with them and if yes what their thoughts were.
ReplyDelete
Replies
Gerd RosenkranzJanuary 18, 2017 at 6:46 AM
Would be glad if I could help in this matter.
ReplyDelete
Replies
SiddharthaFebruary 24, 2017 at 4:34 PM
We know by Bayes rule that P(H|D) = P(D|H)*P(H)/P(D). So assuming that P(D) is reasonably high (so the data is "typical"), and given that P(H) <= 1 always, wouldn't knowing P(D|H) basically provide an upper bound on P(H|D)?
ReplyDelete
Replies
Frank HarrellFebruary 24, 2017 at 7:05 PM
There are several issues with that including my not believing that the problem is discrete, i.e., I want to put a smooth prior on a parameter associated with H, not a point mass of probability for P(H).
ReplyDelete
Replies

Add comment