Statistical Thinking

This blog is devoted to statistical thinking and its impact on science and everyday life. Emphasis is given to maximizing the use of information, avoiding statistical pitfalls, describing problems caused by the frequentist approach to statistical inference, describing advantages of Bayesian and likelihood methods, and discussing intended and unintended differences between statistics and data science. I'll also cover regression modeling strategies, clinical trials, and drug evaluation.

Sunday, February 5, 2017

A Litany of Problems With p-values

In my opinion, null hypothesis testing and p-values have done significant harm to science. The purpose of this note is to catalog the many problems caused by p-values. As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.

The American Statistical Association has done a great service by issuing its Statement on Statistical Significance and P-values. Now it's time to act. To create the needed motivation to change, we need to fully describe the depth of the problem.

It is important to note that no statistical paradigm is perfect. Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults. This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.

Consider an assertion such as "the coin is fair", "treatment A yields the same blood pressure as treatment B", "B yields lower blood pressure than A", or "B lowers blood pressure at least 5mmHg before A." Consider also a compound assertion such as "A lowers blood pressure by at least 3mmHg and does not raise the risk of stroke."

A. Problems With Conditioning

p-values condition on what is unknown (the assertion of interest; H₀) and do not condition on what is known (the data).
This conditioning does not respect the flow of time and information; p-values are backward probabilities.

B. Indirectness

Because of A above, p-values provide only indirect evidence and are problematic as evidence metrics. They are sometimes monotonically related to the evidence (e.g., when the prior distribution is flat) we need but are not properly calibrated for decision making.
p-values are used to bring indirect evidence against an assertion but cannot bring evidence in favor of the assertion.
As detailed here, the idea of proof by contradiction is a stretch when working with probabilities, so trying to quantify evidence for an assertion by bringing evidence against its complement is on shaky ground.
Because of A, p-values are difficult to interpret and very few non-statisticians get it right. The best article on misinterpretations I've found is here.

C. Problem Defining the Event Whose Probability is Computed

In the continuous data case, the probability of getting a result as extreme as that observed with our sample is zero, so the p-value is the probability of getting a result more extreme than that observed. Is this the correct point of reference?
How does more extreme get defined if there are sequential analyses and multiple endpoints or subgroups? For sequential analyses do we consider planned analyses are analyses intended to be run even if they were not?

D. Problems Actually Computing p-values

In some discrete data cases, e.g., comparing two proportions, there is tremendous disagreement among statisticians about how p-values should be calculated. In a famous 2x2 table from an ECMO adaptive clinical trial, 13 p-values have been computed from the same data, ranging from 0.001 to 1.0. And many statisticians do not realize that Fisher's so-called "exact" test is not very accurate in many cases.
Outside of binomial, exponential, and normal (with equal variance) and a few other cases, p-values are actually very difficult to compute exactly, and many p-values computed by statisticians are of unknown accuracy (e.g., in logistic regression and mixed effects models). The more non-quadratic the log likelihood function the more problematic this becomes in many cases.
One can compute (sometimes requiring simulation) the type-I error of many multi-stage procedures, but actually computing a p-value that can be taken out of context can be quite difficult and sometimes impossible. One example: one can control the false discovery probability (incorrectly usually referred to as a rate), and ad hoc modifications of nominal p-values have been proposed, but these are not necessarily in line with the real definition of a p-value.

E. The Multiplicity Mess

Frequentist statistics does not have a recipe or blueprint leading to a unique solution for multiplicity problems, so when many p-values are computed, the way they are penalized for multiple comparisons results in endless arguments. A Bonferroni multiplicity adjustment is consistent with a Bayesian prior distribution specifying that the probability that all null hypotheses are true is a constant no matter how many hypotheses are tested. By contrast, Bayesian inference reflects the facts that P(A ∪ B) ≥ max(P(A), P(B)) and P(A ∩ B) ≤ min(P(A), P(B)) when A and B are assertions about a true effect.
There remains controversy over the choice of 1-tailed vs. 2-tailed tests. The 2-tailed test can be thought of as a multiplicity penalty for being potentially excited about either a positive effect or a negative effect of a treatment. But few researchers want to bring evidence that a treatment harms patients; a pharmaceutical company would not seek a licensing claim of harm. So when one computes the probability of obtaining an effect larger than that observed if there is no true effect, why do we too often ignore the sign of the effect and compute the (2-tailed) p-value?
Because it is a very difficult problem to compute p-values when the assertion is compound, researchers using frequentist methods do not attempt to provide simultaneous evidence regarding such assertions and instead rely on ad hoc multiplicity adjustments.
Because of A1, statistical testing with multiple looks at the data, e.g., in sequential data monitoring, is ad hoc and complex. Scientific flexibility is discouraged. The p-value for an early data look must be adjusted for future looks. The p-value at the final data look must be adjusted for the earlier inconsequential looks. Unblinded sample size re-estimation is another case in point. If the sample size is expanded to gain more information, there is a multiplicity problem and some of the methods commonly used to analyze the final data effectively discount the first wave of subjects. How can that make any scientific sense?
Most practitioners of frequentist inference do not understand that multiplicity comes from chances you give data to be extreme, not from chances you give true effects to be present.

F. Problems With Non-Trivial Hypotheses

It is difficult to test non-point hypotheses such as "drug A is similar to drug B".
There is no straightforward way to test compound hypotheses coming from logical unions and intersections.

G. Inability to Incorporate Context and Other Information

Because extraordinary claims require extraordinary evidence, there is a serious problem with the p-value's inability to incorporate context or prior evidence. A Bayesian analysis of the existence of ESP would no doubt start with a very skeptical prior that would require extraordinary data to overcome, but the bar for getting a "significant" p-value is fairly low. Frequentist inference has a greater risk for getting the direction of an effect wrong (see here for more).
p-values are unable to incorporate outside evidence. As a converse to 1, strong prior beliefs are unable to be handled by p-values, and in some cases the results in a lack of progress. Nate Silver in The Signal and the Noise beautifully details how the conclusion that cigarette smoking causes lung cancer was greatly delayed (with a large negative effect on public health) because scientists (especially Fisher) were caught up in the frequentist way of thinking, dictating that only randomized trial data would yield a valid p-value for testing cause and effect. A Bayesian prior that was very strongly against the belief that smoking was causal is obliterated by the incredibly strong observational data. Only by incorporating prior skepticism could one make a strong conclusion with non-randomized data in the smoking-lung cancer debate.
p-values require subjective input from the producer of the data rather than from the consumer of the data.

H. Problems Interpreting and Acting on "Positive" Findings

With a large enough sample, a trivial effect can cause an impressively small p-value (statistical significance ≠ clinical significance).
Statisticians and subject matter researchers (especially the latter) sought a "seal of approval" for their research by naming a cutoff on what should be considered "statistically significant", and a cutoff of p=0.05 is most commonly used. Any time there is a threshold there is a motive to game the system, and gaming (p-hacking) is rampant. Hypotheses are exchanged if the original H₀ is not rejected, subjects are excluded, and because statistical analysis plans are not pre-specified as required in clinical trials and regulatory activities, researchers and their all-too-accommodating statisticians play with the analysis until something "significant" emerges.
When the p-value is small, researchers act as though the point estimate of the effect is a population value.
When the p-value is small, researchers believe that their conceptual framework has been validated.

I. Problems Interpreting and Acting on "Negative" Findings

Because of B2, large p-values are uninformative and do not assist the researcher in decision making (Fisher said that a large p-value means "get more data").

82 comments:

UnknownFebruary 5, 2017 at 1:42 PM
Thanks for the list! I for one didnt realized that Fisher's exact test is not very accurate!
ReplyDelete
Replies
MattFebruary 5, 2017 at 8:36 PM
Great post, Frank. Re problem I.1, I think this one has implications for publication bias the and the reproducibility crisis: The fact that non-significant p values are regarded as uninformative (rather than evidence against an effect) is part of the reason why studies with non-significant findings aren't published. Some authors and journals might be willing to publish evidence against an hypothesis, but not interested in publishing findings that are uninformative. But when publication is conditional on p < 0.05, a biased literature results.
ReplyDelete
Replies
UnknownFebruary 5, 2017 at 8:46 PM
For I1 I'm not sure B2 is the best argument. Perhaps you might add that because the p-value distribution is flat when the null hypothesis is true a failure to reject the null hypothesis means that every p-value was equally probable to have occurred and therefore the magnitude of the p-value is meaningless.
ReplyDelete
Replies
ErikFebruary 6, 2017 at 2:11 AM
In phase III drug development confirmatory or pivotal trials, sample size is calculated under the decision framework of Neyman Pearson (for example, effect size as a single alternate hypothesis, sigma as an assumption, and alpha + beta as conditions). The P value does not matter, just the critical region. As far as we have enough patients, this has sense to me. What do you think?
But I'm confused because this approach is also used in trials aimed to add evidence (the big majority), not to allow a decision. From my view, and in accordance to reporting guidelines, what matter are uncertainty measures. Within classical inference, precision trials (allowing to some pre-specified amplitude of the 95%CI) are more appropriate. But they are very, very rare...
What do you think?
ReplyDelete
Replies
UnknownFebruary 6, 2017 at 9:35 AM
This is utterly unfair for the p-values. Yes, there are misinterpretations, but that applies to most measures. The p-value is one possibility to quantify a statistical statement, and helps in decision making (and no decision should be taken on p-values alone). What are other quantitative measures, and evaluate then the relative benefits!
ReplyDelete
Replies
Dean BillheimerFebruary 7, 2017 at 8:06 PM
I like the catalog of problems. This will help teach my collaborators (and biostatistics students) about problems and misinterpretation of p-values. I can think of two additional issues.
1. Many physicians and biologists mistakenly treat the 0.05 threshold as a "forced decision"; the null hypothesis is true or it is false, and they act accordingly. I believe Neyman-Pearson also advocated "to act as though" the null or alternative hypothesis were true of false. I think the forced dichotomy, while necessary for decision problems (and for misplaced psychological comfort), is profoundly unscientific in inference. We need space to accommodate uncertainty in inferential results.

BTW - I think any threshold based inference procedure will suffer the same plight.

2. p-values are a nonlinear transformation of test statistics, and are not reproducible in repeated studies (even when the null hypothesis is false).
ReplyDelete
Replies
MayoFebruary 7, 2017 at 9:34 PM
No, N-P did not advocate that you"act as if a hypothesis is true/ false" when you accept/reject it. This is another huge blunder in depicting N-P bypeople who have never read N-p, which is essentially, almost everyone who writes about them. On my blog, you will find several posts on N and N-P which also link to articles, so you won't have to continue to repeat false claims about them. Under the "behavioristic construal where act/reject are "acts", the acts can be anything, e.g., infer there's some evidence for a discrepancy from Ho, check your assumptions, withdraw the paper, sell a bag of bolts. One could never identify these "acts" with taking a standpoint on the truth of a claim. When in the behavioristic land, there's no such thing. It has it's problems, but can't have the one you mention.
Please disregard almost everything you've read on N-P by critics. Neyman has some extremely accessible non-technical papers I can point you to on my blog. also my discussion in Error and the Growth of Experimental Knowledge (Mayo 1996), which can be found in copy form on my publications page off my blog. Good luck.
ReplyDelete
Replies
Frank HarrellFebruary 8, 2017 at 6:58 AM
Interesting comments Dean and Deborah. I can see both sides because so many researchers fail to deeply understand the issues and the net result is all kinds of strange behavior. In my opinion an approach that creates endless debate and requires continual deep thinking has its problems. I am drawn to the Bayesian "what should I believe now".
ReplyDelete
Replies
MayoFebruary 8, 2017 at 10:16 AM
It doesn't require deep thinking to simply do it right. It's exactly the reasoning you'd use every day if you didn't want to be fooled. It's only criticisms from those who have assumed we had to have a posterior, never mind what in the world it means (and no one has been able to say), and questionable research practices, cherry-picking and the like (which don't go away but are masked under Bayesianism) that has caused confusion. If you're saying, gee with Bayesian inference, I don't have to think, then you're better off not using statistics.
That's the opposite attitude that one should have in science.
ReplyDelete
Replies
Frank HarrellFebruary 8, 2017 at 10:50 AM
A bit harsh but I get some of your points. I remain unswayed about posterior probabilities masking the problem (or at least masking it as much as other methods do) but reproducible research, cherry picking, and the rest are extremely important. Posterior probabilities will be misleading primarily when the Bayesian got the model wrong, e.g., failed to take into account a source of uncertainty or used a prior that disagreed with the prior used by the judge of the research. If one's premise is that extraordinary statements require extraordinary evidence, especially where the claim comes from cherry picking, this would need to be reflected in a skeptical prior for the posterior to help.
ReplyDelete
Replies
MayoFebruary 8, 2017 at 11:42 AM
Sorry, I guess I was going too fast because I'm rushing to finish a book (on these very issues). These issues cannot be discussed in a quick comment, I have 5.5 years of blogging and many years of published writings. Your applications may be those with clear priors (subjective? default? frequentist?). If you say you test your model assumptions, and if you ever falsify, I say you need statistical tests. Cherry picking and other biasing selection effects may not change your prior because you believe the hypothesis in question--what they change, for the error statistician--is how well you've tested your claim. These are very different assessments I'm interested in the latter. It doesn't follow there's no place for subjective beliefs.
ReplyDelete
Replies
MayoFebruary 9, 2017 at 6:51 AM
What then is your favored Bayes way, a default?
What's the posterior distribution of each prediction? Gelman sees posterior checks as error-statistical http://www.stat.columbia.edu/~gelman/research/published/philosophy.pdf
ReplyDelete
Replies
Frank HarrellFebruary 9, 2017 at 7:11 AM
Thanks for the reference - look forward to reading. In terms of favored Bayesian way I am a disciple of David Spiegelhalter, heavily influenced by his problem-solving approach as nicely described in his 1994 JRSS A paper which is focused on clinical trials and highly relevant to regulatory decision making.
ReplyDelete
Replies
AnonymousFebruary 9, 2017 at 2:33 PM
Hi Frank,

Interesting post. Do you see likelihood methods as subject to A, B and/or G?

Meaning eg Edwards-style likelihood (as in his book 'Likelihood') or Fisherian-style likelihood (as in eg Pawitan's book 'In all likelihood').
ReplyDelete
Replies
Frank HarrellFebruary 9, 2017 at 3:33 PM
I believe this would be Edwards-Royall. I think that likelihood has a bit of a problem with G but not with A or B.
ReplyDelete
Replies
MayoFebruary 11, 2017 at 9:53 AM
Anyone who ignores the stopping rule can erroneously declare significance with probability 1, and with the corresponding Bayesian priors can erroneously leave the true parameter value out of the Bayesian HPD interval. See stopping rules on my blog.
Anyone who obeys the LP cannot use error probabilities. I think only subjective Bayesians are still prepared to endorse that and staunch likelihoodists. Just what today's practice needs. For that matter, let replicationists try and try again until they can get a stat sig effect Then the replication rate will be 100%

I have, incidentally, disproved alleged proofs of the LP.
ReplyDelete
Replies
Steven McKinneyFebruary 17, 2017 at 12:10 AM
Anil Potti and Joseph Nevins used Bayesian methods recommended by Mike West during the Duke genomics fiasco a few years back. They overfitted models and committed many other analysis gaffes. Bayesian modelers are not immune from running afoul of proper interpretation of modeling findings to shed light on scientific phenomena.

Proper handling of statistics and interpretation of findings is needed in any statistical exercise, Bayesian, frequentist or otherwise.

Given the wealth of great advice in Regression Modeling Strategies and other writings, I am taken aback to see Frank Harrell blame a statistic when the problem is people misinterpreting a statistic.

A p-value is just a statistic, with certain knowable distributional properties under this and that condition. A Bayesian Highest Posterior Density region can be improperly obtained and misinterpreted just as readily.

ReplyDelete
Replies
Frank HarrellFebruary 17, 2017 at 10:48 AM
This comment has been removed by the author.
ReplyDelete
Replies
Frank HarrellFebruary 17, 2017 at 10:52 AM
The statistical paradigm had almost nothing to do with this. Ask the forensic biostatisticians Baggerly and Coombs who uncovered the whole problem. In their wonderful paper (Annals of Applied Statistics 2009) neither Bayes nor prior appear.
ReplyDelete
Replies
Frank HarrellFebruary 17, 2017 at 11:43 AM
I take offense are your choice of words. The fact that I have not been a Bayesian my whole career and hence do not have a plethora of Bayesian examples doesn't hold me back from seeking better approaches. You seem to be threatened by the amount of work we have ahead of us to improve the situation. I am not, which is one reason I'm working closely with FDA on this very problem. You can do unscientific, non-reproducible research using any paradigm. I want to do good science and have results that have sensible interpretations.
ReplyDelete
Replies
Frank HarrellFebruary 17, 2017 at 2:51 PM
I appreciate that Steven, and only take issue with your "find fault" sentence. True, more fault lies with unscientific work than with statistical paradigms, but there are major problems with p-values, and p-values lead to many downstream problems as I've tried to catalog. The paradigm really matters. Not all of the fault lies with practitioners. This becomes more clear for those like me (a follower of David Spiegelhalter) who embrace Bayesian posterior probabilities and favor skeptical priors. Once you do the right simulations or grasp the theory (the former being easier for me) you'll see things such as the fact that multiplicity comes from the chances you give data to be more extreme, not the chances you give assertions to be true. And the fact that frequentist thinking leads usually to fixed sample size designs turns out to be a huge issue in experimental work.

You're giving me the idea to post a separate article on my journey and how this relates to RMS, which I hope to complete in the next few days.
ReplyDelete
Replies
MayoFebruary 17, 2017 at 3:46 PM
Of course it comes from increasing the chances you give for the data to be more extreme, but the relevance of such outcomes that didn't occur is just what's denied by Bayesians who endorse the Likelihood Principle. Sequential trials were advocated by frequentists long ago (Armitage). He also argued that optional stopping also results in posteriors being wrong with high probability. But Savage switched to a simple point against point hypothesis to defend the LP. That is still going on today as the latest reforms champion the irrelevance of optional stopping––to them.

Aside from inability to control error probabilities, the key problem with all Bayesian accounts is they never quite tell us what they're talking about (and they certainly don't agree with eachother), except perhaps empirical Bayesians. Is the prior/posterior an expression of degree of belief in various values of parameters? how frequently they occur in universes of parameters? As the number of parameters increases, the assessments–generally default priors–move further and further from anything we can get a grip on. They are not representing background information. And if we're going to test them, we'll have to report to something like significance tests.
But given what you said earlier, about only caring to match the beliefs of "the judge" the whole business of outsiders critically appriasing you may not matter.
ReplyDelete
Replies
MayoFebruary 17, 2017 at 6:57 PM
If I can't interest you in learning that someone tried and tried again to achieve a stat sig result (or a HPD interval excluding the true value), even though with high or max probability this can be achieved erroneously, then your view of "being cheated" and mine are very different. But I'm glad you stick with this, it's the Bayesians who try to wrangle out of the consequences of accepting the LP that bother me. Please see this blog post:
https://errorstatistics.com/2014/04/05/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour-2/
ReplyDelete
Replies
Frank HarrellFebruary 18, 2017 at 6:34 AM
Very well written article and interesting history. I think that consideration of what constitutes cheating is a very useful exercise. It is also useful to back-construct a prior that makes certain things possible or likely. For example, sampling to a foregone conclusion happens when a statistician uses a smooth prior but her critic uses a prior with an absorbing state (point mass) at the null. Bonferroni correction is equivalent to a prior that specifies that the probability that all null hypotheses are true is the same no matter how many hypotheses are tested (a very strange assumption). A Bayesian can cheat by changing the prior after observing data or by improper conditioning, e.g., acquiring more data, finding the cumulative result to be less impressive than it was before, and rolling back the data to only analyze the smaller sample. But choosing a smooth prior before looking at the data and having the prior at least as skeptical as the critic's implied prior, will result in a stream of posterior probabilities that are well calibrated independent of how aggressive were the 'data looks'. Not only are the posterior probabilities calibrated, but the posterior mean is perfectly calibrated, discounted by the prior more when stopping is very early. The frequentist correction for bias in the sample mean upon early stopping is quite complex. Frequentists tend to be very good at correcting p-values for multiplicity but very bad at correcting point estimates for same.
ReplyDelete
Replies
MayoFebruary 18, 2017 at 12:23 PM
First off, thanks.
The frequentist doesn't assign priors to the hypotheses you mention; it is a fallacy to spoze that a match between numbers (error probabilities and posterior) means the frequentist makes those prior assignments.
But on cheating, I don't see how you can say "a Bayesian can cheat by changing the prior after observing data". You need a notion of cheating. Error statisticians have one, what's the Bayesian's? Bayesians, by and large, aren't troubled by changing their prior post data as you can see from this post:
"Can you change your Bayesian prior?"
https://errorstatistics.com/2015/06/18/can-you-change-your-bayesian-prior-i/
I think only subjective Bayesians may say no, but even Dawid says yes. I'd like to know what you think.
ReplyDelete
Replies
Frank HarrellFebruary 18, 2017 at 1:15 PM
I didn't mean to imply that frequentists use priors. But sometimes you can solve for the prior that is consistent with how they operate. Regarding changing the prior I'm more skeptical that that is OK but I am influenced by working in a regulatory environment where pre-specification is all important.
ReplyDelete
Replies
MayoFebruary 21, 2017 at 6:04 PM
No, Greg doesn't even purport to disprove me nor can he. As a logic professor, I'm clear on the logical mistake that led people to think Birnbaum had proved the LP--though it's quite subtle, and took a long time to explain (not to spot). My disproof is even deeper than Evan's disproof, because, as I explain in my paper, it requires more than a mere counterexample.
ReplyDelete
Replies
Jeffrey BlumeFebruary 24, 2017 at 2:46 PM
I posted because I think it is good to see alternative viewpoints and because I think it illustrates an important issue: The class of evidence functions being considered must be large enough to include functions that depend on the sample space and those that do not depend on the sample space. Otherwise the argument is effectively tautological.
ReplyDelete
Replies
Jeffrey BlumeFebruary 24, 2017 at 2:49 PM
One interesting point is that likelihoodists don’t really care about the proof of the LP from the CP or SP. This is because the LP is implied by the Law of Likelihood. Likewise if one defines the measure of the strength of evidence to be a Bayes factor, posterior probability, or some distance between the two hypothesis. However, if the measure of the strength of evidence is defined to be a probability or some other metric that depends on the sample space, then the LP will not apply. It all boils down to the fundamental building blocks: (1) what is the measure of the strength of evidence, (2) what is the probability that a study will generate misleading evidence, and (3) what is the probability that an observed measure is misleading. This is how systems for measuring evidence ought to be evaluated and compared.
ReplyDelete
Replies

Add comment