Statistical Thinking

This blog is devoted to statistical thinking and its impact on science and everyday life. Emphasis is given to maximizing the use of information, avoiding statistical pitfalls, describing problems caused by the frequentist approach to statistical inference, describing advantages of Bayesian and likelihood methods, and discussing intended and unintended differences between statistics and data science. I'll also cover regression modeling strategies, clinical trials, and drug evaluation.

Wednesday, October 4, 2017

Bayesian vs. Frequentist Statements About Treatment Efficacy

The following examples are intended to show the advantages of Bayesian reporting of treatment efficacy analysis, as well as to provide examples contrasting with frequentist reporting. As detailed here, there are many problems with p-values, and some of those problems will be apparent in the examples below. Many of the advantages of Bayes are summarized here. As seen below, Bayesian posterior probabilities prevent one from concluding equivalence of two treatments on an outcome when the data do not support that (i.e., the "absence of evidence is not evidence of absence" error).

Suppose that a parallel group randomized clinical trial is conducted to gather evidence about the relative efficacy of new treatment B to a control treatment A. Suppose there are two efficacy endpoints: systolic blood pressure (SBP) and time until cardiovascular/cerebrovascular event. Treatment effect on the first endpoint is assumed to be summarized by the B-A difference in true mean SBP. The second endpoint is assumed to be summarized as a true B:A hazard ratio (HR). For the Bayesian analysis, assume that pre-specified skeptical prior distributions were chosen as follows. For the unknown difference in mean SBP, the prior was normal with mean 0 with SD chosen so that the probability that the absolute difference in SBP between A and B exceeds 10mmHg was only 0.05. For the HR, the log HR was assumed to have a normal distribution with mean 0 and SD chosen so that the prior probability that the HR > 2 or HR < 1/2 was 0.05. Both priors specify that it is equally likely that treatment B is effective as it is detrimental. The two prior distributions will be referred to as p1 and p2.

Example 1: So-called "Negative" Trial (Considering only SBP)

Frequentist Statement

Incorrect Statement: Treatment B did not improve SBP when compared to A (p=0.4)
Confusing Statement: Treatment B was not significantly different from treatment A (p=0.4)
Accurate Statement: We were unable to find evidence against the hypothesis that A=B (p=0.4). More data will be needed. As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B (but see the confidence interval below).
Supplemental Information: The observed B-A difference in means was 4mmHg with a 0.95 confidence interval of [-5, 13]. If this study could be indefinitely replicated and the same approach used to compute the confidence interval each time, 0.95 of such varying confidence intervals would contain the unknown true difference in means.

Bayesian Statement

Assuming prior distribution p1 for the mean difference of SBP, the probability that SBP with treatment B is lower than treatment A is 0.67. Alternative statement: SBP is probably (0.67) reduced with treatment B. The probability that B is inferior to A is 0.33. Assuming a minimally clinically important difference in SBP of 3mmHg, the probability that the mean for A is within 3mmHg of the mean for B is 0.53, so the study is uninformative about the question of similarity of A and B.
Supplemental Information: The posterior mean difference in SBP was 3.3mmHg and the 0.95 credible interval is [-4.5, 10.5]. The probability is 0.95 that the true treatment effect is in the interval [-4.5, 10.5]. [could include the posterior density function here, with a shaded right tail with area 0.67.]

Example 2: So-called "Positive" Trial

Frequentist Statement

Incorrect Statement: The probability that there is no difference in mean SBP between A and B is 0.02
Confusing Statement: There was a statistically significant difference between A and B (p=0.02).
Correct Statement: There is evidence against the null hypothesis of no difference in mean SBP (p=0.02), and the observed difference favors B. Had the experiment been exactly replicated indefinitely, 0.02 of such repetitions would result in more impressive results if A=B.
Supplemental Information: Similar to above.
Second Outcome Variable, If the p-value is Small: Separate statement, of same form as for SBP.

Bayesian Statement

Assuming prior p1, the probability that B lowers SBP when compared to A is 0.985. Alternative statement: SBP is probably (0.985) reduced with treatment B. The probability that B is inferior to A is 0.015.
Supplemental Information: Similar to above, plus evidence about clinically meaningful effects, e.g.: The probability that B lowers SBP by more than 3mmHg is 0.81.
Second Outcome Variable: Bayesian approach allows one to make a separate statement about the clinical event HR and to state evidence about the joint effect of treatment on SBP and HR. Examples: Assuming prior p2, HR is probably (0.79) lower with treatment B. Assuming priors p1 and p2, the probability that treatment B both decreased SBP and decreased event hazard was 0.77. The probability that B improved either of the two endpoints was 0.991.

One would also report basic results. For SBP, frequentist results might be chosen as the mean difference and its standard error. Basic Bayesian results could be said to be the entire posterior distribution of the SBP mean difference.

Note that if multiple looks were made as the trial progressed, the frequentist estimates (including the observed mean difference) would have to undergo complex adjustments. Bayesian results require no modification whatsoever, but just involve reporting the latest available cumulative evidence.

17 comments:

UnknownOctober 4, 2017 at 12:49 PM
If I understand the last sentence correctly, then I do not agree with it.

I remember several years ago hearing something similar (but I think I misunderstood at that point, did not hear enough) and tried a test. I simulated a set of 100 values from a binomial (actually Bernoulli) with proportion 0.5 and starting with the 10th observation computed a posterior distribution given a uniform prior and binomial likelihood. From the posterior I calculated the probability that the true proportion was less than 0.5 and had the simulation stop and report the posterior if that probability was less than 0.05. Then I ran this whole process a bunch of times (at least 1,000 but I don't remember exactly). When I looked at all 100 draws from each simulation then the proportion of posterior probabilities less than 0.05 was about 5%, but if I let the simulation stop early, then the proportion was about 14%. Researcher degrees of freedom and the garden of forking paths can affect Bayesian analysis as well.

Later I saw a presentation by Scott Berry (and later read his book on Bayesian adaptive clinical trials) where he recommended using simulations to choose an appropriate prior and probability cut-off on the posterior to give desired properties of the trial.

The best option for multiple looks at the data with possible early stopping is to use the simulations to choose the prior and stopping rule. At least the Bayesian analysis should honestly report the number of actual and potential looks at the data.
ReplyDelete
Replies
Frank HarrellOctober 4, 2017 at 4:36 PM
Thanks for your comments. Scott Berry's statement was in the context of computing frequentist type I error, which I do not care about. I think that the simulation you described did not calculate the needed probability. You don't want the proportion of time the posterior probabilities crossed a threshold. That would be related to Bayesian power. What we want is to determine whether the posterior probability at the moment of stopping is well calibrated. I ran one simulation of 10,000 clinical trials with 400 looks at the data (one look after each new subject is added) with a rule to stop when the posterior prob. exceeds 0.95. The average posterior probability at the moment of stopping was 0.96 which exactly equalled the proportion of the 10,000 simulations in which the true efficacy was positive.

There are no multiplicity problems with Bayes. Multiplicity comes from the chances you give data to be more extreme (relevant in the frequentist world) not the chances you give assertions to be true.
ReplyDelete
Replies
MeganOctober 4, 2017 at 8:32 PM
A useful post, thank you. Can you suggest some readings/references for someone new to Bayesian methodology that is specific to the clinical trials context?
ReplyDelete
Replies
Georg HeinzeOctober 5, 2017 at 2:29 AM
'Assuming prior distribution p1 for the mean difference of SBP, the probability that SBP with treatment B is lower than treatment A is 0.67.'
Clinical researchers will appreciate this as they now think that their study proves that '2 out of 3 patients will benefit from treatment B' (despite the insignificant result). Correct me if I am wrong, but I guess the statement should deal with mean SBP.

ReplyDelete
Replies
Daniel LakensOctober 5, 2017 at 8:31 AM
In a Frequentist paradigm, and using the idea that anything that lowers SPB by more than 3mmHg is 0.8 is meaningful, the appropriate test to discuss is an equivalence test (such as TOST). Using TOST *and* Bayes, you can easily perform power analyses to design an informative study, control the Type 1 error rate (which you don't care about, but if I were a patient receiving a drug you worked on, I would care about), and then you can still add the posterior probability. I don't know why you wouldn't at least raise the bar in your Frequentist criticism a little bit. The p > 0.05 so no effect fallacy is easy to criticize, but also trivial. I'd much rather read your criticism on equivalence testing, and learn something less trivial.

Your report of the posterior probabilities are also not very attractice. You can easily calculate your poaterior, but you can't compute mine. And for me, my posterior is what matters - I don't care about what you believe. So a better report would contain a snesitivity analysis, plotting posteriors across a range of priors. Do you agree, or not? That obviously changes the conclusion a bit.

Overal, it is my strogn conviction that you love more by ignoring Frequentist stats, than you gain. As long as you make correct inferences (from a Neyman-Pearson perspective) you complement your research, especially when designing studies, at almost no cost (because you will end with your posterior anyway). Bayesian stats has limitations, and Frequentist stats has limitations, but there is nothing preventing you from embracing the relative strengths of both approaches. Saying 'I don't care about error rates' is your right, but you should expect a decent proportion of readers to care about it. Alternatively, you can discuss how you would in practice deal with situations where error control matters - e.g., exploring 100 DV's, and reporting the one with the highest posterior is perfectly fine in Bayesian stats, but I see no guidelines on how to prevent massive amounts of misleading information if people work like this.
ReplyDelete
Replies

Add comment