Statistical Thinking: October 2017

(In a Bayesian analysis) It is entirely appropriate to collect data
until a point has been proven or disproven, or until the data collector
runs out of time, money, or patience. - Edwards, Lindman, Savage (1963)

Bayesian inference, which follows the likelihood principle, is not affected by the experimental design or intentions of the investigator. P-values can only be computed if both of these are known, and as been described by Berry (1987) and others, it is almost never the case that the computation of the p-value at the end of a study takes into account all the changes in design that were necessitated when pure experimental designs encounter the real world.

When performing multiple data looks as a study progress, one can accelerate learning by more quickly abandoning treatments that do not work, by sometimes stopping early for efficacy, and frequently by arguing to extend a promising but as-yet-inconclusive study by adding subjects over the originally intended sample size. Indeed the whole exercise of computing a single sample size is thought to be voodoo by most practicing statisticians. It has become almost comical to listen to rationalizations for choosing larger detectable effect sizes so that smaller sample sizes will yield adequate power.

Multiplicity and resulting inflation of type I error when using frequentist methods is real. While Bayesians concern themselves with "what did happen?", frequentists must consider "what might have happened?" because of the backwards time and information flow used in their calculations. Frequentist inference must envision an indefinitely long string of identical experiments and must consider extremes of data over potential studies and over multiple looks within each study if multiple looks were intended. Multiplicity comes from the chances (over study repetitions and data looks) you give data to be more extreme (if the null hypothesis holds), not from the chances you give an effect to be real. It is only the latter that is of concern to a Bayesian. Bayesians entertain only one dataset at a time, and if one computes posterior probabilities of efficacy multiple times, it is only the last value calculated that matters.

To better understand the last point, consider a probabilistic pattern recognition system for identifying enemy targets in combat. Suppose the initial assessment when the target is distant is a probability of 0.3 of being an enemy vehicle. Upon coming closer the probability rises to 0.8. Finally the target is close enough (or the air clears) so that the pattern analyzer estimates a probability of 0.98. The fact that the probabilty was < 0.98 earlier is of no consequence as the gunner prepares to fire a canon. Even though the probability may actually decrease while the shell is in the air due to new information, the probability at the time of firing was completely valid based on then available information.

This is very much how an experimenter would work in a Bayesian clinical trial. The stopping rule is unimportant when interpreting the final evidence. Earlier data looks are irrelevant. The only ways a Bayesian would cheat would be to ignore a later look if it is less favorable than an earlier look, or to try to pull the wool over reviewers' eyes by changing the prior distribution once data patterns emerge.

The meaning and accuracy of posterior probabilities of efficacy in a clinical trial are mathematical necessities that follow from Bayes' rule, if the data model is correctly specified (this model is needed just as much by frequentist methods). So no simulations are needed to demonstrate these points. But for the non-mathematically minded, simulations can be comforting. For everyone, simulation code exposes the logic flow in the Bayesian analysis paradigm.

One other thing: when the frequentist does a sequential trial with possible early termination, the sampling distribution of the statistics becomes extremely complicated, but must be derived to allow one to obtain proper point estimates and confidence limits. It is almost never the case that the statistician actually performs these complex adjustments in a clinical trial with multiple looks. One example of the harm of ignoring this problem is that if the trial stops fairly early for efficacy, efficacy will be overestimated. On the other hand, the Bayesian posterior mean/median/mode of the efficacy parameter will be perfectly calibrated by the prior distribution you assume. If the prior is skeptical and one stops early, the posterior mean will be "pulled back" by a perfect amount, as shown in the simulation below.

We consider the simplest clinical trial design for illustration. The efficacy measure is assumed to be normally distributed with mean μ and variance 1.0, μ=0 indicates no efficacy, and μ<0 indicates a detrimental effect. Our inferential jobs are to see if evidence may be had for a positive effect and to see if further there is evidence for a clinically meaningful effect (except for the futility analysis, we will ignore the latter in what follows). Our business task is to not spend resources on treatments that have a low chance of having a meaningful benefit to patients. The latter can also be an ethical issue: we'd like not to expose too many patients to an ineffective treatment. In the simulation, we stop for futility when the probability that μ < 0.05 exceeds 0.9, considering μ=0.05 to be a minimal clinically important effect.

The logic flow in the simulation exposes what is assumed by the Bayesian analysis.

The prior distribution for the unknown effect μ is taken as a mixture of two normal distributions, each with mean zero. This is a skeptical prior that gives an equal chance for detriment as for benefit from the treatment. Any prior would have done.
In the next step it is seen that the Bayesian does not consider a stream of identical trials but instead (and only when studying performance of Bayesian operating characteristics) considers a stream of trials with different efficacies of treatment, by drawing a single value of μ from the prior distribution. This is done independently for 50,000 simulated studies. Posterior probabilities are not informed by this value of μ. Bayesians operate in a predictive mode, trying for example to estimate Prob(μ>0) no matter what the value of μ.
For the current value of μ, simulate an observation from a normal distribution with mean μ and SD=1.0. [In the code below all n=500 subjects' data are simulated at once then revealed one-at-a-time.]
Compute the posterior probability of efficacy (μ > 0) and of futility (μ < 0.05) using the original prior and latest data.
Stop the study if the probability of efficacy ≥0.95 or the probability of futility ≥0.9.
Repeat the last 3 steps, sampling one more subject each time and performing analyses on the accumulated set of subjects to date.
Stop the study when 500 subjects have entered.

What is it that the Bayesian must demonstrate to the frequentist and reviewers? She must demonstrate that the posterior probabilities computed as stated above are accurate, i.e., they are well calibrated. From our simulation design, the final posterior probability will either be the posterior probability computed after the last (500th) subject has entered, the probability of futility at the time of stopping for futility, or the probability of efficacy at the time of stopping for efficacy. How do we tell if the posterior probability is accurate? By comparing it to the value of μ (unknown to the posterior probability calculation) that generated the sequence of data points that were analyzed. We can compute a smooth nonparametric calibration curve for each of (efficacy, futility) where the binary events are μ > 0 and μ < 0.05, respectively. For the subset of the 50,000 studies that were stopped early, the range of probabilities is limited so we can just compare the mean posterior probability at the moment of stopping with the proportion of such stopped studies for which efficacy (futility) was the truth. The mathematics of Bayes dictates the mean probability and the proportion must be the same (if enough trials are run so that simulation error approaches zero). This is what happened in the simulations.

For the smaller set of studies not stopping early, the posterior probability of efficacy is uncertain and will have a much wider range. The calibration accuracy of these probabilities is checked using a nonparametric calibration curve estimator just as we do in validating risk models, by fitting the relationship between the posterior probability and the binary event μ>0.

The simulations also demonstrated that the posterior mean efficacy at the moment of stopping is perfectly calibrated as an estimator of the true unknown μ.

Simulations were run in R and used functions in the R Hmisc and rms package. The results are below. Feel free to take the code and alter it to run any simulations you'd like.

<p>Your browser does not support iframes.</p>

The following examples are intended to show the advantages of Bayesian reporting of treatment efficacy analysis, as well as to provide examples contrasting with frequentist reporting. As detailed here, there are many problems with p-values, and some of those problems will be apparent in the examples below. Many of the advantages of Bayes are summarized here. As seen below, Bayesian posterior probabilities prevent one from concluding equivalence of two treatments on an outcome when the data do not support that (i.e., the "absence of evidence is not evidence of absence" error).

Suppose that a parallel group randomized clinical trial is conducted to gather evidence about the relative efficacy of new treatment B to a control treatment A. Suppose there are two efficacy endpoints: systolic blood pressure (SBP) and time until cardiovascular/cerebrovascular event. Treatment effect on the first endpoint is assumed to be summarized by the B-A difference in true mean SBP. The second endpoint is assumed to be summarized as a true B:A hazard ratio (HR). For the Bayesian analysis, assume that pre-specified skeptical prior distributions were chosen as follows. For the unknown difference in mean SBP, the prior was normal with mean 0 with SD chosen so that the probability that the absolute difference in SBP between A and B exceeds 10mmHg was only 0.05. For the HR, the log HR was assumed to have a normal distribution with mean 0 and SD chosen so that the prior probability that the HR > 2 or HR < 1/2 was 0.05. Both priors specify that it is equally likely that treatment B is effective as it is detrimental. The two prior distributions will be referred to as p1 and p2.

Example 1: So-called "Negative" Trial (Considering only SBP)

Frequentist Statement

Incorrect Statement: Treatment B did not improve SBP when compared to A (p=0.4)
Confusing Statement: Treatment B was not significantly different from treatment A (p=0.4)
Accurate Statement: We were unable to find evidence against the hypothesis that A=B (p=0.4). More data will be needed. As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B (but see the confidence interval below).
Supplemental Information: The observed B-A difference in means was 4mmHg with a 0.95 confidence interval of [-5, 13]. If this study could be indefinitely replicated and the same approach used to compute the confidence interval each time, 0.95 of such varying confidence intervals would contain the unknown true difference in means.

Bayesian Statement

Assuming prior distribution p1 for the mean difference of SBP, the probability that SBP with treatment B is lower than treatment A is 0.67. Alternative statement: SBP is probably (0.67) reduced with treatment B. The probability that B is inferior to A is 0.33. Assuming a minimally clinically important difference in SBP of 3mmHg, the probability that the mean for A is within 3mmHg of the mean for B is 0.53, so the study is uninformative about the question of similarity of A and B.
Supplemental Information: The posterior mean difference in SBP was 3.3mmHg and the 0.95 credible interval is [-4.5, 10.5]. The probability is 0.95 that the true treatment effect is in the interval [-4.5, 10.5]. [could include the posterior density function here, with a shaded right tail with area 0.67.]

Example 2: So-called "Positive" Trial

Frequentist Statement

Incorrect Statement: The probability that there is no difference in mean SBP between A and B is 0.02
Confusing Statement: There was a statistically significant difference between A and B (p=0.02).
Correct Statement: There is evidence against the null hypothesis of no difference in mean SBP (p=0.02), and the observed difference favors B. Had the experiment been exactly replicated indefinitely, 0.02 of such repetitions would result in more impressive results if A=B.
Supplemental Information: Similar to above.
Second Outcome Variable, If the p-value is Small: Separate statement, of same form as for SBP.

Bayesian Statement

Assuming prior p1, the probability that B lowers SBP when compared to A is 0.985. Alternative statement: SBP is probably (0.985) reduced with treatment B. The probability that B is inferior to A is 0.015.
Supplemental Information: Similar to above, plus evidence about clinically meaningful effects, e.g.: The probability that B lowers SBP by more than 3mmHg is 0.81.
Second Outcome Variable: Bayesian approach allows one to make a separate statement about the clinical event HR and to state evidence about the joint effect of treatment on SBP and HR. Examples: Assuming prior p2, HR is probably (0.79) lower with treatment B. Assuming priors p1 and p2, the probability that treatment B both decreased SBP and decreased event hazard was 0.77. The probability that B improved either of the two endpoints was 0.991.

One would also report basic results. For SBP, frequentist results might be chosen as the mean difference and its standard error. Basic Bayesian results could be said to be the entire posterior distribution of the SBP mean difference.

Note that if multiple looks were made as the trial progressed, the frequentist estimates (including the observed mean difference) would have to undergo complex adjustments. Bayesian results require no modification whatsoever, but just involve reporting the latest available cumulative evidence.

Statistical Thinking

Monday, October 9, 2017

Continuous Learning from Data: No Multiplicities from Computing and Using Bayesian Posterior Probabilities as Often as Desired

Wednesday, October 4, 2017

Bayesian vs. Frequentist Statements About Treatment Efficacy

Example 1: So-called "Negative" Trial (Considering only SBP)

Example 2: So-called "Positive" Trial