Sunday, February 19, 2017

My Journey From Frequentist to Bayesian Statistics

Type I error for smoke detector: probability of alarm given no fire=0.05
Bayesian: probability of fire given current air data

Frequentist smoke alarm designed as most research is done:
Set the alarm trigger so as to have a 0.8 chance of detecting an inferno

Advantage of actionable evidence quantification:
Set the alarm to trigger when the posterior probability of a fire exceeds 0.02 while at home and at 0.01 while away


If I had been taught Bayesian modeling before being taught the frequentist paradigm, I'm sure I would have always been a Bayesian.  I started becoming a Bayesian about 1994 because of an influential paper by David Spiegelhalter and because I worked in the same building at Duke University as Don Berry.  Two other things strongly contributed to my thinking: difficulties explaining p-values and confidence intervals (especially the latter) to clinical researchers, and difficulty of learning group sequential methods in clinical trials.  When I talked with Don and learned about the flexibility of the Bayesian approach to clinical trials, and saw Spiegelhalter's embrace of Bayesian methods because of its problem-solving abilities, I was hooked.  [Note: I've heard Don say that he became Bayesian after multiple attempts to teach statistics students the exact definition of a confidence interval.  He decided the concept was defective.]

At the time I was working on clinical trials at Duke and started to see that multiplicity adjustments were arbitrary.  This started with a clinical trial coordinated by Duke in which low dose and high dose of a new drug were to be compared to placebo, using an alpha cutoff of 0.03 for each comparison to adjust for multiplicity.  The comparison of high dose with placebo resulted in a p-value of 0.04 and the trial was labeled completely "negative" which seemed problematic to me. [Note: the p-value was two-sided and thus didn't give any special "credit" for the treatment effect coming out in the right direction.]

I began to see that the hypothesis testing framework wasn't always the best approach to science, and that in biomedical research the typical hypothesis was an artificial construct designed to placate a reviewer who believed that an NIH grant's specific aims must include null hypotheses.  I saw the contortions that investigators went through to achieve this, came to see that questions are more relevant than hypotheses, and estimation was even more important than questions.   With Bayes, estimation is emphasized.  I very much like Bayesian modeling instead of hypothesis testing.  I saw that a large number of clinical trials were incorrectly interpreted when p>0.05 because the investigators involved failed to realize that a p-value can only provide evidence against a hypothesis. Investigators are motivated by "we spent a lot of time and money and must have gained something from this experiment." The classic "absence of evidence is not evidence of absence" error results, whereas with Bayes it is easy to estimate the probability of similarity of two treatments.  Investigators will be surprised to know how little we have learned from clinical trials that are not huge when p>0.05.

I listened to many discussions of famous clinical trialists debating what should be the primary endpoint in a trial, the co-primary endpoint, the secondary endpoints, co-secondary endpoints, etc.  This was all because of their paying attention to alpha-spending.  I realized this was all a game.

I came to not believe in the possibility of infinitely many repetitions of identical experiments, as required to be envisioned in the frequentist paradigm.  When I looked more thoroughly into the multiplicity problem, and sequential testing, and I looked at Bayesian solutions, I became more of a believer in the approach.  I learned that posterior probabilities have a simple interpretation independent of the stopping rule and frequency of data looks.  I got involved in working with the FDA and then consulting with pharmaceutical companies, and started observing how multiple clinical endpoints were handled.  I saw a closed testing procedures where a company was seeking a superiority claim for a new drug, and if there was insufficient evidence for such a claim, they wanted to seek a non-inferiority claim on another endpoint.  They developed a closed testing procedure that when diagrammed truly looked like a train wreck.  I felt there had to be a better approach, so I sought to see how far posterior probabilities could be pushed.  I found that with MCMC simulation of Bayesian posterior draws I could quite simply compute probabilities such as P(any efficacy), P(efficacy more than trivial), P(non-inferiority), P(efficacy on endpoint A and on either endpoint B or endpoint C), and P(benefit on more than 2 of 5 endpoints).  I realized that frequentist multiplicity problems came from the chances you give data to be more extreme, not from the chances you give assertions to be true.

I enjoy the fact that posterior probabilities define their own error probabilities, and that they count not only inefficacy but also harm.  If P(efficacy)=0.97, P(no effect or harm)=0.03.  This is the "regulator's regret", and type I error is not the error of major interest (is it really even an 'error'?).  One minus a p-value is P(data in general are less extreme than that observed if H0 is true) which is the probability of an event I'm not that interested in.

The extreme amount of time I spent analyzing data led me to understand other problems with the frequentist approach.  Parameters are either in a model or not in a model.  We test for interactions with treatment and hope that the p-value is not between 0.02 and 0.2.  We either include the interactions or exclude them, and the power for the interaction test is modest.  Bayesians have a prior for the differential treatment effect and can easily have interactions "half in" the model.  Dichotomous irrevocable decisions are at the heart of many of the statistical modeling problems we have today.  I really like penalized maximum likelihood estimation (which is really empirical Bayes) but once we have a penalized model all of our frequentist inferential framework fails us.  No one can interpret a confidence interval for a biased (shrunken; penalized) estimate.  On the other hand, the Bayesian posterior probability density function, after shrinkage is accomplished using skeptical priors, is just as easy to interpret as had the prior been flat.  For another example, consider a categorical predictor variable that we hope is predicting in an ordinal (monotonic) fashion.  We tend to either model it as ordinal or as completely unordered (using k-1 indicator variables for k categories).  A Bayesian would say "let's use a prior that favors monotonicity but allows larger sample sizes to override this belief."

Now that adaptive and sequential experiments are becoming more popular, and a formal mechanism is needed to use data from one experiment to inform a later experiment (a good example being the use of adult clinical trial data to inform clinical trials on children when it is difficult to enroll a sufficient number of children for the child data to stand on their own), Bayes is needed more than ever.  It took me a while to realize something that is quite profound: A Bayesian solution to a simple problem (e.g., 2-group comparison of means) can be embedded into a complex design (e.g., adaptive clinical trial) without modification.  Frequentist solutions require highly complex modifications to work in the adaptive trial setting.

I met likelihoodist Jeffrey Blume in 2008 and started to like the likelihood approach.  It is more Bayesian than frequentist.  I plan to learn more about this paradigm. 

Several readers have asked me how I could believe all this and publish a frequentist-based book such as Regression Modeling Strategies.  There are two primary reasons.  First, I started writing the book before I knew much about Bayes.  Second, I performed a lot of simulation studies that showed that purely empirical model-building had a low chance of capturing clinical phenomena correctly and of validating on new datasets.  I worked extensively with cardiologists such as Rob Califf, Dan Mark, Mark Hlatky, David Prior, and Phil Harris who give me the ideas for injecting clinical knowledge into model specification.  From that experience I wrote Regression Modeling Strategies in the most Bayesian way I could without actually using specific  Bayesian methods.  I did this by emphasizing subject-matter-guided model specification.  The section in the book about specification of interaction terms is perhaps the best example.  When I teach the full-semester version of my course I interject Bayesian counterparts to many of the techniques covered.

There are challenges in moving more to a Bayesian approach.  The ones I encounter most frequently are:
  1. Teaching clinical trialists to embrace Bayes when they already do in spirit but not operationally.  Unlearning things is much more difficult than learning things.
  2. How to work with sponsors, regulators, and NIH principal investigators to specify the (usually skeptical) prior up front, and to specify the amount of applicability assumed for previous data.
  3. What is a Bayesian version of the multiple degree of freedom "chunk test"?  Partitioning sums of squares or the log likelihood into components, e.g., combined test of interaction and combined test of nonlinearities, is very easy and natural in the frequentist setting.
  4. How do we specify priors for complex entities such as the degree of monotonicity of the effect of a continuous predictor in a regression model?  The Bayesian approach to this will ultimately be more satisfying, but operationalizing this is not easy.
With new tools such as Stan and well written accessible books such as Kruschke's it's getting to be easier to be Bayesian each day.  The R brms package, which uses Stan, makes a large class of regression models even more accessible.

Update 2017-12-29

Another reason for moving from frequentism to Bayes is that frequentist ideas are so confusing that even expert statisticians frequently misunderstand them, and are tricked into dichotomous thinking because of the adoption of null hypothesis significance testing (NHST).   The paper by BB McShane and D Gal in JASA demonstrates alarming errors in interpretation by many authors of JASA papers.  If those with a high level of statistical training make frequent interpretation errors could frequentist statistics be fundamentally flawed?  Yes!  In McShane and Gal's paper they described two surveys sent to authors of JASA, as well as to authors of articles not appearing in the statistical literature (luckily for statisticians the non-statisticians fared a bit worse).   Some of their key findings are as follows.

  1. When a p-value is present, (primarily frequentist) statisticians confuse population vs. sample, especially if the p-value is large.  Even when directly asked whether patients in this sample fared batter on one treatment than the other, the respondents often answered according to whether or not p < 0.05.  Dichotomous thinking crept in.
  2. When asked whether evidence from the data made it more or less likely that a drug is beneficial in the population, many statisticians again were swayed by the p-value and not tendencies indicated by the raw data.  The failed to understand that your chances are improved by "playing the odds", and gave different answers whether one was playing the odds for an unknown person vs. selecting treatment for themselves.
  3. In previous studies by the authors, they found that "applied researchers presented with not only a p-value but also with a posterior probability based on a noninformative prior were less likely to make dichotomization errors."
The authors also echoed Wasserstein, Lazar, and Cobb's concern that we are setting researchers up for failure: "we teach NHST because that's what the scientific community and journal editors use but they use NHST because that's what we teach them.  Indeed, statistics at the undergraduate level as well as at the graduate level in applied fields is often taught in a rote and recipe-like manner that typically focuses exclusively on the NHST paradigm."

Some of the problems with frequentist statistics are the way in which its methods are misused, especially with regard to dichotomization.  But an approach that is so easy to misuse and which sacrifices direct inference in a futile attempt at objectivity still has fundamental problems.






See the following for discussions about this article that are not on this blog.

  • https://news.ycombinator.com/item?id=13684429

39 comments:

  1. Another great book is Rethinking by Richard Mcelreath.

    ReplyDelete
    Replies
    1. I just started reading it and like it very much. It's highly recommended by Andrew Gelman. Full name is Statistical Rethinking: A Bayesian Course with Examples in R and Stan.

      Delete
    2. Ditto on "Statistical Rethinking." One of the clearest explanations of the Bayesian approach I've seen.

      Delete
  2. I'm just an amateur. But, with regards to "defective concepts", I'm troubled by the strong likelihood principle. It looks like Birnbaum proved it a long time ago from two reasonable principles, and it is inconsistent with commonly used significance tests.

    When I google for it, all I find is that Deborah Mayo disputes it, and there's a new proof of it by Gandenberger. I'm a bit surprised that it doesn't get more mention or attention since there are a lot of statisticians like you that express concerns.

    ReplyDelete
    Replies
    1. Yes it's worthy of much more discussion. I try to come at this from a simpler perspective: do posterior probabilities remain well calibrated when used to trigger stopping a study where doing extremely frequent looks at the data (they do). In the future I'll blog about how to do simple simulations that demonstrate such mathematical necessities. A key issue underneath this is getting investigators and reviewers to agree on a choice of prior up front, or at least not having investigators use priors that are very inconsistent with those of reviewers.

      Delete
  3. Is there a tutorial or beginner level material that compares and contrasts frequentist and Bayesian methods. It will be good to take the same problem and show how the two disciplines approach it. More importantly, as you seem to have done in your own journey, get to the core of the matter and see where they are different. Sometimes use of a name in common parlance ( like probability ) by the two approaches may be different which may not be apparent unless looked at with a lot of rigor by a researcher like you

    ReplyDelete
    Replies
    1. I found these series of courses quite useful as an introduction to Bayesian statistics! There's a couple videos that show the difference between a Frequentist and a Bayesian approach.

      https://www.youtube.com/watch?v=U1HbB0ATZ_A&list=PLFDbGp5YzjqXQ4oE4w9GVWdiokWB9gEpm

      Delete
    2. Check out this YouTube lecture series by Richard McElreath. It parallels his outstanding book, "Statistical Rethinking."

      https://www.youtube.com/watch?v=WFv2vS8ESkk&list=PLDcUM9US4XdMdZOhJWJJD4mDBMnbTWw_z

      Delete
    3. Here's an introductory article focused exactly on putting frequentist and Bayesian side by side, for both hypothesis testing and parameter estimation. Links at this blog post: http://doingbayesiandataanalysis.blogspot.com/2017/02/the-bayesian-new-statistics-finally.html

      Delete
    4. This comment has been removed by the author.

      Delete
    5. This comment has been removed by the author.

      Delete
    6. This comment has been removed by the author.

      Delete
    7. I have a set of blog posts that are intended to provide an accessible introduction:

      - http://gandenberger.org/2014/07/21/intro-to-statistical-methods/
      - http://gandenberger.org/2014/07/28/intro-to-statistical-methods-2/
      - http://gandenberger.org/2014/08/26/intro-to-statistical-methods-3/

      Delete
    8. Thanks to all 4 of you for pointing us to some excellent resources.

      Delete
  4. Start with https://cran.r-project.org/web/packages/BEST/vignettes/BEST.pdf

    But I hope that a reader will tell us of a place where there are multiple side-by-side analyses. I've been looking for that.

    ReplyDelete
    Replies
    1. Zoltan Dienes has a number of examples like that in an upcoming issue of Psychonomic Bulletin and Review. Stay tuned! :)

      Delete
    2. For a few simple side-by-side comparisons of Bayesian and frequentist, for hypothesis testing and parameter estimation, see the article linked in this blog post: http://doingbayesiandataanalysis.blogspot.com/2017/02/the-bayesian-new-statistics-finally.html

      Delete
  5. Indeed, kids should learn from Bayes pre-school.

    Let them elevate their degrees of belief as they grow older :)

    ReplyDelete
  6. Nicely written. A long time ago, I heard Don Berry say that frequentist concepts like Type 1 error, coverage probability, power, randomization, were very valuable at the design stage of experimentation. But that Bayesian summaries of the data at the end of the study were superior for analysis and interpretation of the data. That made sense to me. I think it is regrettable that there was so much vitriol and attacks made against both camps in the formative years of statistics.

    ReplyDelete
    Replies
    1. Nicely put. A big selling point of Spiegelhalter's work is that he doesn't preach but shows the problem-solving power of Bayes with real examples. One problem with considering type I error even only pre-study is that it can be really hard to define and the sample space can be complex. I tend to come at this from the standpoint of planning around how much information will result, quantified e.g. by width of 0.95 credible interval, or by using the probability that posterior probability will exceed some high number for a distribution of true unknown effects.

      Delete
    2. Agreed. Moreover, a Bayesian perspective at the design stage goes even further: It incorporates uncertainty into the research hypothesis, instead of assuming a specific effect size (or small set of candidate effect sizes) as is traditionally done in a frequentist approach to design.

      Delete
    3. Hi John: Differences in interpretation aside, I see the greatest benefit of using Bayesian methods as flexibility. For example, fitting a skewed normal model, in which sigma and the skew are estimated for each group.This can also easily be extended to a multilevel framework. As far as I know, this is not currently possible in a frequentist framework.

      Finally, I see the use of so called noninformative prior as not very Bayesian. We generally can rule out effects greater than d of 1, if not less.

      In sum, the benefits of Bayesian are only fully realized, IMO, when one sees the benefits of informative prior (especially in MLM) and the great flexibility offered in Rstan and brms..etc.

      Delete
    4. This comment has been removed by the author.

      Delete
    5. Nicely put Donald. Another example of flexibility that is worked out in detail in the Box and Tiao book is having a parameter specifying the degree of non-normality of the data, and having a prior for that. They show how this leads to something that is almost the trimmed mean but which is much more interpretable.

      Delete
    6. Hi Frank (I hope that is OK). I (with a collaborator) am currently working on a paper "introducing" a Bayesian heteroscedastic skew-normal model. We are basically characterizing parameter bias, error rates, and power, all while estimating the degree of skew as well as sigma for each group. (yes, error rates. I don't really agree with type one error, but still find that is is useful so long as we understand its limitations).

      Interestingly, for those not wanting to learn Stan or Bayesian methods, I would loosely advocate the trimmed means approach as long as the degree of trim was not used to get a significant p-value. Counter to my original thoughts, the trimmed means approach actually does perform rather well.

      As you said, the Bayesian framework is much more interpretable (IMO) and does not entail directly altering the data. Furthermore, the Bayesian approach actually provides estimates for skew, sigma, etc, which are important for prospective power analyses. Here, it also becomes clear how important priors can be! Not that they influence the final estimates all that much (although they can), but that the model needs them to converge.

      Finally, if one does learn a general Bayesian approach, it will likely fulfill all their needs. This mean no jumping around packages or functions to get the "correct" estimate of standard error. Generalized estimating equations are a good example: there are many bias corrections (small N, etc.) for SE that it can make ones head spin (all resulting in slightly to drastically different p-values).

      Delete
    7. Very interesting Donald, and do call me Frank. I would emphasize mean squared and mean absolute estimation errors, probably. I don't find trimmed means satisfactory because I'm unable to define to a collaborator what they mean prospectively. With Bayes you can estimate the mean or quantiles of the raw data distribution, or better, estimate the whole distribution.

      Delete
  7. Hi Frank:
    I basically agree with everything you have said here. The trimmed means is far from intuitive. I did some simulations with the trimmed means to see how
    a determined researcher can "find" significance. Basically, it reduces
    to a multiple comparisons problem. However, even with the exact same data
    but using different thresholds to trim, inflates the error rate almost
    0.05 * the number of tests (assuming, on average, no difference between
    group). Lots up researchers degrees of freedom with this approach, and others
    (winsorizing).


    ReplyDelete
  8. Thank you for a very interesting and informative post. One oft-repeated argument against the frequentist perspective that has never resonated with me is the fact that CI are hard to explain. In most (but not all) cases I'm interested in selecting the technique that will maximize my chance to deliver correct conclusions, regardless of how hard it would be for a collaborator to understand the statistical methods I used. Should I avoid using MCMC because my collaborators can't understand the technique?

    I may be at risk of attacking a straw-man here, because of course there are deep philosophical/statistical problems with CI. But that's kind of my point -- isn't it best to focus on these fundamentally problematic aspects rather than didactic issues?

    ReplyDelete
    Replies
    1. I am not really sure there are deep problems with confidence intervals.

      I use both Bayesian and frequentist in simulation studies, but only Bayesian for analyzing "real" data. That said, just because people misinterpret something does not mean it is bad. With this logic, almost all things in life have deep issues. I find confidence intervals useful, although I do not think that they necessarily generalize exactly to actual research situations. Exploring long-run outcomes, CI coverage, bias,..etc provide useful information, IMO. The problem as I see it, is individuals not realizing the limitations of models, frequentist, Bayesian, or agent based.

      Delete
    2. I am not really sure there are deep problems with confidence intervals.

      I use both Bayesian and frequentist in simulation studies, but only Bayesian for analyzing "real" data. That said, just because people misinterpret something does not mean it is bad. With this logic, almost all things in life have deep issues. I find confidence intervals useful, although I do not think that they necessarily generalize exactly to actual research situations. Exploring long-run outcomes, CI coverage, bias,..etc provide useful information, IMO. The problem as I see it, is individuals not realizing the limitations of models, frequentist, Bayesian, or agent based.

      Delete
  9. I was speaking more of the problem with statisticians and stat grad students understanding the concept. If after multiple attempts at understanding a primary concept in a paradigm one has to give up, there is a problem with the paradigm.

    ReplyDelete
  10. Error statistics (neiterh Fisherian nor N-P style) never required "the possibility of infinitely many repetitions of identical experiments". That's absurd. When people complain about cherry-picking, p-hacking, optional stopping, data-dependent endpoints, etc. it's because they prevent a stringent test in the case at hand. The appeal to the "ease" (frequency) of producing impressive-looking results, even under Ho, only alludes to hypothetical possibilities (nor need they be identical). Such appeals are at the heart of criticisms of bad statistics and bad science. Unless your Bayesianism takes account of "what could have occurred but didn't" I fail to see your grounds for caring about preregistration, RCTs, etc. You seem to have boxed yourself into an inconsistent position--and I don't know what kind of priors you favor-- based on a mickey-mouse caricature of hypotheses tests.
    On the other hand, if your Bayesian does consider what could have occurred--counterfactual reasoning that we can simulate on our computers today--then you can't say such considerations are irrelevant.

    Frequentists also estimate, and any statistical inference can be appraised according to how well or stringently tested claims are (that's just semantics).

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. # realized previous post misspelled your name (apologies)

      Hello Deborah: I am not sure what is meant by "...required "the
      possibility of infinitely many repetitions of identical experiments"." The usefulness of frequentist statistics, as I see it, is the emphasis on how ones model does perform over the long run given a set of assumptions (that may or may not be realistic). As such, I do not see this as much of a problem, so long as we do understand that the estimates (e.g., errors) are derived from many assumptions that probably do not generalize one-to-one to "real" world settings. That said, they can be very useful, and at minimum, we should ensure our model is has optimal calibration (whether this is obtained from error rates, coverage, parameter bias,..etc.).

      The estimates, however, are indeed computed from sampling a population with a set of characteristics many (many) of times. Infinite would be nice, but really we need about 5,000 to 10,000 repetitions to get an estimate that is stable. Of course, we can then vary population values to see what could happen over the long run under different assumptions. So, I really do not think the argument against repetition is very useful, as that is just how error (type one, or bias) and power are estimated. This is just the way it is, and may or may not makes sense, but is useful (IMO) so long as we do not take our model too seriously (they are all wrong after all!).

      I prefer Bayesian statistics, but I too study long run properties of my models. I do not really see any other way to see how a give model is performing. Posterior predictive checks are also useful, and often provide similar inferences as simulations. For example, exploring variances of each group can show misfit when assuming equal variances, and over the long run this misfit results in an inflated error rate.

      Furthermore, if confidence and credible intervals are basically equivalent with a so-called "uniformative" prior, it follows that Bayesian have, at minimum, expected error rates. One can not have equivalent intervals without this being the case! Once said, notion probably "feels" like common sense. With a prior centered at zero, and is informative, this will just make power lower than a frequentist estimate, so do not see how errors are not controlled. If anything, the Bayesian model can be considered sub-optimal (in NHST framework), since it can provide conservative estimates. This is the kind of Bayesian statistics that I use.

      Now, often Bayesians do not focus on error rates, and would prefer to have a model that best describes the data generating process. This approach often leads to controlling error rates, optimal power, among other things. This occurs as a by-product, so to speak, from focusing on modeling the data.

      Where Bayesian methods really shine is that we do not need to come up with different ways to estimate the standard error, or ways to approximate the degrees of freedom to obtain reasonable inferences. For example, even to accommodate unequal variances, a Welch's t-test resorts to approximating the degrees of freedom for the sampling distribution of the t-statistic. In a multilevel framework, the sampling distribution is entirely unknown, but people have figured out approximations that ensure optimal error rates. Indeed, some would even say that exact p-value do not exist (so much for exact error rates :-)) In contrast, whether comparing two group or multilevel with varying slopes and interecepts, Bayesian methods do not depend on a known sampling distribution and everything is estimated much the same (Yes, even NHST error rates are obtained!). It is actually quite elegant!

      D

      Delete

      Delete
  11. I resonate with Donald's comments on these points and don't see justification for some of Deborah's. Writing simulation pseudo-code will expose many of the issues properly. I don't need to show long-run operating characteristics to show that Bayesian methods optimize the probability of making the right inference for a given set of data. True I need a large number of simulated clinical trials to demonstrate perfect calibration of Bayesian posterior probabilities, but these simulations are made under an entire array of treatment effects not for one single effect as with frequentist methods.

    ReplyDelete
  12. One disadvantage of a Bayesian approach is that it doesn't give you an estimate of error. Can this be accomplished simply by applying bootstrap methods to obtain a confidence level to the posterior probabilities?

    ReplyDelete
    Replies
    1. No, posterior probabilities factor in all uncertainties and are self-contained. But if you want to discuss uncertainties in a parameter, that is represented by the entire posterior distribution.

      Delete
  13. Could you please give an example of a clinical study analysis in a frequentist manner vs. a Bayesian? Also, could you show examples of "I found that with MCMC simulation of Bayesian posterior draws I could quite simply compute probabilities such as P(any efficacy), P(efficacy more than trivial), P(non-inferiority), P(efficacy on endpoint A and on either endpoint B or endpoint C), and P(benefit on more than 2 of 5 endpoints). " ? Thanks again!

    ReplyDelete
  14. In the coming weeks I'll be able to share a comprehensive set of notes. For now look at http://www.fharrell.com/2017/10/bayesian-vs-frequentist-statements.html

    ReplyDelete