Statistical Thinking

This blog is devoted to statistical thinking and its impact on science and everyday life. Emphasis is given to maximizing the use of information, avoiding statistical pitfalls, describing problems caused by the frequentist approach to statistical inference, describing advantages of Bayesian and likelihood methods, and discussing intended and unintended differences between statistics and data science. I'll also cover regression modeling strategies, clinical trials, and drug evaluation.

Friday, January 27, 2017

Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank Goodness

Randomized clinical trials (RCT) have long been held as the gold standard for generating evidence about the effectiveness of medical and surgical treatments, and for good reason. But I commonly hear clinicians lament that the results of RCTs are not generalizable to medical practice, primarily for two reasons:

Patients in clinical practice are different from those enrolled in RCTs
Drug adherence in clinical practice is likely to be lower than that achieved in RCTs, resulting in lower efficacy.

Point 2 is hard to debate because RCTs are run under protocol and research personnel are watching and asking about patients' adherence (but more about this below). But point 1 is a misplaced worry in the majority of trials. The explanation requires getting to the heart of what RCTs are really intended to do: provide evidence for relative treatment effectiveness. There are some trials that provide evidence for both relative and absolute effectiveness. This is especially true when the efficacy measure employed is absolute as in measuring blood pressure reduction due to a new treatment. But many trials use binary or time-to-event endpoints and the resulting efficacy measure is on a relative scale such as the odds ratio or hazard ratio.

RCTs of even drastically different patients can provide estimates of relative treatment benefit on odds or hazard ratio scales that are highly transportable. This is most readily seen in subgroup analyses provided by the trials themselves - so called forest plots that demonstrate remarkable constancy of relative treatment benefit. When an effect ratio is applied to a population with a much different risk profile, that relative effect can still fully apply. It is only likely that the absolute treatment benefit will change, and it is easy to estimate the absolute benefit (e.g., risk difference) for a patient given the relative benefit and the absolute baseline risk for the subject. This is covered in detail in Biostatistics for Biomedical Research, Section 13.6. See also Stephen Senn's excellent presentation.

Clinical practice provides anecdotal evidence that biases clinicians. What a clinician sees in her practice is patient i on treatment A and patient j on treatment B. She may remember how patient i fared in comparison to patient j, not appreciate confounding by indication, and suppose this provides a valid estimate of the difference in effectiveness in treatment A vs. B. But the real therapeutic question is how does the outcome of a patient were she given treatment A compare to her outcome were she given treatment B. The gold standard design is thus the randomized crossover design, when the treatment is short acting. Stephen Senn eloquently writes about how a 6-period 2-treatment crossover study can even do what proponents of personalized medicine mistakenly think they can do with a parallel-group randomized trial: estimate treatment effectiveness for individual patients.

For clinical practice to provide the evidence really needed, the clinician would have to see patients and assign treatments using one of the top four approaches listed in the hierarchy of evidence below. Entries are in the order of strongest evidence requiring the least assumptions to the weakest evidence. Note that crossover studies, when feasible, even surpass randomized studies of matched identical twins in the quality and relevance of information they provide.

Let P_i denote patient i and the treatments be denoted by A and B. Thus P₂^B represents patient 2 on treatment B. P₁ represents the average outcome over a sample of patients from which patient 1 was selected. HTE is heterogeneity of treatment effect.

Design	Patients Compared

6-period crossover	P₁^A vs P₁^B (directly measure HTE)
2-period crossover	P₁^A vs P₁^B
RCT in idential twins	P₁^A vs P₁^B
∥ group RCT	P₁^A vs P₂^B, P₁=P₂ on avg
Observational, good artificial control	P₁^A vs P₂^B, P₁=P₂ hopefully on avg
Observational, poor artificial control	P₁^A vs P₂^B, P₁≠ P₂ on avg
Real-world physician practice	P₁^A vs P₂^B

The best experimental designs yield the best evidence a clinician needs to answer the "what if" therapeutic question for the one patient in front of her.

Regarding adherence, proponents of "real world" evidence advocate for estimating treatment effects in the context of making treatment adherence low as in clinical practice. This would result in lower efficacy and the abandonment of many treatments. It is hard to argue that a treatment should not be available for a potentially adherent patient because her fellow patients were poor adherers. Note that an RCT is the best hope for estimating efficacy as a function of adherence, through for example an instrumental variable analysis (the randomization assignment is a truly valid instrument). Much more needs to be said about how to handle treatment adherence and what should be the target adherence in an RCT, but overall it is a good thing that RCTs do not mimic clinical practice. We are entering a new era of pragmatic clinical trials. Pragmatic trials are worthy of in-depth discussion, but it is not a stretch to say that the chief advantage of pragmatic trials is not that they provide results that are more relevant to clinical practice but that they are cheaper and faster than traditional randomized trials.

Updated 2017-06-25 (last paragraph regarding adherence)

Wednesday, January 25, 2017

Clinicians' Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I Error

Imaging watching a baseball game, seeing the batter get a hit, and hearing the announcer say "The chance that the batter is left handed is now 0.2!" No one would care. Baseball fans are interested in the chance that a batter will get a hit conditional on his being right handed (handedness being already known to the fan), the handedness of the pitcher, etc. Unless one is an archaeologist or medical examiner, the interest is in forward probabilities conditional on current and past states. We are interested in the probability of the unknown given the known and the probability of a future event given past and present conditions and events.

Clinicians are people trained in the science and practice of medicine, and most of them are very good at it. They are also very good at many aspects of research. But they are generally not taught probability, and this can limit their research skills. Many excellent clinicians even let their limitations in understanding probability make them believe that their clinical decision making is worse than it actually is. I have taught many clinicians who say "I need a hard and fast rule so I know how to diagnosis or treat patients. I need a hard cutoff on blood pressure, HbA1c, etc. so that I know what to do, and the fact that I either treat or not treat the patient means that I don't want to consider a probability of disease but desire a simple classification rule." This makes the clinician try to influence the statistician to use inefficient, arbitrary methods such as categorization, stratification, and matching.

In reality, clinicians do not act that way when treating patients. They are smart enough to know that if a patient has cholesterol just over someone's arbitrary threshold they may not start statin therapy right away if the patient has no other risk factors (e.g., smoking) going against him. They know that sometimes you start a patient on a lower dose and see how she responds, or start one drug and try it for a while and then switch drugs if the efficacy is unacceptable or there is a significant side effect.

So I emphasize the need to understand probabilities when I'm teaching clinicians. A probability is a self-contained summary of the current information, except for the patient's risk aversion and other utilities. Clinicians need to be comfortable with a probability of 0.5 meaning "we don't know much" and not requesting a classification of disease/normal that does nothing but cover up the problem. A classification does not account for gray zones or patient and physician utility functions.

Even physicians who understand the meaning of a probability are often not understanding conditioning. Conditioning is all important, and conditioning on different things massively changes the meaning of the probabilities being computed. Every physician I've known has been taught probabilistic medical diagnosis by first learning about sensitivity (sens) and specificity (spec). These are probabilities that are in backwards time- and information flow order. How did this happen? Sensitivity, specificity, and receiver operating characteristic curves were developed for radar and radio research in the military. It was a important to receive radio signals from distant aircraft, and to detect an incoming aircraft on radar. The ability to detect something that is really there is definitely important. In the 1950s, virologists appropriated these concepts to measure the performance of viral cultures. Virus needs to be detected when it's present, and not detected when it's not. Sensitivity is the probability of detecting a condition when it is truly present, and specificity is the probability of not detecting it when it is truly absent. One can see how these probabilities would be useful outside of virology and bacteriology when the samples are retrospective, as in a case-control studies. But I believe that clinicians and researchers would be better off if backward probabilities were not taught or were mentioned only to illustrate how not to think about a problem.

But the way medical students are educated, they assume that sens and spec are what you first consider in a prospective cohort of patients! This gives the professor the opportunity of teaching Bayes' rule and requires the use of a supposedly unconditional probability known as prevalence which is actually not very well defined. The students plugs everything into Bayes' rule and fails to notice that several quantities cancel out. The result is the following: the proportion of patients with a positive test who have disease, and the proportion with a negative test who have disease. These are trivially calculated from the cohort data without knowing anything about sens, spec, and Bayes. This way of thinking harms the student's understanding for years to come and influences those who later engage in clinical and pharmaceutical research to believe that type I error and p-values are directly useful.

The situation in medical diagnosis gets worse when referral bias (also called workup bias) is present. When certain types of patients do not get a final diagnosis, sens and spec are biased. For example, younger women with a negative test may not get the painful procedure that yields the final diagnosis. There are formulas that must be used to correct sens and spec. But wait! When Bayes' rule is used to obtain the probability of disease we needed in the first place, these corrections completely cancel out when the usual correction methods are used! Using forward probabilities in the first place means that one just conditions on age, sex, and result of the initial diagnostic test and no special methods other than (sometimes) logistic regression are required.

There is an analogy to statistical testing. p-values and type I error are affected by sequential testing and a host of other factors, but forward-time probabilities (Bayesian posterior probabilities) are not. Posterior probabilities condition on what is known and does not have to imagine alternate paths to getting to what is known (as do sens and spec when workup bias exists). p-values and type I errors are backwards-information-flow measures, and clinical researchers and regulators come to believe that type I error is the error of interest. They also very frequently misinterpret p-values. The p-value is one minus spec, and power is sens. The posterior probability is exactly analogous to the probability of disease.

Sens and spec are so pervasive in medicine, bioinformatics, and biomarker research that we don't question how silly they would be in other contexts. Do we dichotomize a response variable so that we can compute the probability that a patient is on treatment B given a "positive" response? On the contrary we want to know the full continuous distribution of the response given the assigned treatment. Again this represents forward probabilities.

Monday, January 23, 2017

Split-Sample Model Validation

Methods used to obtain unbiased estimates of future performance of statistical prediction models and classifiers include data splitting and resampling. The two most commonly used resampling methods are cross-validation and bootstrapping. To be as good as the bootstrap, about 100 repeats of 10-fold cross-validation are required.

As discussed in more detail in Section 5.3 of Regression Modeling Strategies Course Notes and the same section of the RMS book, data splitting is an unstable method for validating models or classifiers, especially when the number of subjects is less than about 20,000 (fewer if signal:noise ratio is high). This is because were you to split the data again, develop a new model on the training sample, and test it on the holdout sample, the results are likely to vary significantly. Data splitting requires a significantly larger sample size than resampling to work acceptably well. See also Section 10.11 of BBR.

There are also very subtle problems:

When feature selection is done, data splitting validates just one of a myriad of potential models. In effect it validates an example model. Resampling (repeated cross-validation or the bootstrap) validate the process that was used to develop the model. Resampling is honest in reporting the results because it depicts the uncertainty in feature selection, e.g., the disagreements in which variables are selected from one resample to the next.
It is not uncommon for researchers to be disappointed in the test sample validation and to ask for a "re-do" whereby another split is made or the modeling starts over, or both. When reporting the final result they sometimes neglect to mention that the result was the third attempt at validation.
Users of split-sample validation are wise to recombine the two samples to get a better model once the first model is validated. But then they have no validation of the new combined data model.

There is a less subtle problem but one that is ordinarily not addressed by investigators: unless both the training and test samples are huge, split-sample validation is not nearly as accurate as the bootstrap. See for example the section Studies of Methods Used in the Text here. As shown in a simulation appearing there, bootstrapping is typically more accurate than data splitting and cross-validation that does not use a large number of repeats. This is shown by estimating the "true" performance, e.g., the R-squared or c-index on an infinitely large dataset (infinite here means 50,000 subjects for practical purposes). The performance of an accuracy estimate is taken as the mean squared error of the estimate against the model's performance in the 50,000 subjects.

Data are too precious to not be used in model development/parameter estimation. Resampling methods allow the data to be used for both development and validation, and they do a good job in estimating the likely future performance of a model. Data splitting only has an advantage when the test sample is held by another researcher to ensure that the validation is unbiased.

Update 2017-01-25

Many investigators have been told that they must do an "external" validation, and they split the data by time or geographical location. They are sometimes surprised that the model developed in one country or time does not validate in another. They should not be; this is an indirect way of saying there are time or country effects. Far better would be to learn about and estimate time and location effects by including them in a unified model. Then rigorous internal validation using the bootstrap, accounting for time and location all along the way. The end result is a model that is useful for prediction at times and locations that were at least somewhat represented in the original dataset, but without assuming that time and location effects are nil.

Wednesday, January 18, 2017

Fundamental Principles of Statistics

There are many principles involved in the theory and practice of statistics, but here are the ones that guide my practice the most.

Use methods grounded in theory or extensive simulation
Understand uncertainty
Design experiments to maximize information
Understand the measurements you are analyzing and don't hesitate to question how the underlying information was captured
Be more interested in questions than in null hypotheses, and be more interested in estimation than in answering narrow questions
Use all information in data during analysis
Use discovery and estimation procedures not likely to claim that noise is signal
Strive for optimal quantification of evidence about effects
Give decision makers the inputs (other than the utility function) that optimize decisions
Present information in ways that are intuitive, maximize information content, and are correctly perceived
Give the client what she needs, not what she wants
Teach the client to want what she needs

... the statistician must be instinctively and primarily a logician and a scientist in the broader sense, and only secondarily a user of the specialized statistical techniques.

In considering the refinements and modifications of the scientific method which particularly apply to the work of the statistician, the first point to be emphasized is that the statistician is always dealing with probabilities and degrees of uncertainty. He is, in effect, a Sherlock Holmes of figures, who must work mainly, or wholly, from circumstantial evidence.

Malcolm C Rorty: Statistics and the Scientific Method. JASA 26:1-10, 1931.

Monday, January 16, 2017

Ideas for Future Articles

Suggestions for future articles are welcomed as comments to this entry. Some topics I intend to write about are listed below.

The litany of problems with p-values - catalog of all the problems I can think of
Matching vs. covariate adjustment (see below from Arne Warnke)
Statistical strategy for propensity score modeling and usage
Analysis of change: why so many things go wrong
What exactly is a type I error and should we care? (analogy: worrying about the chance of a false positive diagnostic test vs. computing current probability of disease given whatever the test result was). Alternate title: Why Clinicians' Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I Error.
Forward vs. backwards probabilities and why forward probabilities serve as their own error probabilities (we have been fed backwards probabilities such as p-values, sensitivity, and specificity for so long it's hard to look forward)
What is the full meaning of a posterior probability?
Posterior probabilities can be computed as often as desired
Statistical critiques of published articles in the biomedical literature
New dynamic graphics capabilities using R plotly in the R Hmisc package: Showing more by initially showing less
Moving from pdf to html for statistical reporting
Is machine learning statistics or computer science?
Sample size calculation: Is it voodoo?
Difference between Bayesian modeling and frequentist inference
Proper accuracy scoring rules and why improper scores such as proportion "classified" "correctly" give misleading results.

Classification vs. Prediction

Classification combines prediction and decision making and usurps the decision maker in specifying costs of wrong decisions. The classification rule must be reformulated if costs/utilities change. Predictions are separate from decisions and can be used by any decision maker.

The field of machine learning arose somewhat independently of the field of statistics. As a result, machine learning experts tend not to emphasize probabilistic thinking. Probabilistic thinking and understanding uncertainty and variation are hallmarks of statistics. By the way, one of the best books about probabilistic thinking is Nate Silver's The Signal and The Noise: Why So Many Predictions Fail But Some Don't. In the medical field, a classic paper is David Spiegelhalter's Probabilistic Prediction in Patient Management and Clinical Trials.

By not thinking probabilistically, machine learning advocates frequently utilize classifiers instead of using risk prediction models. The situation has gotten acute: many machine learning experts actually label logistic regression as a classification method (it is not). It is important to think about what classification really implies. Classification is in effect a decision. Optimum decisions require making full use of available data, developing predictions, and applying a loss/utility/cost function to make a decision that, for example, minimizes expected loss or maximizes expected utility. Different end users have different utility functions. In risk assessment this leads to their having different risk thresholds for action. Classification assumes that every user has the same utility function and that the utility function implied by the classification system is that utility function.

Classification is a forced choice. In marketing where the advertising budget is fixed, analysts generally know better than to try to classify a potential customer as someone to ignore or someone to spend resources on. They do this by modeling probabilities and creating a lift curve, whereby potential customers are sorted in decreasing order of estimated probability of purchasing a product. To get the "biggest bang for the buck", the marketer who can afford to advertise to n persons picks the n highest-probability customers as targets. This is rational, and classification is not needed here.

A frequent argument from data users, e.g., physicians, is that ultimately they need to make a binary decision, so binary classification is needed. This is simply not true. First of all, it is often the case that the best decision is "no decision; get more data" when the probability of disease is in the middle. In many other cases, the decision is revocable, e.g., the physician starts the patient on a drug at a lower dose and decides later whether to change the dose or the medication. In surgical therapy the decision to operate is irrevocable, but the choice of when to operate is up to the surgeon and the patient and depends on severity of disease and symptoms. At any rate, if binary classification is needed, it must be done at the point of care when all utilities are known, not in a data analysis.

When are forced choices appropriate? I think that one needs to consider whether the problem is mechanistic or stochastic/probabilistic. Machine learning advocates often want to apply methods made for the former to problems where biologic variation, sampling variability, and measurement errors exist. It may be best to apply classification techniques instead just to high signal:noise ratio situations such as those in which there there is a known gold standard and one can replicate the experiment and get almost the same result each time. An example is pattern recognition - visual, sound, chemical composition, etc. If one creates an optical character recognition algorithm, the algorithm can be trained by exposing it to any number of replicates of attempts to classify an image as the letters A, B, ... The user of such a classifier may not have time to consider whether any of the classifications were "close calls." And the signal:noise ratio is extremely high. In addition, there is a single "right" answer for each character.

When close calls are possible, probability estimates are called for. One beauty of probabilities is that they are their own error measures. If the probability of disease is 0.1 and the current decision is not to treat the patient, the probability of this being an error is by definition 0.1. A probability of 0.4 may lead the physician to run another lab test or do a biopsy. When the signal:noise ratio is small, classification is usually not a good goal; there one must model tendencies, i.e., probabilities.

The U.S. Weather Service has always phrased rain forecasts as probabilities. I do not want a classification of "it will rain today." There is a slight loss/disutility of carrying an umbrella, and I want to be the one to make the tradeoff.

Whether engaging in credit risk scoring, weather forecasting, climate forecasting, marketing, diagnosis a patient's disease, or estimating a patient's prognosis, I do not want to use a classification method. I want risk estimates with credible intervals or confidence intervals. My opinion is that machine learning classifiers are best used in mechanistic high signal:noise ratio situations, and that probability models should be used in most other situations.

This is related to a subtle point that has been lost on many analysts. Complex machine learning algorithms, which allow for complexities such as high-order interactions, require an enormous amount of data unless the signal:noise ratio is high, another reason for reserving some machine learning techniques for such situations. Regression models which capitalize on additivity assumptions (when they are true, and this is approximately true is much of the time) can yield accurate probability models without having massive datasets. And when the outcome variable being predicted has more than two levels, a single regression model fit can be used to obtain all kinds of interesting quantities, e.g., predicted mean, quantiles, exceedance probabilities, and instantaneous hazard rates.

A special problem with classifiers illustrates an important issue. Users of machine classifiers know that a highly imbalanced sample with regard to a binary outcome variable Y results in a strange classifier. For example, if the sample has 1000 diseased patients and 1,000,000 non-diseased patients, the best classifier may classify everyone as non-diseased; you will be correct 0.999 of the time. For this reason the odd practice of subsampling the controls is used in an attempt to balance the frequencies and get some variation that will lead to sensible looking classifiers (users of regression models would never exclude good data to get an answer). Then they have to, in some ill-defined way, construct the classifier to make up for biasing the sample. It is simply the case that a classifier trained to a 1/1000 prevalence situation will not be applicable to a population with a vastly different prevalence. The classifier would have to be re-trained on the new sample, and the patterns detected may change greatly. Logistic regression on the other hand elegantly handles this situation by either (1) having as predictors the variables that made the prevalence so low, or (2) recalibrating the intercept (only) for another dataset with much higher prevalence. Classifiers' extreme dependence on prevalence may be enough to make some researchers always use probability estimators instead. One could go so far as to say that classifiers should not be used at all when there is little variation in the outcome variable, and that only tendencies should be modeled.

One of the key elements in choosing a method is having a sensitive accuracy scoring rule with the correct statistical properties. Experts in machine classification seldom have the background to understand this enormously important issue, and choosing an improper accuracy score such as proportion classified correctly will result in a bogus model. This will be discussed in a future blog.

Saturday, January 14, 2017

p-values and Type I Errors are Not the Probabilities We Need

In trying to guard against false conclusions, researchers often attempt to minimize the risk of a "false positive" conclusion. In the field of assessing the efficacy of medical and behavioral treatments for improving subjects' outcomes, falsely concluding that a treatment is effective when it is not is an important consideration. Nowhere is this more important than in the drug and medical device regulatory environments, because a treatment thought not to work can be given a second chance as better data arrive, but a treatment judged to be effective may be approved for marketing, and if later data show that the treatment was actually not effective (or was only trivially effective) it is difficult to remove the treatment from the market if it is safe. The probability of a treatment not being effective is the probability of "regulator's regret." One must be very clear on what is conditioned upon (assumed) in computing this probability. Does one condition on the true effectiveness or does one condition on the available data? Type I error conditions on the treatment having no effect and does not entertain the possibility that the treatment actually worsens the patients' outcomes. Can one quantify evidence for making a wrong decision if one assumes that all conclusions of non-zero effect are wrong up front because H₀ was assumed to be true? Aren't useful error probabilities the ones that are not based on assumptions about what we are assessing but rather just on the data available to us?

Statisticians have convinced regulators that long-run operating characteristics of a testing procedure should rule the day, e.g., if we did 1000 clinical trials where efficacy was always zero, we want no more than 50 of these trials to be judged as "positive." Never mind that this type I error operating characteristic does not refer to making a correct judgment for the clinical trial at hand. Still, there is a belief that type I error is the probability of regulator's regret (a false positive), i.e., that the treatment is not effective when the data indicate it is. In fact, clinical trialists have been sold a bill of goods by statisticians. No probability derived from an assumption that the treatment has zero effect can provide evidence about that effect. Nor does it measure the chance of the error actually in question. All probabilities are conditional on something, and to be useful they must condition on the right thing. This usually means that what is conditioned upon must be knowable.

The probability of regulator's regret is the probability that a treatment doesn't work given the data. So the probability we really seek is the probability that the treatment has no effect or that it has a backwards effect. This is precisely one minus the Bayesian posterior probability of efficacy.

In reality, there is unlikely to exist a treatment that has exactly zero effect. As Tukey argued in 1991, the effects of treatments A and B are always different, to some decimal place. So the null hypothesis is always false and the type I error could be said to be always zero.

The best paper I've read about the many ways in which p-values are misinterpreted is Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations written by a group of renowned statisticians. One of my favorite quotes from this paper is

Thus to claim that the null P value is the probability that chance alone produced the observed association is completely backwards: The P value is a probability computed assuming chance was operating alone. The absurdity of the common backwards interpretation might be appreciated by pondering how the P value, which is a probability deduced from a set of assumptions (the statistical model), can possibly refer to the probability of those assumptions.

In 2016 the American Statistical Association took a stand against over-reliance on p-values. This would have made a massive impact on all branches of science had it been issued 50 years ago but better late than never.

Update 2017-01-19

Though believed to be true by many non-statisticians, p-values are not the probability that H₀ is true, and to turn them into such probabilities requires Bayes' rule. If you are going to use Bayes' rule you might as well formulate the problem as a full Bayesian model. This has many benefits, not the least of them being that you can select an appropriate prior distribution and you will get exact inference. Attempts by several authors to convert p-values to probabilities of interest (just as sensitivity and specificity are converted to probability of disease once one knows the prevalence of disease) have taken the prior to be discontinuous, putting a high probability on H₀ being exactly true. In my view it is much more sensible to believe that there is no discontinuity in the prior at the point represented by H₀, encapsulating prior knowledge instead by saying that values near H₀ are more likely if no relevant prior information is available.

Returning to the non-relevance of type I error as discussed above, and ignoring for the moment that long-run operating characteristics do not directly assist us in making judgments about the current experiment, there is a subtle problem that leads researchers to believe that by controlling type I "error" they think they have quantified the probability of misleading evidence. As discussed at length by my colleague Jeffrey Blume, once an experiment is done the probability that positive evidence is misleading is not type I error. And what exactly does "error" mean in "type I error?" It is the probability of rejecting H₀ when H₀ is exactly true, just as the p-value is the probability of obtaining data more impressive than that observed given H₀ is true. Are these really error probabilities? Perhaps ... if you have been misled earlier into believing that we should base conclusions on how unlikely the observed data would have been observed under H₀. Part of the problem is in the loaded word "reject." Rejecting H₀ by seeing data that are unlikely if H₀ is true is perhaps the real error.

The "error quantification" truly needed is the probability that a treatment doesn't work given all the current evidence, which as stated above is simply one minus the Bayesian posterior probability of positive efficacy.

Update 2017-01-20

Type I error control is an indirect way to being careful about claims of effects. It should never have been the preferred method for achieving that goal. Seen another way, we would choose type I error as the quantity to be controlled if we wanted to:

require the experimenter to visualize an infinite number of experiments that might have been run, and assume that the current experiment could be exactly replicated
be interested in long-run operating characteristics vs. judgments needing to be made for the one experiment at hand
be interested in the probability that other replications result in data more extreme than mine if there is no treatment effect
require early looks at the data to be discounted for future looks
require past looks at the data to be discounted for earlier inconsequential looks
create other multiplicity considerations, all of them arising from the chances you give data to be extreme as opposed to the chances that you give effects to be positive
data can be more extreme for a variety of reasons such as trying to learn faster by looking more often or trying to learn more by comparing more doses or more drugs

The Bayesian approach focuses on the chances you give effects to be positive and does not have multiplicity issues (potential issues such as examining treatment effects in multiple subgroups are handled by the shrinkage that automatically results when you use the 'right' Bayesian hierarchical model).

The p-value is the chance that someone else would observe data more extreme than mine if the effect is truly zero (if they could exactly replicate my experiment) and not the probability of no (or a negative) effect of treatment given my data.

Update 2017-05-10

As discussed in Gamalo-Siebers at al DOI: 10.1002/pst.1807 the type I error is the probability of making an assertion of an effect when no such effect exists. It is not the probability of regret for a decision maker, e.g., it is not the probability of a drug regulator's regret. The probability of regret is the probability that the drug doesn't work or is harmful when the decision maker had decided it was helpful. It is the probability of harm or no benefit when an assertion of benefit is made. This is best thought of as the probability of harm or no benefit given the data which is one minus the probability of efficacy. Prob(assertion|no benefit) is not equal to 1-Prob(benefit|data).

Update 2017-11-28

Type I ("false positive") error probability would be a useful concept while a study is being designed. Frequentists speak of type I error control, but after a study is completed, the only way to commit a type I error is to know with certainty that an effect is exactly zero. But then the study would not have been necessary. So type I error remains a long-run operating characteristic for a sequence of hypothetical studies.

Thinking of p-values that a sequence of hypothetical studies might provide, when the type I error is α this means P(p-value < α | zero effect) = α. Neither a single p-value nor α is the probability of a decision error. They are "what if" probabilities, if the effect is zero. The p-value for a single study is merely the probability that data more extreme than ours would have been observed had the effect been exactly zero and the experiment was capable of being re-run infinitely often. It is nothing more than this. It is not a false positive probability for the experiment at hand. To compute the false positive probability one would need a prior distribution for the effect (and for the p-value to be perfectly accurate which is rare), and one might as well be fully Bayesian and enjoy all the Bayesian benefits.

Null Hypothesis Significance Testing Never Worked

Much has been written about problems with our most-used statistical paradigm: frequentist null hypothesis significance testing (NHST), p-values, type I and type II errors, and confidence intervals. Rejection of straw-man null hypotheses leads researchers to believe that their theories are supported, and the unquestioning use of a threshold such as p<0.05 has resulted in hypothesis substitution, search for subgroups, and other gaming that has badly damaged science. But we seldom examine whether the original idea of NHST actually delivered on its goal of making good decisions about effects, given the data.

NHST is based on something akin to proof by contradiction. The best non-mathematical definition of the p-value I've ever seen is due to Nicholas Maxwell: "the degree to which the data are embarrassed by the null hypothesis." p-values provide evidence against something, never in favor of something, and are the basis for NHST. But proof by contradiction is only fully valid in the context of rules of logic where assertions are true or false without any uncertainty. The classic paper The Earth is Round (p<.05) by Jacob Cohen has a beautiful example pointing out the fallacy of combining probabilistic ideas with proof by contradiction in an attempt to make decisions about an effect.

The following is almost but not quite the reasononing of null hypothesis rejection:

If the null hypothesis is correct, then this datum (D) can not occur.
It has, however, occurred.
Therefore the null hypothesis is false.

If this were the reasoning of H₀ testing, then it would be formally correct. … But this is not the reasoning of NHST. Instead, it makes this reasoning probabilistic, as follows:

If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

By making it probabilistic, it becomes invalid. … the syllogism becomes formally incorrect and leads to a conclusion that is not sensible:

If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)
This person is a member of Congress.
Therefore, he is probably not an American. (Pollard & Richardson, 1987)

… The illusion of attaining improbability or the Bayesian Id's wishful thinking error …

Induction has long been a problem in the philosophy of science. Meehl (1990) attributed to the distinguished philosopher Morris Raphael Cohen the saying "All logic texts are divided into two parts. In the first part, on deductive logic, the fallacies are explained; in the second part, on inductive logic, they are committed."

Sometimes when an approach leads to numerous problems, the approach itself is OK and the problems can be repaired. But besides all the other problems caused by NHST (including need for arbitrary multiplicity adjustments, need for consideration of investigator intentions and not just her actions, rejecting H₀ for trivial effects, incentivizing gaming, interpretation difficulties, etc.) it may be the case that the overall approach is defective and should not have been adopted.

With all of the amazing things Ronald Fisher gave us, and even though he recommended against the unthinking rejection of H₀, his frequentist approach and dislike of the Bayesian approach did us all a disservice. He called the Bayesian method invalid and was possibly intellectually dishonest when he labeled it as "inverse probability." In fact the p-value is an indirect inverse probability and Bayesian posterior probabilities are direct forwards probabilities that do not condition on a hypothesis, and the Bayesian approach has not only been shown to be valid, but it actually delivers on its promise.

Friday, January 13, 2017

Introduction

Statistics is a field that is a science unto itself and that benefits all other fields and everyday life. What is unique about statistics is its proven tools for decision making in the face of uncertainty, understanding sources of variation and bias, and most importantly, statistical thinking. Statistical thinking is a different way of thinking that is part detective, skeptical, and involves alternate takes on a problem. An excellent example of statistical thinking is statistician Abraham Wald's analysis of British bombers surviving to return to their base in World War II: his conclusion was to reinforce bombers in areas in which no damage was observed. For other great examples watch my colleague Chris Fonnesbeck's Statistical Thinking for Data Science.

Some of my personal philosophy of statistics can be summed up in the list below:

Statistics needs to be fully integrated into research; experimental design is all important
Don't be afraid of using modern methods
Preserve all the information in the data; Avoid categorizing continuous variables and predicted values at all costs
Don't assume that anything operates linearly
Account for model uncertainty and avoid it when possible by using subject matter knowledge
Use the bootstrap routinely
Make the sample size a random variable when possible
Use Bayesian methods whenever possible
Use excellent graphics, liberally
To be trustworthy research must be reproducible
All data manipulation and statistical analysis must be reproducible (one ramification being that I advise against the use of point and click software in most cases)

Statistics has multiple challenges today, which I break down into three major sources:

Statistics has been and continues to be taught in a traditional way, leading to statisticians believing that our historical approach to estimation, prediction, and inference was good enough.
Statisticians do not receive sufficient training in computer science and computational methods, too often leaving those areas to others who get so good at dealing with vast quantities of data that they assume they can be self-sufficient in statistical analysis and not seek involvement of statisticians. Many persons who analyze data do not have sufficient training in statistics.
Subject matter experts (e.g., clinical researchers and epidemiologists) try to avoid statistical complexity by "dumbing down" the problem using dichotomization, and statisticians, always trying to be helpful, fail to argue the case that dichotomization of continuous or ordinal variables is almost never an appropriate way to view or analyze data. Statisticians in general do not sufficiently involve themselves in measurement issues.

I will be discussing several of the issues in future blogs, especially item 1 above and items 2 and 4 below.

Complacency in the field of statistics and in statistical education has resulted in

reliance on large-sample theory so that inaccurate normal distribution-based tools can be used, as opposed to tailoring the analyses to data characteristics using the bootstrap and semiparametric models
belief that null hypothesis significance testing ever answered the scientific question and the p-values are useful
avoidance of the likelihood school of inference (relative likelihood, likelihood support intervals, likelihood ratios, etc.)
avoidance of Bayesian methods (posterior distributions, credible intervals, predictive distributions, etc.)

I was interviewed by Kevin Gray in July, 2017 where more of my opinions about statistics may be found.