Statistical Thinking: Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank Goodness

Friday, January 27, 2017

Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank Goodness

Randomized clinical trials (RCT) have long been held as the gold standard for generating evidence about the effectiveness of medical and surgical treatments, and for good reason. But I commonly hear clinicians lament that the results of RCTs are not generalizable to medical practice, primarily for two reasons:

Patients in clinical practice are different from those enrolled in RCTs
Drug adherence in clinical practice is likely to be lower than that achieved in RCTs, resulting in lower efficacy.

Point 2 is hard to debate because RCTs are run under protocol and research personnel are watching and asking about patients' adherence (but more about this below). But point 1 is a misplaced worry in the majority of trials. The explanation requires getting to the heart of what RCTs are really intended to do: provide evidence for relative treatment effectiveness. There are some trials that provide evidence for both relative and absolute effectiveness. This is especially true when the efficacy measure employed is absolute as in measuring blood pressure reduction due to a new treatment. But many trials use binary or time-to-event endpoints and the resulting efficacy measure is on a relative scale such as the odds ratio or hazard ratio.

RCTs of even drastically different patients can provide estimates of relative treatment benefit on odds or hazard ratio scales that are highly transportable. This is most readily seen in subgroup analyses provided by the trials themselves - so called forest plots that demonstrate remarkable constancy of relative treatment benefit. When an effect ratio is applied to a population with a much different risk profile, that relative effect can still fully apply. It is only likely that the absolute treatment benefit will change, and it is easy to estimate the absolute benefit (e.g., risk difference) for a patient given the relative benefit and the absolute baseline risk for the subject. This is covered in detail in Biostatistics for Biomedical Research, Section 13.6. See also Stephen Senn's excellent presentation.

Clinical practice provides anecdotal evidence that biases clinicians. What a clinician sees in her practice is patient i on treatment A and patient j on treatment B. She may remember how patient i fared in comparison to patient j, not appreciate confounding by indication, and suppose this provides a valid estimate of the difference in effectiveness in treatment A vs. B. But the real therapeutic question is how does the outcome of a patient were she given treatment A compare to her outcome were she given treatment B. The gold standard design is thus the randomized crossover design, when the treatment is short acting. Stephen Senn eloquently writes about how a 6-period 2-treatment crossover study can even do what proponents of personalized medicine mistakenly think they can do with a parallel-group randomized trial: estimate treatment effectiveness for individual patients.

For clinical practice to provide the evidence really needed, the clinician would have to see patients and assign treatments using one of the top four approaches listed in the hierarchy of evidence below. Entries are in the order of strongest evidence requiring the least assumptions to the weakest evidence. Note that crossover studies, when feasible, even surpass randomized studies of matched identical twins in the quality and relevance of information they provide.

Let P_i denote patient i and the treatments be denoted by A and B. Thus P₂^B represents patient 2 on treatment B. P₁ represents the average outcome over a sample of patients from which patient 1 was selected. HTE is heterogeneity of treatment effect.

Design	Patients Compared

6-period crossover	P₁^A vs P₁^B (directly measure HTE)
2-period crossover	P₁^A vs P₁^B
RCT in idential twins	P₁^A vs P₁^B
∥ group RCT	P₁^A vs P₂^B, P₁=P₂ on avg
Observational, good artificial control	P₁^A vs P₂^B, P₁=P₂ hopefully on avg
Observational, poor artificial control	P₁^A vs P₂^B, P₁≠ P₂ on avg
Real-world physician practice	P₁^A vs P₂^B

The best experimental designs yield the best evidence a clinician needs to answer the "what if" therapeutic question for the one patient in front of her.

Regarding adherence, proponents of "real world" evidence advocate for estimating treatment effects in the context of making treatment adherence low as in clinical practice. This would result in lower efficacy and the abandonment of many treatments. It is hard to argue that a treatment should not be available for a potentially adherent patient because her fellow patients were poor adherers. Note that an RCT is the best hope for estimating efficacy as a function of adherence, through for example an instrumental variable analysis (the randomization assignment is a truly valid instrument). Much more needs to be said about how to handle treatment adherence and what should be the target adherence in an RCT, but overall it is a good thing that RCTs do not mimic clinical practice. We are entering a new era of pragmatic clinical trials. Pragmatic trials are worthy of in-depth discussion, but it is not a stretch to say that the chief advantage of pragmatic trials is not that they provide results that are more relevant to clinical practice but that they are cheaper and faster than traditional randomized trials.

Updated 2017-06-25 (last paragraph regarding adherence)

3 comments:

UnknownJune 29, 2017 at 10:37 AM
Thanks for this thought-provoking material. As an early-career statistician, this is a great format for learning. In relation to the last paragraph, is it your view that that intention to treat effects are not important? While I do think the question of efficacy as a function of adherence is important, I'm not sure that we should dismiss the question of whether or not a treatment works in practice, as it is currently quite fashionable to do. Non-adherence could be due to adverse side effects of the treatment, for example. The analogy I think of is a world class football player who is great on the pitch but is injured most of the time. Her efficacy is great but her practical effectiveness is low.
Moreover, if we recommend treatments on the basis of average treatment effects, I wonder if it is really less reasonable to incorporate typical patterns of adherence into our decision making. Would be great to hear your thoughts.
Thanks again.
ReplyDelete
Replies

Add comment