Tuesday, November 21, 2017

Scoring Multiple Variables, Too Many Variables and Too Few Observations: Data Reduction

This post will grow to cover questions about data reduction methods, also known as unsupervised learning methods. These are intended primarily for two purposes:

  • collapsing correlated variables into an overall score so that one does not have to disentangle correlated effects, which is a difficult statistical task
  • reducing the effective number of variables to use in a regression or other predictive model, so that fewer parameters need to be estimated
The latter example is the "too many variables too few subjects" problem.  Data reduction methods are covered in Chapter 4 of my book Regression Modeling Strategies, and in some of the book's case studies.

Sacha Varin writes 2017-11-19:

I want to add/sum some variables having different units. I decide to standardize (Z-scores) the values and then, once transformed in Z-scores, I can sum them.  The problem is that my variables distributions are non Gaussian (my distributions are not symmetrical (skewed), they are long-tailed, I have all types of weird distributions, I guess we can say the distributions are intractable. I know that my distributions don't need to be gaussian to calculate Z-scores, however, if the distributions are not close to gaussian or at least symmetrical enough, I guess the classical Z-score transformation: (Value - Mean)/SD, is not valid, that's why I decide, because my distributions are skewed and long-tailed to use the Gini's mean difference (robust and efficient estimator).
  1. If the distributions are skewed and long-tailed, can I standardize the values using that formula :(Value - Mean)/GiniMd?  Or the mean is not a good estimator in presence of skewed and long-tailed distributions?  What about (Value - Median)/GiniMd?  Or what else with GiniMd for a formula to standardize?
  2. In presence of outliers, skewed and long-tailed distributions, for standardization, what formula is better to use between (Value - Median)/MAD (=median absolute deviation) or (Value - Mean)/GiniMd?  And why?
My situation is not the predictive modeling case, but I want to sum the variables.

These are excellent questions and touch on an interesting side issue.  My opinion is that standard deviations (SDs) are not very applicable to asymmetric (skewed) distributions, and that they are not very robust measures of dispersion.  I'm glad you mentioned Gini's mean difference, which is the mean of all absolute differences of pairs of observations.  It is highly robust and is surprisingly efficient as a measure of dispersion when compared to the SD, even when normality holds. 

The questions also touch on the fact that when normalizing more than one variable so that the variables may be combined, there is no magic normalization method in statistics.  I believe that Gini's mean difference is as good as any and better than the SD.  It is also more precise than the mean absolute difference from the mean or median, and the mean may not be robust enough in some instances.  But we have a rich history of methods, such as principal components (PCs), that use SDs.

What I'm about to suggest is a bit more applicable to the case where you ultimately want to form a predictive model, but it can also apply when the goal is to just combine several variables.  When the variables are continuous and are on different scales, scaling them by SD or Gini's mean difference will allow one to create unitless quantities that may possibly be added.  But the fact that they are on different scales begs the question of whether they are already "linear" or do they need separate nonlinear transformations to be "combinable".

I think that nonlinear PCs may be a better choice than just adding scaled variables.  When the predictor variables are correlated, nonlinear PCs learn from the interrelationships, even occasionally learning how to optimally transform each predictor to ultimately better predict Y.  The transformations (e.g., fitted spline functions) are solved for to maximize predictability of a predictor, from the other predictors or PCs of them.  Sometimes the way the predictors move together is the same way they relate to some ultimate outcome variable that this undersupervised learning method does not have access to.  An example of this is in Section 4.7.3 of my book.

With a little bit of luck, the transformed predictors have more symmetric distributions, so ordinary PCs computed on these transformed variables, with their implied SD normalization, work pretty well.  PCs take into account that some of the component variables are highly correlated with each other, and so are partially redundant and should not receive the same weights ("loadings") as other variables.

The R transcan function in the Hmisc package has various options for nonlinear PCs, and these ideas are generalized in the R homals package.

How do we handle the case where the number of candidate predictors p is large in comparison to the effective sample size n?  Penalized maximum likelihood estimation (e.g., ridge regression) and Bayesian regression typically have the best performance, but data reduction methods are competitive and sometimes more interpretable.  For example, one can use variable clustering and redundancy analysis as detailed in the RMS book and course notes.  Principal components (linear or nonlinear) can also be an excellent approach to lowering the number of variables than need to be related to the outcome variable Y.  Two example approaches are:

  1. Use the 15:1 rule of thumb to estimate how many predictors can reliably be related to Y.  Suppose that number is k.  Use the first k principal components to predict Y.
  2. Enter PCs in decreasing order of variation (of the system of Xs) explained and chose the number of PCs to retain using AIC.  This is far from stepwise regression which enters variables according to their p-values with Y.  We are effectively entering variables in a pre-specified order with incomplete principal component regression. 
Once the PC model is formed, one may attempt to interpret the model by studying how raw predictors relate to the principal components or to the overall predicted values.

Returning to Sacha's original setting, if linearity is assumed for all variables, then scaling by Gini's mean difference is reasonable.  But psychometric properties should be considered, and often the scale factors need to be derived from subject matter rather than statistical considerations.






Sunday, November 5, 2017

Statistical Criticism is Easy; I Need to Remember That Real People are Involved

I have been critical of a number of articles, authors, and journals in this growing blog article. Linking the blog with Twitter is a way to expose the blog to more readers. It is far too easy to slip into hyperbole on the blog and even easier on Twitter with its space limitations. Importantly, many of the statistical problems pointed out in my article, are very, very common, and I dwell on recent publications to get the point across that inadequate statistical review at medical journals remains a serious problem. Equally important, many of the issues I discuss, from p-values, null hypothesis testing to issues with change scores are not well covered in medical education (of authors and referees), and p-values have caused a phenomenal amount of damage to the research enterprise. Still, journals insist on emphasizing p-values. I spend a lot of time educating biomedical researchers about statistical issues and as a reviewer for many medical journals, but still am on a quest to impact journal editors.

Besides statistical issues, there are very real human issues, and challenges in keeping clinicians interested in academic clinical research when there are so many pitfalls, complexities, and compliance issues. In the many clinical trials with which I have been involved, I've always been glad to be the statistician and not the clinician responsible for protocol logistics, informed consent, recruiting, compliance, etc.

A recent case discussed here has brought the human issues home, after I came to know of the extraordinary efforts made by the ORBITA study's first author, Rasha Al-Lamee, to make this study a reality. Placebo-controlled device trials are very difficult to conduct and to recruit patients into, and this was Rasha's first effort to launch and conduct a randomized clinical trial. I very much admire Rasha's bravery and perseverance in conducting this trial of PCI, when it is possible that many past trials of PCI vs. medical theory were affected by placebo effects.

Professor of Cardiology at Imperial College London, a co-author on the above paper, and Rasha's mentor, Darrel Francis, elegantly pointed out to me that there is a real person on the receiving end of my criticism, and I heartily agree with him that none of us would ever want to discourage a clinical researcher from ever conducting her second randomized trial. This is especially true when the investigator has a burning interest to tackle difficult unanswered clinical questions. I don't mind criticizing statistical designs and analyses, but I can do a better job of respecting the sincere efforts and hard work of biomedical researchers.

I note in passing that I had the honor of being a co-author with Darrel on this paper of which I am extremely proud.

Dr Francis gave me permission to include his thoughts, which are below. After that I list some ideas for making the path to presenting clinical research findings a more pleasant journey.


As the PI for ORBITA, I apologise for this trial being 40 years late, due to a staffing issue. I had to wait for the lead investigator, Rasha Al-Lamee, to be born, go to school, study Medicine at Oxford University, train in interventional cardiology, and start as a consultant in my hospital, before she could begin the trial.

Rasha had just finished her fellowship. She had experience in clinical research, but this was her first leadership role in a trial. She was brave to choose for her PhD a rigorous placebo-controlled trial in this controversial but important area.

Funding was difficult: grant reviewers, presumably interventional cardiologists, said the trial was (a) unethical and (b) unnecessary. This trial only happened because Rasha was able to convince our colleagues that the question was important and the patients would not be without stenting for long. Recruitment was challenging because it required interventionists to not apply the oculostenotic reflex. In the end the key was Rasha keeping the message at the front of all our colleagues' minds with her boundless energy and enthusiasm. Interestingly, when the concept was explained to patients, they agreed to participate more easily than we thought, and dropped out less frequently than we feared. This means we should indeed acquire placebo-controlled data on interventional procedures.

Incidentally, I advocate the term "placebo" over "sham" for these trials, for two reasons. First, placebo control is well recognised as essential for assessing drug efficacy, and this helps people understand the need for it with devices. Second, "sham" is a pejorative word, implying deception. There is no deception in a placebo controlled trial, only pre-agreed withholding of information.


There are several ways to improve the system that I believe would foster clinical research and make peer review more objective and productive.

  • Have journals conduct reviews of background and methods without knowledge of results.
  • Abandon journals and use researcher-led online systems that invite open post-"publication" peer review and give researchers the opportunities to improve their "paper" in an ongoing fashion.
  • If not publishing the entire paper online, deposit the background and methods sections for open pre-journal submission review.
  • Abandon null hypothesis testing and p-values. Before that, always keep in mind that a large p-value means nothing more than "we don't yet have evidence against the null hypothesis", and emphasize confidence limits.
  • Embrace Bayesian methods that provide safer and more actionable evidence, including measures that quantify clinical significance. And if one is trying to amass evidence that the effects of two treatments are similar, compute the direct probability of similarity using a Bayesian model.
  • Improve statistical education of researchers, referees, and journal editors, and strengthen statistical review for journals.
  • Until everyone understands the most important statistical concepts, better educate researchers and peer reviewers on statistical problems to avoid.
On a final note, I regularly review clinical trial design papers for medical journals. I am often shocked at design flaws that authors state are "too late to fix" in their response to the reviews. This includes problems caused by improper endpoint variables that necessitated the randomization of triple the number of patients actually needed to establish efficacy. Such papers have often been through statistical review before the journal submission. This points out two challenges: (1) there is a lot of between-statistician variation that statisticians need to address, and (2) there are many fundamental statistical concepts that are not known to many statisticians (witness the widespread use of change scores and dichotomization of variables even when senior statisticians are among a paper's authors).