Statistical Thinking

This blog is devoted to statistical thinking and its impact on science and everyday life. Emphasis is given to maximizing the use of information, avoiding statistical pitfalls, describing problems caused by the frequentist approach to statistical inference, describing advantages of Bayesian and likelihood methods, and discussing intended and unintended differences between statistics and data science. I'll also cover regression modeling strategies, clinical trials, and drug evaluation.

Tuesday, November 21, 2017

Scoring Multiple Variables, Too Many Variables and Too Few Observations: Data Reduction

This post will grow to cover questions about data reduction methods, also known as unsupervised learning methods. These are intended primarily for two purposes:

collapsing correlated variables into an overall score so that one does not have to disentangle correlated effects, which is a difficult statistical task
reducing the effective number of variables to use in a regression or other predictive model, so that fewer parameters need to be estimated

The latter example is the "too many variables too few subjects" problem. Data reduction methods are covered in Chapter 4 of my book Regression Modeling Strategies, and in some of the book's case studies.

Sacha Varin writes 2017-11-19:

I want to add/sum some variables having different units. I decide to standardize (Z-scores) the values and then, once transformed in Z-scores, I can sum them. The problem is that my variables distributions are non Gaussian (my distributions are not symmetrical (skewed), they are long-tailed, I have all types of weird distributions, I guess we can say the distributions are intractable. I know that my distributions don't need to be gaussian to calculate Z-scores, however, if the distributions are not close to gaussian or at least symmetrical enough, I guess the classical Z-score transformation: (Value - Mean)/SD, is not valid, that's why I decide, because my distributions are skewed and long-tailed to use the Gini's mean difference (robust and efficient estimator).

If the distributions are skewed and long-tailed, can I standardize the values using that formula :(Value - Mean)/GiniMd? Or the mean is not a good estimator in presence of skewed and long-tailed distributions? What about (Value - Median)/GiniMd? Or what else with GiniMd for a formula to standardize?
In presence of outliers, skewed and long-tailed distributions, for standardization, what formula is better to use between (Value - Median)/MAD (=median absolute deviation) or (Value - Mean)/GiniMd? And why?

My situation is not the predictive modeling case, but I want to sum the variables.

These are excellent questions and touch on an interesting side issue. My opinion is that standard deviations (SDs) are not very applicable to asymmetric (skewed) distributions, and that they are not very robust measures of dispersion. I'm glad you mentioned Gini's mean difference, which is the mean of all absolute differences of pairs of observations. It is highly robust and is surprisingly efficient as a measure of dispersion when compared to the SD, even when normality holds.

The questions also touch on the fact that when normalizing more than one variable so that the variables may be combined, there is no magic normalization method in statistics. I believe that Gini's mean difference is as good as any and better than the SD. It is also more precise than the mean absolute difference from the mean or median, and the mean may not be robust enough in some instances. But we have a rich history of methods, such as principal components (PCs), that use SDs.

What I'm about to suggest is a bit more applicable to the case where you ultimately want to form a predictive model, but it can also apply when the goal is to just combine several variables. When the variables are continuous and are on different scales, scaling them by SD or Gini's mean difference will allow one to create unitless quantities that may possibly be added. But the fact that they are on different scales begs the question of whether they are already "linear" or do they need separate nonlinear transformations to be "combinable".

I think that nonlinear PCs may be a better choice than just adding scaled variables. When the predictor variables are correlated, nonlinear PCs learn from the interrelationships, even occasionally learning how to optimally transform each predictor to ultimately better predict Y. The transformations (e.g., fitted spline functions) are solved for to maximize predictability of a predictor, from the other predictors or PCs of them. Sometimes the way the predictors move together is the same way they relate to some ultimate outcome variable that this undersupervised learning method does not have access to. An example of this is in Section 4.7.3 of my book.

With a little bit of luck, the transformed predictors have more symmetric distributions, so ordinary PCs computed on these transformed variables, with their implied SD normalization, work pretty well. PCs take into account that some of the component variables are highly correlated with each other, and so are partially redundant and should not receive the same weights ("loadings") as other variables.

The R transcan function in the Hmisc package has various options for nonlinear PCs, and these ideas are generalized in the R homals package.

How do we handle the case where the number of candidate predictors p is large in comparison to the effective sample size n? Penalized maximum likelihood estimation (e.g., ridge regression) and Bayesian regression typically have the best performance, but data reduction methods are competitive and sometimes more interpretable. For example, one can use variable clustering and redundancy analysis as detailed in the RMS book and course notes. Principal components (linear or nonlinear) can also be an excellent approach to lowering the number of variables than need to be related to the outcome variable Y. Two example approaches are:

Use the 15:1 rule of thumb to estimate how many predictors can reliably be related to Y. Suppose that number is k. Use the first k principal components to predict Y.
Enter PCs in decreasing order of variation (of the system of Xs) explained and chose the number of PCs to retain using AIC. This is far from stepwise regression which enters variables according to their p-values with Y. We are effectively entering variables in a pre-specified order with incomplete principal component regression.

Once the PC model is formed, one may attempt to interpret the model by studying how raw predictors relate to the principal components or to the overall predicted values.

Returning to Sacha's original setting, if linearity is assumed for all variables, then scaling by Gini's mean difference is reasonable. But psychometric properties should be considered, and often the scale factors need to be derived from subject matter rather than statistical considerations.

23 comments:

Thomas BNovember 22, 2017 at 8:05 AM
This discussion on CrossValidated mentions GiniMD as being limited by having a '0 breakdown point' (see the first answer to the query... https://stats.stackexchange.com/questions/200595/comparing-spread-dispersion-between-samples). The concept of 'breakdown points' are discussed in this paper by Davies and Gather, The Breakdown Point-examples and counterexamples, (here ... https://www.ine.pt/revstat/pdf/rs070101.pdf).
ReplyDelete
Replies
Frank HarrellNovember 22, 2017 at 10:22 AM
I like that paper. Can be a powerful approach though hard to interpret the results.
ReplyDelete
Replies
Frank HarrellNovember 22, 2017 at 3:54 PM
I'm not convinced that Gini's mean difference has a zero breakdown point. It's not like a quantile. But it has so many other good properties I might not be concerned anyway. Can you describe what QnQn and SnSn are?
ReplyDelete
Replies
Frank HarrellNovember 23, 2017 at 6:16 AM
I read the entire CrossValidated page for a second time and still do not see any useful information about Gini's mean difference there.
ReplyDelete
Replies
Frank HarrellNovember 23, 2017 at 6:40 AM
I just read http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf which was indirectly referenced in stackexchange.stats.com (CrossValidated) but oddly this paper did not address Gini's mean difference at all other than acknowledging its existence.
ReplyDelete
Replies
Frank HarrellNovember 23, 2017 at 7:51 AM
Thanks Thomas. An expert, HA David, has written much about Gini's mean difference and its high efficiency with regard to the ordinary SD even if normality holds. So I'm doubtful about the 0 breakdown comment. Gini's index is completely continuous. It does not involve any sample quantiles, just taking the mean of all absolute differences. It may not be as robust as the median of all absolute differences from the median, but the efficiency gain it has over that probably offsets the little bit of non-robustness. I would rely on Gini's mean difference until some reference shows it is inefficient or meaningless in a situation that occurs in practice.
ReplyDelete
Replies
Thomas BDecember 1, 2017 at 9:40 PM
One thing that I don't see discussed in the GMD literature is it's utility in hypothesis testing. This is very different from the SD (or SE) and suggests that, as a measure of dispersion for non-normally distributed information, it's closer to the coefficient of variation, a scale invariant measure of dispersion for more normally distributed information.
ReplyDelete
Replies
Thomas BDecember 2, 2017 at 7:26 AM
Good points and agreement on Neyman-Pearson approaches to hypothesis testing. I was using that as a 'straw man' example in an effort to clarify my ignorance and distinguish GMDs from SDs. Anther example could be ANOVA-type contrasts, and so on. But your point about the GMD being 'in data units' was both helpful and interesting. The CV does have value in a comparison, e.g., of systolic and diastolic blood pressure. Since these two metrics are in differing units, direct comparison of their SDs is not meaningful, whereas comparing their CVs would tell you which metric has more variability. To me, your comment suggests that the GMD would not be useful in determining which blood pressure metric has greater dispersion.
ReplyDelete
Replies
Thomas BDecember 2, 2017 at 11:52 AM
So, just to be clear, the GMDs for DBP and SBP are directly comparable?
ReplyDelete
Replies
Thomas BDecember 3, 2017 at 2:19 PM
I don't mean to be pedantic but there is literature that, in this specific instance wrt DBP and SBP, supports the CV as being 'fully comparable' and, therefore, enabling a direct comparison of variability. E.g., p. 14 of Levy and Lemeshow's book Sampling for Health Professionals (1980 edition). It doesn't sound like the GMD is 'fully comparable' in the same sense.
ReplyDelete
Replies
Thomas BDecember 5, 2017 at 4:59 AM
This has been an interesting, useful exchange, at least for me. Thank you.
ReplyDelete
Replies

Add comment