SlideShare una empresa de Scribd logo
1 de 8
Descargar para leer sin conexión
On the transformation of a response in regression modelling
and hypothesis testing
Adrian Olszewski
Originally posted at: Research Gate, March 21st
2020, URL: https://tinyurl.com/yyv2ryus
Updated and enhanced: November, 13th
2020
Find the most updated version here:
https://www.dropbox.com/s/62bh8cvbkjuu21n/data%20transformations.pdf?dl=0
I do suggest avoiding any variable transformations as much as possible, except the cases you
can thoroughly and convincingly justify the reason and explain the outcome. It applies
especially to Box-Cox.
1. It completely changes the formulation, and affects the interpretation. Only in "clean"
cases you will get interpretable outcome, like log-transformed data generated by
multiplicative process (not *any right skewed data*!). log, exp, reciprocal, square/cube
root, power of 2, 3 transformations may be meaningful in *special scenarios*, e.g.
velocity, area, volume, concentration, length (square root of area), but y^-0.67 doesn't
mean anything. And most of your audience will have no idea how your response
changes with the predictor unless you draw the curve. Easy for singe response, but if
you have more? You will need marginal effects to give some idea.
Sometimes you can decide to approximate the obtained coefficient with well-known
ones, e.g. 0.45 is close to 0.5 (square root), but it’s not easy in general. So why doing
that?
By transforming, you *force your variables to follow certain distribution* and to *tell
it your story*. For example, log-transformation assumes your data comes from log-
normal distribution. Just look what it does with the equation. As a consequence...
2. … It changes the model along with the errors! In our case - from additive to multiple
including errors. Maybe it's good maybe not, depends on your case.
3. It will also affect the variance along with means - many people blindly use
transformations completely forgetting about that! Well, it can be useful if we want to
stabilize variance, BUT it changes more! For example, in normal distribution mean and
variance are independent, in log-normal it's not! Of course this is an idealized case. Box-
Cox may return any weird coefficient - guess how will the model and the mean-variance
relationship change?
4. The Jensen's inequality says clearly, log(E(y)) ≠ E(log(y)) (except the identity link). By
running regression you are interested in modelling the conditional expectation of the
response, rather than response itself (transformed). And remember that no
transformation can handle certain response distribution properly, like counts (it makes
no sense, by the way).
5. In case of testing, it changes the null hypothesis, which is likely not the one you wanted
to assess anymore! In our case: it switches from testing the shift in arithmetic means to
the ratio of geometric means.
I can hear you: “I was told it leads to valid inference!”.
Yes, it leads to valid inference... of a hypothesis you did not started with initially (unless
you can justify that transition).
You obtain a valid answer to *unasked question*. And yes, log(y) may results may
differ from results returned by a model with log link (e.g. gamma regression). You will
have to decide which one to choose.
Sometimes there are industry guidelines, like those given by the FDA for clinical
biostatistics, which advises using log on PK data (for a good reason), but *even those
guidelines* warn you against unconditional and *unjustified* transformations!
6. Your back-transformed confidence intervals will be biased. Another disease to the
collection.
7. BoxCox and any other transformation does NOT guarantee the properties you need.
And then what are you going to do? Transform again the data, until satisfied? You know
this will only complicate already complicated situation? Not to mention that you may
turn your right skewed data into... left-skewed one and fall into more troubles.
I know there are many proponents of unconventional data transformation ("skewed data?  go
transform it!") on ResearchGate. They were taught this for tens of years. Moreover, some of
them were told by authorities to continue using it.
But in the light of the above arguments I collected I strongly suggest considering (practically
always better) alternatives.
Except the mentioned few scenarios, the transformations can cause more harm than good in
confirmatory and exploratory modelling.
Can it be useful? Yes, it may be OK in predictive modelling, especially if you agree on using a
“black box” approach, where you care mostly of the predicted outcomes and not the rest of the
story. If the predicted outcome agrees with the expectations – you are fine with that. Then –
it’s OK.
“OK, so what techniques and methods do you recommend instead”?
In the 21st century we have a plenty of models, estimation methods and other techniques (being
here for ca 50 years) allowing us to deal with certain violations of the assumptions (normality,
homoscedasticity, independence of observations and so on), including:
1) generalized models (GLM and GAM), like: gamma, beta, logistic (and probit),
fractional logit, Poisson, negative binomial, etc. regressions. Trucated (most of real
variables have truncated domain, keep it in mind!) and censored regression (e.g. tobit
model). I’m sure you will be able to find a tool suitable for you. Remember, that this
generalizes nicely to the mixed-effect models.
2) non-linear models
3) robust and non-parametric methods and tests (there are over 280 statistical tests! Lots
of them do not require or relax certain parametric assumptions, like Yuen, Brunner-
Munzel, ATS, WTS, ART ANOVA, Welch, Mann-Whitney/Wilcoxon, and many,
many more). At the end of this document I added an longer set of the literature that I
read and can wholeheartedly recommend it.
a. If you need ANOVA on non-normal or heterogeneous data, remember that you
can a run more advanced model (e.g. robust regression, quantile regression,
mixed models) or use GLS or GEE estimation and follow the modelling with a
set of LR (likelihood ratio) tests to mimic the type-3 ANOVA and get the main
and interaction effects!
Yes, you read well – that’s what the anova() (or car::Anova(… type=2/3)
function does in R when dealing with so many kinds models, performing the
assessment of reduction of the residual deviance. Which – in case of the simple
linear model – reduces to the analysis of certain contrasts, which is nothing but
comparing group means. See? The dots connect with themselves!
Yes, the outcome will be approximated (Chi2 rather than F), but hey – it’s still
a worthwhile and flexible solution!
4) quantile regression (which handles also mixed effects) – it’s one of the most powerful
method, requiring no distributional assumptions yet still offering good
interpretability!
5) Generalized Least Square and Generalized Estimation Equations estimation
6) Passing-Bablock and Deming regression
7) resampling (permutation/exact tests, approximate permutation tests, bootstrapped
interval estimation). Only remember those methods aren’t accepted by the regulators in
the Clinical Research industry when used to analyse the primary outcomes
8) In case of serious skewness you can also try adding categorical covariate(s) to your
model which may split your dataset into more homogeneous subgroups. Why? Because
the skewness often comes from mixed data coming from 2+ populations with different
variability.
Afterword
As always in statistics – there’s no easy solution to all cases. There are justified cases, where
the transformations are not only applicable, but also demanded by the regulations – see an
example here: FDA: Guidance for Industry - Statistical Approaches to Establishing
Bioequivalence
Or here: EMA - ICH Topic E 9 Statistical Principles for Clinical Trials, step 5
Also: “THE LOG TRANSFORMATION IS SPECIAL”
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.530.9640&rep=rep1&type=pdf
Also, find my diagram (on DropBox) showing a few of the families of models (along with the
relationships) and estimation methods, that may be useful for you more than data
transformation:
https://www.dropbox.com/s/5a8w8kckyfeaix0/statistical%20models%20-%20diagram.pdf
A note on “what we call a regression” may also interest you, especially if you:
1) you advise people to transform their DV (response) with Box-Cox (or log) without a
thought on the consequences, to "achieve normality" in DV or residuals. It MAY be
OK when predicting with the "black-box" approach, but is NOT OK when you use a
model to explain / confirm relationships between variables.
2) you "chase" for normality of the raw DV (response) in the linear regression.
3) you believe that strongly skewed data cannot be modelled with the linear regression
4) …and vice versa - you overuse it to everything (counts, %, categorical data from
questionnaires, concentrations)
5) you say that linear model is named so as it "produces" straight line
6) you say that the logistic r. is "not a regression, because it models binary response"
7) you believe the "stepwise regression" is a regression
https://www.linkedin.com/posts/adrianolszewski_rockyourr-datascience-dataanalysis-activity-
6691521288101531648-jkuW
A few URL linking to discussions and resources worth reading:
1. Log-transformation and its implications for data analysis
2. GLM with a Gamma-distributed Dependent Variable (PDF)
3. CrossValidated: When to use gamma GLMs?
4. To transform or not to transform: using generalized linear mixed models to analyse
reaction time data
5. Stat 504 - Introduction to Generalized Linear Models
6. Do Not Log-Transform Count Data, Bitches!
7. Generalized linear models - An introduction by Christoph Scherber
8. https://www.theanalysisfactor.com/the-difference-between-link-functions-and-data-
transformations/
9. Notes on Transformations and Generalized Linear Models
10. Handling Skewed Data: A Comparison of Two Popular Methods
11. CrossValidated: Linear model with log-transformed response vs. generalized linear
model with log link
12. CrossValidated: How to decide which glm family to use?
13. CrossValidated: Family of GLM represents the distribution of the response variable or
residuals?
14. CrossValidated: Why is GLM different than an LM with transformed variable
15. CrossValidated: GLM vs square root data transformation
16. https://stats.idre.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in-
terms-of-percent-change-in-linear-regression/
17. CrossValidated: How to interpret regression coefficients when response was
transformed by the 4th root?
18. CrossValidated: Express answers in terms of original units, in Box-Cox transformed
data
Books worth reading (yes, I read or “familiarized enough” with and use(d) at work to
recommend them):
1. Alan Agresti, Foundations of Linear and Generalized Linear Models
2. John Fox, Applied regression analysis and generalized linear model
3. Roger Koenker, Victor Chernozhukov, Xuming He, Limin Peng, Handbook of Quantile
Regression
4. Young, Derek S, Handbook of regression methods
5. Andreas Ziegler, Generalized Estimating Equations
6. Daryl S. Paulson, Handbook of Regression and Modeling Applications for the Clinical
and Pharmaceutical Industries
7. Myles Hollander, Douglas A. Wolfe, Eric Chicken, Nonparametric Statistical Methods
8. Jason C. Hsu, Multiple Comparisons, Theory and methods
9. Alex Dmitrienko, Ajit C. Tamhane, Frank Bretz, Multiple Testing Problems in
Pharmaceutical Statistics
10. Michael G. Akritas and Dimitris N. Politis, Recent Advances and Trends in
Nonparametric Statistics
11. W. J. Conover practical nonparametric statistics
+ some more literature about the modern and flexible non-parametric methods (there’s lots of
more beyond the Mann-Whitney-Wilcoxon, Friedman, Kruska-Wallis!), so you don’t have to
transform your data ;]
1. Erceg-Hurn, David & Mirosevich, Vikki. (2008). Modern Robust Statistical Methods
An Easy Way to Maximize the Accuracy and Power of Your Research. The American
psychologist. 63. 591-601. 10.1037/0003-066X.63.7.591.
https://www.researchgate.net/publication/23319441_Modern_Robust_Statistical_Meth
ods_An_Easy_Way_to_Maximize_the_Accuracy_and_Power_of_Your_Research
https://pdfs.semanticscholar.org/88cb/15520b2f84fd2a5a09e0341e791f40ab4118.pdf
2. Edgar Brunner, Madan L. Puri, Nonparametric Methods in Factorial Designs
https://www.researchgate.net/profile/Jos_Feys/post/What_statistical_tests_can_I_use_
to_compare_mean_values_for_my_study/attachment/59d6558b79197b80779acad7/A
S:526088510111744@1502440683536/download/Brunner.pdf
3. Brunner, E., & Puri, M. L. (1996). Nonparametric methods in design and analysis of
experiments. In Design and Analysis of Experiments (Vol. 13, pp. 631–703). Elsevier.
https://doi.org/https://doi.org/10.1016/S0169-7161(96)13021-2
4. Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins, J.J. (2011). The Aligned Rank
Transform for nonparametric factorial analyses using only ANOVA procedures.
Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI
'11). Vancouver, British Columbia (May 7-12, 2011). New York: ACM Press, pp.
143-146. http://faculty.washington.edu/wobbrock/pubs/chi-11.06.pdf
5. Christophe Leys, Sandy Schumann, A nonparametric method to analyze interactions:
The adjusted rank transform test http://cescup.ulb.be/wp-
content/uploads/2015/04/Leys_and_Schumann_nonparametric_interactions.pdf
6. Haiko Lüpsen, The Aligned Rank Transform and discrete Variables -a Warning
https://kups.ub.uni-koeln.de/7554/1/ART-discrete.pdf
7. Friedrich, S., Konietschke, F., & Pauly, M. (2017). GFD: An R Package for the
Analysis of General Factorial Designs. Journal of Statistical Software, 79(Code
Snippet 1), 1 - 18. doi:http://dx.doi.org/10.18637/jss.v079.c01
8. Kimihiro Noguchi, Yulia R. Gel, Edgar Brunner, Frank Konietschke,“nparLD: An R
Software Package for the Nonparametric Analysis of Longitudinal Data in Factorial
Experiments”
9. Edgar Brunner, Arne C. Bathke, Frank Konietschke, Rank and Pseudo-Rank
Procedures for Independent Observations in Factorial Designs: Using R and SAS,
Springer, 2019, ISBN: 303002914X, 9783030029142, page 134
https://books.google.pl/books?id=t9KiDwAAQBAJ&lpg=PA134&ots=_Jgi9Rt0Kz&h
l=pl&pg=PA134#v=onepage&q&f=false
10. Feys, Jos. "New Nonparametric Rank Tests for Interactions in Factorial Designs with
Repeated Measures." Journal of Modern Applied Statistical Methods 15.1 (2016): 78-
99. Web.
https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=1924&context=jmasm
11. Friedrich, S., Konietschke, F., Pauly, M.(2017). GFD - An R-package for the Analysis
of GeneralFactorial Designs. Journal of Statistical Software, Code Snippets 79(1), 1–
18, doi:10.18637/jss.v079.c01.Pauly, M., Brunner, E., Konietschke, F.(2015).
Asymptotic Permutation Tests in General FactorialDesigns. Journal of the Royal
Statistical Society - Series B 77, 461-473
12. Akritas, M. G., & Politis, D. N. (2003). Recent Advances and Trends in
Nonparametric Statistics. Elsevier B.V. https://doi.org/10.1016/B978-0-444-51378-
6.X5000-5
13. Peterson, K.M. (2002). Six Modifications Of The Aligned Rank Transform Test For
Interaction.
https://pdfs.semanticscholar.org/ad4b/54e104acf7356b53c075e959ba8c24e23fea.pdf
14. Schacht, A., Bogaerts, K., Bluhmki, E., & Lesaffre, E. (2008). A new nonparametric
approach for baseline covariate adjustment for two-group comparative studies.
Biometrics, 64 4, 1110-6
15. Shah DA, Madden LV. Nonparametric analysis of ordinal data in designed factorial
experiments. Phytopathology. 2004;94(1):33-43. doi:10.1094/PHYTO.2004.94.1.33,
https://apsjournals.apsnet.org/doi/pdf/10.1094/PHYTO.2004.94.1.33
16. Versace, V., Schwenker, K., Langthaler, P. B., Golaszewski, S., Sebastianelli, L.,
Brigo, F., Pucks-Faes, E., Saltuari, L., & Nardone, R. (2020). Facilitation of Auditory
Comprehension After Theta Burst Stimulation of Wernicke's Area in Stroke Patients:
A Pilot Study. Frontiers in neurology, 10, 1319.
https://doi.org/10.3389/fneur.2019.01319,
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6960103/
17. Prossegger, J., Huber, D., Grafetstätter, C., Pichler, C., Braunschmid, H., Weisböck-
Erdheim, R., & Hartl, A. (2019). Winter Exercise Reduces Allergic Airway
Inflammation: A Randomized Controlled Study. International journal of
environmental research and public health, 16(11), 2040.
https://doi.org/10.3390/ijerph16112040
18. Akritas, M.G. and E. Brunner. 1997. A unified approach to rank tests for mixed
models. Journal of Statistical Planning and Inference. 61:249–277.
19. Haiko Lüpsen, Anova with binary variables - Alternatives for a dangerous F-test (dac
lepszy citation)
20. Haiko Lüpsen, Comparison of nonparametric analysis of variance methods a Monte
Carlo study - Part A: Between subjects designs - A Vote for van der Waerden
+ my list of various non-parametric and robust alternatives to the classic n-way ANOVA:
https://www.quora.com/Is-there-any-reliable-non-parametric-alternative-to-two-way-
ANOVA-in-biostatistics/answer/Adrian-Olszewski-1?ch=10&share=2dada943&srid=MByz
2020-11-13 (Friday ), LinkedIn, Adrian Olszewski

Más contenido relacionado

La actualidad más candente

Ch4 Boolean Algebra And Logic Simplication1
Ch4 Boolean Algebra And Logic Simplication1Ch4 Boolean Algebra And Logic Simplication1
Ch4 Boolean Algebra And Logic Simplication1
Qundeel
 
Repeatability and Reproducibility in science
Repeatability and Reproducibility in scienceRepeatability and Reproducibility in science
Repeatability and Reproducibility in science
pramod41kumar
 
Least square method
Least square methodLeast square method
Least square method
Somya Bagai
 

La actualidad más candente (20)

Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Introduction to simple linear regression and correlation in spss
Introduction to  simple linear regression and correlation in spssIntroduction to  simple linear regression and correlation in spss
Introduction to simple linear regression and correlation in spss
 
Ch4 Boolean Algebra And Logic Simplication1
Ch4 Boolean Algebra And Logic Simplication1Ch4 Boolean Algebra And Logic Simplication1
Ch4 Boolean Algebra And Logic Simplication1
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Scientific Inquiry
Scientific InquiryScientific Inquiry
Scientific Inquiry
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Poisson distribution
Poisson distributionPoisson distribution
Poisson distribution
 
Hypothesis testing, error and bias
Hypothesis testing, error and biasHypothesis testing, error and bias
Hypothesis testing, error and bias
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Data collection and analysis
Data collection and analysisData collection and analysis
Data collection and analysis
 
Preparing a manuscript
Preparing a manuscriptPreparing a manuscript
Preparing a manuscript
 
Repeatability and Reproducibility in science
Repeatability and Reproducibility in scienceRepeatability and Reproducibility in science
Repeatability and Reproducibility in science
 
Least square method
Least square methodLeast square method
Least square method
 
Qualtrics les UAntwerpen - Simone
Qualtrics les UAntwerpen - SimoneQualtrics les UAntwerpen - Simone
Qualtrics les UAntwerpen - Simone
 
Graphical representation of data
Graphical representation of dataGraphical representation of data
Graphical representation of data
 

Similar a Why are data transformations a bad choice in statistics

© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved.  Pa.docx© Charles T. Diebold, Ph.D., 73013. All Rights Reserved.  Pa.docx
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
LynellBull52
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.
Teng Xiaolu
 
A researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docxA researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docx
evonnehoggarth79783
 

Similar a Why are data transformations a bad choice in statistics (20)

Data analysis01 singlevariable
Data analysis01 singlevariableData analysis01 singlevariable
Data analysis01 singlevariable
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
StatsModelling
StatsModellingStatsModelling
StatsModelling
 
deep larning
deep larningdeep larning
deep larning
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The Data
 
Are Evolutionary Algorithms Required to Solve Sudoku Problems
Are Evolutionary Algorithms Required to Solve Sudoku ProblemsAre Evolutionary Algorithms Required to Solve Sudoku Problems
Are Evolutionary Algorithms Required to Solve Sudoku Problems
 
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved.  Pa.docx© Charles T. Diebold, Ph.D., 73013. All Rights Reserved.  Pa.docx
© Charles T. Diebold, Ph.D., 73013. All Rights Reserved. Pa.docx
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Algo sobre cladista to read
Algo sobre cladista to readAlgo sobre cladista to read
Algo sobre cladista to read
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputation
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.
 
BASIC MATH PROBLEMS IN STATISCTICSS.pptx
BASIC MATH PROBLEMS IN STATISCTICSS.pptxBASIC MATH PROBLEMS IN STATISCTICSS.pptx
BASIC MATH PROBLEMS IN STATISCTICSS.pptx
 
Project Analytics
Project AnalyticsProject Analytics
Project Analytics
 
Real Estate Data Set
Real Estate Data SetReal Estate Data Set
Real Estate Data Set
 
A researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docxA researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docx
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005
 
Pentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BIPentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BI
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
Logistic regression analysis
Logistic regression analysisLogistic regression analysis
Logistic regression analysis
 

Más de Adrian Olszewski

Más de Adrian Olszewski (10)

Logistic regression vs. logistic classifier. History of the confusion and the...
Logistic regression vs. logistic classifier. History of the confusion and the...Logistic regression vs. logistic classifier. History of the confusion and the...
Logistic regression vs. logistic classifier. History of the confusion and the...
 
Logistic regression - one of the key regression tools in experimental research
Logistic regression - one of the key regression tools in experimental researchLogistic regression - one of the key regression tools in experimental research
Logistic regression - one of the key regression tools in experimental research
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
 
Flextable and Officer
Flextable and OfficerFlextable and Officer
Flextable and Officer
 
Modern statistical techniques
Modern statistical techniquesModern statistical techniques
Modern statistical techniques
 
Dealing with outliers in Clinical Research
Dealing with outliers in Clinical ResearchDealing with outliers in Clinical Research
Dealing with outliers in Clinical Research
 
The use of R statistical package in controlled infrastructure. The case of Cl...
The use of R statistical package in controlled infrastructure. The case of Cl...The use of R statistical package in controlled infrastructure. The case of Cl...
The use of R statistical package in controlled infrastructure. The case of Cl...
 
Rcommander - a menu-driven GUI for R
Rcommander - a menu-driven GUI for RRcommander - a menu-driven GUI for R
Rcommander - a menu-driven GUI for R
 
GNU R in Clinical Research and Evidence-Based Medicine
GNU R in Clinical Research and Evidence-Based MedicineGNU R in Clinical Research and Evidence-Based Medicine
GNU R in Clinical Research and Evidence-Based Medicine
 

Último

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Último (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 

Why are data transformations a bad choice in statistics

  • 1. On the transformation of a response in regression modelling and hypothesis testing Adrian Olszewski Originally posted at: Research Gate, March 21st 2020, URL: https://tinyurl.com/yyv2ryus Updated and enhanced: November, 13th 2020 Find the most updated version here: https://www.dropbox.com/s/62bh8cvbkjuu21n/data%20transformations.pdf?dl=0 I do suggest avoiding any variable transformations as much as possible, except the cases you can thoroughly and convincingly justify the reason and explain the outcome. It applies especially to Box-Cox. 1. It completely changes the formulation, and affects the interpretation. Only in "clean" cases you will get interpretable outcome, like log-transformed data generated by multiplicative process (not *any right skewed data*!). log, exp, reciprocal, square/cube root, power of 2, 3 transformations may be meaningful in *special scenarios*, e.g. velocity, area, volume, concentration, length (square root of area), but y^-0.67 doesn't mean anything. And most of your audience will have no idea how your response changes with the predictor unless you draw the curve. Easy for singe response, but if you have more? You will need marginal effects to give some idea. Sometimes you can decide to approximate the obtained coefficient with well-known ones, e.g. 0.45 is close to 0.5 (square root), but it’s not easy in general. So why doing that? By transforming, you *force your variables to follow certain distribution* and to *tell it your story*. For example, log-transformation assumes your data comes from log- normal distribution. Just look what it does with the equation. As a consequence... 2. … It changes the model along with the errors! In our case - from additive to multiple including errors. Maybe it's good maybe not, depends on your case. 3. It will also affect the variance along with means - many people blindly use transformations completely forgetting about that! Well, it can be useful if we want to stabilize variance, BUT it changes more! For example, in normal distribution mean and variance are independent, in log-normal it's not! Of course this is an idealized case. Box- Cox may return any weird coefficient - guess how will the model and the mean-variance relationship change? 4. The Jensen's inequality says clearly, log(E(y)) ≠ E(log(y)) (except the identity link). By running regression you are interested in modelling the conditional expectation of the response, rather than response itself (transformed). And remember that no transformation can handle certain response distribution properly, like counts (it makes no sense, by the way).
  • 2. 5. In case of testing, it changes the null hypothesis, which is likely not the one you wanted to assess anymore! In our case: it switches from testing the shift in arithmetic means to the ratio of geometric means. I can hear you: “I was told it leads to valid inference!”. Yes, it leads to valid inference... of a hypothesis you did not started with initially (unless you can justify that transition). You obtain a valid answer to *unasked question*. And yes, log(y) may results may differ from results returned by a model with log link (e.g. gamma regression). You will have to decide which one to choose. Sometimes there are industry guidelines, like those given by the FDA for clinical biostatistics, which advises using log on PK data (for a good reason), but *even those guidelines* warn you against unconditional and *unjustified* transformations! 6. Your back-transformed confidence intervals will be biased. Another disease to the collection. 7. BoxCox and any other transformation does NOT guarantee the properties you need. And then what are you going to do? Transform again the data, until satisfied? You know this will only complicate already complicated situation? Not to mention that you may turn your right skewed data into... left-skewed one and fall into more troubles. I know there are many proponents of unconventional data transformation ("skewed data?  go transform it!") on ResearchGate. They were taught this for tens of years. Moreover, some of them were told by authorities to continue using it. But in the light of the above arguments I collected I strongly suggest considering (practically always better) alternatives. Except the mentioned few scenarios, the transformations can cause more harm than good in confirmatory and exploratory modelling. Can it be useful? Yes, it may be OK in predictive modelling, especially if you agree on using a “black box” approach, where you care mostly of the predicted outcomes and not the rest of the story. If the predicted outcome agrees with the expectations – you are fine with that. Then – it’s OK. “OK, so what techniques and methods do you recommend instead”? In the 21st century we have a plenty of models, estimation methods and other techniques (being here for ca 50 years) allowing us to deal with certain violations of the assumptions (normality, homoscedasticity, independence of observations and so on), including: 1) generalized models (GLM and GAM), like: gamma, beta, logistic (and probit), fractional logit, Poisson, negative binomial, etc. regressions. Trucated (most of real variables have truncated domain, keep it in mind!) and censored regression (e.g. tobit model). I’m sure you will be able to find a tool suitable for you. Remember, that this generalizes nicely to the mixed-effect models.
  • 3. 2) non-linear models 3) robust and non-parametric methods and tests (there are over 280 statistical tests! Lots of them do not require or relax certain parametric assumptions, like Yuen, Brunner- Munzel, ATS, WTS, ART ANOVA, Welch, Mann-Whitney/Wilcoxon, and many, many more). At the end of this document I added an longer set of the literature that I read and can wholeheartedly recommend it. a. If you need ANOVA on non-normal or heterogeneous data, remember that you can a run more advanced model (e.g. robust regression, quantile regression, mixed models) or use GLS or GEE estimation and follow the modelling with a set of LR (likelihood ratio) tests to mimic the type-3 ANOVA and get the main and interaction effects! Yes, you read well – that’s what the anova() (or car::Anova(… type=2/3) function does in R when dealing with so many kinds models, performing the assessment of reduction of the residual deviance. Which – in case of the simple linear model – reduces to the analysis of certain contrasts, which is nothing but comparing group means. See? The dots connect with themselves! Yes, the outcome will be approximated (Chi2 rather than F), but hey – it’s still a worthwhile and flexible solution! 4) quantile regression (which handles also mixed effects) – it’s one of the most powerful method, requiring no distributional assumptions yet still offering good interpretability! 5) Generalized Least Square and Generalized Estimation Equations estimation 6) Passing-Bablock and Deming regression 7) resampling (permutation/exact tests, approximate permutation tests, bootstrapped interval estimation). Only remember those methods aren’t accepted by the regulators in the Clinical Research industry when used to analyse the primary outcomes 8) In case of serious skewness you can also try adding categorical covariate(s) to your model which may split your dataset into more homogeneous subgroups. Why? Because the skewness often comes from mixed data coming from 2+ populations with different variability. Afterword As always in statistics – there’s no easy solution to all cases. There are justified cases, where the transformations are not only applicable, but also demanded by the regulations – see an example here: FDA: Guidance for Industry - Statistical Approaches to Establishing Bioequivalence Or here: EMA - ICH Topic E 9 Statistical Principles for Clinical Trials, step 5
  • 4. Also: “THE LOG TRANSFORMATION IS SPECIAL” http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.530.9640&rep=rep1&type=pdf Also, find my diagram (on DropBox) showing a few of the families of models (along with the relationships) and estimation methods, that may be useful for you more than data transformation: https://www.dropbox.com/s/5a8w8kckyfeaix0/statistical%20models%20-%20diagram.pdf
  • 5. A note on “what we call a regression” may also interest you, especially if you: 1) you advise people to transform their DV (response) with Box-Cox (or log) without a thought on the consequences, to "achieve normality" in DV or residuals. It MAY be OK when predicting with the "black-box" approach, but is NOT OK when you use a model to explain / confirm relationships between variables. 2) you "chase" for normality of the raw DV (response) in the linear regression. 3) you believe that strongly skewed data cannot be modelled with the linear regression 4) …and vice versa - you overuse it to everything (counts, %, categorical data from questionnaires, concentrations) 5) you say that linear model is named so as it "produces" straight line 6) you say that the logistic r. is "not a regression, because it models binary response" 7) you believe the "stepwise regression" is a regression https://www.linkedin.com/posts/adrianolszewski_rockyourr-datascience-dataanalysis-activity- 6691521288101531648-jkuW A few URL linking to discussions and resources worth reading: 1. Log-transformation and its implications for data analysis 2. GLM with a Gamma-distributed Dependent Variable (PDF) 3. CrossValidated: When to use gamma GLMs? 4. To transform or not to transform: using generalized linear mixed models to analyse reaction time data 5. Stat 504 - Introduction to Generalized Linear Models 6. Do Not Log-Transform Count Data, Bitches! 7. Generalized linear models - An introduction by Christoph Scherber 8. https://www.theanalysisfactor.com/the-difference-between-link-functions-and-data- transformations/ 9. Notes on Transformations and Generalized Linear Models 10. Handling Skewed Data: A Comparison of Two Popular Methods
  • 6. 11. CrossValidated: Linear model with log-transformed response vs. generalized linear model with log link 12. CrossValidated: How to decide which glm family to use? 13. CrossValidated: Family of GLM represents the distribution of the response variable or residuals? 14. CrossValidated: Why is GLM different than an LM with transformed variable 15. CrossValidated: GLM vs square root data transformation 16. https://stats.idre.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in- terms-of-percent-change-in-linear-regression/ 17. CrossValidated: How to interpret regression coefficients when response was transformed by the 4th root? 18. CrossValidated: Express answers in terms of original units, in Box-Cox transformed data Books worth reading (yes, I read or “familiarized enough” with and use(d) at work to recommend them): 1. Alan Agresti, Foundations of Linear and Generalized Linear Models 2. John Fox, Applied regression analysis and generalized linear model 3. Roger Koenker, Victor Chernozhukov, Xuming He, Limin Peng, Handbook of Quantile Regression 4. Young, Derek S, Handbook of regression methods 5. Andreas Ziegler, Generalized Estimating Equations 6. Daryl S. Paulson, Handbook of Regression and Modeling Applications for the Clinical and Pharmaceutical Industries 7. Myles Hollander, Douglas A. Wolfe, Eric Chicken, Nonparametric Statistical Methods 8. Jason C. Hsu, Multiple Comparisons, Theory and methods 9. Alex Dmitrienko, Ajit C. Tamhane, Frank Bretz, Multiple Testing Problems in Pharmaceutical Statistics 10. Michael G. Akritas and Dimitris N. Politis, Recent Advances and Trends in Nonparametric Statistics 11. W. J. Conover practical nonparametric statistics + some more literature about the modern and flexible non-parametric methods (there’s lots of more beyond the Mann-Whitney-Wilcoxon, Friedman, Kruska-Wallis!), so you don’t have to transform your data ;] 1. Erceg-Hurn, David & Mirosevich, Vikki. (2008). Modern Robust Statistical Methods An Easy Way to Maximize the Accuracy and Power of Your Research. The American psychologist. 63. 591-601. 10.1037/0003-066X.63.7.591. https://www.researchgate.net/publication/23319441_Modern_Robust_Statistical_Meth ods_An_Easy_Way_to_Maximize_the_Accuracy_and_Power_of_Your_Research https://pdfs.semanticscholar.org/88cb/15520b2f84fd2a5a09e0341e791f40ab4118.pdf 2. Edgar Brunner, Madan L. Puri, Nonparametric Methods in Factorial Designs https://www.researchgate.net/profile/Jos_Feys/post/What_statistical_tests_can_I_use_
  • 7. to_compare_mean_values_for_my_study/attachment/59d6558b79197b80779acad7/A S:526088510111744@1502440683536/download/Brunner.pdf 3. Brunner, E., & Puri, M. L. (1996). Nonparametric methods in design and analysis of experiments. In Design and Analysis of Experiments (Vol. 13, pp. 631–703). Elsevier. https://doi.org/https://doi.org/10.1016/S0169-7161(96)13021-2 4. Wobbrock, J.O., Findlater, L., Gergle, D. and Higgins, J.J. (2011). The Aligned Rank Transform for nonparametric factorial analyses using only ANOVA procedures. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '11). Vancouver, British Columbia (May 7-12, 2011). New York: ACM Press, pp. 143-146. http://faculty.washington.edu/wobbrock/pubs/chi-11.06.pdf 5. Christophe Leys, Sandy Schumann, A nonparametric method to analyze interactions: The adjusted rank transform test http://cescup.ulb.be/wp- content/uploads/2015/04/Leys_and_Schumann_nonparametric_interactions.pdf 6. Haiko Lüpsen, The Aligned Rank Transform and discrete Variables -a Warning https://kups.ub.uni-koeln.de/7554/1/ART-discrete.pdf 7. Friedrich, S., Konietschke, F., & Pauly, M. (2017). GFD: An R Package for the Analysis of General Factorial Designs. Journal of Statistical Software, 79(Code Snippet 1), 1 - 18. doi:http://dx.doi.org/10.18637/jss.v079.c01 8. Kimihiro Noguchi, Yulia R. Gel, Edgar Brunner, Frank Konietschke,“nparLD: An R Software Package for the Nonparametric Analysis of Longitudinal Data in Factorial Experiments” 9. Edgar Brunner, Arne C. Bathke, Frank Konietschke, Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs: Using R and SAS, Springer, 2019, ISBN: 303002914X, 9783030029142, page 134 https://books.google.pl/books?id=t9KiDwAAQBAJ&lpg=PA134&ots=_Jgi9Rt0Kz&h l=pl&pg=PA134#v=onepage&q&f=false 10. Feys, Jos. "New Nonparametric Rank Tests for Interactions in Factorial Designs with Repeated Measures." Journal of Modern Applied Statistical Methods 15.1 (2016): 78- 99. Web. https://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=1924&context=jmasm 11. Friedrich, S., Konietschke, F., Pauly, M.(2017). GFD - An R-package for the Analysis of GeneralFactorial Designs. Journal of Statistical Software, Code Snippets 79(1), 1– 18, doi:10.18637/jss.v079.c01.Pauly, M., Brunner, E., Konietschke, F.(2015). Asymptotic Permutation Tests in General FactorialDesigns. Journal of the Royal Statistical Society - Series B 77, 461-473 12. Akritas, M. G., & Politis, D. N. (2003). Recent Advances and Trends in Nonparametric Statistics. Elsevier B.V. https://doi.org/10.1016/B978-0-444-51378- 6.X5000-5
  • 8. 13. Peterson, K.M. (2002). Six Modifications Of The Aligned Rank Transform Test For Interaction. https://pdfs.semanticscholar.org/ad4b/54e104acf7356b53c075e959ba8c24e23fea.pdf 14. Schacht, A., Bogaerts, K., Bluhmki, E., & Lesaffre, E. (2008). A new nonparametric approach for baseline covariate adjustment for two-group comparative studies. Biometrics, 64 4, 1110-6 15. Shah DA, Madden LV. Nonparametric analysis of ordinal data in designed factorial experiments. Phytopathology. 2004;94(1):33-43. doi:10.1094/PHYTO.2004.94.1.33, https://apsjournals.apsnet.org/doi/pdf/10.1094/PHYTO.2004.94.1.33 16. Versace, V., Schwenker, K., Langthaler, P. B., Golaszewski, S., Sebastianelli, L., Brigo, F., Pucks-Faes, E., Saltuari, L., & Nardone, R. (2020). Facilitation of Auditory Comprehension After Theta Burst Stimulation of Wernicke's Area in Stroke Patients: A Pilot Study. Frontiers in neurology, 10, 1319. https://doi.org/10.3389/fneur.2019.01319, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6960103/ 17. Prossegger, J., Huber, D., Grafetstätter, C., Pichler, C., Braunschmid, H., Weisböck- Erdheim, R., & Hartl, A. (2019). Winter Exercise Reduces Allergic Airway Inflammation: A Randomized Controlled Study. International journal of environmental research and public health, 16(11), 2040. https://doi.org/10.3390/ijerph16112040 18. Akritas, M.G. and E. Brunner. 1997. A unified approach to rank tests for mixed models. Journal of Statistical Planning and Inference. 61:249–277. 19. Haiko Lüpsen, Anova with binary variables - Alternatives for a dangerous F-test (dac lepszy citation) 20. Haiko Lüpsen, Comparison of nonparametric analysis of variance methods a Monte Carlo study - Part A: Between subjects designs - A Vote for van der Waerden + my list of various non-parametric and robust alternatives to the classic n-way ANOVA: https://www.quora.com/Is-there-any-reliable-non-parametric-alternative-to-two-way- ANOVA-in-biostatistics/answer/Adrian-Olszewski-1?ch=10&share=2dada943&srid=MByz 2020-11-13 (Friday ), LinkedIn, Adrian Olszewski