PuSH - Publication Server of Helmholtz Zentrum München

Berrar, D.* ; Dubitzky, W.

Should significance testing be abandoned in machine learning?

Int. J. Data Sci. Anal. 7, 247-257 (2019)
Postprint DOI
Open Access Green
Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far. Here, we investigate one particular problem, the Jeffreys–Lindley paradox. This paradox describes a statistical conundrum: the p value can be close to zero, convincing us that there is overwhelming evidence against the null hypothesis. At the same time, however, the posterior probability of the null hypothesis being true can be close to 1, convincing us of the exact opposite. In experiments with synthetic data sets and a subsequent thought experiment, we demonstrate that this paradox can have severe repercussions for the comparison of multiple classifiers over multiple benchmark data sets. Our main result suggests that significance tests should not be used in such comparative studies. We caution that the reliance on significance tests might lead to a situation that is similar to the reproducibility crisis in other fields of science. We offer for debate four avenues that might alleviate the looming crisis.
Altmetric
Additional Metrics?
Edit extra informations Login
Publication type Article: Journal article
Document type Scientific Article
Corresponding Author
Keywords Bayesian Test ; Classification ; Jeffreys–lindley Paradox ; P Value ; Significance Test
ISSN (print) / ISBN 2364-415X
e-ISSN 2364-4168
Quellenangaben Volume: 7, Issue: 4, Pages: 247-257 Article Number: , Supplement: ,
Publisher Springer
Publishing Place Cham (ZG)
Non-patent literature Publications
Reviewing status Peer reviewed