No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets.
In: (42nd International Conference on Machine Learning, ICML 2025, 13-19 July 2025, Vancouver). 2025. 11405-11434 (Proceedings of Machine Learning Research ; 267)
Benchmark datasets have proved pivotal to the success of graph learning, and good benchmark datasets are crucial to guide the development of the field. Recent research has highlighted prob-lems with graph-learning datasets and benchmark-ing practices revealing, for example, that meth-ods which ignore the graph structure can outper-form graph-based approaches. Such findings raise two questions: (1) What makes a good graph-learning dataset, and (2) how can we evaluate dataset quality in graph learning? Our work ad-dresses these questions. As the classic evalua-tion setup uses datasets to evaluate models, it does not apply to dataset evaluation. Hence, we start from first principles. Observing that graph-learning datasets uniquely combine two modes graph structure and node features, we introduce RINGS, a flexible and extensible mode-perturbation framework to assess the quality of graph-learning datasets based on dataset abla-tions ie., quantifying differences between the original dataset and its perturbed representations. Within this framework, we propose two mea-sures performance separability and mode com-plementarity as evaluation tools, each assess-ing the capacity of a graph dataset to benchmark the power and efficacy of graph-learning meth-ods from a distinct angle. We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for im-proving the evaluation of graph-learning methods. Our work opens new research directions in data-centric graph learning, and it constitutes a step toward the systematic evaluation of evaluations.