The prediction of the remaining time for business processes is a major task in predictive process monitoring (PPM). In the last years, various machine learning methods were introduced which reduced error levels steadily. However, the commonly applied metric for optimization and evaluation, the Mean Absolute Error (MAE), has limitations regarding its interpretability. In this work we introduce and evaluate the normalized Mean Absolute Error (nMAE) as an interpretable metric for model evaluation. It accounts for different kinds of label shifts, which are a special type of concept drift that can distort remaining time results. We investigate these concepts in a thorough benchmark study and use them to assess the current state of remaining time prediction for business processes. This includes the evaluation of four different baseline models, identifying the most accurate one. Furthermore, our study compares three different state-of-The-Art methods, namely XGBoost, DA-LSTM, and PGT-Net. In contrary to prior studies we find that there is no significant difference in the performance between these models. Additionally, using the nMAE as evaluation metric we find that these models do not perform reasonably well on a range of event logs. Initial ideas for this behaviour are discussed and consolidated along with other findings from the case study into a comprehensive list motivating future research directions.