Radar-based machine learning pipelines require extensive annotated datasets. However, producing large volumes of precise labels remains prohibitively laborious and prone to inconsistency, as radar signals lack a direct visual correspondence. To address this limitation, we introduce a fully automated, multi-modal annotation pipeline built around our custom RadarBox that co-registers a FMCW MIMO radar with an Azure Kinect RGB-D camera. Precise spatial calibration and hardware-level synchronization yield exact pixel-to-radar alignment. RGB images undergo panoptic segmentation to generate per-pixel human masks, which are fused with depth measurements to reconstruct a voxelized surface mesh. We extract 3D joint positions from the Kinect Body Tracking SDK and apply a bidirectional Kalman filter to derive precise per-joint positions and velocity vectors free from sudden, non-physiological fluctuations. These enhanced labels are projected into 5D radar cube slices and target lists through robust spatio-temporal association. As a demonstration, we train a deep neural network on annotated radar target lists for indoor people localization, achieving a mean positional error of 0.31 m and 91.8% occupancy accuracy, even under occlusion. Unlike prior semi-automatic or heuristic-based methods, our approach delivers consistent 5D labels at scale, bridging spatial, temporal, and Doppler dimensions, and thus paves the way for large-scale, learning-based radar sensing in human-centered applications.