In this work, we propose a deep learning-based method for iterative registration of fetal brain images acquired by ultrasound and magnetic resonance, inspired by “Spatial Transformer Networks”. Images are co-aligned to a dual modality spatio-temporal atlas, where computational image analysis may be performed in the future. Our results show better alignment accuracy compared to “Self-Similarity Context descriptors”, a state-of-the-art method developed for multi-modal image registration. Furthermore, our method is robust and able to register highly misaligned images, with any initial orientation, where similarity-based methods typically fail.