Access to large, annotated samples represents a considerable challenge for training accurate deep-learning models in medical imaging. Although at present transfer learning from pre-trained models can help with cases lacking data, this limits design choices and generally results in the use of unnecessarily large models. Here we propose a self-supervised training scheme for obtaining high-quality, pre-trained networks from unlabelled, cross-modal medical imaging data, which will allow the creation of accurate and efficient models. We demonstrate the utility of the scheme by accurately predicting retinal thickness measurements based on optical coherence tomography from simple infrared fundus images. Subsequently, learned representations outperformed advanced classifiers on a separate diabetic retinopathy classification task in a scenario of scarce training data. Our cross-modal, three-stage scheme effectively replaced 26,343 diabetic retinopathy annotations with 1,009 semantic segmentations on optical coherence tomography and reached the same classification accuracy using only 25% of fundus images, without any drawbacks, since optical coherence tomography is not required for predictions. We expect this concept to apply to other multimodal clinical imaging, health records and genomics data, and to corresponding sample-starved learning problems.