Pixel-level fusion of satellite images coming from multiple sensors allows for an improvement in the quality of the acquired data both spatially and spectrally. In particular, multispectral and hyperspectral images have been fused to generate images with a high spatial and spectral resolution. In literature, there are several approaches for this task, nonetheless, those techniques still present a loss of relevant spatial information during the fusion process. This work presents a multi scale deep learning model to fuse multispectral and hyperspectral data, each with high-spatial-and-low-spectral resolution (HSaLS) and low-spatial-and-high-spectral resolution (LSaHS) respectively. As a result of the fusion scheme, a high-spatial-and-spectral resolution image (HSaHS) can be obtained. In order of accomplishing this result, we have developed a new scalable high spatial resolution process in which the model learns how to transition from low spatial resolution to an intermediate spatial resolution level and finally to the high spatial-spectral resolution image. This step-by-step process reduces significantly the loss of spatial information. The results of our approach show better performance in terms of both the structural similarity index and the signal to noise ratio.