skip to main content


Title: A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images
Abstract

Real-time magnetic resonance imaging (RT-MRI) of human speech production is enabling significant advances in speech science, linguistics, bio-inspired speech technology development, and clinical applications. Easy access to RT-MRI is however limited, and comprehensive datasets with broad access are needed to catalyze research across numerous domains. The imaging of the rapidly moving articulators and dynamic airway shaping during speech demands high spatio-temporal resolution and robust reconstruction methods. Further, while reconstructed images have been published, to-date there is no open dataset providing raw multi-coil RT-MRI data from an optimized speech production experimental setup. Such datasets could enable new and improved methods for dynamic image reconstruction, artifact correction, feature extraction, and direct extraction of linguistically-relevant biomarkers. The present dataset offers a unique corpus of 2D sagittal-view RT-MRI videos along with synchronized audio for 75 participants performing linguistically motivated speech tasks, alongside the corresponding public domain raw RT-MRI data. The dataset also includes 3D volumetric vocal tract MRI during sustained speech sounds and high-resolution static anatomical T2-weighted upper airway MRI for each participant.

 
more » « less
NSF-PAR ID:
10277177
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Data
Volume:
8
Issue:
1
ISSN:
2052-4463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Purpose

    To develop and evaluate a technique for 3D dynamic MRI of the full vocal tract at high temporal resolution during natural speech.

    Methods

    We demonstrate 2.4 × 2.4 × 5.8 mm3spatial resolution, 61‐ms temporal resolution, and a 200 × 200 × 70 mm3FOV. The proposed method uses 3D gradient‐echo imaging with a custom upper‐airway coil, a minimum‐phase slab excitation, stack‐of‐spirals readout, pseudo golden‐angle view order inkxky, linear Cartesian order alongkz, and spatiotemporal finite difference constrained reconstruction, with 13‐fold acceleration. This technique is evaluated using in vivo vocal tract airway data from 2 healthy subjects acquired at 1.5T scanner, 1 with synchronized audio, with 2 tasks during production of natural speech, and via comparison with interleaved multislice 2D dynamic MRI.

    Results

    This technique captured known dynamics of vocal tract articulators during natural speech tasks including tongue gestures during the production of consonants “s” and “l” and of consonant–vowel syllables, and was additionally consistent with 2D dynamic MRI. Coordination of lingual (tongue) movements for consonants is demonstrated via volume‐of‐interest analysis. Vocal tract area function dynamics revealed critical lingual constriction events along the length of the vocal tract for consonants and vowels.

    Conclusion

    We demonstrate feasibility of 3D dynamic MRI of the full vocal tract, with spatiotemporal resolution adequate to visualize lingual movements for consonants and vocal tact shaping during natural productions of consonant–vowel syllables, without requiring multiple repetitions.

     
    more » « less
  2. Purpose

    To demonstrate a tagging method compatible with RT‐MRI for the study of speech production.

    Methods

    Tagging is applied as a brief interruption to a continuous real‐time spiral acquisition. Tagging can be initiated manually by the operator, cued to the speech stimulus, or be automatically applied with a fixed frequency. We use a standard 2D 1‐3‐3‐1 binomial SPAtial Modulation of Magnetization (SPAMM) sequence with 1 cm spacing in both in‐plane directions. Tag persistence in tongue muscle is simulated and validated in vivo. The ability to capture internal tongue deformations is tested during speech production of American English diphthongs in native speakers.

    Results

    We achieved an imaging window of 650‐800 ms at 1.5T, with imaging signal to noise ratio ≥ 17 and tag contrast to noise ratio ≥ 5 in human tongue, providing 36 frames/s temporal resolution and 2 mm in‐plane spatial resolution with real‐time interactive acquisition and view‐sharing reconstruction. The proposed method was able to capture tongue motion patterns and their relative timing with adequate spatiotemporal resolution during the production of American English diphthongs and consonants.

    Conclusion

    Intermittent tagging during real‐time MRI of speech production is able to reveal the internal deformations of the tongue. This capability will allow new investigations of valuable spatiotemporal information on the biomechanics of the lingual subsystems during speech without reliance on binning speech utterance repetition.

     
    more » « less
  3. Purpose

    To improve the depiction and tracking of vocal tract articulators in spiral real‐time MRI (RT‐MRI) of speech production by estimating and correcting for dynamic changes in off‐resonance.

    Methods

    The proposed method computes a dynamic field map from the phase of single‐TE dynamic images after a coil phase compensation where complex coil sensitivity maps are estimated from the single‐TE dynamic scan itself. This method is tested using simulations and in vivo data. The depiction of air–tissue boundaries is evaluated quantitatively using a sharpness metric and visual inspection.

    Results

    Simulations demonstrate that the proposed method provides robust off‐resonance correction for spiral readout durations up to 5 ms at 1.5T. In ‐vivo experiments during human speech production demonstrate that image sharpness is improved in a majority of data sets at air–tissue boundaries including the upper lip, hard palate, soft palate, and tongue boundaries, whereas the lower lip shows little improvement in the edge sharpness after correction.

    Conclusion

    Dynamic off‐resonance correction is feasible from single‐TE spiral RT‐MRI data, and provides a practical performance improvement in articulator sharpness when applied to speech production imaging.

     
    more » « less
  4. Purpose

    To develop and evaluate a fast and effective method for deblurring spiral real‐time MRI (RT‐MRI) using convolutional neural networks.

    Methods

    We demonstrate a 3‐layer residual convolutional neural networks to correct image domain off‐resonance artifacts in speech production spiral RT‐MRI without the knowledge of field maps. The architecture is motivated by the traditional deblurring approaches. Spatially varying off‐resonance blur is synthetically generated by using discrete object approximation and field maps with data augmentation from a large database of 2D human speech production RT‐MRI. The effect of off‐resonance range, shift‐invariance of blur, and readout durations on deblurring performance are investigated. The proposed method is validated using synthetic and real data with longer readouts, quantitatively using image quality metrics and qualitatively via visual inspection, and with a comparison to conventional deblurring methods.

    Results

    Deblurring performance was found superior to a current autocalibrated method for in vivo data and only slightly worse than an ideal reconstruction with perfect knowledge of the field map for synthetic test data. Convolutional neural networks deblurring made it possible to visualize articulator boundaries with readouts up to 8 ms at 1.5 T, which is 3‐fold longer than the current standard practice. The computation time was 12.3 ± 2.2 ms per frame, enabling low‐latency processing for RT‐MRI applications.

    Conclusion

    Convolutional neural networks deblurring is a practical, efficient, and field map‐free approach for the deblurring of spiral RT‐MRI. In the context of speech production imaging, this can enable 1.7‐fold improvement in scan efficiency and the use of spiral readouts at higher field strengths such as 3 T.

     
    more » « less
  5. Purpose

    To provide 3D real‐time MRI of speech production with improved spatio‐temporal sharpness using randomized, variable‐density, stack‐of‐spiral sampling combined with a 3D spatio‐temporally constrained reconstruction.

    Methods

    We evaluated five candidate (k,t) sampling strategies using a previously proposed gradient‐echo stack‐of‐spiral sequence and a 3D constrained reconstruction with spatial and temporal penalties. Regularization parameters were chosen by expert readers based on qualitative assessment. We experimentally determined the effect of spiral angle increment andkztemporal order. The strategy yielding highest image quality was chosen as the proposed method. We evaluated the proposed and original 3D real‐time MRI methods in 2 healthy subjects performing speech production tasks that invoke rapid movements of articulators seen in multiple planes, using interleaved 2D real‐time MRI as the reference. We quantitatively evaluated tongue boundary sharpness in three locations at two speech rates.

    Results

    The proposed data‐sampling scheme uses a golden‐angle spiral increment in thekxkyplane and variable‐density, randomized encoding alongkz. It provided a statistically significant improvement in tongue boundary sharpness score (P < .001) in the blade, body, and root of the tongue during normal and 1.5‐times speeded speech. Qualitative improvements were substantial during natural speech tasks of alternating high, low tongue postures during vowels. The proposed method was also able to capture complex tongue shapes during fast alveolar consonant segments. Furthermore, the proposed scheme allows flexible retrospective selection of temporal resolution.

    Conclusion

    We have demonstrated improved 3D real‐time MRI of speech production using randomized, variable‐density, stack‐of‐spiral sampling with a 3D spatio‐temporally constrained reconstruction.

     
    more » « less