When the brain processes language, it needs to mobilize neurons in multiple brain regions to work together in real-time. Therefore, the construction of neuroimaging data with high temporal and spatial resolution is crucial for studying the language-processing mechanism of the brain. Existing open source data are mainly collected for English, only include single modality neuroimaging data, such as high spatial resolution functional MRI (fMRI) or high temporal resolution magnetoencephalography (MEG), and mostly use experimental materials within 1 hour cannot conduct more comprehensive brain research with the help of computational models that require large amounts of data. To this end, we collected and processed the largest and most informative simultaneous multimodal neuroimaging dataset in the world so far. The paper introducing this dataset has been accepted and published by Scientific Data, a sub-journal of Nature (https://rdcu.be/cWDSx).
Figure 1 Schematic overview of the study procedure. a. The participants followed the instructions on the screen and listened to stories while their brain activity was recorded by fMRI and MEG. b. Participants lied in the MRI machine while structural and resting-state MRI data were recorded.
The dataset contains fMRI and MEG collected when 12 subjects listened to stories for about 6 hours, together with T1/T2 weighted structural images of each subject, diffusion MRI and resting-state MRI. The collection process is shown in Figure 1. In order to facilitate the study of brain language processing mechanisms using computational models, all story materials were manually marked with a syntactic structure tree, and the audio time points, word frequencies, and vectors of various words and vocabulary corresponding to each word in the text were calculated, such as Figure 2 shows. All test indicators are beyond or comparable to the existing similar data sets, with sufficient quality assurance. This is by far the largest multimodal neuroimaging data set for brain language processing research in the world, and the first large-scale Chinese multimodal neuroimaging data set. The public release of this data set provides important support for all-around research on scientific issues such as how the brain mobilizes different brain regions and how different brain regions work together when understanding vocabulary, phrases, and sentences in real scenarios. What is especially important is that the data covers nearly 10,000 Chinese words. In order to explore the relationship between the language computing model and the language processing mechanism of the human brain, and explore how to use neuroimaging data to improve the performance of the existing language computing model, so as to build a new generation of Brain-inspired neural language models are of great importance.
Figure 2 An example of annotation information for the stimuli. a. Speech-to-text alignment. b. Linguistic annotations of characters. c. Linguistic annotations of words. d. Part-of-speech tag annotations. e. Constituency tree annotations. f. Dependency tree annotations.
Wang, S., Zhang, X., Zhang, J. et al. A synchronized multimodal neuroimaging dataset for studying brain language processing. Sci Data 9, 590 (2022). https://doi.org/10.1038/s41597-022-01708-5