The study uses a multimodal dataset comprising synchronized audio, video, and radar recordings collected to evaluate the effectiveness of audio–visual–radar (AVR) fusion in speech-recognition tasks. The dataset contains 800 labeled samples, produced by four fluent but non-native English speakers aged 28–40, each repeating ten phonetically challenging English words twenty times under controlled recording conditions.

Audio was recorded at 44.1 kHz/16-bit, video at 1080p/30 fps, and radar data were acquired using the XeThru X4M03 IR-UWB sensor positioned approximately 2 meters from the speaker to capture speech-related micro-Doppler articulatory signatures. The three modalities were automatically segmented and aligned using pretrained machine-learning models to ensure consistent synchronization across streams.

The purpose of this dataset is to serve as a resource for studying how radar sensing can enhance speech-recognition robustness under noise, occlusion, and other adverse conditions.

All audio–visual recordings contain identifiable human data and will therefore be shared only under controlled access to protect participant privacy.