Predicting where a person will look in a dynamic social interaction requires understanding not only visual cues, but also who is speaking, what is being said, and how conversational context unfolds over time. We present a multimodal framework for real-time egocentric head-gaze forecasting that integrates visual, audio, and language signals within a unified architecture. Our key insight is that naturally occurring egocentric human interaction videos—when combined with spatially grounded speaker cues—provide a rich supervisory signal for anticipating socially meaningful gaze behavior. To support large-scale training, we build a new 40+ hour conversation-centric egocentric benchmark drawn from Aria, Ego4D, and EgoCom, and introduce a novel method for deriving frame-level yaw–pitch gaze labels using point trajectories via video point tracking (i.e. CoTracker). This proxy supervision correlates very closely with head-mounted IMU measurements, enabling scalable annotation without specialized hardware. Our model fuses multimodal cues through a speaker-conditioned cross-attention mechanism that injects audio–language features into localized visual regions, distinguishing egocentric speech (global attention) from non-egocentric speakers (spatially localized attention) to predict short-horizon head-gaze trajectories suitable for low-latency embodied applications. Across all datasets, our approach outperforms prior state-of-the-art baselines and yields fine-grained improvements on socially relevant behaviors such as joint attention, mutual gaze, and gaze shifts. Together, these results demonstrate a scalable, multimodal pathway toward socially grounded real-time gaze anticipation for future embodied agents.
We curate a 40+ hour conversation-centric egocentric benchmark from three prominent datasets: Aria (1.25h), Ego4D (5.9h), and EgoCom (35h). We filter for low-egomotion clips and process them into 5-second segments for training.
Most casual videos lack gaze annotations. We derive proxy head-gaze labels (yaw/pitch per frame) from CoTracker point trajectories. Validated against Aria's IMU, this proxy achieves MAE 0.47° / 0.22° (yaw/pitch), enabling scalable supervision without specialized hardware.
Proxy head-gaze closely tracks ground-truth IMU measurements.
Vision: Face detection (InsightFace) + body detection (YOLOv11x) with stable ID tracking across
frames.
Audio: Speaker diarization via WhisperX + pyannote to identify who is speaking when.
Our multimodal fusion architecture.
We fuse audio and language embeddings into the visual representation of egocentric video. Speaker-aware features are projected into the spatial regions of active speakers, creating multimodal representations that are then downsampled and decoded into gaze predictions (yaw, pitch).
Visual: Multi-Scale Vision Transformer (MViT) extracts spatiotemporal tokens from video.
Audio: Log-spectrogram windows processed by a transformer encoder, aligned to visual
patches.
Language: Frozen LLM encodes diarized transcripts, broadcast across spatial locations.
We apply masked cross-attention to inject audio-language cues into vision:
Fused representations are passed through 3D convolutions and linear layers to predict short-horizon gaze trajectories.