Voltar

Meta’s V‑JEPA AI Model Demonstrates Human‑Like Physical Intuition

Meta’s V‑JEPA AI Model Demonstrates Human‑Like Physical Intuition
Wired AI

Design Philosophy and Core Architecture

V‑JEPA (Video Joint Embedding Predictive Architecture) was created by Meta to move beyond traditional pixel‑space video models. Instead of predicting individual pixel values, V‑JEPA masks the same set of pixels across multiple frames and encodes the masked frames into compact latent representations using an initial encoder. A second encoder processes the unmasked frames to generate a parallel set of latent codes. A predictor network then learns to map the latent representation of the masked input to the latent representation of the full input, effectively learning to reconstruct the essential content of a scene without extraneous detail.

This approach enables the model to disregard irrelevant information—such as the flutter of leaves—while concentrating on critical aspects like object positions, colors, and motions. The latent‑space training reduces the amount of labeled data needed for downstream tasks, because the model already captures high‑level visual concepts during pre‑training.

Demonstrating Physical Intuition

Researchers evaluated V‑JEPA on the IntPhys benchmark, which measures an AI’s ability to judge whether video events are physically plausible. V‑JEPA achieved nearly 98 % accuracy, a dramatic improvement over a well‑known pixel‑space model that performed only slightly better than chance. The model also quantified “surprise” by calculating prediction error when future frames deviated from learned expectations. Errors spiked when videos presented impossible events, such as an object disappearing behind an occluder and failing to reappear, mirroring the intuitive response observed in infants.

These results suggest that V‑JEPA can develop a rudimentary sense of object permanence, constancy of shape and color, and basic gravitational effects solely from video exposure, without hand‑crafted physics priors.

Application to Robotics and Limitations

Building on its video‑understanding capabilities, the V‑JEPA team fine‑tuned a predictor network using roughly 60 hours of robot data, enabling the model to plan simple manipulation actions. This demonstrates the potential for V‑JEPA to support autonomous robots that need an intuitive grasp of physical interactions.

However, the model’s memory window spans only a few seconds of video, limiting its ability to predict longer‑term dynamics. When tested on a more demanding benchmark, IntPhys 2, V‑JEPA and comparable models performed only marginally better than chance. Researchers liken the model’s short‑term memory to that of a goldfish, indicating a need for broader temporal context in future versions.

Outlook and Expert Perspectives

Experts praised V‑JEPA’s ability to learn intuitive physics from raw video, noting its alignment with developmental findings that infants acquire such knowledge with minimal exposure. Nonetheless, critics highlighted the absence of uncertainty quantification, a factor that could improve decision‑making in ambiguous scenarios.

Meta’s release of a 1.2‑billion‑parameter V‑JEPA 2 model, trained on 22 million videos, marks a significant scale‑up, yet the core challenges of temporal memory and uncertainty remain. Ongoing research aims to extend the model’s horizon and embed probabilistic reasoning, potentially bringing AI closer to human‑like perception of the physical world.

Usado: News Factory APP - descoberta e automação de notícias - ChatGPT para Empresas

Source: Wired AI

Também disponível em: