World models require robust relational understanding to support prediction, reasoning, and control. While
object-centric representations provide a useful abstraction, they are not sufficient to capture
interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world
model that extends masked joint embedding prediction from image patches to object-centric representations.
By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA
induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making
interaction reasoning essential.
Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute
improvement
of about 20% in counterfactual reasoning compared to the same architecture without object-level
masking.
On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1% of the
total latent input features required by patch-based world models, while achieving comparable
performance.
Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive
bias via latent interventions.