[Videos are entanglements of space and time.]
Around one year ago, we released VSI-Bench, in which we studied visual spatial intelligence: a fundamental but missing pillar of current MLLMs.
Today, we are excited to introduce Cambrian-S, our further step that goes beyond visual spatial intelligence to spatial supersensing.
The core idea behind our work is that: we believe real supersensing intelligence requires the ability to not only see, but also actively anticipate, select, and organize its sensory input by constructing an internal world model.
👇Scroll down to delve deeper into our position, analysis, explorations, and findings along this supersensing journey.
🧵[1/n]
Introducing Cambrian-S
it’s a position, a dataset, a benchmark, and a model
but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶