The Intrinsic Vision Model

IVM is an industrial-grade foundation model built to handle complex perception tasks like pose estimation, detection, and tracking. It reasons directly with CAD models, delivering zero-shot, sub-millimeter accuracy for new parts without any specific training.

Made for industrial use

Intrinsic Vision Model leverages a growing set of specialized perception transformers to achieve best-in-class results on a wide variety of perception tasks across object types, materials, environments and lighting conditions. On the BOP industrial challenges, we achieved at 12% improvement mm-level pose accuracy and a 16% improvement in detection. It makes it easy to deploy state-of-the-art perception for tasks like object and pose detection.

Pixel perfect precision

By teaching IVM multi-view, pixel-perfect reasoning, the model achieves sub-mm accuracy using on RGB cameras, reducing hardware costs by ~5X to 20X compared with industry-standard depth-sensing systems. The ability to leverage cheaper hardware while delivering world class perception capabilities could substantially shift the economics AI-enabled robotic systems. The model is highly accurate and precise - making tough tasks like tight-fit insertions possible without the use of expensive custom fixtures.

Zero training breakthrough

By training the model to reason directly with CAD models (3D files), it achieves state-of-the-art performance at zero training time. Compared to previous zero training approaches, we achieve a massive 29% improvement in industrial pose estimation, beating all approaches, including fully trained ones. We also achieve 14% improvement in augmented reality pose estimation.

Versatile applications

Using a growing set of function-specific transformers for perception tasks, including segmentation, tracking and point cloud generation - the model generalizes efficiently to a wide variety of objects. Whether for industrial, household, or AR/VR use cases - the model can adapt quickly and robustly.