CVPR 2026

3D-Object Perception Transformer (3PT)

Architecture of the Intrinsic Vision Model

Pushing the boundary of 3D-Object Perception Tasks. Awarded a Highlight at CVPR 2026 - top 3% of submissions.

Achieving state-of-the-art results across all 3D-Perception Tasks

Current 3D perception pipelines are often lack a fundamental understanding of 3D geometry, making them brittle in high-stakes industrial environments. 3PT (3D-Object Perception Transformer) is a unified foundation model trained directly on 3D CAD prompts. It replaces complex multi-model heuristics with a streamlined, two-transformer pipeline that achieves sub-millimeter precision using only off-the-shelf RGB cameras

+17.5AP vs prior best method

2D-Detection

‍

Unlike methods that match images to templates after the fact, 3PT-D is directly conditioned on 3D models during the encoding process. This allows it to jointly learn object representations and localize them with superior confidence.

+12.3AP-mm vs prior best method

6D-Pose Estimation

‍

It performs an iterative render-and-compare loop, matching the live camera feed against internal 3D hypotheses. It natively scales to multi-view setups, fusing information from different angles to resolve visual ambiguities.

‍

5x Faster than prior best method

Cycle Time

‍

3PT uses epipolar matching and Kernel Density Estimation (KDE) to fuse observations from multiple angles and arrive at one spatial truth. Triangulating 2D detections into 3D space, filters out ambiguities, and achieving consensus across all available data and views.

Depth-free precision

Traditionally, high-accuracy pose estimation required high-end depth sensors. 3PT breaks this dependency. By utilizing a novel multi-view RGB refinement loop, it achieves state-of-the-art accuracy that surpasses many RGB-D methods. It eliminates depth artifacts caused by reflective surfaces or harsh lighting, delivering "industrial-grade" results with standard camera hardware.

One-shot, highly versatile

No retraining. No per-object fine-tuning. 3PT is truly "one-shot. With one 3D CAD model (the "prompt") it can detect, segment, and estimate the 6D pose of that object in any scene. It generalizes across diverse environments, from cluttered household bins to complex industrial "house-of-cards" structures.

Industrially robust

3PT isn't built for benchmarks, but to achieve real world performance for industrial automation, and real world use on the shop and factory floor. The model is reliable and built for real world use. We demonstrate this through over 100 successful and repeatable real-world attempts in realistic-to-manufacturing robotics tasks, including:

- DIMM-Insertion: Vision-only insertion with a 0.57mm clearance.
- Sheet Metal Bin Picking: Clearing 100+ densely cluttered, thin parts.

Platform

Artificial Intelligence

Mission

Careers

Contact

Events

Platform

Company

Connect

Blog

Platform

Artificial Intelligence

Mission

Careers

Contact

Events

3D-Object Perception Transformer (3PT)

Architecture of the Intrinsic Vision Model

About the research

Achieving state-of-the-art results across all 3D-Perception Tasks

2D-Detection

6D-Pose Estimation

Cycle Time

Depth-free precision

One-shot, highly versatile

Industrially robust