
Unlike methods that match images to templates after the fact, 3PT-D is directly conditioned on 3D models during the encoding process. This allows it to jointly learn object representations and localize them with superior confidence.

It performs an iterative render-and-compare loop, matching the live camera feed against internal 3D hypotheses. It natively scales to multi-view setups, fusing information from different angles to resolve visual ambiguities.

3PT uses epipolar matching and Kernel Density Estimation (KDE) to fuse observations from multiple angles and arrive at one spatial truth. Triangulating 2D detections into 3D space, filters out ambiguities, and achieving consensus across all available data and views.
Traditionally, high-accuracy pose estimation required high-end depth sensors. 3PT breaks this dependency. By utilizing a novel multi-view RGB refinement loop, it achieves state-of-the-art accuracy that surpasses many RGB-D methods. It eliminates depth artifacts caused by reflective surfaces or harsh lighting, delivering "industrial-grade" results with standard camera hardware.
No retraining. No per-object fine-tuning. 3PT is truly "one-shot. With one 3D CAD model (the "prompt") it can detect, segment, and estimate the 6D pose of that object in any scene. It generalizes across diverse environments, from cluttered household bins to complex industrial "house-of-cards" structures.
3PT isn't built for benchmarks, but to achieve real world performance for industrial automation, and real world use on the shop and factory floor. The model is reliable and built for real world use. We demonstrate this through over 100 successful and repeatable real-world attempts in realistic-to-manufacturing robotics tasks, including:
- DIMM-Insertion: Vision-only insertion with a 0.57mm clearance.
- Sheet Metal Bin Picking: Clearing 100+ densely cluttered, thin parts.

