GECO: Geometrically Consistent Embedding with Lightspeed Inference

ICCV, 2025
1TUM 2Munich Center for Machine Learning

Abstract

Recent advances in feature learning have shown that self-supervised vision foundation models can capture semantic correspondences but often lack awareness of underlying 3D geometry. GECO addresses this gap by producing geometrically coherent features that semantically distinguish parts based on geometry (e.g., left/right eyes, front/back legs).
We propose a training framework based on optimal transport, enabling supervision beyond keypoints, even under occlusions and disocclusions. With a lightweight architecture, GECO runs at 30 fps, 98.2% faster than prior methods, while achieving state-of-the-art performance on PFPascal, APK, and CUB, improving PCK by 6.0%, 6.2%, and 4.1%, respectively. Finally, we show that PCK alone is insufficient to capture geometric quality and introduce new metrics and insights for more geometry-aware feature learning.

Lightspeed



Our formulation leads to geometrically-aware features which enables state-of-the-art performance on correspondence estimation while being significantly more efficient—it surpasses competitors on multiple datasets while reducing computation time by 98.2%.

Source Image
Target Image
🦎 GECO vs. Geo [1]



After marking a query location in the source image, we can observe the correspondence in the target image.

Source Image

🦎 GECO"

Segmentation



We conduct object part segmentation evaluation on PascalParts. Our method effectively separates parts, indicating that it learns meaningful, dense feature representations.

Source Image
GT
🦎 GECO vs Dinov2
🦎 GECO vs Geo [1]
🦎 GECO
Dinov2 [4]
🦎 GECO
Geo [1]
🦎 GECO
Dinov2 [4]
🦎 GECO
Geo [1]
🦎 GECO
Dinov2 [4]
🦎 GECO
Geo [1]
🦎 GECO
Dinov2 [4]
🦎 GECO
Geo [1]
🦎 GECO
Dinov2 [4]
🦎 GECO
Geo [1]

Tracking



From left to right: source image with query keypoint, argmax correspondence of feature embedding in each consecutive frame.

Query Point
🦎 GECO vs. DINOv2 [4]
🦎 GECO
Dinov2 [4]
🦎 GECO
Dinov2 [4]

Pixel Warping



🦎 GECO warps Pixels with geometric consistency.

Method



We propose a novel loss function and a lightweight architecture for image representation learning, leveraging optimal transport and KL-regularized soft assignment;

Given a dataset with keypoint annotation we can train the LoRA adapter by comparing the estimated assignment obtained from the KL-regularized Optimal Transport layer $ \widehat{\mathbf{\mathbf{P}}}^{\lambda,\alpha, \beta}$ to the sparse ground truth assignment giving us following loss function: $$ \mathcal{L} = - \sum_{(i,j)\in \mathcal{M}^+ \cup \mathcal{M}^0} \log \widehat{\mathbf{P}}^{\lambda,\alpha, \beta}_{i,j} \, -\sum_{(i,j)\in \mathcal{M}^-} \log (1- \widehat{\mathbf{P}}^{\lambda,\alpha, \beta}_{i,j}).$$

Citation