Lightspeed Computation of Geometry-aware Semantic Embeddings

TBD, 2025
1TUM 2Munich Center for Machine Learning

Qualitative Results. From left to right: source image with query keypoint, target image with ground truth correspondence, the baseline method, and our method. Our method is able to predict accurate correspondences in a fraction of the time compared to the baseline.

Abstract

Recent advancements in feature computation have revealed that self-supervised feature extractors can recognize semantic correspondences. However, these features often lack an understanding of objects' underlying geometry and 3D structure. In this paper, we focus on object categories with well-defined shapes and address the challenge of matching semantically similar parts distinguished by their geometric properties, e.g., left/right eyes or front/back legs. We propose a novel, optimal-transport based learning method that is faster and outperforms previous supervised methods in terms of semantic matching and geometric understanding.

Overview



a) We develop a new method for image key points prediction based on optimal transport and KL-regularized soft assignment; it enables efficient and robust key point matching, showing a better understanding of geometrical features.
b) We demonstrate that our novel formulation provides state-of-the-art performance while being also lightfast; our method outperforms competitors on multiple datasets, and it takes 98% less time.
c) We provide an extensive analysis of the common PCK metric, and we complement it; we proposea new evaluation that provide better insights into the geometrical understanding of the methods and also the accuracy of their similarity prediction. </p>

Method



Training Scheme Overview.

Given a dataset with keypoint annotation we can train the attention head by comparing the estimated assignment obtained from the KL-regularized Optimal Transport layer $ \widehat{P}^{\lambda,\alpha, \beta}$ to the sparse ground truth assignment giving us following loss function: $$ \mathcal{L} = - \sum_{(i,j)\in \mathcal{M}^+ \cup \mathcal{M}^0} \log \widehat{P}^{\lambda,\alpha, \beta}_{i,j} \, -\sum_{(i,j)\in \mathcal{M}^-} \log (1- \widehat{P}^{\lambda,\alpha, \beta}_{i,j}).$$