Qualitative Results. From left to right: source image with query keypoint, target image with ground truth correspondence, the baseline method, and our method. Our method is able to predict accurate correspondences in a fraction of the time compared to the baseline.
Recent advancements in feature computation have revealed that self-supervised feature extractors can recognize semantic correspondences. However, these features often lack an understanding of objects' underlying geometry and 3D structure. In this paper, we focus on object categories with well-defined shapes and address the challenge of matching semantically similar parts distinguished by their geometric properties, e.g., left/right eyes or front/back legs. We propose a novel, optimal-transport based learning method that is faster and outperforms previous supervised methods in terms of semantic matching and geometric understanding.
Training Scheme Overview.
Given a dataset with keypoint annotation we can train the attention head by comparing the estimated assignment obtained from the KL-regularized Optimal Transport layer $ \widehat{P}^{\lambda,\alpha, \beta}$ to the sparse ground truth assignment giving us following loss function: $$ \mathcal{L} = - \sum_{(i,j)\in \mathcal{M}^+ \cup \mathcal{M}^0} \log \widehat{P}^{\lambda,\alpha, \beta}_{i,j} \, -\sum_{(i,j)\in \mathcal{M}^-} \log (1- \widehat{P}^{\lambda,\alpha, \beta}_{i,j}).$$