Recent advancements in feature computation have revealed that self-supervised feature extractors can recognize semantic correspondences. However, these features often lack an understanding of objects' underlying 3D geometry. In this paper, we focus on learning features capable of semantically characterizing parts distinguished by their geometric properties, e.g., left/right eyes or front/back legs.
We propose GECO, a novel, optimal-transport-based learning method that obtains features geometrically coherent, well-characterizing symmetric points. GECO uses a lightweight model architecture that results in a fast inference, capable of processing images at 30fps. Our method is interpretable and generalizes across datasets, achieving state-of-the-art performance on PFPascal, APK, and CUB datasets improving by 6.0%, 6.2%, and 4.1% respectively. We achieve a speed-up of 98.2% compared to previous methods by using a smaller backbone and a more efficient training scheme. Finally, we find PCK insufficient to analyze the geometrical properties of the features. Hence, we expand our analysis, proposing novel metrics and insights that will be instrumental in developing more geometrically-aware methods.
Our formulation leads to geometrically-aware features which enables state-of-the-art performance on correspondence estimation while being significantly more efficient—it surpasses competitors on multiple datasets while reducing computation time by 98.2%.
After marking a query location in the source image, we can observe the correspondence in the target image.
Source Image
🦎 GECO"
We conduct object part segmentation evaluation on PascalParts. Our method effectively separates parts, indicating that it learns meaningful, dense feature representations.
From left to right: source image with query keypoint, argmax correspondence of feature embedding in each consecutive frame.
We propose a novel loss function and a lightweight architecture for image representation learning, leveraging optimal transport and KL-regularized soft assignment;
Given a dataset with keypoint annotation we can train the LoRA adapter by comparing the estimated assignment obtained from the KL-regularized Optimal Transport layer $ \widehat{\mathbf{\mathbf{P}}}^{\lambda,\alpha, \beta}$ to the sparse ground truth assignment giving us following loss function: $$ \mathcal{L} = - \sum_{(i,j)\in \mathcal{M}^+ \cup \mathcal{M}^0} \log \widehat{\mathbf{P}}^{\lambda,\alpha, \beta}_{i,j} \, -\sum_{(i,j)\in \mathcal{M}^-} \log (1- \widehat{\mathbf{P}}^{\lambda,\alpha, \beta}_{i,j}).$$