Image matching establishes pixel correspondences between two images. It is usually achieved in two steps, where the high-dimensional local image features are first extracted across images and correspondences are established based on features' similarities. Image matching can be categorized into two sub-domains: geometric and semantic matching, where the former matches pixels describing the same 3D point while the latter finds pixels having the same semantic meaning. This thesis focuses on advancing both types of matching from various perspectives and explores another application of image features; namely learning 3D attributes from 2D images. For geometric matching, we propose improvements in accuracy through refined network architectures. NCNet by Rocco et al., demonstrates that 4D convolution can be used to filter incorrect matches and improve matching accuracy. However, the quadratic complexity of this module makes it impractical for matching high-resolution images, a potentially crucial factor in achieving accurate geometric matching. To address this limitation, we propose a dual-resolution network architecture. Our method applies 4D convolution to filter matches on a coarse scale and uses this result to guide the matching on a finer scale. This approach significantly enhances accuracy by increasing the matching resolution while maintaining low computational overhead. For semantic matching, we first propose a simple and effective training framework that significantly reduces model complexity. Previous works typically follow connecting a pre-trained feature backbone with a complex matching module to handle appearance differences between semantically similar points. Inspired by metric learning, we show that the feature backbone itself can achieve comparable performance without any additional module, hence greatly simplifying the overall design. Building on this framework, we next explore adapting large vision foundation models for semantic matching. Earlier works found that Stable Diffusion (SD), known for its text-to-image generation capabilities, is also able to produce highly discriminative features for image matching. We demonstrate that the matching capability of SD can be enhanced through prompt tuning, and propose a conditional prompting module that infuses prior knowledge of the image pair as the SD's prompt to further improve matching accuracy. Beyond feature matching, we investigate inferring 3D information from 2D images, a different use case of image features. Previous studies have shown that a 3D model of a dressed person can be derived from a 2D image feature map. We found that a person's pose, shape, and garment attributes can be disentangled from this feature map. These attributes can be swapped or replaced, allowing the creation of a new 3D model from the modified feature map. Unlike typical disentanglement tasks like face swapping, which operate in the same data domain, our work transitions from 2D to 3D and is able to generate 3D models from 2D image attributes. To summarise, this thesis enhances both geometric and semantic matching, and also advances the inference of 3D information from 2D image features.
Liang MaYu SunJin ChenFengchao WangHongwei Zhao
Yinyang LiuXiaobin XuFeixiang Li
Bailey KongJames Steven SupančičDeva RamananCharless C. Fowlkes