JOURNAL ARTICLE

Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene Relations

Botao WangDahua LinHongkai XiongYuan F. Zheng

Year: 2016 Journal:   IEEE Transactions on Multimedia Vol: 18 (3)Pages: 507-520   Publisher: Institute of Electrical and Electronics Engineers

Abstract

The rapid growth of web images presents new challenges as well as opportunities to the task of image understanding. Conventional approaches rely heavily on fine-grained annotations, such as bounding boxes and semantic segmentations, which are not available for web-scale images. In general, images over the Internet are accompanied with descriptive texts, which are relevant to their contents. To bridge the gap between textual and visual analysis for image understanding, this paper presents an algorithm to learn the relations between scenes, objects, and texts with the help of image-level annotations. In particular, the relation between the texts and objects is modeled as the matching probability between the nouns and the object classes, which can be solved via a constrained bipartite matching problem. On the other hand, the relations between the scenes and objects/texts are modeled as the conditional distributions of their co-occurrence. Built upon the learned cross-domain relations, an integrated model brings together scenes, objects, and texts for joint image understanding, including scene classification, object classification and localization, and the prediction of object cardinalities. The proposed cross-domain learning algorithm and the integrated model elevate the performance of image understanding for web images in the context of textual descriptions. Experimental results show that the proposed algorithm significantly outperforms conventional methods in various computer vision tasks.

Keywords:
Computer science Artificial intelligence Object (grammar) Inference Matching (statistics) Domain (mathematical analysis) Context (archaeology) Relation (database) Scene graph Bipartite graph Bounding overwatch Conditional random field Natural language processing Computer vision Pattern recognition (psychology) Data mining Theoretical computer science Graph

Metrics

7
Cited By
0.67
FWCI (Field Weighted Citation Impact)
57
Refs
0.77
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.