Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene Relations

Botao Wang; Dahua Lin; Hongkai Xiong; Yuan F. Zheng

doi:10.1109/tmm.2016.2520087

ScienceGate Book Chapters

JOURNAL ARTICLE

Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene Relations

Botao Wang Dahua Lin Hongkai Xiong Yuan F. Zheng

Year: 2016 Journal: IEEE Transactions on Multimedia Vol: 18 (3)Pages: 507-520 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tmm.2016.2520087

Get Full-Text PDF Get Analytical Report

Abstract

The rapid growth of web images presents new challenges as well as opportunities to the task of image understanding. Conventional approaches rely heavily on fine-grained annotations, such as bounding boxes and semantic segmentations, which are not available for web-scale images. In general, images over the Internet are accompanied with descriptive texts, which are relevant to their contents. To bridge the gap between textual and visual analysis for image understanding, this paper presents an algorithm to learn the relations between scenes, objects, and texts with the help of image-level annotations. In particular, the relation between the texts and objects is modeled as the matching probability between the nouns and the object classes, which can be solved via a constrained bipartite matching problem. On the other hand, the relations between the scenes and objects/texts are modeled as the conditional distributions of their co-occurrence. Built upon the learned cross-domain relations, an integrated model brings together scenes, objects, and texts for joint image understanding, including scene classification, object classification and localization, and the prediction of object cardinalities. The proposed cross-domain learning algorithm and the integrated model elevate the performance of image understanding for web images in the context of textual descriptions. Experimental results show that the proposed algorithm significantly outperforms conventional methods in various computer vision tasks.

Keywords:

Computer science Artificial intelligence Object (grammar) Inference Matching (statistics) Domain (mathematical analysis) Context (archaeology) Relation (database) Scene graph Bipartite graph Bounding overwatch Conditional random field Natural language processing Computer vision Pattern recognition (psychology) Data mining Theoretical computer science Graph

Metrics

Cited By

0.67

FWCI (Field Weighted Citation Impact)

Refs

0.77

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene Relations

Abstract

Metrics

Citation History

Topics

Related Documents

ReplaceAnything3D: Text-Guided Object Replacement in 3D Scenes with Compositional Scene Representations

Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes

Learning spatial relations between objects from 3D scenes

Imaging object-scene relations processing in visible and invisible natural scenes

Saliency, objects and scenes: global scene factors in attention and object detection