TableVLM: Multi-modal Pre-training for Table Structure Recognition

L. Chen; Chengsong Huang; Xiaoqing Zheng; Jinshu Lin; Xuanjing Huang

doi:10.18653/v1/2023.acl-long.137

ScienceGate Book Chapters

JOURNAL ARTICLE

TableVLM: Multi-modal Pre-training for Table Structure Recognition

L. Chen Chengsong Huang Xiaoqing Zheng Jinshu Lin Xuanjing Huang

Year: 2023 Pages: 2437-2449

DOI: 10.18653/v1/2023.acl-long.137

Get Full-Text PDF Get Analytical Report

Abstract

Tables are widely used in research and business, which are suitable for human consumption, but not easily machine-processable, particularly when tables are present in images.One of the main challenges to extracting data from images of tables is accurately recognizing table structures, especially for complex tables with cross rows and columns.In this study, we propose a novel multi-modal pre-training model for table structure recognition, named TableVLM.With a two-stream multi-modal transformer-based encoder-decoder architecture, TableVLM learns to capture rich table structure-related features by multiple carefully-designed unsupervised objectives inspired by the notion of masked visual-language modeling.To pre-train this model, we also created a dataset, called ComplexTable, which consists of 1,000K samples to be released publicly. Experiment results show that the model built on pre-trained TableVLM can improve the performance up to 1.97% in tree-editing-distance-score on ComplexTable.

Keywords:

Computer science Table (database) Modal Artificial intelligence Row Encoder Pattern recognition (psychology) Machine learning Natural language processing Data mining Speech recognition Database

Metrics

Cited By

1.27

FWCI (Field Weighted Citation Impact)

Refs

0.76

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Currency Recognition and Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Digital Media Forensic Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

TableVLM: Multi-modal Pre-training for Table Structure Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Multi-Modal Pre-Training for Automated Speech Recognition

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for speech recognition

VLTSR: Visual-Layout Multi-Modal Fusion for Table Structure Recognition

Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training