JOURNAL ARTICLE

Semi-supervised multi-task learning of structured prediction models for web information extraction

Abstract

Extracting information from web pages is an important problem; it has several applications such as providing improved search results and construction of databases to serve user queries. In this paper we propose a novel structured prediction method to address two important aspects of the extraction problem: (1) labeled data is available only for a small number of sites and (2) a machine learned global model does not generalize adequately well across many websites. For this purpose, we propose a weight space based graph regularization method. This method has several advantages. First, it can use unlabeled data to address the limited labeled data problem and falls in the class of graph regularization based semi-supervised learning approaches. Second, to address the generalization inadequacy of a global model, this method builds a local model for each website. Viewing the problem of building a local model for each website as a task, we learn the models for a collection of sites jointly; thus our method can also be seen as a graph regularization based multi-task learning approach. Learning the models jointly with the proposed method is very useful in two ways: (1) learning a local model for a website can be effectively influenced by labeled and unlabeled data from other websites; and (2) even for a website with only unlabeled examples it is possible to learn a decent local model. We demonstrate the efficacy of our method on several real-life data; experimental results show that significant performance improvement can be obtained by combining semi-supervised and multi-task learning in a single framework.

Keywords:
Computer science Regularization (linguistics) Graph Machine learning Artificial intelligence Semi-supervised learning Labeled data Generalization Web page Supervised learning Task (project management) Data mining Artificial neural network Theoretical computer science Mathematics World Wide Web

Metrics

9
Cited By
0.75
FWCI (Field Weighted Citation Impact)
41
Refs
0.81
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Web Data Mining and Analysis
Physical Sciences →  Computer Science →  Information Systems
Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence
Caching and Content Delivery
Physical Sciences →  Computer Science →  Computer Networks and Communications

Related Documents

DISSERTATION

Semi-supervised structured prediction models

Ulf Brefeld

University:   edoc Publication server (Humboldt University of Berlin) Year: 2008 Pages: 1-168
JOURNAL ARTICLE

Semi-supervised learning for structured output prediction

Jurica Levatić

Journal:   Informatica Year: 2022 Vol: 46 (4)
© 2026 ScienceGate Book Chapters — All rights reserved.