Abstract

One of the major difficulties related to German LVCSR is the rich morphology nature of German, leading to high out-of-vocabulary (OOV) rates, and high language model (LM) perplexities. Normally, compound words make up an essential fraction of the German vocabulary. Most compound OOVs are composed of frequent in-vocabulary words. Here, we investigate the use of sub-lexical LMs based on different approaches for word decomposition, namely supervised and unsupervised decomposition, as well as decomposition derived from grapheme-to-phoneme (G2P) conversion. In the later approach, we augment a normal word model with a set of grapheme-phoneme pairs called graphones used to model the OOV words. A novel approach is proposed to select the representative graphone sequences for OOVs based on unsupervised decomposition and word-pronunciation alignment. We obtain relative reductions in word error rate (WER) from 4.2% to 6.5% with respect to a comparable full-words system.

Keywords:
Computer science Pronunciation German Artificial intelligence Natural language processing Vocabulary Grapheme Word (group theory) Word error rate Speech recognition Language model Set (abstract data type) Decomposition Compound Linguistics

Metrics

27
Cited By
4.81
FWCI (Field Weighted Citation Impact)
26
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.