Pretrained multilingual Party model

Benjamin Kiessling

doi:10.5281/zenodo.15073482

ScienceGate Book Chapters

JOURNAL ARTICLE

Pretrained multilingual Party model

Benjamin Kiessling

Year: 2025 Journal: Zenodo (CERN European Organization for Nuclear Research) Publisher: European Organization for Nuclear Research

DOI: 10.5281/zenodo.15073482

Get Full-Text PDF Get Analytical Report

Abstract

Llama Party Party is page-wise recognition of text-y. It is a replacement for conventional text recognizers in automatic text recognition pipelines that utilize either bounding box or baseline+bounding polygon segmentation methods for layout analysis. Llama party is a full-page generative text recognizer that has been pretrained on a large corpus of multilingual historical, contemporary, and born-digital document page images, both handwritten and machine-printed. Architecture The recognizer is a deep fusion multimodal model consisting of a Swin vision encoder and a tiny Llama (100M parameters) decoder trained with octet tokenization. The network is prompted with the line positions through positional embeddings added to the encoder hidden state. During training the encoder weights were initialized with a ImageNet-22k pretrained Swin-base from pytorch-image-models, the decoder weights came from a custom Llama 3.2 pretrained on a subset of OSCAR 2301 tokenized with a ByT5-style octet tokenizer. The pre-initialized model was then pre-trained on a collection of public and private training historical document page datasets augmented with born-digital data crafted from PubLayNet. Uses Llama party is a recognition foundation model primarily targeted at automatic text recognition for the humanities. While it produces fairly accurate output on an impressive range of material it is intended to be fine-tuned on some target dataset to ensure compliance with desired transcription guidelines. Transcription guidelines, Normalization, and Transformations No attempts have been made to normalize the datasets or to only use data adhering to common transcription guidelines. While some subsets of the corpus are internally consistent, only a very small proportion of the languages in the training data only contain datasets from a single source. Bias, Risks, and Limitations The training corpus is heavily skewed towards a couple of languages (Chinese, English, French, German, and Portuguese) and frequently incorporates datasets of esoteric material transcribed for specific purposes. Especially machine-printed and born-digital material lack diversity, so error rates will most likely vary considerably across languages and document type. Some additional limitations are to be expected: Arabic, Hebrew, and South Indian script recognition is likely to require fine-tuning. Some transcriptions resolved abbreviations while others did not. Inconsistent output is to be expected, in particular for European manuscripts in Latin script. As the model predicts 8-bit UTF-8 code units directly the lack of consistent Unicode normalization can cause slightly different code point streams during prediction. How to Get Started with the Model Install the party package from github and follow the instructions. Training Details Training Data The model has been pretrained on the vast majority of publicly available ATR datasets, in addition to a decent number of restricted datasets. For English exclusively we converted the PubLayNet dataset for layout analysis on born-digital documents into an ATR dataset with PDFMiner and some basic baseline heuristic based on the line bounding box. |Language|Pages|Lines|Datasets| |:-------|:----|:----|:-------| |Arabic | | |RASAM 1TariMaOpenITI Arabic MS DataOpenITI Arabic Print Data| |Catalan | | |FONDUE-CA-PRINT-20| |Chinese | | |1 large private dataset| |Corsican | | |HN2021-OCR-Poesie-Corse| |Czech| | | |Padeřov-Bible-handwriting-ground-truth| |Dutch| | | |4 private manuscript datasetsVOC dataset| |English | | |FONDUE-EN-PRINT-20PubLayNetUniversity of Denver CollectionsJoseph Hooker HTRCCCC MS 41| |Finnish | | |NewsEye/READ OCR Finnish Newspapers| |French | | |NewsEye READ AS French NewspapersBoccaceFabliauxLiberCremma MedievalDecameronFRFONDUE-FR-MSS-18FONDUE-FR-MSS-19FONDUE-FR-PRINT-16FONDUE-FR-PRINT-17FONDUE-FR-PRINT-20Données imprimés gothiques du 16e siècleDonnées HTR incunables du 15e siècleDonnées HTR manuscrits du 15e siècle"Tables Décennales" French Civil RegistryDonnées imprimés du 16e siècleDonnées imprimés du 17e siècleDonnées imprimés du 18e siècleIncunable français du 15e siècleHTRomanceHTR-SETAF-Jean-MichelHTR-SETAF-LesFaictzJCHHTR-SETAF-Pierre-de-VingleLa Correspondance Jacques Doucet - René JeanOCR17+Tapus CorpusTIMEUS CorpusRecensement Valaisan3 private handwritten and print datasets| |Georgian | | |1 private dataset| |German | | |Charlottenburger AmtsschrifttumDACH GTDigiTue GTFibelnFONDUE-DE-MSS-18FoNDUE_Wolfflin_FotosammlungHKB GTGround truth for Neue Zürcher Zeitung black letterReichsanzeiger GTStABS Ratsbücher O10NewsEye / READ OCR Austrian NewspapersWeisthuemer3 private manuscript datasets| |Greek | | |EPARCHOSHTR CPgr23Handwritten Paleographic Greek Text RecognitionΧΦ114XΦ79ΧΦ5310 small private manuscript datasets| |Hebrew | | |Tikkoun SofrimBiblIA| |Italian | | |episearch-htrFONDUE-IT-PRINT-20HTRomance Italian1 private print dataset| |Japanese | | |mm-ocr-dataset-v1| |Latin | | |Caroline MinusculeCREMMA-Medieval-LATHTRomance LatinDIVA-HisDBEutychesFONDUE-LA-MSS-MAFONDUE-LA-PRINT-16Lateinische GedichteWien ÖNB Cod 21602 private manuscript datasets| |Multilingual| | |FONDUE-MLT-ART[FONDUE-MLT-CAT](https://github.com/FoNDUE-HTR/FONDUE-MLT-CAT)[FONDUE-MLT-PRINT-TEST](https://github.com/FoNDUE-HTR/FONDUE-MLT-PRINT-TEST)gt_structure_text| |Ottoman Turkish|| |OpenITI Arabic MS DataOpenITI Arabic Print Data| |Farsi | | |OpenITI Arabic MS DataOpenITI Arabic Print Data| |Portuguese| | |Portuguese Handwriting 16th-19th c.| |Russian | | |1 private manuscript dataset| |Spanish | | |FONDUE-ES-PRINT-19FoNDUE-Spanish-chapbooks-DatasetHTR AraucaniaHTRomance Spa3 private manuscript datasets| |Swedish | | |ATR_TrainingSet_NLF_Newseye_GT_SV_M2+Kat -57 |Syriac | | |2 private print and manuscript datasets| |Urdu| | | |OpenITI Arabic MS DataOpenITI Arabic Print Data| |Yiddish | | |1 private print datasets| Training Procedure and Hyperparameters Training regime: 6 * A40 GPU, BF16 precision (AMP), Mars-AdamW optimizer with caution, batch size: 32, gradient accumulation: 4, effective batch size: 768, 5+8 epochs(5 synthetic+real data, 8 real only) with 5000 iteration warmup and cosine decay, max LR 5e-4, min LR 5e-6 at end of epoch 5, weight decay 1e-5, gradient clipping 1.0, augmentation, random sampling of bbox and curve batches Evaluation The current base model's character accuracies on the validation set of 1000 randomly sampled pages with curve and bounding box prompts (sorted by ascending curve error rate): | Script | Code Points | %Right (curves) | %Right (boxes) | | :-------- | :---------- | :-------------- | :------------- | | Han | 107416 | 98.90% | 98.88% | | Hiragana | 1868 | 97.11% | 97.11% | | Cyrillic | 22239 | 92.70% | 92.34% | | Greek | 1036 | 92.28% | 91.31% | | Katakana | 390 | 90.00% | 90.00% | | Latin | 199703 | 88.02% | 86.98% | | Common | 85863 | 80.24% | 79.28% | | Arabic | 18061 | 79.22% | 79.64% | | Hebrew | 40182 | 73.98% | 73.97% | | Inherited | 2886 | 61.61% | 60.95% | | Unknown | 202 | 58.42% | 57.43% | The script types are determined from the Unicode script property of each individual code point. The base model has been trained on Georgian, Syriac, Newa, Malayalam, and Devanagari, albeit with fairly small datasets. No pages with these scripts are contained in the validation sample.

Keywords:

Transcription (linguistics) Encoder Segmentation Deep learning Named-entity recognition Text recognition Historical document Bounding overwatch

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.54

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Handwritten Text Recognition Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Processing and 3D Reconstruction

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Pretrained multilingual Party model

Abstract

Metrics

Topics

Related Documents