Developing Pretrained Language Models for Turkish Biomedical Domain

Hazal Türkmen; Oğuz Dikenelli; Cenk Eraslan; Mehmet Cem Çallı; Süha Süreyya Özbek

doi:10.1109/ichi54592.2022.00117

ScienceGate Book Chapters

JOURNAL ARTICLE

Developing Pretrained Language Models for Turkish Biomedical Domain

Hazal Türkmen Oğuz Dikenelli Cenk Eraslan Mehmet Cem Çallı Süha Süreyya Özbek

Year: 2022 Journal: 2022 IEEE 10th International Conference on Healthcare Informatics (ICHI) Pages: 597-598

DOI: 10.1109/ichi54592.2022.00117

Get Full-Text PDF Get Analytical Report

Abstract

Pretrained language models elevated with in-domain corpora show impressive results in biomedicine and clinical NLP tasks in English. However, there is minimal work in low-resource languages. This work introduces the BioBERTurk family, three pretrained models in Turkish for biomedicine. To evaluate models, we also introduce a labeled dataset to classify radiology reports of CT exams. Our first model was initialized from BERTurk and pretrained with biomedical corpus. The second model again continues to pretrain the general BERT model with a corpus of Ph.D. theses on radiology to test the effect of the task-related text. The final model combines radiology and biomedicine corpora with the corpus of BERTurk and pretrained a BERT model from scratch. F-scores of our models in the radiology resort classification are 92.99, 92.75, and 89.49 respectively. As far as we know, this is the first model that evaluates the effect of small size in-domain corpus in pretraining from scratch.

Keywords:

Biomedicine Computer science Natural language processing Turkish Artificial intelligence Language model Scratch Domain (mathematical analysis) Task (project management) Linguistics Programming language Bioinformatics Engineering

Metrics

Cited By

1.06

FWCI (Field Weighted Citation Impact)

Refs

0.77

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Text Readability and Simplification

Physical Sciences → Computer Science → Artificial Intelligence

Developing Pretrained Language Models for Turkish Biomedical Domain

Abstract

Metrics

Citation History

Topics

Related Documents

Improving Biomedical Pretrained Language Models with Knowledge

Pretrained Biomedical Language Models for Clinical NLP in Spanish

Data-efficient domain adaptation for pretrained language models

Efficient Hierarchical Domain Adaptation for Pretrained Language Models

A study of Turkish emotion classification with pretrained language models