JOURNAL ARTICLE

Harnessing Large Datasets for Large Language Models

D'Souza, Jennifer

Year: 2024 Journal:   Zenodo (CERN European Organization for Nuclear Research)   Publisher: European Organization for Nuclear Research

Abstract

In this presentation, I visit a series of LLMs specifically those whose pre-training datasets have moved the needle in obtaining efficient downstream LLMs. Thus the talk sheds light on the technology of LLMs w.r.t. notable pretraining datasets introduced. Models discussed are the Encoder-Decoder T5 model, the Encoder-only BERT model, and the Decoder-only models, specifically GPT-1, GPT-2, GPT-3, GPT-J, LLaMA, and Falcon. In the context of these models, notable pre-training datasets discussed include: C4, BooksCorpus, WebText, the Pile, and the RefinedWeb.

Keywords:
Context (archaeology) Language model Series (stratigraphy) Key (lock) Context model

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.42
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Neural and Behavioral Psychology Studies
Life Sciences →  Neuroscience →  Cognitive Neuroscience
Cognitive Functions and Memory
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Mind wandering and attention
Life Sciences →  Neuroscience →  Cognitive Neuroscience

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.