Harnessing Large Datasets for Large Language Models

D'Souza, Jennifer

doi:10.5281/zenodo.13928511

ScienceGate Book Chapters

JOURNAL ARTICLE

Harnessing Large Datasets for Large Language Models

D'Souza, Jennifer

Year: 2024 Journal: Zenodo (CERN European Organization for Nuclear Research) Publisher: European Organization for Nuclear Research

DOI: 10.5281/zenodo.13928511

Get Full-Text PDF Get Analytical Report

Abstract

In this presentation, I visit a series of LLMs specifically those whose pre-training datasets have moved the needle in obtaining efficient downstream LLMs. Thus the talk sheds light on the technology of LLMs w.r.t. notable pretraining datasets introduced. Models discussed are the Encoder-Decoder T5 model, the Encoder-only BERT model, and the Decoder-only models, specifically GPT-1, GPT-2, GPT-3, GPT-J, LLaMA, and Falcon. In the context of these models, notable pre-training datasets discussed include: C4, BooksCorpus, WebText, the Pile, and the RefinedWeb.

Keywords:

Context (archaeology) Language model Series (stratigraphy) Key (lock) Context model

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.42

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Neural and Behavioral Psychology Studies

Life Sciences → Neuroscience → Cognitive Neuroscience

Cognitive Functions and Memory

Social Sciences → Psychology → Experimental and Cognitive Psychology

Mind wandering and attention

Life Sciences → Neuroscience → Cognitive Neuroscience

Harnessing Large Datasets for Large Language Models

Abstract

Metrics

Topics

Related Documents

Harnessing Large Datasets for Large Language Models

Harnessing Large Datasets for Large Language Models

Pre-training Datasets powering Large Language Models?

Harnessing Large Language Models for Group POI Recommendations

HARNESSING LARGE LANGUAGE MODELS FOR MEDICAL LEXICON SIMPLIFICATION