In this presentation, I visit a series of LLMs specifically those whose pre-training datasets have moved the needle in obtaining efficient downstream LLMs. Thus the talk sheds light on the technology of LLMs w.r.t. notable pretraining datasets introduced. Models discussed are the Encoder-Decoder T5 model, the Encoder-only BERT model, and the Decoder-only models, specifically GPT-1, GPT-2, GPT-3, GPT-J, LLaMA, and Falcon. In the context of these models, notable pre-training datasets discussed include: C4, BooksCorpus, WebText, the Pile, and the RefinedWeb.
Jing LongLiang QuJunliang YuTong ChenQuoc Viet Hung NguyenHongzhi Yin
Zečević, AnđelkaStanković, Ranka