Text

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI

uonlp/CulturaX · Datasets at Hugging Face

tiiuae/falcon-refinedweb · Datasets at Hugging Face

The Pile

gaia-benchmark/GAIA · Datasets at Hugging Face

C4 Search by AI2

La startup IA Française Pleias bouscule OpenAI avec Common Corpus : vers une innovation ouverte, éthique et multilingue ?

HuggingFaceFW/fineweb · Datasets at Hugging Face

Code

bigcode/the-stack · Datasets at Hugging Face

Images

LAION-400-MILLION OPEN DATASET | LAION

Best practices

Yam Peleg on Twitter / X

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models