uonlp/CulturaX · Datasets at Hugging Face
tiiuae/falcon-refinedweb · Datasets at Hugging Face
gaia-benchmark/GAIA · Datasets at Hugging Face
HuggingFaceFW/fineweb · Datasets at Hugging Face
bigcode/the-stack · Datasets at Hugging Face
LAION-400-MILLION OPEN DATASET | LAION
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models