AI Training Datasets
An overview of prominent datasets used for training large-scale AI models, including details on their contents and licensing.
Books3
10/10 concernCopyrighted books from shadow libraries
Updated: 12/1/2024
LAION-5B
10/10 concern5.85 billion image-text pairs from the web
Updated: 12/1/2024
LibGen
10/10 concernLibrary Genesis, a shadow library of pirated books and articles.
Updated: 10/30/2024
The Pile
9/10 concernAn 825 GiB dataset of datasets
Updated: 10/30/2024
BookCorpus
8/10 concernUnpublished books from Smashwords
Updated: 10/30/2024
Built by Google
Spotify Podcast Dataset
8/10 concernOver 100,000 podcast episodes and transcripts
Updated: 10/30/2024
Built by Spotify
YouTube
8/10 concernBillions of videos and their transcripts
Updated: 10/30/2024
Built by Google
Common Crawl
7/10 concernBillions of web pages from the internet
Updated: 12/1/2024
GitHub Code
7/10 concernBillions of lines of public code from GitHub
Updated: 10/30/2024
Built by Microsoft
OSCAR
7/10 concernOpen Super-large Crawled Aggregated coRpus, a multilingual dataset from Common Crawl.
Updated: 10/30/2024
ROOTS
7/10 concernThe BigScience ROOTS Corpus, a large, documented, multilingual dataset.
Updated: 10/30/2024
WebText / OpenWebText
7/10 concernHigh-quality text from Reddit links
Updated: 10/30/2024
Built by OpenAI
C4
6/10 concernA colossal, cleaned version of Common Crawl.
Updated: 10/30/2024
Built by Google
DataComp
6/10 concern12.8B image-text pairs for research
Updated: 10/30/2024
FFHQ
6/10 concernFlickr-Faces-HQ: 70,000 high-quality face images.
Updated: 10/30/2024
Built by NVIDIA
PiLiMi
6/10 concernPirate Library Mirror, a major mirror of shadow libraries.
Updated: 10/30/2024
RefinedWeb
5/10 concernFiltered web data to train Falcon models
Updated: 10/30/2024
ArXiv
3/10 concernPre-print scientific papers
Updated: 10/30/2024
Wikipedia
2/10 concernA corpus of all articles from Wikipedia
Updated: 10/30/2024