In a significant step towards promoting transparency and innovation in AI research, EleutherAI, a leading AI research organization, has unveiled The Common Pile v0.1 – an extensive dataset of licensed and open-domain text. This colossal collection, spanning an impressive 8 terabytes, has been meticulously curated over two years in collaboration with AI startups like Poolside and Hugging Face, along with several academic institutions.
The Common Pile v0.1 represents a groundbreaking initiative, providing a rich resource for training AI models without resorting to unlicensed, copyrighted data. This move by EleutherAI addresses the ongoing legal battles faced by AI companies, such as OpenAI, which rely on web scraping to build their training datasets, often including copyrighted material.
EleutherAI's executive director, Stella Biderman, highlighted the impact of these lawsuits on the AI research field, stating that they have "drastically decreased" transparency among AI companies. This lack of transparency, she argues, hinders the broader AI research community's ability to understand how models function and identify potential flaws.
The Common Pile v0.1, available for download from Hugging Face's AI dev platform and GitHub, was crafted in consultation with legal experts. It incorporates a diverse range of sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also leveraged Whisper, OpenAI's open-source speech-to-text model, to transcribe audio content.
EleutherAI's AI models, Comma v0.1-1T and Comma v0.1-2T, demonstrate the potential of The Common Pile v0.1. Both models, trained on a fraction of the dataset, boast 7 billion parameters and compete with proprietary alternatives on benchmarks for coding, image understanding, and math. These models rival Meta's first Llama AI model, showcasing the effectiveness of the Common Pile v0.1 in training high-performing AI models.
Biderman challenges the prevailing notion that unlicensed text is essential for driving AI performance. As the pool of accessible openly licensed and public domain data expands, she predicts that the quality of models trained on this content will improve.
The Common Pile v0.1 is also a step towards rectifying EleutherAI's past. The company previously released The Pile, an open collection of training text that included copyrighted material, drawing criticism and legal scrutiny. EleutherAI is now committed to releasing open datasets more frequently, in collaboration with its research and infrastructure partners, including the University of Toronto, which played a key role in leading the research.
This initiative by EleutherAI not only promotes transparency and innovation in AI research but also paves the way for the development of AI models that can compete with proprietary alternatives, all while adhering to licensing and copyright regulations.