EleutherAI Releases Common Pile v0.1, an 8TB Dataset for AI Training
EleutherAI has announced the release of the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text, designed for training large language models. This dataset, curated in collaboration with several academic and research institutions, aims to provide a comprehensive resource for AI research and development detailed in a company blog post.
The Common Pile v0.1 includes content from 30 diverse sources, such as research papers, code, books, and educational materials. It serves as a successor to the Pile, EleutherAI's previous dataset, and is intended to address the need for large-scale, openly licensed data in AI training. Alongside the dataset, EleutherAI has released two models, Comma v0.1-1T and Comma v0.1-2T, trained on 1 trillion and 2 trillion tokens respectively, demonstrating competitive performance with models trained on unlicensed data.
This release underscores EleutherAI's commitment to open science and transparency in AI research, providing tools and resources that enable researchers to build and evaluate AI models under consistent data conditions. The dataset and associated models are available on platforms like arXiv, Hugging Face, and GitHub, facilitating widespread access and collaboration in the AI community.
We hope you enjoyed this article.
Consider subscribing to one of several newsletters we publish. For example, in the Daily AI Brief you can read the most up to date AI news round-up 6 days per week.
Also, consider following us on social media:
Subscribe to Daily AI Brief
Daily report covering major AI developments and industry news, with both top stories and complete market updates
Market report
AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation
The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.
Read more