EleutherAI Releases Common Pile v0.1, an 8TB Dataset for AI Training

June 09, 2025

EleutherAI has announced the release of the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text, designed for training large language models.

EleutherAI has announced the release of the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text, designed for training large language models. This dataset, curated in collaboration with several academic and research institutions, aims to provide a comprehensive resource for AI research and development detailed in a company blog post.

The Common Pile v0.1 includes content from 30 diverse sources, such as research papers, code, books, and educational materials. It serves as a successor to the Pile, EleutherAI's previous dataset, and is intended to address the need for large-scale, openly licensed data in AI training. Alongside the dataset, EleutherAI has released two models, Comma v0.1-1T and Comma v0.1-2T, trained on 1 trillion and 2 trillion tokens respectively, demonstrating competitive performance with models trained on unlicensed data.

This release underscores EleutherAI's commitment to open science and transparency in AI research, providing tools and resources that enable researchers to build and evaluate AI models under consistent data conditions. The dataset and associated models are available on platforms like arXiv, Hugging Face, and GitHub, facilitating widespread access and collaboration in the AI community.

We hope you enjoyed this article.

Consider subscribing to one of our newsletters like Daily AI Brief.

Also, consider following us on social media:

AI Brief AI Brief (X)

Subscribe to Daily AI Brief

Daily report covering major AI developments and industry news, with both top stories and complete market updates

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Categories

Companies

Resources

EleutherAI Releases Common Pile v0.1, an 8TB Dataset for AI Training

We hope you enjoyed this article.

Subscribe to Daily AI Brief

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

You May Also Like

Ant Group Releases Ling-2.5-1T and Ring-2.5-1T Open-Source AI Models

Anthropic Unveils Claude Opus 4.6 with 1M Token Context and Enhanced Coding Skills

Google Releases WAXAL Open Dataset for African Speech Technology

OpenAI Introduces Prism, an AI Workspace for Scientific Collaboration

Robbyant Open-Sources LingBot-VLA Model for Cross-Platform Robotics

OpenAI Releases GPT-5.3-Codex-Spark Powered by Cerebras Chip

BoodleBox Integrates NVIDIA Nemotron 3 Nano to Expand AI Learning Tools

Born.ai Offers Distributed Compute Patent Portfolio for License or Sale

NVIDIA Introduces PersonaPlex for Natural Full-Duplex AI Conversations

Microsoft Introduces Maia 200 AI Inference Chip Built on 3nm Process

Coveo Launches Hosted MCP Server for Enterprise AI Integration

Google Threat Intelligence Group Reports Surge in AI Misuse for Cyber Operations