Anthropic Releases Bloom, an Open-Source Framework for AI Behavior Evaluation

December 22, 2025
Anthropic has introduced Bloom, an open-source tool designed to automate behavioral evaluations of large AI models. The system generates and scores scenarios to measure behaviors like bias and self-preservation across multiple models.

Anthropic has released Bloom, an open-source framework for automated behavioral evaluations of frontier AI models, according to an announcement on the company's website. The tool allows researchers to specify a target behavior and automatically generate scenarios that test how often and how severely that behavior appears.

Bloom operates through four stages—understanding, ideation, rollout, and judgment—to produce evaluation suites that quantify the presence of specific behaviors. It integrates with research tools such as Weights & Biases for large-scale experiments and exports results in Inspect-compatible formats. Each evaluation run can produce unique scenarios while maintaining reproducibility through a configuration seed file.

The company reported benchmark results for behaviors including delusional sycophancy, instructed long-horizon sabotage, self-preservation, and self-preferential bias across 16 models. Using Bloom, these evaluations were completed within days, and the results aligned closely with human-labeled judgments. Validation tests showed that Claude Opus 4.1 correlated most strongly with human scoring, achieving a Spearman correlation of 0.86.

Bloom complements Anthropic’s earlier open-source tool, Petri, which explores AI models' behavioral profiles through simulated interactions. Researchers can access Bloom and its documentation via the official GitHub repository.

We hope you enjoyed this article.

Consider subscribing to one of our newsletters like AI Policy Brief or Daily AI Brief.

Also, consider following us on social media:

Subscribe to AI Policy Brief

Weekly report on AI regulations, safety standards, government policies, and compliance requirements worldwide.

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Read more