actAVA.ai Releases CHI-Bench to Test AI Agents on Healthcare Workflows

May 21, 2026
actAVA.ai has introduced CHI-Bench, a benchmark evaluating AI agents from companies like Anthropic, OpenAI, and Google across 75 healthcare workflows. The top-performing agent succeeded in only 28% of cases.

actAVA.ai has released CHI-Bench, a new benchmark designed to evaluate AI agents across real U.S. healthcare workflows, according to a press release. The benchmark tested 30 advanced agents from Anthropic, OpenAI, Google, x.AI, DeepSeek, and Z.ai across 75 workflows, finding that even the best system failed about 72% of cases.

Each CHI-Bench test runs an agent through 60 to 80 steps across multiple clinical stages, covering processes like intake, review, and authorization. The system evaluates every step and artifact using deterministic tests and an LLM-based judge to check evidence grounding, consent, and consistency.

Anthropic’s Claude Code with Opus 4.6 achieved the highest score with a 28% pass rate, while OpenAI’s Codex with GPT-5.5 followed at 21%. Performance varied by domain, with utilization review reaching 41% and care management 32%. Reliability remained low, as no agent passed more than 20% of repeated cases.

CHI-Bench was developed in collaboration with over 20 institutions, including Johns Hopkins, Stanford, and Oxford. The benchmark is open under the Apache 2.0 license on GitHub, and a public leaderboard is now accepting community submissions.

We hope you enjoyed this article.

Subscribe to Life AI Weekly

Weekly coverage of AI applications in healthcare, drug development, biotechnology research, and genomics breakthroughs.

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Read more