
OpenAI Introduces HealthBench for AI Model Evaluation in Healthcare
OpenAI has launched HealthBench, an open-source dataset aimed at benchmarking AI models in the healthcare sector. This initiative, detailed in a company blog post, involves collaboration with 262 physicians from 60 countries and includes 5,000 realistic health conversations. The dataset is designed to evaluate whether AI models provide optimal responses to health-related inquiries, using a physician-written rubric scored by GPT-4.1.
HealthBench's evaluation process highlights OpenAI's o3 reasoning model as the top performer with a score of 60%, followed by Elon Musk's Grok at 54% and Google's Gemini 2.5 Pro at 52%. The dataset supports 49 languages and covers 26 medical specialties, such as neurological surgery and ophthalmology.
An example scenario provided by OpenAI involves a 70-year-old unresponsive individual, where the AI model suggests steps like calling emergency services and checking airways. HealthBench scores the response, offering insights into the model's accuracy and areas for improvement, with a sample score of 77% for the scenario.
We hope you enjoyed this article.
Consider subscribing to one of several newsletters we publish. For example, in the Daily AI Brief you can read the most up to date AI news round-up 6 days per week.
Also, consider following us on social media:
More from: Healthcare & Life Sciences
Subscribe to Daily AI Brief
Daily report covering major AI developments and industry news, with both top stories and complete market updates
Market report
AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation
The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.
Read more