Microsoft Releases Open-Source Benchmark for AI Cybersecurity Agents

October 20, 2025

Microsoft has launched ExCyTIn-Bench, an open-source benchmark designed to test AI agents on realistic cybersecurity investigations using simulated multi-stage attacks and data from Microsoft Sentinel.

Microsoft has introduced ExCyTIn-Bench, an open-source benchmarking tool for evaluating how AI agents perform in realistic cybersecurity scenarios, announced on its security blog. The benchmark simulates multi-stage cyberattacks within a controlled Microsoft Azure environment to measure how effectively AI systems investigate and reason through complex incidents.

ExCyTIn-Bench includes 57 log tables from Microsoft Sentinel and related services, reflecting the scale and noise of real-world security operations. The tool assesses not only the accuracy of an AI agent’s answers but also the logical steps taken to reach them, offering fine-grained reward signals for each investigative action.

In recent evaluations, OpenAI’s GPT-5 in high reasoning mode achieved the highest average reward score of 56.2%, followed by OpenAI-o3 at 45.6%. Other tested models included xAI’s Grok 4, Alibaba’s Qwen 3-235b-thinking, Meta’s Llama 4-17b-Maverick, and Microsoft’s Phi-4-14B. Google’s Gemini models were excluded due to benchmarking restrictions.

Microsoft is using ExCyTIn-Bench to improve its own security-focused AI products such as Microsoft Security Copilot, Sentinel, and Defender. The benchmark is publicly available, allowing researchers and developers to test and compare AI models for cybersecurity performance and share their findings.

We hope you enjoyed this article.

Consider subscribing to one of our newsletters like Cybersecurity AI Weekly or Daily AI Brief.

Also, consider following us on social media:

Cybersecurity AI AI Brief AI Brief (X)

More from: Cybersecurity

03/06

UpGuard Raises $75 Million Series C and Launches Risk Automations Platform

03/06

Aikido Security Unveils AI-Powered Continuous Penetration Testing Tool

03/05

Reclaim Security Raises $26 Million to Automate Cyber Remediation

03/05

United Planners Consolidates Compliance and Cybersecurity on SurgeONE.ai Platform

03/02

Palo Alto Networks and Partners Introduce 'Secure by Design' AI Factories at MWC Barcelona

Subscribe to Cybersecurity AI Weekly

Weekly newsletter about AI in Cybersecurity.

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Categories

Companies

Resources

Microsoft Releases Open-Source Benchmark for AI Cybersecurity Agents

We hope you enjoyed this article.

More from: Cybersecurity

Subscribe to Cybersecurity AI Weekly

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

You May Also Like

Google Threat Intelligence Group Reports Surge in AI Misuse for Cyber Operations

Astrix Security Releases OpenClaw Scanner to Detect AI Agent Deployments

OpenAI and Microsoft Join UK’s AI Alignment Coalition

Eve Security Files Patent for 'Interrogation-as-a-Service' to Manage AI Agent Risks

Ginkgo Bioworks and OpenAI Use GPT-5 to Automate Biological Experiments

OpenAI Releases GPT-5.3-Codex-Spark Powered by Cerebras Chip

Cisco Expands AI Defense and Introduces AI-Aware SASE for Secure Agentic Workflows

OpenAI Launches Frontier Platform for Enterprise AI Agents

Impulse AI Launches Autonomous Machine Learning Engineer

Alice Launches Caterpillar to Detect Malicious OpenClaw Skills

Aikido Security Unveils AI-Powered Continuous Penetration Testing Tool

OpenCxMS Files 15 Patents for Hardware Safety Standard in AI Robotics