Microsoft Releases Open-Source Benchmark for AI Cybersecurity Agents

October 20, 2025
Microsoft has launched ExCyTIn-Bench, an open-source benchmark designed to test AI agents on realistic cybersecurity investigations using simulated multi-stage attacks and data from Microsoft Sentinel.

Microsoft has introduced ExCyTIn-Bench, an open-source benchmarking tool for evaluating how AI agents perform in realistic cybersecurity scenarios, announced on its security blog. The benchmark simulates multi-stage cyberattacks within a controlled Microsoft Azure environment to measure how effectively AI systems investigate and reason through complex incidents.

ExCyTIn-Bench includes 57 log tables from Microsoft Sentinel and related services, reflecting the scale and noise of real-world security operations. The tool assesses not only the accuracy of an AI agent’s answers but also the logical steps taken to reach them, offering fine-grained reward signals for each investigative action.

In recent evaluations, OpenAI’s GPT-5 in high reasoning mode achieved the highest average reward score of 56.2%, followed by OpenAI-o3 at 45.6%. Other tested models included xAI’s Grok 4, Alibaba’s Qwen 3-235b-thinking, Meta’s Llama 4-17b-Maverick, and Microsoft’s Phi-4-14B. Google’s Gemini models were excluded due to benchmarking restrictions.

Microsoft is using ExCyTIn-Bench to improve its own security-focused AI products such as Microsoft Security Copilot, Sentinel, and Defender. The benchmark is publicly available, allowing researchers and developers to test and compare AI models for cybersecurity performance and share their findings.

We hope you enjoyed this article.

Consider subscribing to one of our newsletters like Cybersecurity AI Weekly or Daily AI Brief.

Also, consider following us on social media:

Subscribe to Cybersecurity AI Weekly

Weekly newsletter about AI in Cybersecurity.

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Read more