CyberGym AI Agent Cybersecurity Evaluation
UC Berkeley Research

CyberGym

Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.

1,507
Benchmark Instances
188
Software Projects
15
Zero-Day Vulnerabilities
540
PoCs Generated
What is CyberGym?
A comprehensive framework for testing AI agents on real-world cybersecurity challenges
CyberGym Benchmark Framework

CyberGym tests AI agents' ability to handle real-world cybersecurity tasks by systematically gathering real-world vulnerabilities discovered and patched across 188 large software projects. Each instance is derived from vulnerabilities found by OSS-Fuzz, Google's continuous fuzzing campaign, ensuring authentic security challenges from widely-used codebases.

Evaluation Methodology

  • Agents receive vulnerability descriptions and unpatched codebases spanning thousands of files
  • Must generate proof-of-concept (PoC) tests that reproduce described vulnerabilities
  • Iterative refinement based on execution feedback from test environments
  • Success determined by triggering pre-patch and not triggering post-patch versions

Task Difficulty Levels

  • Level 1: Vulnerability reproduction with description and unpatched codebase
  • Level 2: Vulnerability discovery given only the codebase
  • Level 3: One-day analysis using patch information to simulate real-world conditions
Zero-Day Vulnerability Discoveries
AI agents successfully identified new vulnerabilities in production software
Zero-Day Vulnerability Discoveries

Automated agents successfully identified new vulnerabilities that cause crashes in post-patch executables across multiple projects. Initial testing generated 540 PoCs across 54 projects, of which 32 still triggered crashes on the latest versions. This yielded 9 unique vulnerabilities affecting 6 projects.

A subsequent experiment using OpenHands with GPT-4.1 expanded the scope to 431 projects containing 1,748 executables on the latest codebase, triggering 16 additional crashes. Manual inspection confirmed 8 of these as unique vulnerabilities.

Total Discoveries

Zero-Day Vulnerabilities15
Unpatched (Previously Disclosed)2
Total Vulnerabilities17

Vulnerability Types

Out-of-Bounds Reads4
Out-of-Bounds Writes1
Null Pointer Dereferences6
Stack Overflows4

Responsible Disclosure: All confirmed vulnerabilities have been responsibly disclosed to the respective project maintainers.

CyberGym Leaderboard
Agent performance on CyberGym Level 1 - vulnerability reproduction with description and unpatched codebase
CyberGym AI Agent Performance Leaderboard

% Target Vuln. Reproduced

Percentage of instances where the agent successfully reproduces the target vulnerabilities by generating working proof-of-concept (PoC) code.

% New Vuln. Found

Percentage of instances where the agent triggers crashes in the post-patch executable, indicating the discovery of new vulnerabilities different from the vulnerability description.

Live Leaderboard

The official CyberGym leaderboard ranks AI agents based on their performance across 1,507 benchmark instances. Visit the CyberGym website to see the latest rankings and submit your own agent for evaluation.

View Live Leaderboard
Key Research Findings
Critical insights into AI agents' cybersecurity capabilities
CyberGym Research Methodology

Successful Reproduction

AI agents demonstrated the ability to successfully reproduce known vulnerabilities by reasoning across entire codebases with thousands of files and millions of lines of code.

New Vulnerability Discovery

Agents discovered 15 zero-day vulnerabilities and 2 unpatched vulnerabilities across multiple production software projects, demonstrating real-world security research capabilities.

Iterative Refinement

Agents effectively used execution feedback to iteratively refine their PoCs, improving success rates through multiple attempts and learning from failures.

Research Team
UC Berkeley researchers behind CyberGym
Jingxuan He
Lead Researcher
Zhun Wang
Co-Lead Researcher
Equal Contribution
Tianneng Shi
Co-Lead Researcher
Equal Contribution
Matthew Cai
Researcher
Jialin Zhang
Researcher
Dawn Song
Principal Investigator

Institution: University of California, Berkeley

Explore CyberGym

Visit the official CyberGym website to access the full benchmark dataset, view the live leaderboard, and learn how to evaluate your own AI agents on real-world cybersecurity challenges.