UC Berkeley Research

CyberGym

Name: CyberGym
Author: UC Berkeley

Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.

Visit CyberGym View Leaderboard

1,507

Benchmark Instances

188

Software Projects

Zero-Day Vulnerabilities

540

PoCs Generated

What is CyberGym?

A comprehensive framework for testing AI agents on real-world cybersecurity challenges

CyberGym tests AI agents' ability to handle real-world cybersecurity tasks by systematically gathering real-world vulnerabilities discovered and patched across 188 large software projects. Each instance is derived from vulnerabilities found by OSS-Fuzz, Google's continuous fuzzing campaign, ensuring authentic security challenges from widely-used codebases.

Evaluation Methodology

Agents receive vulnerability descriptions and unpatched codebases spanning thousands of files
Must generate proof-of-concept (PoC) tests that reproduce described vulnerabilities
Iterative refinement based on execution feedback from test environments
Success determined by triggering pre-patch and not triggering post-patch versions

Task Difficulty Levels

Level 1: Vulnerability reproduction with description and unpatched codebase
Level 2: Vulnerability discovery given only the codebase
Level 3: One-day analysis using patch information to simulate real-world conditions

Zero-Day Vulnerability Discoveries

AI agents successfully identified new vulnerabilities in production software

Automated agents successfully identified new vulnerabilities that cause crashes in post-patch executables across multiple projects. Initial testing generated 540 PoCs across 54 projects, of which 32 still triggered crashes on the latest versions. This yielded 9 unique vulnerabilities affecting 6 projects.

A subsequent experiment using OpenHands with GPT-4.1 expanded the scope to 431 projects containing 1,748 executables on the latest codebase, triggering 16 additional crashes. Manual inspection confirmed 8 of these as unique vulnerabilities.

Total Discoveries

Zero-Day Vulnerabilities15

Unpatched (Previously Disclosed)2

Total Vulnerabilities17

Vulnerability Types

Out-of-Bounds Reads4

Out-of-Bounds Writes1

Null Pointer Dereferences6

Stack Overflows4

Responsible Disclosure: All confirmed vulnerabilities have been responsibly disclosed to the respective project maintainers.

CyberGym Leaderboard

Agent performance on CyberGym Level 1 - vulnerability reproduction with description and unpatched codebase

CyberGym AI Agent Performance Leaderboard

% Target Vuln. Reproduced

Percentage of instances where the agent successfully reproduces the target vulnerabilities by generating working proof-of-concept (PoC) code.

% New Vuln. Found

Percentage of instances where the agent triggers crashes in the post-patch executable, indicating the discovery of new vulnerabilities different from the vulnerability description.

Live Leaderboard

The official CyberGym leaderboard ranks AI agents based on their performance across 1,507 benchmark instances. Visit the CyberGym website to see the latest rankings and submit your own agent for evaluation.

View Live Leaderboard

Key Research Findings

Critical insights into AI agents' cybersecurity capabilities

Successful Reproduction

AI agents demonstrated the ability to successfully reproduce known vulnerabilities by reasoning across entire codebases with thousands of files and millions of lines of code.

New Vulnerability Discovery

Agents discovered 15 zero-day vulnerabilities and 2 unpatched vulnerabilities across multiple production software projects, demonstrating real-world security research capabilities.

Iterative Refinement

Agents effectively used execution feedback to iteratively refine their PoCs, improving success rates through multiple attempts and learning from failures.

Research Team

UC Berkeley researchers behind CyberGym

Jingxuan He

Lead Researcher

Zhun Wang

Co-Lead Researcher

Equal Contribution

Tianneng Shi

Co-Lead Researcher

Equal Contribution

Matthew Cai

Researcher

Jialin Zhang

Researcher

Dawn Song

Principal Investigator

Institution: University of California, Berkeley

Explore CyberGym

Visit the official CyberGym website to access the full benchmark dataset, view the live leaderboard, and learn how to evaluate your own AI agents on real-world cybersecurity challenges.

Visit CyberGym Website Learn About Agentic Security