PBT-Bench: Benchmarking AI Agents on Property-Based Testing
Published in arXiv preprint, 2026
A benchmark evaluating AI agents on property-based testing—a skill distinct from general code generation. It comprises 100 curated problems across 40 Python libraries with 365 injected bugs that are deliberately difficult to trigger with random inputs, requiring agents to read documentation, identify invariants, and specify targeted Hypothesis strategies.
Recommended citation: Xinqi Wang*, Lucas Jing*, Liao Zhang, Simon S. Du (*equal contribution). (2026). "PBT-Bench: Benchmarking AI Agents on Property-Based Testing." arXiv:2605.15229. https://arxiv.org/abs/2605.15229
