OpenAI and Paradigm launch EVMbench to benchmark AI on EVM smart-contract security

OpenAI and crypto investor Paradigm released EVMbench, an open-source benchmark suite to evaluate AI agents and automated tools on Ethereum Virtual Machine (EVM) smart-contract security. Building on an earlier announcement, the full release bundles 120 high-severity vulnerabilities drawn from 40 audits (including public audit competitions and Tempo’s security work) and provides labeled test cases, a taxonomy of flaw types (reentrancy, access control, integer bugs, logic errors, etc.), and reproducible evaluation tooling. EVMbench runs agents in three modes — detect (find known flaws), patch (propose fixes without breaking functionality) and exploit (attempt controlled fund drains in an isolated sandbox) — so teams can measure detection rates, false positives, false negatives and coverage gaps across models and conventional static-analysis scanners. Early results show wide variance by task: newer models (notably OpenAI’s GPT-5.3-Codex in initial tests) outperformed earlier models on exploit tasks, while detection and patching remain imperfect. OpenAI and Paradigm emphasise transparency: datasets, evaluation scripts and documentation are public to enable reproducible comparisons and community contributions. The project is framed as both a measurement tool and a warning — as AI capabilities improve they can help defenders and attackers alike — underlining the need for stronger defenses and more rigorous auditing. For crypto traders, EVMbench could indirectly affect market risk over time by improving automated detection and patching of DeFi vulnerabilities, potentially reducing exploit frequency and protocol risk, though immediate price effects are uncertain.
Neutral
EVMbench is primarily a research and measurement tool that targets improved detection, patching and exploit simulation for EVM-based smart contracts. For traded assets mentioned or implied (Ethereum and DeFi tokens), the news is neutral in the short term because publication of a benchmark and early model results do not immediately change exploit rates or token fundamentals. Over the medium to long term, wider adoption of better automated auditing could reduce protocol-level risks and exploit frequency, which would be mildly bullish for Ethereum-based DeFi tokens by lowering systemic security risk and raising confidence in protocols. Conversely, the benchmark also highlights that AI can be used to craft exploits, a factor that could sustain or increase attack sophistication until defenses catch up. Traders should therefore treat this as a gradual, structural shift: limited short-term price reaction, potential modest positive impact on risk premia for well-audited protocols over time, but continued caution warranted while AI-assisted offensive capabilities evolve.