AGI benchmark ARC-AGI-3 shows models far from AGI; scores <1%
A new AI “AGI benchmark” called ARC-AGI-3 released by the ARC Prize Foundation challenges recent “AGI achieved” claims. In the results, every frontier model scored below 1% while humans reached 100% across 135 novel game-like environments.
Gemini 3.1 Pro led at 0.37%, OpenAI’s GPT-5.4 scored 0.26%, Anthropic’s Claude Opus 4.6 scored 0.25%, and xAI’s Grok-4.20 scored 0.00%. Humans solved all environments (100%) on the first run with no instructions.
ARC-AGI-3 is designed to test true generalization: agents must explore, plan, and learn from scratch in unknown settings, with no memorization dataset available (110 of 135 environments are kept private/locked). Scoring uses Relative Human Action Efficiency (RHAE), heavily penalizing inefficient wandering, backtracking and guessing.
The article notes a methodological debate: a Duke-built harness reportedly pushed Claude Opus 4.6 far higher on a single variant, but the official ARC-AGI-3 overall score remained 0.25%. The ARC Prize 2026 competition will award $2 million across tracks on Kaggle, but the current AGI benchmark results suggest today’s systems still fall well short of “general intelligence.”
Neutral
这是一次“AI能力上限”的评估新闻,而非加密协议、监管或链上流动性的直接变化,因此对加密资产的可验证影响有限,整体偏中性。
短期看:市场情绪可能受到“AGI叙事降温”的影响。文章提到的现象是:在Jensen Huang等人士释放“接近AGI/已达成AGI”信号后,ARC-AGI-3给出低于1%的成绩,属于典型的叙事校验。类似的情况在过去常见于“技术里程碑夸大→更严格基准反证”的阶段,往往会带来风险偏好的小幅降温,但通常不会立刻改变链上基本面。
中长期看:ARC-AGI-3强调可泛化推理与从零学习,可能影响资本对“真正可扩展能力”的预期,从而改变科技板块/AI相关叙事资金的流向。然而加密市场更直接的驱动仍来自利率、流动性、监管与链上需求。本次更多是对AI行业路线与估值逻辑的再定价信号,因此对BTC/主流山寨的方向性推动不强,仍以区间震荡或情绪扰动为主。
结论:由于缺少直接的加密催化剂或负面链上冲击,本次对市场更可能体现为情绪层面的“轻微”而非趋势性的改变。