AGI benchmark ARC-AGI-3 shows models far from AGI; scores <1%

Published: 2026-03-26 19:40:41 |

A new AI “AGI benchmark” called ARC-AGI-3 released by the ARC Prize Foundation challenges recent “AGI achieved” claims. In the results, every frontier model scored below 1% while humans reached 100% across 135 novel game-like environments. Gemini 3.1 Pro led at 0.37%, OpenAI’s GPT-5.4 scored 0.26%, Anthropic’s Claude Opus 4.6 scored 0.25%, and xAI’s Grok-4.20 scored 0.00%. Humans solved all environments (100%) on the first run with no instructions. ARC-AGI-3 is designed to test true generalization: agents must explore, plan, and learn from scratch in unknown settings, with no memorization dataset available (110 of 135 environments are kept private/locked). Scoring uses Relative Human Action Efficiency (RHAE), heavily penalizing inefficient wandering, backtracking and guessing. The article notes a methodological debate: a Duke-built harness reportedly pushed Claude Opus 4.6 far higher on a single variant, but the official ARC-AGI-3 overall score remained 0.25%. The ARC Prize 2026 competition will award $2 million across tracks on Kaggle, but the current AGI benchmark results suggest today’s systems still fall well short of “general intelligence.”

Neutral

这是一次“AI能力上限”的评估新闻，而非加密协议、监管或链上流动性的直接变化，因此对加密资产的可验证影响有限，整体偏中性。短期看：市场情绪可能受到“AGI叙事降温”的影响。文章提到的现象是：在Jensen Huang等人士释放“接近AGI/已达成AGI”信号后，ARC-AGI-3给出低于1%的成绩，属于典型的叙事校验。类似的情况在过去常见于“技术里程碑夸大→更严格基准反证”的阶段，往往会带来风险偏好的小幅降温，但通常不会立刻改变链上基本面。中长期看：ARC-AGI-3强调可泛化推理与从零学习，可能影响资本对“真正可扩展能力”的预期，从而改变科技板块/AI相关叙事资金的流向。然而加密市场更直接的驱动仍来自利率、流动性、监管与链上需求。本次更多是对AI行业路线与估值逻辑的再定价信号，因此对BTC/主流山寨的方向性推动不强，仍以区间震荡或情绪扰动为主。结论：由于缺少直接的加密催化剂或负面链上冲击，本次对市场更可能体现为情绪层面的“轻微”而非趋势性的改变。