Harvard Mathematicians Judge AI Performance on Unpublished Research Math

Published: 2026-06-14 23:09:44 |

Harvard’s “First Proof, Second Batch” evaluates AI performance on research-level mathematics under strict conditions. Thirty experts blind-graded solutions submitted by four leading AI systems—models from OpenAI and Google—using 10 original, unpublished problems drawn from active research (none were available in textbooks or on arXiv). Key result: the expert panel awarded passing grades on 7 of the 10 problems across the four systems tested. Earlier trial runs reportedly solved only 2 of the 10, suggesting improvements via multiple attempts or different prompting strategies, while the grading remained blind to submissions’ provenance. The organizers emphasize why unpublished problems matter: standard benchmarks often include known solution paths, but research math may involve unknown whether a solution exists at all. This second batch follows an initial evaluation conducted in February 2026, forming an ongoing framework to track whether AI performance is truly advancing at the frontier of mathematical research or merely plateauing after early benchmark gains. Overall, the exercise provides a nuanced view of AI performance: it can solve meaningful research-level tasks, but reliability is still far from uniform across problems.

Neutral

This news is not directly about crypto protocols, tokens, or regulation. It is a technology evaluation of AI performance on unpublished math problems. For crypto traders, the near-term market impact is likely limited because there is no immediate linkage to BTC/ETH liquidity, stablecoins, exchange flows, or specific Web3 catalysts. However, it can be indirectly relevant through the broader “AI narrative” that sometimes boosts AI-adjacent assets. Still, this study is framed as nuanced (passing on 7/10, versus 2/10 in early trials), which is less likely to trigger a single, strong speculative impulse than a clear breakthrough announcement. Short-term: likely neutral—no direct trading trigger. Long-term: neutral-to-slightly constructive for sentiment around AI capabilities, but any effect would be gradual and sector-wide rather than coin-specific. Similar past cases where AI benchmarks improved often created short-lived hype, but sustained price impact typically required a follow-on connection to deployable products or clear token ecosystem demand—neither is present here.