
Epoch AI, a California-based analysis institute launched a brand new artificial intelligence (AI) benchmark final week. Dubbed FrontierMath, the brand new AI benchmark checks massive language fashions (LLMs) on their functionality of reseasoning and mathematical problem-solving. The AI agency claims that present math benchmarks aren’t very helpful as a consequence of components like knowledge contamination and AI fashions scoring very excessive scores on them. Epoch AI claims that even the main LLMs have scored lower than two % on the brand new benchmark.
Epoch AI Launches FrontierMath Benchmark
In a post on X (previously referred to as Twitter), the AI agency defined that it collaborated with greater than 60 mathematicians to create a whole lot of origins and unpublished math issues. Epoch AI claims that these questions would take even mathematicians hours to resolve. The rationale behind creating the brand new benchmark was cited as the constraints with present benchmarks resembling GSM8K and MATH, the place AI fashions typically rating a excessive level.
The corporate claimed that the excessive scores achieved by LLMs are largely as a consequence of knowledge contamination. This implies the questions in some way have been already fed into the AI fashions, leading to them simply fixing the questions.
FrontierMath solves the issue by together with new issues which can be distinctive and haven’t been revealed wherever, mitigating the dangers related to knowledge contamination. Additional, the benchmark contains a variety of questions together with computationally intensive issues in quantity principle, actual evaluation, and algebraic geometry, in addition to matters resembling Zermelo–Fraenkel set principle. The AI agency says all of the questions are “guess proof”, which means they can’t be solved by accident with out robust reasoning.
Epoch AI highlighted that to measure AI’s aptitude, benchmarks needs to be created on inventive problem-solving the place the AI has to take care of reasoning over a number of steps. Notably, many trade veterans imagine that the present benchmarks aren’t enough to accurately measure how superior an AI mannequin is.
Responding to the brand new benchmark in a post, Noam Brown, an OpenAI researcher who was behind the corporate’s o1 mannequin welcomed the brand new benchmark and stated, “I like seeing a brand new eval with such low cross charges for frontier fashions.”
Catch the newest from the Shopper Electronics Present on Devices 360, at our CES 2025 hub.