
Each Sunday, NPR host Will Shortz, The New York Occasions’ crossword puzzle guru, will get to quiz hundreds of listeners in a long-running section known as the Sunday Puzzle. Whereas written to be solvable with out too a lot foreknowledge, the brainteasers are often difficult even for expert contestants.
That’s why some consultants assume they’re a promising approach to check the boundaries of AI’s problem-solving skills.
In a new study, a workforce of researchers hailing from Wellesley Faculty, Oberlin Faculty, the College of Texas at Austin, Northeastern College, and startup Cursor created an AI benchmark utilizing riddles from Sunday Puzzle episodes. The workforce says their check uncovers stunning insights, like that so-called reasoning fashions — OpenAI’s o1, amongst others — generally “surrender” and supply solutions they know aren’t right.
“We needed to develop a benchmark with issues that people can perceive with solely basic information,” Arjun Guha, a pc science undergraduate at Northeastern and one of many co-authors on the research, advised TechCrunch.
The AI business is in a little bit of a benchmarking quandary in the mean time. Many of the checks generally used to guage AI fashions probe for abilities, like competency on PhD-level math and science questions, that aren’t related to the common consumer. In the meantime, many benchmarks — even benchmarks released relatively recently — are shortly approaching the saturation level.
The benefits of a public radio quiz recreation just like the Sunday Puzzle is that it doesn’t check for esoteric information, and the challenges are phrased such that fashions can’t draw on “rote reminiscence” to resolve them, defined Guha.
“I believe what makes these issues laborious is that it’s actually tough to make significant progress on an issue till you remedy it — that’s when every thing clicks collectively suddenly,” Guha stated. “That requires a mix of perception and a strategy of elimination.”
No benchmark is ideal, in fact. The Sunday Puzzle is U.S.-centric and English-only. And since the quizzes are publicly accessible, it’s doable that fashions educated on them and might “cheat” in a way, though Guha says he hasn’t seen proof of this.
“New questions are launched each week, and we are able to anticipate the newest inquiries to be really unseen,” he added. “We intend to maintain the benchmark recent and monitor how mannequin efficiency adjustments over time.”
On the researchers’ benchmark, which consists of round 600 Sunday Puzzle riddles, reasoning fashions resembling o1 and DeepSeek’s R1 far outperform the remainder. Reasoning fashions totally fact-check themselves earlier than giving out outcomes, which helps them avoid some of the pitfalls that usually journey up AI fashions. The trade-off is that reasoning fashions take just a little longer to reach at options — sometimes seconds to minutes longer.
Not less than one mannequin, DeepSeek’s R1, provides options it is aware of to be flawed for a number of the Sunday Puzzle questions. R1 will state verbatim “I surrender,” adopted by an incorrect reply chosen seemingly at random — conduct this human can definitely relate to.
The fashions make different weird decisions, like giving a flawed reply solely to instantly retract it, try and tease out a greater one, and fail once more. Additionally they get caught “considering” ceaselessly and provides nonsensical explanations for solutions, or they arrive at an accurate reply straight away however then go on to contemplate various solutions for no apparent motive.
“On laborious issues, R1 actually says that it’s getting ‘annoyed,’” Guha stated. “It was humorous to see how a mannequin emulates what a human would possibly say. It stays to be seen how ‘frustration’ in reasoning can have an effect on the standard of mannequin outcomes.”

The present best-performing mannequin on the benchmark is o1 with a rating of 59%, adopted by the lately launched o3-mini set to excessive “reasoning effort” (47%). (R1 scored 35%.) As a subsequent step, the researchers plan to broaden their testing to extra reasoning fashions, which they hope will assist to establish areas the place these fashions is likely to be enhanced.

“You don’t want a PhD to be good at reasoning, so it ought to be doable to design reasoning benchmarks that don’t require PhD-level information,” Guha stated. “A benchmark with broader entry permits a wider set of researchers to understand and analyze the outcomes, which can in flip result in higher options sooner or later. Moreover, as state-of-the-art fashions are more and more deployed in settings that have an effect on everybody, we consider everybody ought to have the ability to intuit what these fashions are — and aren’t — able to.”