
Anthropic used Pokémon to benchmark its latest AI mannequin. Sure, actually.
In a weblog post printed Monday, Anthropic stated that it examined its newest mannequin, Claude 3.7 Sonnet, on the Sport Boy basic Pokémon Purple. The corporate outfitted the mannequin with fundamental reminiscence, display screen pixel enter, and performance calls to press buttons and navigate across the display screen, permitting it to play Pokémon constantly.
A singular characteristic of Claude 3.7 Sonnet is its potential to have interaction in “prolonged considering.” Like OpenAI’s o3-mini and DeepSeek’s R1, Claude 3.7 Sonnet can “purpose” by way of difficult issues by making use of extra computing — and taking extra time.
That got here in helpful in Pokémon Purple, apparently.
In comparison with a earlier model of Claude, Claude 3.0 Sonnet, which did not go away the home in Pallet City the place the story begins, Claude 3.7 Sonnet efficiently battled three Pokémon health club leaders and received their badges.

Now, it’s not clear how a lot computing was required for Claude 3.7 Sonnet to succeed in these milestones — and the way lengthy every took. Anthropic solely stated that the mannequin carried out 35,000 actions to succeed in the final health club chief, Surge.
It certainly received’t be lengthy earlier than some enterprising developer finds out.
Pokémon Purple is extra of a toy benchmark than something. Nevertheless, there is a long history of video games getting used for AI benchmarking functions. Previously few months alone, various new apps and platforms have cropped as much as take a look at fashions’ game-playing skills on titles starting from Street Fighter to Pictionary.