As major AI companies compete for dominance, some of their most illuminating experiments are unfolding in the pixelated battlefields of Pokémon Red and Blue.
In a revealing new report, Google DeepMind disclosed that its Gemini 2.5 Pro AI model exhibits a simulated “panic” response when playing early Pokémon games. When its in-game Pokémon are close to fainting, the model’s reasoning ability declines noticeably, triggering what researchers call “qualitatively observable degradation.” This unusual behavior has become a focal point in efforts to understand how large language models respond to uncertainty and pressure.
The findings come from an ongoing public experiment in which Google’s AI is streamed live on a Twitch channel titled Gemini Plays Pokémon. The livestream features the AI navigating the classic game while displaying its thought process in natural language. Despite being emotionless code, Gemini’s behavior in high-stress scenarios eerily mimics human-like panic, including hesitation, strategic missteps, and abandonment of available tools.
The experiment is part of a larger trend of using classic video games as unconventional benchmarking tools for AI development. Though such tests are not standardized, they offer revealing glimpses into the reasoning pathways of language models when faced with open-ended, interactive tasks. While AI benchmarking has long been criticized for lacking real-world relevance, researchers argue that watching an AI struggle in a structured, goal-oriented environment like a video game can expose deeper truths about its limitations.
Gemini isn’t alone in its trials. Anthropic’s Claude model is undergoing a similar public test in another stream called Claude Plays Pokémon. Claude has made his own puzzling decisions. In one instance, it appeared to deduce that allowing all its Pokémon to faint would teleport it to the next town’s Pokémon Center. It was a clever-sounding strategy, but incorrect. The AI failed to realize that the game’s mechanics return players only to the last-used Pokémon Center, not the geographically nearest one. The result? A self-induced defeat based on a flawed understanding.
These odd behaviors have not gone unnoticed by audiences. Twitch viewers have become attuned to identifying when the AIs are in “panic mode,” reacting with both fascination and alarm as the models repeat errors or spiral into faulty logic loops. In this way, the streams serve a dual function: entertainment for viewers and real-time insight for developers.
Despite these shortcomings, Gemini has shown signs of strength in specific areas. Notably, it has demonstrated exceptional skill in solving the game’s environmental puzzles. With minimal prompting, the model created tools to analyze and solve complex Boulder challenges in Victory Road, often succeeding on its first attempt. These “agentic tools” were developed using a basic description of the puzzle’s physics and validation rules, suggesting a capacity for structured problem-solving well beyond surface-level comprehension.
Google’s team believes that future iterations of Gemini could eventually construct such tools without any human guidance, raising the possibility of fully autonomous reasoning agents that can adapt and self-optimize in unfamiliar digital environments.
As these AI experiments play out publicly, they offer more than just quirky anecdotes. They underscore important philosophical and technical questions about how AI interprets rules, adapts to failure, and handles ambiguity. When an AI simulates panic, is it merely responding to missing data, or are we seeing the edge of machine behavior that mimics emotion without feeling it?
The implications stretch far beyond Pokémon gyms. These behaviors reflect the broader challenge of training AI systems for real-world tasks that require adaptability, judgment, and consistency under pressure. If an AI “panics” in a game, how might it react when facing complex real-world decisions?
For now, researchers and viewers alike will keep watching. In the errors, the overthinking, and the occasional brilliance, there is much to learn — and perhaps even more to anticipate — as AI models continue their unpredictable journey from code to cognition.