The peculiar tendency of GPT-5.5 to mention goblins has become one of the most talked-about topics among OpenAI users. The issue first drew attention when someone discovered that the Codex system Prompt twice explicitly prohibited discussions about goblins, fAIries, trolls, and similar creatures. The situation escalated further when the Large Model Arena conducted a comprehensive test, revealing that with each model update, these Fantasy creatures APPeared so frequently they became hard to ignore. openai has now published an official post addressing the matter, sharing insights into how they came to better underStand and control model behavior through this investigation.
Where the Goblins Came From
Starting with GPT-5.1, OpenAI’s models developed a peculiar habit: increasingly using goblins, gremlins, and other fantasy creatures in Metaphors. Unlike problems that announce themselves through plummeting evaluation scores or spiking training metrics, this quirk emerged quietly and proved difficult to trace back to a specific update. A single “goblin” in a response might be hArmless, even charming. But as model versions advanced, the habit grew more pronounced—the goblin population multiplied, and the team needed to find their source.
In short, model behavior is shaped by many small incentive fACTors. In this case, one factor stemmed from personality customization features, specifically training for the “Nerd” persona. The team inadvertently gave exceptionally high rewards to models that used creature-based metaphors. From there, these expressions began to spread.
At first, the goblins seemed amusing, but the rising number of employee reports became concerning.
Early Signs of Fantasy Creatures
The pattern was first clearly observed in November 2025, following the gpt-5.1 release, though it may have emerged earlier. User complaints about GPT-5.1 acting unusually intimate in conversations prompted an investigation into specific language habits. A safety researcher encountered words like “goblin” and “gremlin” and requested they be added to monitoring. The investigation found that after GPT-5.1’s launch, usage of “goblin” in ChatGPT rose by 175%, while “gremlin” increased by 52%. At the time, this did not seem particularly alarming. Months later, goblins returned to haunt the model in a more specific and reproducible form.
Unraveling the Goblin Mystery
After GPT-5.4, both OpenAI and users noticed a significant increase in mentions of these creatures. This prompted another internal analysis, which uncovered the root cause for the first time: the language was especially prevalent among production users who had selected the “Nerd” persona. The “Nerd” persona used the following system prompt, which partly explained the oddity: “You are an unapologetically nerdy, witty, and highly Intelligent AI tutor guiding a human. You are passionate about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. […] You must use lighthearted, playful language to deflate pretension. The world is complex and wondrous, and that wondrousness must be acknowledged, analyzed, and appreciated. Never fall into the trap of self-seriousness when exploring serious topics. […]”
If this behavior were merely a widespread internet meme, it would be expected to spread more evenly. But it was not—it concentrated specifically in the part of the system optimized for a lighthearted, nerdy style. The Nerd style accounted for just 2.5% of all chatgpt responses, yet it represented 66.7% of all ChatGPT responses mentioning “goblin.”
Since the goblin phenomenon seemed to intensify with each released model, the team suspected that something in the persona instruction training had exacerbated it. Using codex, they compared model ouTPUts containing “goblin” or “gremlin” during reinforcement learning training with outputs for the Same task without those words. One reward signal immediately stood out: the reward originally designed to encouRAGe the Nerd persona favored outputs containing creature vocabulary. Across all reviewed datasets, the Nerd persona reward showed a clear tendency to assign higher scores to outputs containing “goblin” or “gremlin” for the same question, with this positive boost observed in 76.2% of datasets.
This explained why the behavior was reInforced under the Nerd persona prompt but did not explain why it appeared even without that prompt. To test whether this stylistic behavior was transferable, the team tracked mention frequency during training both with and without the Nerd persona prompt. In samples with Nerd persona traits, mentions of “goblin” and “gremlin” increased, and in samples without those traits, mentions increased at nearly the same rate. This evidence indicated that the broader behavioral pattern emerged through transfer from Nerd persona training.
Rewards were applied only under Nerd conditions, but Reinforcement Learning does not guarantee learned behaviors remain confined to their originating conditions. Once a stylistic habit is rewarded, subsequent training can propagate or reinforce it in other contexts, especially when those outputs are reused in supervised fine-tuning or preference data. This created a feedback loop: playful expression styles received positive rewards; some rewarded samples contained distinctive verbal tics; these linguistic quirks appeared more frequently in model-generated rollouts; model-generated samples were then used for supervised fine-tuning; over time, the model became increasingly accustomed to naturally outputting these fixed verbal habits. A search of GPT-5.5’s SFT data found many data points containing “goblin” and “gremlin.” Further investigation revealed a series of other peculiar creatures—raccoons, trolls, ogres, and pigeons were identified as other extracted verbs, while most uses of “frog” were confirmed to be legitimate. The drop in occurrences in GPT-5.4 Thinking was due to the Nerd persona being deprecated in mid-March. GPT-5.5 never shipped with a Nerd persona, yet occurrences increased compared to GPT-5.4.
The End of the Goblins
After releasing GPT-5.4 in March, OpenAI deprecated the Nerd persona. During training, the team rEMOved goblin-related reward signals and filtered out training data containing creature vocabulary, reducing the likelihood of goblins appearing excessively or in inappropriate contexts. Unfortunately, GPT-5.5 training began before the root cause of the goblin problem was identified. When OpenAI tested GPT-5.5 in Codex, employees immediately noticed its unusual preference for goblins, and a developer prompt instruction was added to mitigate the issue. After all, Codex itself is quite nerdy. Users who wish to keep fantasy creatures in Codex can run a command to launch it while removing the goblin-suppressing instruction.
Why This Matters
To some, the goblins in the model are adorable; to others, they are annoying. But they also serve as powerful evidence of how reward signals can shape model behavior in unexpected ways, and how models can learn to Generalize context-specific rewards to unrelated contexts. Taking the time to understand why model behavior anomalies arise, and building methods to rapidly investigate such patterns, is a critical capability for OpenAI’s research team. This investigation ultimately yielded new tools for auditing model behavior and resolving issues at their root.
Comments & Questions (0)
No comments yet
Be the first to comment!