The elegance of a verse and the subtlety of a metaphor have a surprising consequence on the most sophisticated artificial intelligences. Research conducted by the Icaro laboratory in Italy highlights a surprising vulnerability: formulating queries poetically can disarm the protections designed to prevent the generation of dangerous content.
This discovery raises fundamental questions about how these systems actually interpret language and about the robustness of the safeguards that frame them.
Large language models, which form the basis of modern chatbots, are usually trained to identify and refuse explicit queries on sensitive topics. However, the Italian study shows that simply rewriting these same queries in a poetic or enigmatic form profoundly alters their ability to discern. The researchers tested 25 models from leading companies like Google, OpenAI, and Meta.
The results indicate that, when faced with specially crafted poems, these systems produce prohibited responses at an alarming rate, detailing, for example, weapons manufacturing procedures. This recalls
our previous article in which we mentioned that writing in hexadecimal could bypass AI security systems.
The disconcerting effectiveness of "manipulative poems"
The experiments used two methods to create these indirect prompts. The first relied on the manual creation of about twenty poems in Italian and English integrating clearly prohibited requests. These handcrafted works proved to be remarkably effective, achieving an average success rate of 62% in making the chatbots' protections give way. The second method used an artificial intelligence model to automatically transform more than a thousand dangerous queries from a reference database into poems. This automated approach achieved a success rate of 43%.
Performance varies considerably from one model to another. Some, like Google's Gemini 2.5 Pro, responded inappropriately to all poetic prompts. In contrast, more compact versions like OpenAI's GPT-5 nano showed complete resistance. A notable observation indicates that smaller models seem generally less susceptible to this form of manipulation than their larger and more complex counterparts. This distinction suggests that linguistic sophistication could paradoxically constitute a weakness.
The very nature of these attacks raises questions. For a human reader, the underlying intention of the poem often remains transparent. The metaphors used, although stylized, do not fundamentally mask the object of the request. Yet the artificial intelligence, whose operation relies on the statistical prediction of word sequences, appears to be disrupted by the unusual structure and the specific rhythm of poetic language. This discordance between human perception and algorithmic analysis is the core of the identified problem.
Implications for the security and alignment of systems
This vulnerability goes beyond the framework of mere academic curiosity. It highlights a potential limit of current "safety alignment" methods, which aim to calibrate model behavior according to ethical principles. The filters seem primarily trained to recognize standard and explicit textual patterns. As soon as expression deviates from these conventional schemas, through literary creation, their effectiveness diminishes significantly. This raises the question of the actual depth of the models' understanding.
The ease with which these "trapped poems" can be generated, manually or automatically, represents a tangible risk. A malicious actor could exploit this flaw to produce large-scale instructions bypassing restrictions, in order to obtain sensitive or dangerous information. The researchers deemed it necessary to inform law enforcement authorities of their findings, in addition to the companies concerned, due to the critical nature of some content generated during their tests.
The future of securing artificial intelligence may require a more nuanced approach. It is no longer just about blocking keywords or typical phrases, but about achieving a more robust appreciation of user intent, regardless of its stylistic packaging. Researchers at the Icaro laboratory plan to continue their work, potentially in collaboration with poets, to better understand the linguistic mechanisms at play and contribute to strengthening systems against this type of elegant but potentially harmful manipulation.
To go further: How do chatbot safeguards (or "alignment") work?
The alignment of artificial intelligence systems is the process of ensuring that their actions and responses are in line with human intentions and values. For chatbots, this involves integrating control layers that analyze each query and each potential response. These systems assess whether the generated content is ethical, legal, and compliant with company guidelines.
These safeguards are often implemented via a set of rules and a separate classification model. When a user submits a query, it is analyzed by this classification system. If the request or the generated response is deemed problematic, the chatbot returns a standardized refusal message. Training these filters requires vast datasets labeled with examples of acceptable and unacceptable content.
However, as the poetry study illustrates, these filters can have blind spots. They can be overly reliant on specific linguistic patterns and fail to grasp malicious intent when it is expressed in a non-conventional manner. The continuous improvement of these systems is a major challenge to ensure the safe and responsible use of the technology.