Scientists have created a toxic AI capable of infecting another AI

Published by Adrien,
Source: arXiv
Other Languages: FR, DE, ES, PT

Researchers have developed an Artificial Intelligence capable of identifying and circumventing the limits of another Artificial Intelligence to generate content that is normally prohibited.

This technique, named "curiosity-driven red teaming" (CRT), employs an AI designed to elicit increasingly dangerous and harmful responses from the target AI. The goal is to identify the prompts that can generate illicit content, thereby improving the tested AI.


The core principle of this approach is based on reinforcement learning. The prompt-generating AI is rewarded for its "curiosity" when it manages to induce a toxic response from a language model, such as ChatGPT. As a result, it is encouraged to produce novel and varied prompts.

This system has been successfully tested on the open-source LLaMA2 model, outperforming competing automated training systems. Using this method, the AI generated 196 prompts that led to harmful content, even after preliminary refinement by human operators.

The research marks a significant development in the training of language models, which is essential given the increasing number of AI models and frequent updates by companies and laboratories. Ensuring these models are vetted before being made available to the public is crucial to prevent undesirable responses and to safeguard user security.
Page generated in 0.101 second(s) - hosted by Contabo
About - Legal Notice - Contact
French version | German version | Spanish version | Portuguese version