Scientists have created a toxic AI capable of infecting another AI

Published by Adrien,
Source: arXiv
Other Languages: FR, DE, ES, PT

Follow us on Google News (click on ☆)

Researchers have developed an Artificial Intelligence capable of identifying and circumventing the limits of another Artificial Intelligence to generate content that is normally prohibited.

This technique, named "curiosity-driven red teaming" (CRT), employs an AI designed to elicit increasingly dangerous and harmful responses from the target AI. The goal is to identify the prompts that can generate illicit content, thereby improving the tested AI.

The core principle of this approach is based on reinforcement learning. The prompt-generating AI is rewarded for its "curiosity" when it manages to induce a toxic response from a language model, such as ChatGPT. As a result, it is encouraged to produce novel and varied prompts.

This system has been successfully tested on the open-source LLaMA2 model, outperforming competing automated training systems. Using this method, the AI generated 196 prompts that led to harmful content, even after preliminary refinement by human operators.

The research marks a significant development in the training of language models, which is essential given the increasing number of AI models and frequent updates by companies and laboratories. Ensuring these models are vetted before being made available to the public is crucial to prevent undesirable responses and to safeguard user security.

💬 Your ear can no longer tell the difference between a human voice and an AI clone

🧠 This AI predicts human behavior with astonishing accuracy

What if AI understood emotions better than us? 🧐

Scandal of a secret AI experiment on Reddit: what really happened? 🚨

Page generated in 0.147 second(s) - hosted by Contabo
About - Legal Notice - Contact
French version | German version | Spanish version | Portuguese version