Artificial intelligences (AI) have reached a critical threshold: they have exhausted almost all available human knowledge for their learning. Elon Musk, among others, is sounding the alarm on this technological dead end.
This situation is pushing researchers and companies to explore alternatives, particularly synthetic data generated by AI itself. While this solution seems promising, it raises major questions about the quality and reliability of future models.
The end of human data: a turning point for AI
Modern AI models, such as ChatGPT or Bard, require astronomical amounts of data to function. This data comes from books, scientific articles, online conversations, and other sources. However, the exponential growth in data needs has led to a shortage of quality resources.
Elon Musk recently stated that all human knowledge had been exploited to train AI, a milestone reached last year. This has led to a "model collapse," also known as
model collapse. This limitation forces researchers to rethink the learning methods of artificial intelligence systems.
Synthetic data: a risky solution
Synthetic data, generated by AI, appears to be a viable alternative. It reduces costs and avoids privacy-related issues. For example, the startup Writer reduced the training cost of its Palmyra X 004 model by six times using this method.
However, this approach carries risks. AI trained on synthetic data can produce erroneous results, a phenomenon called "hallucination." Moreover, this data can amplify biases present in the initial models, compromising their reliability.
Consequences for the future of AI
The increasing use of synthetic data could lead to a degradation in the quality of AI models. Researchers at Stanford University have shown that models trained on more than 50% artificial data make more factual errors.
Furthermore, this reliance on synthetic data could limit the creativity of AI. Models risk going in circles, reproducing the same patterns without innovation. This situation could force companies to rethink their development strategies.
Towards enhanced collaboration and regulation
Faced with these challenges, companies might turn to more compact and specialized models. Collaboration between organizations to share real data could also become essential.
At the same time, stricter regulatory frameworks will need to be established to govern the use of synthetic data. These measures will aim to limit the ethical and technical risks associated with this practice.