Artificial Intelligence Could Face Challenges Due to Limited Training Data
Artificial intelligence (AI) systems, such as ChatGPT, are becoming smarter thanks to the large volume of words written and shared online. However, a new study by Epoch AI suggests that the supply of publicly available training data for AI language models could soon run out. Tech companies may exhaust this resource between 2026 and 2032, creating a potential bottleneck for AI development.
The study compares this situation to a “literal gold rush” that depletes finite natural resources. Tamay Besiroglu, one of the study’s authors, explains that once the reserves of human-generated writing are drained, the AI field may face challenges in maintaining its current pace of progress. In response, companies like OpenAI and Google are racing to secure high-quality data sources to train their AI models, often by paying for access to data from platforms like Reddit forums and news media outlets.
However, in the long term, there won’t be enough new blogs, news articles, and social media commentary to sustain the trajectory of AI development. This puts pressure on companies to tap into sensitive data, like emails or text messages, or rely on less-reliable “synthetic data” generated by the AI systems themselves.
The researchers initially projected the depletion of high-quality text data two years ago but have revised their timeframe in light of new techniques that allow AI researchers to better utilize existing data. Despite these advances, Epoch now anticipates that public text data will be exhausted within the next two to eight years.
The study, which is peer-reviewed and due to be presented at an upcoming machine learning conference, highlights a potential bottleneck in the growth of AI systems. Scaling up models has been crucial in expanding their capabilities and improving output quality, but with limited data, this scalability may become impeded.
While some experts argue that training larger models is not necessary and more specialized models for specific tasks can be developed, others express concerns about training AI systems on the same data they produce. Training on AI-generated data can lead to degraded performance, as well as the perpetuation of mistakes, biases, and unfairness already present in the information ecosystem.
As AI relies heavily on human-generated content, companies must consider the usage of websites like Reddit and Wikipedia, as well as news and book publishers, which hold sought-after troves of data. While some platforms have restricted AI access to their content, Wikipedia has allowed AI companies to use its volunteer-written entries. Nonetheless, it is essential to preserve incentives for human contributions and prevent an inundation of cheap and automatically generated “garbage content” on the internet.
Epoch’s study suggests that paying millions of humans to generate the text required for AI models is not economically feasible for optimizing technical performance. Consequently, AI companies like OpenAI are exploring the use of synthetic data for training, although concerns remain about relying too heavily on this approach.
In conclusion, the availability of publicly accessible training data is becoming increasingly limited for AI language models. As companies strive to maintain AI progress, they must navigate challenges caused by this dearth of data and consider alternative approaches to training models effectively.
————–
Questions:
1. What is the main concern highlighted in the study by Epoch AI?
Answer: The study raises concerns about the limited supply of publicly available training data for AI language models.
2. How do tech companies currently address the shortage of training data?
Answer: Tech companies are racing to secure high-quality data sources by signing deals to tap into data from platforms like Reddit forums and news media outlets.
3. What are the potential consequences of running out of new text data for AI development?
Answer: Running out of new text data could hinder the progress of AI development and make it challenging to maintain the current pace of improvement.
4. What alternative solutions are mentioned in the article for obtaining data for AI models?
Answer: The article mentions the possibility of tapping into sensitive data, such as emails or text messages, and relying on synthetic data generated by AI systems.
5. Why is training AI systems on the same data they produce a cause for concern?
Answer: Training on AI-generated data can lead to degraded performance and further encode existing mistakes, biases, and unfairness in the information ecosystem.