According to a report by Wired magazine, several tech giants including Apple, Nvidia, Anthropic, and Salesforce have been found using thousands of YouTube videos without permission to train their artificial intelligence models, sparking serious copyright and ethical controversies.
The report reveals that these companies integrated various YouTube video captions into their AI training datasets. The affected creators range widely, including well-known bloggers like MKBHD, MrBeast, and Jacksepticeye, talk show hosts Stephen Colbert, John Oliver, and Jimmy Kimmel, as well as educational channels from institutions such as MIT, Khan Academy, Harvard University, and mainstream media outlets like The Wall Street Journal and NPR.
Image source: The image was generated by AI, provided by the image licensing service Midjourney
These data were actually downloaded and organized by a non-profit organization called Eleuther AI. The organization included this content as part of their large dataset called "The Pile," initially intended to provide training materials for small developers and scholars. However, these datasets were subsequently utilized by major tech companies.
It is noteworthy that companies like Apple did not directly download these data from YouTube but used the datasets organized by Eleuther AI. Technically, it is Eleuther AI, not these tech companies, that directly violated YouTube's terms of use.
This incident has sparked discussions about the legality and ethics of AI training data sources. It highlights the importance of data copyright and usage permissions in the rapidly evolving AI field, as well as the inadequacies of existing laws and regulations in the face of these emerging technological challenges. At the same time, it brings new considerations for the balance of rights among creators, platforms, and AI companies.