Recently, a YouTube creator from Massachusetts, David Millette, has filed a class-action lawsuit against OpenAI, alleging that the company used transcriptions of millions of YouTube videos to train their generative AI models without permission. According to the complaint filed by Millette's attorneys in the U.S. District Court for the Northern District of California, OpenAI is accused of secretly transcribing his videos and those of other creators to train models for ChatGPT and other generative AI products.
The complaint states that OpenAI has profited from the creators' work by collecting this data, which violates copyright laws and YouTube's terms of service, which prohibit the use of videos for applications independent of its service. Millette's attorneys write in the complaint that OpenAI's AI products have become more valuable due to the use of training data that was not consented to, credited, or compensated.
The law firm representing Millette seeks a jury trial and demands over $5 million in damages on behalf of all potentially affected YouTube users and creators.
It is well known that generative AI models do not possess true intelligence. They learn the likelihood and patterns of data occurrences by processing large samples of data such as movies, recordings, and papers. Many models' training data is sourced from public websites and datasets online. Although companies claim their data scraping complies with the principle of "fair use," many copyright holders disagree and have resorted to litigation to halt this practice.
Video transcriptions have become an important training data source, especially as other data sources have dried up. According to Originality.AI, over 35% of the top websites worldwide have now blocked OpenAI's web crawlers. Additionally, research from MIT's Data Source Initiative shows that about 25% of high-quality data sources have been restricted, making training data for AI models more scarce.
It is worth noting that OpenAI's Whisper model is specifically designed to transcribe video audio to collect more training data. According to The New York Times, after transcribing over a million hours of YouTube videos, OpenAI used these transcriptions to train their GPT-4 model, sparking internal discussions that this might violate YouTube's rules.
Key Points:
🔍 YouTuber David Millette has filed a class-action lawsuit against OpenAI, accusing it of using video transcriptions for AI training without permission.
💰 Millette seeks over $5 million in damages, representing all affected YouTube creators.
🚫 The data sources for generative AI models face increasingly stringent restrictions, with many top websites having blocked OpenAI's crawlers.