Behind the rapid development of artificial intelligence, some tech giants have quietly resorted to controversial means. They have not only "drained" books, websites, photos, and social media posts, but also, without the creators' knowledge, extensively used YouTube videos to train their AI models.

Who's Touching My Videos?

According to an investigation by Proof News, Silicon Valley giants including Anthropic, NVIDIA, Apple, and Salesforce used subtitle data from 173,536 YouTube videos to train AI. These videos come from over 48,000 channels, despite YouTube's explicit ban on scraping material from its platform without permission.

youtube

These datasets are known as "YouTube Subtitles" and include subtitles from educational and online learning channels such as Khan Academy, MIT, and Harvard. Videos from the Wall Street Journal, NPR, and BBC were also used to train AI, including segments from "The Late Show with Stephen Colbert," "Last Week Tonight with John Oliver," and "Jimmy Kimmel Live."

Proof News also discovered that videos from YouTube superstars such as MrBeast (with 289 million subscribers, 2 videos used for training), Marques Brownlee (19 million subscribers, 7 videos used for training), Jacksepticeye (nearly 31 million subscribers, 377 videos used for training), and PewDiePie (111 million subscribers, 337 videos used for training) were used to train AI. Some of the material used for training AI even promoted conspiracy theories such as "the earth is flat."

Creators' Anger

"Nobody came to me and said, 'We want to use this,'" said David Pakman, host of "The David Pakman Show." His channel has over 2 million subscribers and over 2 billion views, but nearly 160 videos were included in the YouTube Subtitles training dataset.

Pakman's team works full-time, releasing multiple videos daily, as well as producing podcasts, TikTok videos, and content for other platforms. If AI companies pay for this, Pakman said he should be compensated for the use of his data. He pointed out that some media companies have recently signed agreements to receive compensation for using their works to train AI.

Dave Wiskus, CEO of Nebula, was even more blunt, calling it "theft." Nebula is a streaming service partly owned by creators, some of whose works were taken from YouTube for AI training.

The "Goldmine" of Datasets

AI companies compete with each other by obtaining higher quality data, which is one of the reasons they keep their data sources secret. The New York Times reported earlier this year that Google (which owns YouTube) also used video text from the platform to train its models. In response, a spokesperson said its use was with the consent of YouTube creators.

Proof News' investigation also found that OpenAI used YouTube videos without authorization. Company representatives neither confirmed nor denied this finding.

Legal and Ethical Challenges

YouTube Subtitles and other types of speech-to-text data are potential "goldmines" because they can help train models to mimic the way people talk and converse. However, this also raises controversies about copyright and ethics. Many creators are concerned that their work is being used to train AI, which could eventually replace their work.

Proof News attempted to contact the owners of all the channels mentioned in this article. Many did not respond to requests for comment. None of the creators we interviewed were aware that their information had been taken, let alone how it was used.

Uncertainty About the Future

Many creators are uncertain about the future path. Full-time YouTubers regularly patrol for unauthorized use of their work and regularly submit removal notices. Some are worried that AI will eventually be able to generate content similar to what they produce, or even directly copy it.

Pakman, the creator of "The David Pakman Show," recently saw the power of AI on TikTok. He found a video marked as a Tucker Carlson clip, but when he watched it, he was shocked. It sounded like Carlson, but every word was what Pakman said in his YouTube show, even in the same tone. He was equally shocked that only one commenter seemed to realize it was fake—a clone of Carlson's voice mimicking Pakman's script.

"This is going to be a problem," Pakman said in a YouTube video he made about the fake video. "You could almost do this with anyone."

EleutherAI co-founder Sid Black wrote on GitHub that he created YouTube Subtitles using a script. The script downloads YouTube subtitles in the same way that a viewer's browser would when watching a video. According to the documentation on GitHub, Black used 495 search terms to collect videos, including "funny vloggers," "Einstein," "black protestants," "protective social services," "information warfare," "quantum chromodynamics," "Ben Shapiro," "Uighurs," "fruitarians," "cake recipes," "Nazca Lines," and "the earth is flat."

Although YouTube's terms of service prohibit accessing its videos through "automated means," over 2,000 GitHub users have starred or endorsed the code.

"If YouTube wanted to prevent this module from working, there are many ways to do that," machine learning engineer Jonas Depoix wrote in a discussion on GitHub, where he posted the code Black used to access YouTube subtitles. "So far, this hasn't happened."

In an email to Proof News, Depoix said he hadn't used the code since he wrote it as a college student for a project several years ago and was surprised that people found it useful. He declined to answer questions about YouTube's rules.

Google spokesperson Jack Malon responded to a request for comment in an email, saying the company has taken "actions over the years to prevent abuse, unauthorized scraping." He did not respond to questions about other companies using the material as training data.

Among the videos used by AI companies were 146 videos from "Einstein Parrot," a channel with nearly 150,000 subscribers. Marcia, the caretaker of the African grey parrot (who did not want to give her last name for fear of endangering the famous parrot's safety), initially thought it was amusing that AI models absorbed the parrot's speech.

"Who would want to use a parrot's voice?" Marcia said. "But then, I know he talks very well. He speaks in my voice. So he's mimicking me, and then AI is mimicking the parrot."

Once the data is absorbed by AI, it cannot be "forgotten." Marcia is troubled by the possibility that the parrot's information could be used in unknown ways, including creating a digital copy of the parrot, and is concerned about making it swear.

"We are entering uncharted territory," Marcia said.

References:

https://www.wired.com/story/youtube-training-data-apple-nvidia-anthropic/

https://arstechnica.com/ai/2024/07/apple-was-among-the-companies-that-trained-its-ai-on-youtube-videos/