Google News Online Services

Looks like no one really cares about YouTube’s rules on AI training

Dwayne Cubbins

Jul 17, 2024 3 Min Read

In a revelation that’s stirring up the tech world, several major companies including Apple, Nvidia, and Salesforce have been caught using YouTube content to train their AI models without proper authorization. This comes hot on the heels of similar allegations against OpenAI, pointing to a worrying trend in the AI industry’s data practices.

An investigation by Proof News has uncovered that subtitles from over 170,000 YouTube videos, spanning more than 48,000 channels, were used to train AI models without creators’ consent. This practice directly violates YouTube’s Terms of Service, which prohibit using materials from the platform without permission. But it appears that the AI giants have put ethics on the sidelines.

The scale of this unauthorized data usage is significant. The dataset in question, called “YouTube Subtitles,” contains transcripts from a wide range of sources, including educational channels like Harvard and MIT, major publications such as the BBC and The Wall Street Journal, and popular YouTubers like PewDiePie, MKBHD, and MrBeast. Alarmingly, it even includes content from over 12,000 deleted videos, raising serious privacy concerns. MKBHD took to 𝕏 to express his frustrations after the recent revelations:

Fun fact, I pay a service (by the minute) for more accurate transcriptions of my own videos, which I then upload to YouTube's back-end. So companies that scrape transcripts are stealing *paid* work in more than one way. Not great.
— Marques Brownlee (@MKBHD) July 16, 2024

This YouTube Subtitles dataset is part of a larger compilation known as “the Pile,” created by a non-profit organization called EleutherAI. While EleutherAI aims to provide access to cutting-edge AI techniques, their methods have now come under scrutiny. They reportedly used a script to automatically download subtitles from YouTube’s API, likely violating the platform’s terms of service.

The list of tech giants implicated in this controversy is surprising. Apple, often praised for its stance on privacy, reportedly used the Pile to train its OpenELM model, released just before announcing new AI capabilities for iPhones and MacBooks. Nvidia, Salesforce, Anthropic (which recently received a $4 billion investment from Amazon), Bloomberg, and Databricks have also been named as users of this dataset.

This unauthorized use of data raises serious ethical and legal questions. Dave Farina, host of the YouTube channel Professor Dave Explains, had 140 of his videos included in the dataset without his knowledge or consent. He argues that these companies are profiting from creators’ work while developing AI models that could potentially replace those same creators. Farina and others are calling for regulation or compensation to address this issue.

This controversy follows reports that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model. According to The New York Times, OpenAI was aware that this practice was legally questionable but believed it to fall under fair use. Similarly, Google, which owns YouTube, has admitted to using videos from the platform to train its own AI models, though they claim this use was permitted under agreements with creators.

The race for AI supremacy is clearly pushing companies to take risks with data acquisition. As high-quality training data becomes scarce, AI developers are turning to unconventional and potentially unethical sources. The Wall Street Journal reports that at the current rate of data consumption, AI companies may outpace the creation of new content by 2028, further intensifying the scramble for training material.

This situation presents a complex challenge for the tech industry, content creators, and regulators. While advanced AI models have the potential to revolutionize numerous fields, the methods used to train them raise serious concerns about copyright infringement, privacy violations, and exploitation of content creators. The legal landscape surrounding AI training data is still evolving, with companies arguing “fair use” while facing lawsuits from creators alleging copyright violations. For YouTube creators, the unauthorized use of their content is particularly troubling, and it remains unclear what recourse they have.

As a writer myself, I’d not want massive companies with billions in their pockets stealing my work for free. I’m confident that creators around the world share the same sentiment. As this story unfolds, it’s clear we’re at a critical juncture in AI development. The actions of these tech giants will likely shape the future of AI and our understanding of data rights, content ownership, and ethical responsibilities in tech. I, for one, hope companies adopt clear guidelines and ethical standards for using public data, ensure proper compensation for content creators, and develop robust oversight mechanisms. But what are your thoughts on this? Do you think information, once published on the web publicly, is fair game for AI training? Feel free to share your thoughts in the comments section below.

Dwayne Cubbins
400 Posts

For nearly a decade, I've been deciphering the complexities of the tech world, with a particular passion for helping users navigate the ever-changing tech landscape. From crafting in-depth guides that unlock your phone's hidden potential to uncovering and explaining the latest bugs and glitches, I make sure you get the most out of your devices. And yes, you might occasionally find me ranting about some truly frustrating tech mishaps.

Comments

Looks like no one really cares about YouTube’s rules on AI training

Dwayne Cubbins 400 Posts

Follow Us

Dwayne Cubbins
400 Posts