Back

Tech Giants Accused of Using YouTube Videos to Train AI Without Permission

A recent investigation by Proof News has revealed that major tech companies, including Apple, Nvidia, and Anthropic, have allegedly used subtitles from thousands of YouTube videos to train their artificial intelligence models without the creators’ consent. This practice, which bypasses YouTube’s rules against unauthorized data harvesting, has sparked significant concern among content creators and privacy advocates.

The investigation found that subtitles from 173,536 YouTube videos, sourced from over 48,000 channels, were used by these companies. The dataset, known as YouTube Subtitles, includes transcripts from educational channels like Khan Academy and prestigious institutions such as MIT and Harvard. It also features content from popular media outlets and late-night talk shows, as well as videos from YouTube stars like MrBeast and PewDiePie.

Content creators, including David Pakman of The David Pakman Show, expressed their frustration upon discovering their videos were used without permission. Pakman emphasized the significant effort and resources invested in creating his content and argued that he should be compensated if AI companies profit from his work.

Dave Wiskus, CEO of the creator-owned streaming service Nebula, described the practice as theft and disrespectful, noting the potential for AI to replace human artists and creators.

The dataset, published by EleutherAI, consists of plain text subtitles and includes translations in multiple languages. Despite being publicly available, the unauthorized use of this data raises ethical and legal questions, particularly concerning YouTube’s terms of service.

Prominent companies such as Apple, Nvidia, and Salesforce confirmed using the dataset for training AI models, often for research and academic purposes. However, the potential risks associated with biases and inaccuracies in the data have not gone unnoticed. Salesforce developers highlighted concerns about profanity and biases in the dataset, which could lead to safety and ethical issues.

The unauthorized use of YouTube data is part of a broader trend where AI companies seek high-quality data to improve their models. This practice has also extended to other datasets, such as Books3, which included works from renowned authors and led to multiple lawsuits against AI companies for copyright violations.

As the AI industry continues to evolve, the debate over the ethical use of data remains unresolved. Content creators and advocates are calling for greater transparency and regulation to ensure fair compensation and protect against unauthorized data usage. The controversy surrounding YouTube Subtitles underscores the need for a balanced approach that respects creators’ rights while fostering technological innovation.

Source: Wired