OpenAI reportedly transcribed more than a million hours of YouTube videos to collect training data for its advanced GPT-4 model, in defiance of the Google-owned platform’s copyright rules. According to a report from The New York Times, MicrosoftBacked by OpenAI, it used an indigenous speech recognition tool called Whisper to transcribe audio from YouTube videos to produce conversational text, which was then used to train the AI model that powers ChatGPT.
According to the report, ChatGPT creators discussed internally how using YouTube data for training purposes might be against the platform’s policy. The company reportedly chose to use data from YouTube videos because it had exhausted the pool of publicly available data. The report states that OpenAI President Greg Brockman personally helped select the videos to be transcribed.
Google prohibits the use of videos posted on YouTube for “independent” applications of the video platform.
In a statement to The edge, OpenAI spokesperson Lindsay Held said the company uses “unique” data sets for each of its models to “help them understand the world.” She added that the company uses “many sources, including publicly available data and partnerships for non-public data.”
Commenting on the matter, Google spokesperson Matt Bryant said The edge that Google has “seen unconfirmed reports” related to OpenAI using YouTube videos to train AI models. It added that the streaming platform’s “terms of service and robots.txt files prohibit unauthorized scraping or downloading of YouTube content.”
Earlier this week, YouTube CEO Neal Mohan in an interview with Bloomberg said that “he has seen reports” related to OpenAI using YouTube videos to train their Sora text-to-video generator. He said he had no information on this, but that it would be a “blatant violation” of the platform’s policies if it did.
According to the report of The New York Times, Google also used transcribed texts from YouTube videos to train its Gemini AI model. If true, this violates the copyright of the videos, which belongs to the creator who posts the video on the platform. The report states that Google has expanded its terms of service to allow the company to be able to use publicly available Google Docs files, restaurant reviews on Google Maps, and more. to train AI models.
First publication: April 8, 2024 | 12:07 p.m. STI