Recently, an investigation revealed that several tech giants, including Apple, had utilized YouTube video subtitles to train AI models. These data encompassed over 170,000 videos, including content from well-known creators such as MKBHD and Mr. Beast. Apple used this data to train its open-source model, OpenELM, which was released in April of this year.
In response, Apple recently clarified externally that OpenELM has not been applied to any of its AI or machine learning functions, including Apple Intelligence. Apple emphasized that the purpose of developing OpenELM was to contribute to the research community and promote the advancement of open-source large language models. Previously, Apple researchers had described OpenELM as a "state-of-the-art open language model."
Apple stated that OpenELM is only used for research purposes and does not support any Apple Intelligence features. The model is released in an open-source format and can be obtained from Apple's machine learning research website. This means that the "YouTube subtitles" dataset has not been used to support Apple Intelligence. Apple previously stated that the Apple Intelligence model was "trained on licensed data, including data selected for specific functions and publicly available data collected through web crawlers."
It is worth noting that Apple currently has no plans to develop a new version of OpenELM. Wired magazine reported that in addition to Apple, companies like Anthropic and NVIDIA have also used the "YouTube subtitles" dataset to train their AI models. This dataset is part of the non-profit organization EleutherAI's large dataset "The Pile."
This incident has sparked discussions about the sources of AI training data and its impact on privacy and copyright. Although Apple has clarified the use of OpenELM, the practice of tech companies using public data to train AI models remains a topic of concern.