Apple Inc. has recently released a technical paper detailing the models developed for the generative artificial intelligence capabilities of the "Apple Intelligence" series. These features are set to roll out to iOS, macOS, and iPadOS platforms in the coming months. In the paper, Apple addresses external concerns about ethical issues in their model training process, reiterating that they have not used any private user data, instead relying on publicly available and licensed data for training.
Image Source Note: The image is AI-generated, with image licensing provided by Midjourney.
Apple states that their pre-training dataset includes licensed data from publishers, carefully selected public datasets, and publicly available information scraped by their web crawler, Applebot. Emphasizing the importance of user privacy, Apple highlights that these datasets do not contain any private user information.
In July, media reports surfaced that Apple had used a dataset called "The Pile," which contained captions from hundreds of thousands of YouTube videos, many of which were created without the knowledge or authorization of the caption authors. Apple later clarified that they did not intend to use these models to provide any AI features for their products.
This technical paper unveils the mystery behind Apple's "Apple Foundation Model" (AFM), which was announced at the 2024 WWDC. It emphasizes that the training data for these models was acquired "responsibly." The AFM models' training data comes from public web data and some undisclosed licensed data from publishers. It was reported that Apple contacted multiple publishers such as NBC and Condé Nast at the end of 2023, reaching long-term agreements worth at least $50 million to use their news archives for model training. Additionally, AFM models also utilized open-source code hosted on GitHub, including code in programming languages such as Swift, Python, and C.
However, using open-source code for model training has sparked controversy among developers. Some open-source code repositories lack proper licensing or do not permit use for AI training. Apple asserts that they undergo a "licensing filter," selecting only those repositories with fewer restrictions.
To enhance the mathematical capabilities of the AFM models, Apple specifically included mathematical problems and answers from web pages, math forums, blogs, tutorials, and seminars in their training dataset. They also used "high-quality, publicly available" datasets for fine-tuning to minimize the likelihood of the models exhibiting inappropriate behavior.
The integrated dataset contains approximately 6.3 trillion tokens, compared to the 15 trillion tokens used by Meta for training its flagship text generation model, Llama3.1405B. Apple further optimized the AFM models through human feedback and synthetic data to better align with user needs.
Although the paper does not present any groundbreaking discoveries, this is a deliberate outcome. Most such papers avoid excessive detail to sidestep legal issues. Apple mentions in the paper that they allow web administrators to block crawlers from scraping data, but this is not particularly helpful for individual creators, leaving the protection of their work as an unresolved issue.
Key Points:
🌟 Apple emphasizes that they did not use private user data in training models, relying instead on public and licensed data.
📊 The training data includes authorized content from multiple publishers and open-source code repositories.
🔍 Apple strives to enhance AI model performance and accountability while protecting user privacy.