In an interesting turn of events, Apple has acknowledged using a controversial dataset for AI training while simultaneously distancing itself from the ethical concerns surrounding it. The tech giant confirmed it had utilized “The Pile,” a dataset compiled by AI research lab EleutherAI, which includes subtitles from YouTube videos obtained without explicit creator permission.
The dataset, which also incorporates content from Wikipedia, English Parliament records, and even Enron staff emails, was originally created to democratize AI development. However, its use by major tech companies like Apple, Nvidia, and Salesforce has raised eyebrows in the AI ethics community.
Apple, known for its stance on privacy and ethical data use, was quick to clarify its position. In statements to multiple tech publications, the company confirmed that while it had indeed used The Pile, it was not for its flagship Apple Intelligence project. Instead, Apple used the dataset to train its open-source OpenELM models, released in April.
“OpenELM was created purely for research purposes,” an Apple spokesperson told 9to5Mac. “It doesn’t power any of our AI or machine learning features, including Apple Intelligence.”
This revelation comes at a time when the AI industry is under increasing scrutiny for its data collection and training practices. Apple’s response seems calculated to maintain its image as a privacy-focused company while acknowledging its participation in broader AI research.
The company reiterated its commitment to ethical AI development, pointing out that its Apple Intelligence models are trained on licensed data and publicly available information collected by its web crawler. This stands in contrast to the YouTube subtitles and other potentially problematic data sources found in The Pile.
Furthermore, 9to5Mac reports that Apple “has no plans to build any new versions of the OpenELM model.” So let’s just hope Apple and co. take steps to ensure the data they’re using isn’t scrapped from the web unethically.