Meta is facing a lawsuit involving copyright infringement, with the plaintiff's attorney claiming that Meta CEO Mark Zuckerberg approved the use of a dataset containing pirated eBooks and articles to train its Llama AI model. This case is one of many copyright lawsuits against several tech giants accused of using copyrighted works for AI model training without authorization.
In documents submitted to the U.S. District Court for the Northern District of California on Wednesday evening, the plaintiff reiterated testimony from Meta at the end of last year, revealing that Zuckerberg approved the use of a dataset known as LibGen for training related to Llama. LibGen is viewed as a "link aggregator" that provides a vast array of copyrighted academic publications. Despite facing multiple lawsuits and injunctions for copyright infringement, the site continues to provide works from major publishers such as Cengage Learning and McGraw Hill.
Image Source Note: Image generated by AI, image licensed by Midjourney
The documents mention that some Meta employees internally acknowledged that LibGen is a "dataset we know is pirated," indicating that its use could negatively impact the company's negotiating position with regulators. Particularly concerning is the allegation that Meta engineer Nikolay Bashlykov was accused of writing scripts to remove copyright information from LibGen eBooks, including the words "copyright" and "acknowledgments." Meta is also alleged to have removed copyright marks and source metadata from scientific journal articles to cover up its infringement.
More controversially, Meta is accused of downloading LibGen content via torrenting and aiding in the distribution of these pirated files. Torrenting is a method of distributing files online, where downloaders share content while simultaneously uploading files. The plaintiff's attorney stated that by participating in torrenting, Meta effectively committed another form of copyright infringement. Although Meta engineers expressed reservations about the legality of this action, Meta continued this practice with the support of AI generation lead Ahmad Al-Dahle.
These allegations align with a report from The New York Times last April, which suggested that Meta cut corners in collecting AI data. Reports indicated that Meta hired contractors in Africa to compile book summaries and considered acquiring the publisher Simon & Schuster. However, Meta executives believed that negotiating copyright licenses took too long, and the principle of fair use became their main defense argument.
Currently, the case has not reached a conclusion and only involves the early Llama model from Meta. Although the court dismissed several copyright lawsuits related to AI in 2023, ruling that the plaintiffs failed to prove infringement, the allegations in this case could still adversely affect Meta. Chief Judge Vince Chhabria noted in an order on Wednesday that he rejected Meta's request to dismiss most of the documents, stating that the deletion of these documents was clearly intended to avoid negative publicity rather than to protect sensitive business information.
This case will continue to spark widespread discussion about how tech companies use copyrighted works to train AI models, especially regarding the boundaries between fair use and copyright protection.