Recently, documents revealed in a collective lawsuit concerning copyright against Meta have drawn significant attention to the company's use of a piracy eBook library called Library Genesis (LibGen) to train its latest AI chatbot, Llama3. These documents show that Meta's engineers discussed the potential risks of utilizing LibGen, a "shadow library," especially in light of the growing concerns surrounding copyright and data ownership. Despite the potential negative impacts and public relations risks, Meta's CEO Mark Zuckerberg approved this decision.
Image Source Note: Image generated by AI, image licensed by Midjourney
At the court's request, confidential internal conversations about using the LibGen dataset were unsealed, revealing that Meta's executives explicitly stated in discussions with the AI research team that the data from LibGen was "known to be pirated," and they agreed to use this data to enhance Llama3's performance. In an email, Meta's product management director Sony Theakanath noted that while the decision to use LibGen posed public relations risks, other AI companies were also using similar data, leading Meta's team to feel that this path was not unique.
More concerning is that Meta employees discussed how to handle and filter the text from LibGen to remove copyright identifiers such as ISBNs and copyright notices. An internal memo stated that the materials provided by LibGen were "high quality and lengthy, making them very suitable for learning particularly specialized knowledge." This suggests that Meta appears to be attempting to obscure its use of unauthorized content.
Additionally, Meta employees mentioned in emails that directly using the company's IP address for torrent downloads might be inappropriate and expressed concerns about this practice. However, under Zuckerberg's "top-down push" to use the LibGen dataset, Meta's competitive drive in the AI race became apparent. This incident has once again raised external scrutiny and questions regarding major tech companies' handling of copyright issues.
The outcome of this copyright lawsuit could significantly impact other ongoing similar cases, especially those involving the use of creative works such as images, music, and literature. As tech companies' demand for original content continues to rise, the rights of original content creators will become a focal point of attention.