Recently, LAION announced the release of a new version of its AI training dataset, Re-LAION-5B, which has undergone a security review. This new dataset builds on the widely popular LAION-5B with significant enhancements, particularly in the removal of links associated with Child Sexual Abuse Material (CSAM). LAION states that Re-LAION-5B is the first web-scale dataset to comprehensively eliminate known CSAM links in text-image pair datasets.
LAION's spokesperson mentioned that Re-LAION-5B is primarily divided into two versions: Re-LAION-5B Research and Re-LAION-5B Research-Safe. In this new dataset, a total of 2,236 links were removed, all checked against lists provided in collaboration with child protection organizations. Among these, 1,008 links were confirmed in a report released by the Stanford Internet Observatory in December 2023.
It is worth noting that LAION pointed out that many known CSAM links may no longer be active, as efforts have been ongoing to remove such content from the public internet. Therefore, this number represents a possible upper limit, and the actual number of accessible CSAM links might be even lower. Re-LAION-5B currently contains 5.5 billion text-image pairs, and third parties can utilize this metadata to clean up existing derivatives of LAION-5B, generate differences, and remove all matching content.
LAION hopes that by releasing Re-LAION-5B, it can set a new standard for cleaning web-scale datasets. This update follows criticism of the original LAION-5B dataset for including patient images. At the same time, LAION also mentioned that the presence of CSAM in AI training datasets is a serious issue, especially since some trained systems have been used to generate CSAM content.
According to a report by the Internet Watch Foundation (IWF), there has been a significant increase in AI-generated child sexual abuse material since the fall of 2023. This rise in AI-generated content not only complicates the investigation of real child abuse cases but also leads to a surge in automated reports of CSAM on social media platforms, further intensifying the complexity of the issue.
Key Points:
🌟 Re-LAION-5B is the first web-scale dataset to comprehensively eliminate CSAM links in text-image pair datasets.
🔗 Removed 2,236 links, including 1,008 known links from child protection organizations.
🛡️ LAION aims for the new dataset to set a new safety standard for cleaning web-scale datasets.