The German research institution LAION has created datasets used to train models such as Stable Diffusion and other generative AI models. The institution has released a new dataset, claiming that it "has been thoroughly purged of known suspected child sexual abuse material (CSAM) links."
The new dataset, Re-LAION-5B, is essentially a re-release of the old dataset LAION-5B, but with "fixes" implemented based on recommendations from the nonprofit Internet Watch Foundation, Human Rights Watch, the Canadian Centre for Child Protection, and the now-defunct Stanford Internet Observatory. It is available in two versions for download: Re-LAION-5B Research and Re-LAION-5B Research-Safe (which also removes additional NSFW content). LAION states that both versions have filtered out thousands of known (and "potentially") CSAM links.
LAION wrote in a blog post: "From the outset, LAION has been committed to removing illegal content from its datasets and has taken appropriate measures to achieve this goal." "LAION strictly adheres to the principle of removing illegal content as soon as it is discovered."
It is important to note that LAION's datasets do not contain images, nor have they ever contained images. Instead, they are indexes of image links and alternative image texts compiled by LAION, all sourced from another dataset—Common Crawl, which includes scraped websites and web pages.
Image source: Picture generated by AI, provided by Midjourney, an image licensing service.
The release of Re-LAION-5B came after an investigation by the Stanford Internet Observatory in December 2023, which found that LAION-5B (particularly the subset named LAION-5B400M) contained at least 1,679 illegal image links scraped from social media posts and popular adult websites. According to the report, 400M also contained links to "various inappropriate content," including pornographic images, racist slurs, and harmful social stereotypes.
Although the Stanford University co-authors of the report noted that removing the offending content would be difficult and the presence of CSAM would not necessarily affect the output of models trained on the dataset, LAION said it would temporarily take LAION-5B offline.
The Stanford report recommended that models trained on LAION-5B "should be deprecated and discontinued where possible." Perhaps related to this, AI startup Runway recently removed its Stable Diffusion 1.5 model from the AI hosting platform Hugging Face; we have reached out to the company for more information. (Runway partnered with Stability AI in 2023, the company behind Stable Diffusion, to help train the original Stable Diffusion model.)
The new Re-LAION-5B dataset contains approximately 5.5 billion text-image pairs and is released under the Apache 2.0 license. LAION states that third parties can use the metadata to clean existing copies of LAION-5B by removing matching illegal content.
LAION emphasizes that its datasets are for research purposes, not commercial use. But if history is any indication, this won't stop some organizations. In addition to Stability AI, Google has also used LAION datasets to train its image generation models.
LAION continues in its post: "A total of 2,236 [links to suspected CSAM] were removed after matching with the list of links and image hashes provided by our partners." "These links also included the 1,008 links found in the December 2023 Stanford Internet Observatory report... We strongly urge all research labs and organizations still using the old LAION-5B to migrate to the Re-LAION-5B dataset as soon as possible."