Recently, the social media platform Bluesky faced a significant data scraping incident. A machine learning librarian, Daniel van Strien, scraped over one million public user posts from Bluesky's API and uploaded this data to the AI company Hugging Face.

image.png

The dataset includes users' decentralized identifiers (DID) and a range of features that allow for searching specific user content. Van Strien stated that the primary purpose of this dataset is for the development of language models and natural language processing, as well as for research on social media trend analysis, content moderation, and posting patterns.

This data scraping action has attracted widespread attention, as Bluesky users did not consent to have their content used for such purposes. Although the platform does not explicitly prohibit this behavior, its API provides an "aggregated, chronological public data stream," which includes posts, likes, follows, account changes, and more. Therefore, Bluesky's content is theoretically open to third-party developers.

In response, a representative from Bluesky stated, "Bluesky is an open and public social network, just like other websites on the internet."

While the robots.txt file does not always prevent external companies from scraping these websites, the situation is similar. We hope to find a way for Bluesky users to communicate to external organizations/developers whether they consent to the use of their data and expect external organizations to respect users' consent. We are actively discussing how to achieve this goal."

This incident has raised concerns among users, especially since many switched to Bluesky due to the new AI training policies of the competing platform X. Notably, shortly after the report was published, Van Strien removed the dataset from Hugging Face.

image.png

He stated on Bluesky, "I have removed the Bluesky data from the repository. While I wanted to support the development of tools for the platform, I realize that this practice violates the principles of transparency and consent in data collection. I sincerely apologize for this."

Key Points:

🌐1. A machine learning expert scraped one million public posts from Bluesky and uploaded them to the AI company Hugging Face for machine learning research.  

🔍2. Bluesky users did not consent to the use of their data, and the platform did not explicitly prohibit such data scraping activities.  

🚫3. The data scraping incident raised user concerns, and Van Strien has removed the related data from Hugging Face and expressed his apologies.