Recent reports have highlighted a data breach incident involving the OpenAI system, but rest assured that your ChatGPT conversation content is not at risk. Although this hack appears to be superficial, it serves as a reminder that AI companies have quickly become one of the top targets for hackers.
According to The New York Times, former OpenAI employee Leopold Aschenbrenner hinted at this hack incident in a podcast. He termed it a "major security incident," but anonymous company sources told The New York Times that the hackers only gained access to an employee discussion forum.
Safety vulnerabilities should never be considered trivial, and intercepting internal development discussions at OpenAI certainly has value. However, this is far from the hackers gaining access to internal systems, ongoing models, and secret roadmaps.
Despite this, this should still raise our concerns, not necessarily due to the threat of China or other adversaries surpassing us in the AI arms race. The simple fact is that these AI companies have become guardians of extremely valuable data.
Let's talk about the three types of data created or accessed by OpenAI and, to some extent, other AI companies: high-quality training data, extensive user interactions, and customer data.
It is uncertain what specific training data they possess, as these companies are very secretive about their treasures. However, it is incorrect to think that they are just a massive collection of web-scraped data. Yes, they do use web crawlers or datasets like "Pile," but shaping raw data for training models like GPT-4o is a massive task that requires a lot of human labor and can only be partially automated.
Some machine learning engineers speculate that one of the most influential factors in creating large language models (or, perhaps any transformer-based system) is the quality of the dataset. This is why models trained on Twitter and Reddit will never be as eloquent as those trained on all the works published in the past century. (It may also explain why OpenAI was reportedly using questionable legal sources such as copyrighted books in their training data, which they claim to have abandoned.)
Therefore, the training datasets established by OpenAI are of immense value to competitors, other companies, adversary nations, and U.S. regulators. Will the FTC or court want to know exactly what data was used, and whether OpenAI has actually been honest about this issue?
But perhaps more valuable is OpenAI's vast user database — potentially containing billions of conversations on millions of topics with ChatGPT. Just as search data was once the key to understanding the collective psychology of the internet, ChatGPT holds a group of people who may not be as vast as Google's user base but provide a deeper understanding. (If you don't know, unless you opt out, your conversations are being used as training data.)
Hundreds of large companies and numerous small companies use API tools like those from OpenAI and Anthropic for a variety of tasks. To make language models useful, they often need to be fine-tuned or otherwise granted access to their internal databases.
This could be mundane old budget sheets or personnel records (for example, making them easier to search), or yet-to-be-released software code. How they use AI capabilities (and whether they are actually useful) is their business, but the simple fact is that AI providers have privileged access, just like any other SaaS product.
These are industrial secrets, and AI companies have suddenly become the core of these secrets. The novelty of this industry brings a unique risk because AI processes have not yet been standardized or fully understood.
Key Points:
- The data owned by AI companies includes high-quality training data, user interaction data, and customer data, which are of great value to competitors, regulators, and market analysts.
- Conversations between users and AI models are valuable information, a goldmine for developers, marketing teams, and consulting analysts.
- The new trend of AI companies becoming targets for hacking highlights the importance of security measures, even without severe data breaches, it should raise concerns.