Last week (December 11), OpenAI's ChatGPT and services like Sora experienced a downtime event lasting 4 hours and 10 minutes, affecting numerous users. Now, OpenAI has officially released a detailed report on the ChatGPT outage.
In simple terms, the root cause of this outage was a small change that led to serious consequences. Engineers were locked out of control at a critical moment, preventing them from addressing the issue promptly. After identifying the problem, OpenAI's engineers quickly initiated multiple repair actions, including reducing the cluster size, blocking network access to the Kubernetes management API, and increasing the resources for the Kubernetes API server. After several rounds of efforts, the engineers finally restored access to parts of the Kubernetes control plane and took measures to redirect traffic to healthy clusters, ultimately achieving a full system recovery.
The incident occurred at 3:12 PM Pacific Standard Time, when engineers deployed a new telemetry service to collect metrics from the Kubernetes (K8S) control plane. However, due to an inadvertently broad configuration of this service, resource-intensive K8S API operations were executed simultaneously on every node in each cluster. This quickly caused the API server to crash, rendering the K8S data plane of most clusters unable to serve requests.
It is worth noting that while the K8S data plane can theoretically operate independently of the control plane, the functionality of DNS relies on the control plane, which hinders communication between services. When API operations became overloaded, the service discovery mechanism was compromised, leading to a complete service failure. Although the issue was pinpointed within three minutes, engineers were unable to access the control plane to roll back services, resulting in a "deadlock" situation. The crash of the control plane prevented them from removing the problematic services, which in turn hindered recovery efforts.
OpenAI engineers then began exploring different methods to recover the clusters. They attempted to scale down the clusters to reduce the API load on K8S and blocked access to the management K8S API to allow the servers to return to normal operation. Additionally, they expanded the resource configuration of the K8S API server to better handle requests. After a series of efforts, the engineers finally regained control over the K8S control plane, allowing them to remove the faulty services and gradually restore the clusters.
During this period, the engineers also redirected traffic to recovered or newly added healthy clusters to reduce the load on other clusters. However, since many services attempted to recover simultaneously, resource constraints became saturated, requiring additional manual intervention in the recovery process, and some clusters took longer to restore. Through this incident, OpenAI aims to learn from the experience to avoid being "locked out" again in similar situations in the future.
Report details: https://status.openai.com/incidents/ctrsv3lwd797
Key points:
🔧 Cause of the outage: A small change in the telemetry service led to an overload of K8S API operations, causing service failure.
🚪 Engineer dilemma: The crash of the control plane prevented engineers from accessing it, hindering issue resolution.
⏳ Recovery process: Services were ultimately restored through cluster scaling and resource increases.