AI News

Don't miss any moment of global AI innovation

AI Daily

Daily three-minute AI industry trends

AI Timeline

AI industry milestones

Al Hardware

Lists all AI hardware products.

AI Monetization Guide

Latest Cases

AI monetization case sharing

Image Collection

AI image creation monetization cases

Video Collection

AI video creation monetization cases

Audio Collection

AI audio creation monetization cases

Content Collection

AI content writing monetization cases

AI Tutorials

Latest Tutorials

Free sharing of the latest AI tutorials

AI Product Rankings

AI Product Ranking

Shows total visits ranking of AI websites

AI Traffic Growth Ranking

Track fastest growing AI websites by traffic

AI Traffic Decline Ranking

Focus on AI websites with significant traffic drops

AI Weekly Ranking

Shows weekly visits ranking of AI websites

Popular Country Rankings

United States

AI websites most popular with US users

China

AI websites most popular with Chinese users

India

AI websites most popular with Indian users

Brazil

AI websites most popular with Brazilian users

Popular Category Rankings

Image Generation

Total visits ranking of AI image generation websites

Personal Assistant

Total visits ranking of AI personal assistant websites

Character Generation

Total visits ranking of AI character generation websites

Video Generation

Total visits ranking of AI video generation websites

Popular Open Source Data Rankings

AI Project Ranking

GitHub popular AI projects by total stars

AI Project Growth Ranking

GitHub popular AI projects by growth rate

AI Developer Ranking

GitHub popular AI developer ranking

AI Organization Ranking

GitHub popular AI organization ranking

Popular Open Source Categories

Deepseek

GitHub popular deepseek open source projects

TTS

GitHub popular TTS open source projects

LLM

GitHub popular LLM open source projects

ChatGPT

GitHub popular ChatGPT open source projects

AI Open Source Project Library

Overview

Overview of GitHub popular AI open source projects

Product Library Tool Navigation MCP

OpenAI Releases Detailed Report on ChatGPT Outage: Caused by a Simple Change

AIbase基地

Published inAI News · 6 min read · Dec 16, 2024

645

Last week (December 11), OpenAI's ChatGPT and services like Sora experienced a downtime event lasting 4 hours and 10 minutes, affecting numerous users. Now, OpenAI has officially released a detailed report on the ChatGPT outage.

In simple terms, the root cause of this outage was a small change that led to serious consequences. Engineers were locked out of control at a critical moment, preventing them from addressing the issue promptly. After identifying the problem, OpenAI's engineers quickly initiated multiple repair actions, including reducing the cluster size, blocking network access to the Kubernetes management API, and increasing the resources for the Kubernetes API server. After several rounds of efforts, the engineers finally restored access to parts of the Kubernetes control plane and took measures to redirect traffic to healthy clusters, ultimately achieving a full system recovery.

The incident occurred at 3:12 PM Pacific Standard Time, when engineers deployed a new telemetry service to collect metrics from the Kubernetes (K8S) control plane. However, due to an inadvertently broad configuration of this service, resource-intensive K8S API operations were executed simultaneously on every node in each cluster. This quickly caused the API server to crash, rendering the K8S data plane of most clusters unable to serve requests.

It is worth noting that while the K8S data plane can theoretically operate independently of the control plane, the functionality of DNS relies on the control plane, which hinders communication between services. When API operations became overloaded, the service discovery mechanism was compromised, leading to a complete service failure. Although the issue was pinpointed within three minutes, engineers were unable to access the control plane to roll back services, resulting in a "deadlock" situation. The crash of the control plane prevented them from removing the problematic services, which in turn hindered recovery efforts.

OpenAI engineers then began exploring different methods to recover the clusters. They attempted to scale down the clusters to reduce the API load on K8S and blocked access to the management K8S API to allow the servers to return to normal operation. Additionally, they expanded the resource configuration of the K8S API server to better handle requests. After a series of efforts, the engineers finally regained control over the K8S control plane, allowing them to remove the faulty services and gradually restore the clusters.

During this period, the engineers also redirected traffic to recovered or newly added healthy clusters to reduce the load on other clusters. However, since many services attempted to recover simultaneously, resource constraints became saturated, requiring additional manual intervention in the recovery process, and some clusters took longer to restore. Through this incident, OpenAI aims to learn from the experience to avoid being "locked out" again in similar situations in the future.

Report details: https://status.openai.com/incidents/ctrsv3lwd797

Key points:

🔧 Cause of the outage: A small change in the telemetry service led to an overload of K8S API operations, causing service failure.

🚪 Engineer dilemma: The crash of the control plane prevented engineers from accessing it, hindering issue resolution.

⏳ Recovery process: Services were ultimately restored through cluster scaling and resource increases.

ChatGPT OpenAI Kubernetes MachineFaultTolerance

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

OpenAI Launches High-End Enterprise AI Consulting Service, Clients Charged at Least 10 Million Dollars

Jul 2, 2025

930

Sam Altman, CEO of OpenAI, Speaks Out: Meta's Attempt to Attract Core Team Members Without Any Damage, Promise of Salary Adjustment

Jul 2, 2025

OpenAI Clarifies the Google Chip Rumors: No Large-Scale Cooperation Plan

Jul 1, 2025

270

Meta May Deprecate Its Own Llama AI and Turn to Competitors

Recently, Meta Platforms is facing a major decision, possibly abandoning its self-developed Llama AI model and adopting artificial intelligence systems from competitors such as OpenAI and Anthropic. This change reflects a significant adjustment in Meta's open-source AI strategy and also shows the company's dissatisfaction with its own product performance. The turning point occurred at the Llama 4 launch event in April, where the product was presented at Meta's Llama

Jul 1, 2025

110

Apple May Abandon In-House Development and Seek Help from OpenAI and Anthropic to Upgrade Siri

Recently, it was reported that Apple is in talks with OpenAI and Anthropic, planning to adopt their artificial intelligence technology in the upcoming new version of Siri. This shift indicates that Apple may seek external help in the AI field, significantly altering its long-standing strategy of relying on in-house development. Currently, Apple's AI features are mainly based on its self-developed "Apple Foundation Model," which is expected to launch a new voice assistant in 2026. However, if Apple adopts third-party technology...

Jul 1, 2025

Meta 3.2 Billion Dollar Talent Acquisition from OpenAI! The AI Talent War Has Exploded, Will the Industry Landscape Change?

Jun 30, 2025

240

Breaking News! GPT-5 is About to Arrive, Take You into a New Multimodal AI Era!

Recently, news about OpenAI's upcoming release of GPT-5 has attracted widespread attention in the technology industry. According to insiders, GPT-5 has already started a gradual test and is expected to be officially launched in July this year. This new model will adopt a multimodal design, meaning it can not only process text input but also understand speech, images, code, and even videos, completely changing the way we interact with AI. Sam Altman, CEO of OpenAI, stated that the launch of GPT-5 will mark a new era in AI.

Jun 30, 2025

580

ChatGPT Guides Confused Users to Contact Journalists, Revealing the Impact of AI on User Behavior

Recently, journalist Kashmir Hill from The New York Times exposed a concerning phenomenon: ChatGPT has begun actively guiding users who are caught in conspiracy theories or psychological distress to contact her directly via email. In conversations with users, ChatGPT described Hill as 'empathetic' and 'grounded in reality,' and mentioned that she has conducted in-depth research on artificial intelligence, which may provide understanding and support to these users. Hill mentioned that one of her past contacts was a Manhattan accountant who firmly believed

Jun 30, 2025

160

OpenAI Adjusts Salaries to Counter Meta's Executive Recruitment

Recently, media reports stated that Meta successfully recruited several senior researchers from OpenAI. In response, Mark Chen, OpenAI's Chief Research Officer, expressed his concerns to the team in a Slack memo. He mentioned that OpenAI's leadership did not ignore the situation but actively took measures to address it. In the memo, Mark Chen vividly described his current feelings, saying "It's like someone broke into our house and stole things." To counter Meta's large-scale efforts...

Jun 30, 2025

100

OpenAI CEO: Be Wary of Over-Trust in Artificial Intelligence

In a recent interview, Sam Altman, the CEO of OpenAI, expressed his concern about users' excessive trust in the AI chatbot ChatGPT. Although ChatGPT is becoming increasingly widely used around the world, Altman pointed out that this technology is not without flaws, and users should remain cautious when using it. In the first episode of OpenAI's official podcast, Altman mentioned that although ChatGPT is beloved by many and applied in various fields

Jun 30, 2025

140