In the management of large-scale cloud infrastructures, even minor performance declines can lead to significant resource waste. For instance, at companies like Meta, a 0.05% slowdown in application performance may seem negligible, but given the operation of millions of servers simultaneously, this tiny delay accumulates into waste across thousands of servers. Therefore, promptly identifying and addressing these subtle performance regressions presents a substantial challenge for Meta.

image.png

To tackle this issue, Meta AI has introduced FBDetect, a performance regression detection system for production environments capable of capturing the smallest regressions, as low as 0.005%. FBDetect monitors approximately 800,000 time series, covering metrics such as throughput, latency, CPU, and memory usage across hundreds of services and millions of servers. By employing innovative techniques like stack trace sampling across entire server clusters, FBDetect can detect subtle performance differences at the subroutine level.

image.png

FBDetect primarily focuses on subroutine-level performance analysis, reducing the detection difficulty from a 0.05% application-level regression to a more manageable 5% subroutine-level change. This approach significantly reduces noise, making it more practical to track changes.

The core technology of FBDetect encompasses three main aspects. Firstly, it reduces variance in performance data through subroutine-level regression detection, enabling the timely identification of even minute regressions. Secondly, the system conducts stack trace sampling across the entire server cluster, accurately measuring the performance of each subroutine, akin to performance analysis in a large-scale environment. Lastly, for each detected regression, FBDetect performs root cause analysis to determine if the regression is due to transient issues, cost changes, or actual code modifications.

After seven years of real-world production testing, FBDetect boasts robust interference resistance, effectively filtering out false regression signals. The introduction of this system not only significantly reduces the number of incidents developers need to investigate but also enhances the efficiency of Meta's infrastructure. By detecting minor regressions, FBDetect helps Meta avoid the waste of approximately 4,000 servers annually.

In large enterprises like Meta with millions of servers, detecting performance regressions is尤为 important. FBDetect, with its advanced monitoring capabilities, not only improves the detection rate of minor regressions but also provides developers with effective root cause analysis tools, aiding in the timely resolution of potential issues and promoting the efficient operation of the entire infrastructure.

Paper link: https://tangchq74.github.io/FBDetect-SOSP24.pdf

Key Points:

🔍 FBDetect can monitor subtle performance regressions, even as low as 0.005%, greatly enhancing detection precision.

💻 The system covers approximately 800,000 time series, involving multiple performance metrics, and can perform precise analysis in large-scale environments.

🚀 FBDetect, after seven years of practical application, helps Meta avoid the waste of approximately 4,000 servers annually, improving the overall efficiency of the infrastructure.