ByteDance, in collaboration with the research team at Peking University, has published a paper on arXiv introducing their production system MegaScale for training large language models. MegaScale has constructed a single cluster consisting of over 10,000 GPUs, achieving a model FLOP utilization rate of 55.2%. The system also includes a set of diagnostic tools designed to monitor system components and events, identify root causes, and implement fault tolerance and mitigate lag issues.