With the surge in popularity of ChatGPT, various domestic and international large-scale model evaluation rankings have been introduced. However, large models with similar parameter sizes often show significant ranking differences across different lists. The industry and academia attribute this primarily to the use of different evaluation sets, and also to the increasing proportion of subjective questions, which raises doubts about the fairness of the evaluations. As a result, third-party evaluation institutions like OpenCompass and FlagEval have started to gain attention. However, the industry believes that to create truly comprehensive and effective large-scale model evaluations, other dimensions such as model robustness and security need to be considered. This is still under exploration.