Poe, in collaboration with SurgeAI, conducted a systematic evaluation of mainstream large models across four dimensions: reasoning, writing, creativity, and non-English language capability. The evaluation results show that GPT-4 performs best in all dimensions, particularly in English tasks; Google's PaLM excels in non-English language capabilities. Claude 2 ranks second in reasoning, while Llama 2 70b ranks third in writing and creativity. The evaluation methods include industry benchmark tests, expert assessments, Elo ratings, etc., to highlight the strengths of each model.