OpenAI's new system has achieved outstanding results in the recent evaluation, claiming the top spot in the chatbot rankings. However, due to the low number of ratings, this could potentially skew the evaluation results.

QQ20240920-103932.png

According to the published overview, these new systems have performed exceptionally well in all evaluation categories, including overall performance, safety, and technical capabilities. One system specifically designed for STEM tasks briefly ranked second alongside the GPT-4o version released in early September and has taken the lead in the technical field.

The Chatbot Arena is a platform for comparing different systems, which evaluated the new systems using over 6,000 community ratings. The results indicate that these new systems excel in mathematical tasks, complex prompts, and programming.

QQ20240920-103553.png

However, the ratings these new systems received are significantly lower than those of other mature systems, such as GPT-4o or Anthropic's Claude3.5, with each system receiving fewer than 3,000 reviews. Such a small sample size could distort the evaluation and limit the significance of the results.

OpenAI's new systems have excelled in mathematics and coding, which are the main objectives of their design. By "thinking" longer before responding, these systems aim to set a new standard for AI reasoning. However, these systems do not outperform others in all areas. Many tasks do not require complex logical reasoning, and sometimes a quick response from other systems is sufficient.

The chart from Lmsys regarding the strength of mathematical models clearly shows that these new systems score over 1360, far exceeding the performance of other systems.