According to Artificial Analysis, a third-party AI testing agency, evaluating OpenAI's o1 reasoning model across seven popular benchmarks cost $2,767.05, while its non-reasoning model, GPT-4o, cost only $108.85. This significant difference has sparked discussions about the sustainability and transparency of AI evaluation.
Reasoning models, AI systems capable of step-by-step "thinking" to solve problems, excel in specific areas but incur significantly higher benchmark testing costs than traditional models. Artificial Analysis estimated that evaluating roughly a dozen reasoning models totaled around $5,200, nearly double the cost of analyzing over 80 non-reasoning models ($2,400).
Image Source Note: Image generated by AI, licensed through Midjourney.
The cost disparity stems primarily from the massive token generation of reasoning models. For instance, o1 generated over 44 million tokens during testing, approximately eight times that of GPT-4o. As benchmarks become more complex, assessing real-world task capabilities, coupled with rising per-token costs for top models (e.g., OpenAI's o1-pro charges $600 per million output tokens), independently verifying the performance of these models becomes prohibitively expensive.
While some AI labs offer free or subsidized access to benchmarks for testing institutions, experts worry this might compromise the objectivity of the evaluations. Ross Taylor, CEO of General Reasoning, questioned: "From a scientific perspective, if you publish a result that nobody can replicate using the same model, is it really science?"