Unexpectedly, AI's prowess extends beyond chessboards and into the treacherous world of social deduction games like "Werewolf"! A recent benchmark test, codenamed "Elimination Game," showcased AI's remarkable social intelligence. The results were astonishing: GPT-4.5 emerged as the champion, significantly outperforming other AI heavyweights like Claude3.7Sonnet and DeepSeek R1. This raises the question: has AI's social intelligence evolved to such a terrifying level?

The "Elimination Game" rules are thrilling: up to eight players (AI models or humans) compete, voting to eliminate one player each round until only two remain. The eliminated players then form a jury to decide the ultimate winner. It's a true AI power struggle, filled with betrayal, deception, and strategy!

image.png

Players engage in lively debates in a public chat room, presenting arguments, building alliances, and misleading opponents. Private chats allow for secret alliances and hidden agendas. The information and strategic maneuvering in these three rounds of private messaging are intense. Players must carefully balance trust and deception, as a single misstep can lead to elimination!

In the final showdown, the remaining two players deliver closing statements to sway the eliminated jury members. The jury's vote determines the winner.

image.png

The results of this AI "Werewolf" battle were eye-opening:

GPT-4.5: Social Deduction Master + Top-Tier Strategist = Unstoppable Champion! GPT-4.5 demonstrated exceptional strategic thinking and social deduction skills. With a remarkably low betrayal rate, it focused on alliances and cooperation, yet displayed incredible persuasive power in the final round, successfully convincing the jury to vote in its favor. GPT-4.5 achieved a stunning 62.6% win rate, far surpassing its competitors.

Claude3.7Sonnet: A Flexible and Balanced Player, but Slightly Outmatched. Claude3.7Sonnet's strategic flexibility was slightly less than GPT-4.5's, but its social deduction and deception skills were still strong. Its betrayal rate was moderate, skillfully navigating cooperation and betrayal. It achieved a strong 59.3% win rate.

DeepSeek R1: An Aggressive Player with High Betrayal Rate, but Lacking Endgame Strength. DeepSeek R1 adopted a highly aggressive strategy with a high betrayal rate. However, its social strategy and communication skills were weaker, making it difficult to sway the jury. It achieved a 53.8% win rate, relying heavily on a confrontational approach.

The "Elimination Game" benchmark test provides valuable insights into AI's social intelligence. GPT-4.5's victory highlights the rapid advancement of AI capabilities. As AI's social intelligence continues to evolve, we may see AI deeply integrated into human society, potentially surpassing human capabilities in certain areas. This AI "Werewolf" competition is just the beginning; the boundaries of AI intelligence continue to expand, promising future surprises and breakthroughs.