Large language models (LLMs) such as the GPT-4 model, which is widely used in platforms like ChatGPT, have demonstrated remarkable capabilities in understanding written prompts and generating appropriate responses in multiple languages. This has led some of us to wonder: are the texts and answers generated by these models so realistic that they can be mistaken for human-written content?
Pass rates for each witness type (left) and interrogator confidence (right).
Recently, researchers at the University of California, San Diego, conducted a study known as the Turing Test to evaluate the extent to which machines can exhibit human-like intelligence. Their findings revealed that people struggle to distinguish between conversations with GPT-4 models and human agents during paired dialogues.
The research paper was pre-published on the arXiv server, showing that GPT-4 was mistaken for a human in about 50% of the interactions. Although the initial experiments did not fully control for some variables that could affect the results, they decided to conduct a second experiment to obtain more detailed outcomes.
One of these four conversations was with a human witness, while the others were with AI.
In their study, people found it difficult to determine whether GPT-4 was human or not. Compared to GPT-3.5 and ELIZA models, people were usually able to identify the latter as machines, but their ability to discern whether GPT-4 was human or machine was no better than random guessing.
The research team designed an online two-player game called "Human or Not," where participants interacted with either another person or an AI model. In each game, a human interrogator conversed with a "witness" to determine if the other party was human.
Although actual humans were more successful, convincing interrogators they were human about two-thirds of the time, the results suggest that in real-world scenarios, people may not be reliably able to determine whether they are interacting with a human or an AI system.