Researchers from Princeton University and Yale University have recently released a report on the reasoning capabilities of large language models (LLM) with "Chain of Thought (CoT)" mechanisms, unveiling the mysteries of CoT reasoning: it is not purely based on symbolic reasoning with logical rules, but integrates various factors such as memory, probability, and noisy reasoning.

The researchers used the task of breaking shift ciphers to analyze the performance of three LLMs: GPT-4, Claude3, and Llama3.1. A shift cipher is a simple encoding method where each letter is replaced by a letter a fixed number of positions forward in the alphabet. For example, shifting the alphabet forward by 3 positions would change "CAT" to "FDW".

image.png

The study results indicate that three key factors affecting CoT reasoning are:

Probability: LLMs tend to generate outputs with higher probabilities, even if the reasoning steps point to a less probable answer. For example, if the reasoning steps point to "STAZ", but "STAY" is a more common word, the LLM might "self-correct" and output "STAY".

Memory: During pre-training, LLMs memorize a vast amount of text data, which affects the accuracy of their CoT reasoning. For instance, rot-13 is the most common shift cipher, and the accuracy of LLMs on rot-13 is significantly higher than on other types of shift ciphers.

Noisy Reasoning: The reasoning process of LLMs is not entirely accurate but contains a certain degree of noise. As the shift amount of the cipher increases, the intermediate steps required for decoding also increase, and the impact of noisy reasoning becomes more pronounced, leading to a decrease in the accuracy of LLMs.

The researchers also found that LLM's CoT reasoning relies on self-conditioning, where LLMs need to explicitly generate text as context for subsequent reasoning steps. If instructed to "think silently" without outputting any text, the LLM's reasoning ability would significantly decline. Additionally, the effectiveness of demonstration steps has little impact on CoT reasoning; even if the demonstration steps are incorrect, the CoT reasoning performance of LLMs can still remain stable.

This study shows that LLM's CoT reasoning is not a perfect symbolic reasoning but integrates various factors such as memory, probability, and noisy reasoning. LLMs exhibit characteristics of both a memory master and a probability expert during CoT reasoning. This research helps us to better understand the reasoning capabilities of LLMs and provides valuable insights for developing more powerful AI systems in the future.

Paper link: https://arxiv.org/pdf/2407.01687