Meta recently released a groundbreaking research achievement, developing a new type of memory layer technology that significantly enhances the factual accuracy of large language models (LLMs) and achieves unprecedented scalability in parameter size. This technology not only challenges traditional neural network scaling methods but also provides new directions for future AI architecture design.

The core of this research lies in utilizing a trainable key-value lookup mechanism to add extra parameters to the model without increasing the computational load (FLOPs). The central idea of this approach is to supplement the computationally intensive feedforward layers with a sparsely activated memory layer, thereby providing dedicated storage and retrieval capabilities for information.

image.png

Compared to traditional dense networks, memory layers are more efficient in handling information storage. For example, language models need to learn simple associative information such as birthdays of people and capitals of countries; memory layers can achieve this through a simple key-value lookup mechanism, which is more efficient than using feedforward networks.

The main contribution of this research is the expansion of the memory layer's scale to an unprecedented level, reaching 128 billion parameters. Experimental results indicate that language models equipped with the improved memory layer outperform dense models that double the computational load and also surpass mixture of experts models when computational and parameter sizes are matched. The performance improvement is especially significant in factual tasks.

image.png

Meta's researchers achieved this by replacing one or more feedforward networks (FFNs) in the Transformer network with memory layers. This replacement method consistently shows advantages across different base model sizes (ranging from 134 million to 8 billion parameters) and memory capacities (up to 128 billion parameters). Experimental results demonstrate that memory layers can improve the factual accuracy of language models by over 100%, while also showing significant improvements in code writing and general knowledge. In many cases, models equipped with memory layers can even achieve the performance of dense models that require four times the computational load.

Researchers also made several enhancements to the memory layer to overcome challenges in scaling:

Using a product key lookup mechanism: To address the bottleneck in querying keys in large-scale memory layers, the study adopted trainable product quantization keys, avoiding the need to compare every query key pair.

Parallelizing memory layers: To implement the parallelization of memory layers in a multi-GPU environment, researchers distributed the embedding lookup and aggregation operations across multiple GPUs.

Shared memory mechanism: To maximize parameter sharing, researchers utilized a shared memory parameter pool across all memory layers.

Optimizing performance and stability: Researchers optimized the EmbeddingBag operation using custom CUDA kernels, significantly improving memory bandwidth utilization. Additionally, an input-related gating mechanism with silu non-linearity was introduced to enhance training performance and stability.

image.png

Experimental results also revealed the following key findings:

The size of the memory layer significantly affects performance: As the size of the memory layer increases, the performance of factual question answering continuously improves.

Multiple memory layers outperform a single memory layer: Using multiple memory layers with shared parameters can enhance performance, but too many memory layers can degrade it. The optimal number of memory layers is three.

Memory layers learn facts more quickly: In the early stages of training, models equipped with memory layers show faster performance improvements, indicating that memory layers help the model learn facts more quickly.

Memory layers complement dense layers: Experiments show that both sparse memory layers and dense feedforward layers are essential.

To validate the effectiveness of the memory layer technology, researchers evaluated it on multiple benchmarks, including:

Fact-based question answering (NaturalQuestions, TriviaQA)

Multi-hop question answering (HotpotQA)

Scientific and commonsense knowledge (MMLU, HellaSwag, OBQA, PIQA)

Code writing (HumanEval, MBPP)

Results indicate that models equipped with memory layers outperform baseline models in all these tests, with the most notable performance improvements in factual question answering.

Meta's research not only provides new insights for scaling AI models but also opens new pathways for addressing factual issues and enhancing model performance. Researchers believe that memory layer technology has strong scalability and is expected to be widely applied in various AI applications in the future. They also pointed out that while memory layers still face challenges in hardware acceleration, they are confident that ongoing research and optimization can make their performance comparable to or even surpass traditional feedforward networks.

Furthermore, Meta's research team hopes to further enhance the performance of memory layers through new learning methods, reducing model forgetfulness and hallucinations, and achieving continuous learning.

This research release undoubtedly injects new vitality into the AI field and fills us with anticipation for the future development of AI.

Paper: https://arxiv.org/pdf/2412.09764