In the era of AI, large language models (LLMs) are like martial arts manuals, requiring enormous computational power and data for training, akin to a martial arts master who has spent years in seclusion honing their skills. The release of open-source models is like these masters publicly sharing their manuals, but it comes with certain licenses (such as Apache 2.0 and the LLaMA2 community license) to protect their intellectual property (IP).
However, the world is fraught with dangers, and there are always incidents of "shell models." Some developers claim to have trained new LLMs, but in reality, they are merely repackaging or fine-tuning existing foundational models (like Llama-2 and MiniCPM-V). This is similar to someone secretly learning another's martial arts skills and then claiming them as their own. To prevent such occurrences, model owners and third parties urgently need a method to identify "shell" models.
Current model fingerprinting methods mainly fall into two categories:
Injected Fingerprints: This is like secretly marking a manual, such as through watermarking methods. This approach artificially adds "triggers" during the model training or fine-tuning process, causing the model to generate specific content under certain conditions, thus identifying its source. However, this method increases training costs, affects model performance, and may even be removed. Moreover, it cannot be applied to models that have already been released.
Intrinsic Fingerprints: This is like determining the source from the content and style of the manual itself. This method utilizes the inherent properties of the model for identification, including model weights and feature representations. Among them, weight-based fingerprinting methods identify models by calculating the similarity of model weights. However, this approach is easily affected by weight changes, such as weight arrangement, pruning, and fine-tuning. On the other hand, semantic analysis methods identify models through statistical analysis of the text generated by the model. However, both methods suffer from issues of insufficient robustness.
So, is there a method that can effectively identify "shell" models without impacting model performance and resist various "fancy" modifications?
Researchers from the Shanghai Artificial Intelligence Laboratory and other institutions have proposed a new model fingerprinting method—REEF.
The working principle of REEF is as follows:
REEF is a feature representation-based fingerprinting method. It does not rely on the representation of any specific layer, but instead leverages the powerful representation modeling capability of LLMs to extract features for identification from various layers.
It compares the center kernel alignment (CKA) similarity of feature representations between two models on the same samples. CKA is a similarity metric based on the Hilbert-Schmidt independence criterion (HSIC), which measures the independence between two sets of random variables.
If the similarity is high, it indicates that the suspect model is likely derived from the victim model; conversely, it is less likely.
What are the advantages of REEF?
No training required: This means it does not affect the performance of the model and does not incur additional training costs.
Strong robustness: It is robust against various subsequent developments such as model pruning, fine-tuning, merging, rearrangement, and scaling transformations. Even if the suspect model undergoes extensive fine-tuning (with data amounts up to 700B tokens), REEF can still effectively identify whether it originated from the victim model.
Theoretical guarantees: Researchers have theoretically proven that CKA is invariant to weight arrangement and scaling transformations.
Experimental results show that REEF performs exceptionally well in identifying "shell" models, outperforming existing weight-based and semantic analysis methods.
The emergence of REEF provides a new tool for protecting the intellectual property of LLMs and helps combat unethical or illegal activities such as unauthorized use or replication of models.
Paper link: https://arxiv.org/pdf/2410.14273