Have you ever considered that your research paper might have been used to train AI? Indeed, many academic publishers are "bundling and selling" their works to tech companies developing AI models. This move has stirred considerable controversy in the academic community, especially when authors are unaware of it. Experts suggest that if your work hasn't already been utilized by a large language model (LLM), it likely will be in the near future.

Recently, British academic publisher Taylor & Francis struck a $10 million deal with Microsoft, allowing the tech giant to use their research data to enhance AI systems. Earlier in June, American publisher Wiley also made a deal with a company, earning $23 million in return for using their content to train generative AI models.

If a paper is available online, whether through open access or behind a paywall, it has likely already been fed into a large language model. AI researcher at the University of Washington, Lucy Lu Wang, notes: "Once a paper is used to train a model, it cannot be removed after the training."

image.png

Large language models require extensive data for training, typically scraped from the internet. By analyzing billions of language fragments, these models can learn and generate fluent text. Academic papers, with their high information density and length, are highly valuable to LLM developers. Such data helps AI make better inferences in the scientific field.

The trend of purchasing high-quality datasets is on the rise, with many renowned media outlets and platforms partnering with AI developers to sell their content. Given that many works could be silently scraped without agreements, such collaborations are likely to become more common.

However, some AI developers, like the Large-scale Artificial Intelligence Network, choose to keep their datasets open, while many generative AI companies remain secretive about their training data. Experts believe that open-source platforms like arXiv and databases like PubMed are hot targets for AI companies to scrape.

Proving whether a specific paper is included in an LLM's training set is not straightforward. Researchers can test the model's output against unusual sentences in a paper, but this does not fully prove the paper wasn't used, as developers can tweak the model to avoid direct output of training data.

image.png

Even if it's proven that an LLM used specific text, what happens next? Publishers claim unauthorized use of copyrighted text constitutes infringement, but some argue that LLMs do not copy text but generate new text by analyzing the content.

In the U.S., a landmark copyright lawsuit is underway. The New York Times is suing Microsoft and OpenAI, the developers of ChatGPT, for using their news content to train models without permission.

Many scholars welcome their works being included in LLM training data, especially when these models improve research accuracy. However, not all researchers are comfortable with this, as many feel threatened by their work.

Overall, individual research authors currently have little say when publishers decide to sell their work, and there is a lack of clear mechanisms for crediting and usage of already published articles. Some researchers feel frustrated: "We want the help of AI models but also a fair mechanism, and we haven't found a solution yet."

References:

https://www.nature.com/articles/d41586-024-02599-9

https://arxiv.org/pdf/2112.03570