Researchers from Midjourney and New York University have collaboratively developed a novel method to significantly enhance the diversity of creative text generated by language models, while minimizing quality loss.
This technique, detailed in a recent research paper, introduces a "deviation metric" into the AI's training process. It works by quantifying the difference between each generated text and other texts created for the same prompt. Researchers use text embeddings and their pairwise cosine distance to calculate these differences, providing the system with a mathematical framework for understanding textual variation. This novel training approach evaluates the differences between Large Language Model (LLM) responses to enhance output diversity.
The training method evaluates the differences between LLM responses to enhance output diversity. | Image: Chung et al.
Preliminary test results are encouraging. Models trained with this new method showed a 23% increase in text diversity, with only a 5% decrease in quality score as assessed by Reddit's reward system.
A specific test case clearly demonstrates the method's effectiveness. When researchers prompted a standard GPT-4o model with "My dear, why are you trembling? You are the king now," the model primarily generated stories about a nervous new ruler. However, the improved Llama-3.1-8B model (despite its smaller size) produced vastly different stories, ranging from dark fantasy about a bear king to underwater supernatural tales, showcasing a far greater creative breadth. Human testers corroborated these findings, perceiving the texts as more diverse while maintaining quality. It's noteworthy that the researchers tested against a relatively older GPT-4o model, not the newer, more natural-sounding GPT-4.5, which is more computationally expensive. The research data indicates that the improved model outperforms others in both story quality and diversity.
The research team focused on two types of diversity: semantic variation (different story content and plot) and stylistic variation (sounding like it was written by different authors). They developed specific versions for each type, but experiments showed that combining both yielded the best results.
Data shows the modified model outperforms others in both story quality and diversity. | Image: Chung et al.
During the research, the team utilized over 100,000 prompt-response pairs from the Reddit community r/WritingPrompts. They found that only four different responses per prompt were needed to significantly improve model diversity. Furthermore, the system can maintain output quality by using carefully selected training samples or setting minimum quality standards for different responses, making it more flexible than other methods in enhancing output diversity.
Despite the promising outlook, some questions remain. Researchers haven't yet verified whether their method applies to areas beyond creative writing, such as technical documentation and summarization, which might require different approaches. The technique's effectiveness in the online training environments used by many large models also hasn't been fully tested.
Moreover, the Reddit upvote system used to measure quality has limitations. While upvotes offer some indication of text quality, they neglect important factors such as technical accuracy, consistency, and professional writing standards, suggesting a more comprehensive evaluation method might be needed in the future.
Despite these outstanding issues, this new technology promises to revolutionize how large language models handle creative writing tasks, as current models often fall into repetitive patterns. The researchers stated they will publicly share their code on GitHub for other researchers and developers to utilize.