Recently, Microsoft's research team launched an artificial intelligence technology known as the "Large Action Model" (LAM), marking a new phase in AI development. Unlike traditional language models like GPT-4o, LAM can autonomously operate Windows programs, meaning that AI can not only converse or provide suggestions but can actually perform tasks.

image.png

The advantage of LAM lies in its ability to understand various user inputs, including text, voice, and images, and then convert these requests into detailed action plans. LAM can not only create plans but also adjust its action strategies based on real-time situations. The process of building LAM primarily involves four steps: first, the model learns to break tasks down into logical steps; next, it learns how to translate these plans into specific actions using more advanced AI systems (like GPT-4o); then, LAM independently explores new solutions, even tackling problems that other AI systems cannot address; finally, it undergoes fine-tuning training through a reward mechanism.

In experiments, the research team built a LAM model based on Mistral-7B and tested it in a Word environment. The results showed that the model successfully completed tasks with a probability of 71%, whereas GPT-4o had a success rate of 63% without visual information.

Moreover, LAM also excelled in task execution speed, completing each task in just 30 seconds, while GPT-4o took 86 seconds. Although GPT-4o's success rate improved to 75.5% when handling visual information, LAM demonstrated significant advantages in both speed and effectiveness overall.

To construct the training data, the research team initially collected 29,000 pairs of tasks and plans, sourced from Microsoft documents, wikiHow articles, and Bing searches. They then utilized GPT-4o to convert simple tasks into complex ones, expanding the dataset to 76,000 pairs, an increase of 150%. Ultimately, about 2,000 successful action sequences were included in the final training set.

image.png

Despite LAM's potential in AI development, the research team still faces challenges such as the possibility of AI actions going awry, regulatory issues, and technical limitations in scaling and adapting to different applications. However, researchers believe that LAM represents a significant shift in AI development, indicating that intelligent assistants will be able to assist humans more actively in completing real tasks.

Key Points:

🌟 LAM can autonomously execute Windows programs, breaking the limitation of traditional AI that only converses.

⏱️ In the Word test, LAM achieved a task completion probability of 71%, higher than GPT-4o's 63%, with faster execution speed.

📈 The research team enhanced the model's training effectiveness by expanding the number of task-plan pairs to 76,000.