Pruna AI, a European startup, focuses on developing compression algorithms for AI models. Recently, the company announced the open-sourcing of its optimization framework to help developers compress AI models more efficiently.
Pruna AI's framework combines several efficiency methods, including caching, pruning, quantization, and distillation, aiming to enhance AI model performance. It standardizes saving and loading compressed models, evaluates the compressed models to check for significant quality degradation, and measures performance improvements from compression.
John Rachwan, co-founder and CTO of Pruna AI, stated: "Our framework is similar to Hugging Face's standardization of transformers and diffusers; we provide a unified way to call and use various efficiency methods." Large companies like OpenAI already employ multiple compression methods in their models, such as distillation to create faster versions of their flagship models.
Distillation is a knowledge extraction technique using a "teacher-student" model. Developers send requests to the teacher model and record the output. This output then trains the student model to approximate the teacher model's behavior. Rachwan noted that while many large companies tend to build their compression tools, the open-source community often only offers single-method solutions. Pruna AI provides a tool integrating multiple methods, significantly simplifying the process.
Currently, Pruna AI's framework supports various model types, including large language models, diffusion models, speech recognition models, and computer vision models. However, the company primarily focuses on optimizing image and video generation models. Companies like Scenario and PhotoRoom already use Pruna AI's services.
Besides the open-source version, Pruna AI offers an enterprise version with advanced optimization features and an optimization agent. Rachwan revealed: "The most exciting feature we're about to release is the compression agent. Users only need to provide the model and specify speed and accuracy requirements; the agent will automatically find the best compression combination."
Pruna AI charges by the hour, similar to renting GPUs on cloud services. Using optimized models helps enterprises save significant costs during inference. For example, Pruna AI successfully reduced the size of a Llama model by eight times with almost no loss of accuracy. The company hopes clients view its compression framework as an investment with a strong return.
Recently, Pruna AI secured $6.5 million in seed funding from investors including EQT Ventures, Daphni, Motier Ventures, and Kima Ventures.
Project: https://github.com/PrunaAI/pruna
Key Highlights:
🌟 Pruna AI launches an open-source optimization framework combining multiple compression methods to improve AI model performance.
🤖 Large companies often use techniques like distillation; Pruna AI offers a tool integrating multiple methods, simplifying the process.
💰 The enterprise version supports advanced features, helping users achieve model compression and performance improvements while maintaining accuracy.