Recently, Zhejiang University and Alibaba Damo Academy jointly released a noteworthy study aimed at creating high-quality multimodal textbooks through instructional videos. This innovative research not only provides new insights for training large-scale language models (VLMs) but may also change the way educational resources are utilized.

With the rapid development of artificial intelligence technology, the pre-training data for VLMs primarily relies on text-image paired data and intertwined text-image corpora. However, the current data mostly comes from the web, where the correlation between text and images is weak, and the knowledge density is relatively low, making it ineffective for supporting complex visual reasoning.

image.png

To address this challenge, the research team decided to extract high-quality knowledge corpora from the vast array of instructional videos available online. They collected over 159,000 instructional videos, and after meticulous filtering and processing, ultimately retained 75,000 high-quality videos covering various subjects such as mathematics, physics, and chemistry, with a total duration exceeding 22,000 hours.

The researchers designed a complex "video to textbook" processing pipeline. First, they used automatic speech recognition (ASR) technology to transcribe the spoken content in the videos into text. Then, through image analysis and text matching, they filtered out segments that were highly relevant to the knowledge points. Finally, these processed key frames, OCR texts, and transcribed texts were interwoven to create a content-rich and well-structured multimodal textbook.

image.png

The preliminary results of this research indicate that compared to previous web-centric datasets, the newly generated textbook dataset shows significant improvements in knowledge density and image relevance, providing a more solid foundation for VLMs' learning. Moreover, the study has garnered widespread attention in the academic community, with the related dataset rapidly climbing the popularity charts on the Hugging Face platform, accumulating over 7,000 downloads in just two weeks.

Through this innovative attempt, the researchers hope to not only advance the development of VLMs but also open up new possibilities for the integration and application of educational resources.