VSP-LLM is a technology designed to understand and translate spoken content by observing the lip movements of individuals in videos, primarily used for lip-reading recognition. By converting lip movements into text and translating them into the target language, combined with advanced visual speech recognition and large language models, VSP-LLM can process efficiently. Techniques such as self-supervised learning, removing redundant information, multitasking, and low-rank adapters make this technology more accurate and efficient. In the future, VSP-LLM holds broad application prospects in the fields of visual speech processing and translation.