The Video ReCap model is an open-source technology for generating video subtitles, capable of processing videos ranging from 1 second to 2 hours, and producing hierarchical video subtitles at various levels. By employing a recursive video-language architecture, which includes a video encoder, video-language alignment, and a recursive text decoder, this model can comprehend videos at different time lengths and abstraction levels, generating precise and richly layered video description subtitles. Experiments have demonstrated the importance of the recursive architecture for generating segment descriptions and video summaries. Additionally, the hierarchical video subtitles generated by this model can significantly enhance the performance of long video question-answering based on the EgoSchema dataset.