INFP is an audio-driven interactive head generation framework specifically designed for two-person dialogues. It dynamically synthesizes speech, non-verbal expressions, and interactive avatar videos with realistic facial expressions and rhythmic head movements based on dual-track audio of a conversation and a single portrait image of any chosen avatar. This lightweight yet powerful framework is suitable for instant communication scenarios like video conferencing. INFP stands for Interactive, Natural, Fast, and Person-generic.