Media2Face is a co-speech facial animation generation tool guided by audio, text, and image multi-modality. It first utilizes generic neural parameterized facial assets (GNPFA) to map facial geometry and images to a highly generic expression latent space. Then, it extracts high-quality expressions and accurate head poses from a large dataset of videos to build the M2F-D dataset. Finally, it employs a diffusion model in the GNPFA latent space for co-speech facial animation generation. This tool not only achieves high fidelity in facial animation synthesis but also expands expressiveness and style adaptability.