In the field of biotechnology, artificial intelligence is rapidly advancing protein discovery and design. Recently, a research team from UC Berkeley and Caltech jointly developed a novel multimodal framework called ProteinDT, designed to leverage text descriptions to assist in protein design. This innovative approach combines protein sequence and structural information with a vast amount of biological knowledge in textual form, ushering in a new era of protein design.

The ProteinDT workflow consists of three main steps. First, the team utilized the "Contrastive Language-Protein Pre-training" (ProteinCLAP) method to align text descriptions with protein sequences. This process leverages 411,000 text-protein pairs from the UniProt database, employing contrastive learning techniques to ensure effective multimodal fusion.

Protein Tissue Biology

Image Source Note: Image generated by AI, image licensing provider Midjourney

Next, ProteinDT's "Facilitator" model generates a representation of the protein sequence from the text, using Gaussian distribution to estimate the conditional distribution for precise generation. The final step involves a decoder, which acts as a conditional generative model, producing the final protein sequence based on the representation from the previous step.

To validate the framework's effectiveness, the research team designed three downstream tasks. First, the text-to-protein generation task demonstrated ProteinDT's ability to generate relevant protein sequences from textual descriptions of target protein properties, achieving over 90% accuracy. Second, the zero-shot text-guided protein editing task, using both latent space interpolation and latent optimization methods, effectively incorporated textual information to improve protein generation quality. Finally, the team evaluated the robustness and generalization ability of ProteinCLAP's learned representation, showing ProteinDT's superior performance across multiple benchmark tests compared to six other state-of-the-art methods.

This research not only opens new avenues for protein design but also showcases the immense potential of combining textual data with biomolecular design, promising to further advance biomedical research and drug development.