Large language models (LLMs) have seen widespread adoption across various fields in recent years, demonstrating powerful capabilities in content creation, programming assistance, and search engine optimization. However, their application in biomedical research faces challenges related to transparency, reproducibility, and customization.
To address these challenges, Heidelberg University and the European Bioinformatics Institute (EMBL-EBI) have jointly developed BioChatter, an open-source Python framework designed to simplify the use of LLMs for biomedical researchers.
Image Source Note: Image generated by AI, licensed through Midjourney.
BioChatter is designed to reduce technical complexity, allowing researchers to focus on their research without needing expertise in programming or machine learning. The framework enables researchers to extract relevant data from biomedical databases and literature, and access external bioinformatics tools in real-time. This is facilitated by seamless integration with the BioCypher knowledge graph, which links crucial data such as gene mutations and drug-disease associations, significantly supporting the analysis of complex datasets.
BioChatter's core functionalities include basic question-answering interactions with various LLMs, reproducible prompt engineering, knowledge graph querying, retrieval-augmented generation, and chained model calls. For enhanced usability, BioChatter provides an intuitive API, allowing researchers to easily integrate its functionality into web applications, command-line interfaces, or Jupyter notebooks.
In experimental evaluations, the research team created customized benchmarks to accurately assess BioChatter's performance. Results showed that models using BioChatter significantly outperformed models without a prompt engine in generating correct queries, strongly supporting BioChatter's practical application.
Looking ahead, the BioChatter team will continue collaborating with life science databases like Open Targets to integrate human genetics and genomics data, helping users more efficiently identify and prioritize drug targets. They are also developing a complementary system called BioGather, aimed at extracting information from other clinical data types such as genomics, medical notes, and images, to address complex problems in personalized medicine and drug development.
BioChatter empowers biomedical researchers to leverage LLMs more effectively, driving scientific advancement and innovation.