Harvard and Columbia Release Open Dataset of 16 Million Protein Sequences, Solving the Private Data Issue for AlphaFold 2 Training!
新智元
66
The research institutions such as Harvard University and Columbia University have released an open-source dataset called OpenProteinSet, which includes 16 million protein multiple sequence alignments (MSA) and related data. This dataset addresses the issue of privatized training data for DeepMind's AlphaFold 2, providing significant support for the fields of bioinformatics and protein machine learning. AlphaFold 2 has led the field in the accuracy of protein structure prediction, but its private data has restricted progress for other researchers. OpenProteinSet contains proteins from all protein databases and data from various UniProt clusters, making it suitable for training a wide range of AI models. This resource is of great significance for research in biology, drug development, and other fields, and will drive the advancement of related studies.
© Copyright AIbase Base 2024, Click to View Source - https://www.aibase.com/news/1496