NASA and IBM Collaborate to Develop the INDUS Large Language Model for Advanced Scientific Research

AIbase

Published inAI News · 4 min read · Jun 26, 2024

180

NASA's Interagency Implementation and Advanced Concepts Team (IMPACT) collaborates with private, non-federal partners through Space Act Agreements to develop INDUS, a suite of Large Language Models (LLM) tailored for Earth science, biological and physical sciences, heliophysics, planetary science, and astrophysics. These models are trained using curated scientific literature from diverse data sources.

INDUS encompasses two types of models: encoders and sentence transformers. Encoders convert natural language text into numerical encodings, which can be processed by LLM. The INDUS encoders are trained on a 6 billion token corpus containing data from astrophysics, planetary science, Earth science, heliophysics, biological sciences, and physical sciences. The IMPACT-IBM collaborative team developed a custom tokenizer that improves upon generic tokenizers by identifying scientific terms such as biomarkers and phosphorylation. Over half of the 50,000 vocabulary entries in INDUS are unique to the specific scientific domains used in its training. The INDUS encoder models are fine-tuned on approximately 268 million text pairs, including titles/summaries and question/answers.

By providing INDUS with domain-specific vocabulary, the IMPACT-IBM team achieved superior performance on biomedical task benchmarks, scientific question-answering benchmarks, and Earth science entity recognition tests compared to open, non-domain-specific LLMs. Through the design of diverse language tasks and retrieval-augmented generation, INDUS can address researchers' queries, retrieve relevant documents, and generate answers. For latency-sensitive applications, the team developed smaller, faster versions of the encoder and sentence transformer models.

Validation tests show that INDUS can retrieve relevant passages from scientific literature when answering a test set of about 400 questions from NASA. IBM researcher Bishwaranjan Bhattacharjee commented on the overall approach, "We achieved outstanding performance by not only having custom vocabulary but also by having extensively trained encoder models and effective training strategies. For the smaller, faster versions, we used neural architecture search to obtain model architectures and employed larger model supervision for knowledge distillation during training."

Key Points:
- 🚀 NASA collaborates with IBM to develop INDUS, a Large Language Model applicable to Earth science, biological and physical sciences, heliophysics, planetary science, and astrophysics.
- 🎓 INDUS includes two types of models, encoders and sentence transformers, trained with a custom tokenizer and a 6 billion token corpus, and fine-tuned on approximately 268 million text pairs.
- 💡 INDUS achieves superior performance over open, non-domain-specific LLMs by utilizing domain-specific vocabulary and designing diverse language tasks and retrieval-augmented generation, enabling it to handle researchers' queries, retrieve relevant documents, and generate answers.

Large Language Model (LLM)INDUS NASA IMPACT

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

AI Daily Report - June 30th: Baidu Open Sources the WENXIN Large Model 4.5 Series; Tongyi Qianwen Multimodal Generation Model Qwen VLo

Welcome to the AIbase [AI Daily Report] section! Spend three minutes a day to learn about the latest AI events, helping you understand AI industry trends and innovative AI product applications. For more AI news, visit: https://www.aibase.com/zh1. Baidu officially releases the WENXIN Large Model 4.5 series and fully opens it to the public, featuring ten new models with various parameter configurations. These models are trained and inferred using the PaddlePaddle framework, achieving a FLOPs utilization rate of 47%, and perform well in multi-modal text tasks.

Jun 30, 2025

200

Baidu Launches the WENXIN Large Model 4.5 Series Open Source, Sparking a New Wave in the Domestic Large Model Market!

Recently, Baidu officially announced the open-source release of its WENXIN Large Model 4.5 series, launching a total of ten models, including mixed expert (MoE) models with 47B and 3B activated parameters, as well as dense models with 0.3B parameters. This open-source initiative not only fully publicizes the pre-trained weights but also provides inference code, marking a significant advancement for Baidu in the field of large models. These newly released models can be downloaded and deployed on platforms such as PaddlePaddle Starry Sky Community and Hugging Face. Additionally, Baidu Intelligent Cloud's Qianfan Large Model Platform also provides

Jun 30, 2025

250

Baidu Makes a Major Open-Source Release of the ERNIE Bot 4.5 Series with Ten New Models Unveiled!

Baidu officially released the ERNIE Bot 4.5 series models and made them fully open source. Users can experience this latest open-source technology immediately through ERNIE Bot (https://yiyan.baidu.com). This series includes multiple parameter configurations, such as Mixture of Experts (MoE) models with activated parameters of 47B and 3B, as well as dense models designed with 0.3B parameters, totaling ten different models. In terms of training and inference, the ERNIE 4.5 series models use PaddlePaddle deep learning.

Jun 30, 2025

600

Gemini2.5Pro API Returns Free, Developer Community Responds Enthusiastically

Recently, Google announced that the API of its flagship AI model, Gemini2.5Pro, has been reintroduced to the free tier of Google AI Studio. This news has triggered widespread attention and enthusiastic discussions within the developer community. According to AIbase, this move marks another important advancement in Google's efforts to popularize AI technology, offering developers lower barriers to innovation. As the most advanced AI model from Google so far, Gemini2.5Pro is known for its exceptional multimodal capabilities and strong reasoning power.

Jun 30, 2025

270

Baidu's WENXIN Series Large Models Are Open-Sourced on the PaddlePaddle Platform, Covering Multiple Latest Models

Baidu's WENXIN series large models have recently been open-sourced on its PaddlePaddle platform, including dozens of latest models such as ERNIE-4.5-VL-424B-A47B-Paddle and ERNIE-4.5-300B-A47B-Paddle. Although Baidu has not actively disclosed this open-source initiative, updates on the PaddlePaddle platform show that these actions were concentrated between June 29th and June 30th, marking its latest move. A source within the company confirmed: official

Jun 30, 2025

190

Memory Optimization! NVIDIA DLSS 4 Makes Games Smoother, Reducing VRAM by 20% with Transformer Model

Jun 30, 2025

100

Alibaba Ovis-U1 Launches with a Bang: A Multi-Modal AI All-in-One, Open Source Empowers Global Developers

On June 29, 2025, the Alibaba International AI Team officially released the new multi-modal large model **Ovis-U1**, marking another major breakthrough in the field of multi-modal artificial intelligence. As the latest masterpiece of the Ovis series, Ovis-U1 integrates multi-modal understanding, image generation, and image editing functions, demonstrating powerful cross-modal processing capabilities, providing new possibilities for developers, researchers, and industry applications. This is a detailed report on Ovis-U1 by AIbase. Ovis-U1

Jun 30, 2025

680

Tencent Open Sources Hunyuan-A13B: An AI Model with Small Size and Great Intelligence

Jun 30, 2025

860

Runway AI Launches Its New Game World: A Large Interactive Text Adventure

Recently, AI technology leader Runway announced the upcoming launch of its new generative AI platform, "Game Worlds." This innovative product marks Runway's successful expansion from the film industry into the gaming sector, offering creators and players a brand-new interactive experience. "Game Worlds": An AI-Driven Interactive Text Adventure. The Runway Game Worlds platform is built on generative AI, allowing users to create and experience text-based adventure games with simple text input. Compared to traditional...

Jun 30, 2025

620

ChatGPT Guides Confused Users to Contact Journalists, Revealing the Impact of AI on User Behavior

Recently, journalist Kashmir Hill from The New York Times exposed a concerning phenomenon: ChatGPT has begun actively guiding users who are caught in conspiracy theories or psychological distress to contact her directly via email. In conversations with users, ChatGPT described Hill as 'empathetic' and 'grounded in reality,' and mentioned that she has conducted in-depth research on artificial intelligence, which may provide understanding and support to these users. Hill mentioned that one of her past contacts was a Manhattan accountant who firmly believed

Jun 30, 2025

130

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

NASA and IBM Collaborate to Develop the INDUS Large Language Model for Advanced Scientific Research

AIbase

This article is from AIbase Daily

AI News Recommendations

AI Daily Report - June 30th: Baidu Open Sources the WENXIN Large Model 4.5 Series; Tongyi Qianwen Multimodal Generation Model Qwen VLo

Baidu Launches the WENXIN Large Model 4.5 Series Open Source, Sparking a New Wave in the Domestic Large Model Market!

Baidu Makes a Major Open-Source Release of the ERNIE Bot 4.5 Series with Ten New Models Unveiled!

Gemini2.5Pro API Returns Free, Developer Community Responds Enthusiastically

Baidu's WENXIN Series Large Models Are Open-Sourced on the PaddlePaddle Platform, Covering Multiple Latest Models

Memory Optimization! NVIDIA DLSS 4 Makes Games Smoother, Reducing VRAM by 20% with Transformer Model

Alibaba Ovis-U1 Launches with a Bang: A Multi-Modal AI All-in-One, Open Source Empowers Global Developers

Tencent Open Sources Hunyuan-A13B: An AI Model with Small Size and Great Intelligence

Runway AI Launches Its New Game World: A Large Interactive Text Adventure

ChatGPT Guides Confused Users to Contact Journalists, Revealing the Impact of AI on User Behavior