Google's Gemini model, just six months old, has already demonstrated remarkable capabilities in areas such as safety, coding, and debugging, though it also exhibits significant limitations. Notably, this large language model (LLM) outperforms humans in providing sleep and fitness advice.
Google researchers have introduced a Personal Health Large Language Model (PH-LLM), a finely tuned version of Gemini, capable of understanding and reasoning about time-series personal health data from wearable devices such as smartwatches and heart rate monitors. In their experiments, the model's responses and predictions significantly surpassed those of experts with years of experience in health and fitness.
Wearable technology aids in monitoring health and facilitating meaningful changes under ideal conditions. These devices offer a "rich and long-term source of data" passively and continuously collected from inputs like exercise and diet logs, mood diaries, and sometimes even social media activities. However, data on sleep, physical activity, cardiometabolic health, and stress captured by these devices is rarely integrated into "fragmentary" clinical settings. Researchers speculate this is likely due to the lack of context when data is captured and the significant computational requirements for storage and analysis. Additionally, interpreting this data can be challenging.
Google researchers, however, have made breakthroughs in training the PH-LLM to provide advice, answer professional exam questions, and predict self-reported sleep disturbances and sleep disorder outcomes. The model was given multiple-choice questions, and researchers employed "chain of thought" (mimicking human reasoning) and "zero-shot" methods (identifying previously unseen objects and concepts).
Impressively, PH-LLM scored 79% on a sleep exam and 88% on a fitness exam, outperforming the average scores of a sample group of human experts, including five professional sports trainers (average experience 13.8 years) and five sleep medicine experts (average experience 25 years). Human experts averaged 71% in fitness and 76% in sleep.
Researchers noted, "While further development and evaluation are needed in the personal health domain, these results demonstrate the broad knowledge base and capabilities of the Gemini model."
To achieve these results, researchers first created and curated three datasets for testing personalized insights and recommendations from wearables, domain-specific knowledge, and predictions of self-reported sleep quality. They collaborated with domain experts to create 857 case studies representing real-world scenarios in sleep and fitness. Sleep scenarios used individual metrics to identify potential factors and provide personalized advice to improve sleep quality. Fitness tasks utilized information from training, sleep, health indicators, and user feedback to recommend physical activity intensity for a given day.
Both types of case studies included wearable sensor data, with sleep data spanning up to 29 days, fitness data over 30 days, and demographic information (age and gender) along with expert analysis.
Although researchers acknowledged that PH-LLM is just a beginning, like any emerging technology, it has some issues to address. For instance, the model's responses are not always consistent, with "significant variability" in fictional case studies, and the LLM can sometimes be overly conservative or cautious in its answers. In fitness case studies, the model was highly sensitive to overtraining, and in one instance, a human expert noted it failed to identify potential causes of injury due to sleep deprivation. Additionally, the case studies broadly covered various demographic data and relatively active individuals, so they may not fully represent the population or address broader sleep and fitness issues.