Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

🩺 Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

▶ University of Georgia ▶ UC Merced ▶ Lehigh University
▶ University of Queensland ▶ National University of Singapore ▶ UCLA

We introduce the Medical Knowledge Judgment (MKJ) Dataset 🩺, designed to evaluate LLMs' factual medical knowledge through one-hop judgment tasks. Built from the Unified Medical Language System (UMLS), MKJ measures how well LLMs encode, retain, and recall fundamental medical facts. Dataset Pipeline

Our findings reveal that LLMs struggle with factual medical knowledge retention, showing significant performance variance across semantic categories, particularly for rare medical conditions, and often exhibit poor calibration with overconfidence.

Abstract

Large language models (LLMs) have been widely adopted in various downstream task domains. However, their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop reasoning, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.

Medical Knowledge Assessment Challenge. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate how well LLMs encode, retain, and recall fundamental medical facts. 🩺

MKJ Dataset. To bridge this gap, we introduce the Medical Knowledge Judgment Dataset (MKJ) 📊, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ is constructed from the Unified Medical Language System (UMLS), a large-scale repository of standardized biomedical vocabularies and knowledge graphs.

Binary Judgment Framework. We frame knowledge assessment as a binary judgment task, requiring LLMs to verify the correctness of medical statements extracted from reliable and structured knowledge sources. This approach isolates factual recall from complex reasoning abilities. ⚖️

Key Findings. Our experiments reveal that LLMs struggle with factual medical knowledge retention, exhibiting significant performance variance across different semantic categories, particularly for rare medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers.

Retrieval-Augmented Solutions. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.

Dataset Construction

Construction of the MKJ Dataset
The Medical Knowledge Judgment (MKJ) dataset is specifically designed to measure LLMs' one-hop factual medical knowledge through binary judgment tasks. Built from the Unified Medical Language System (UMLS), it provides a comprehensive evaluation framework for medical knowledge assessment.

Our dataset construction process involves:

Extracting medical entities and relationships from UMLS knowledge graphs.
Generating binary judgment statements using template-based approaches.
Creating balanced positive and negative examples across diverse semantic categories.
Ensuring coverage of rare and common medical conditions.

MKJ enables direct assessment of factual medical knowledge without complex reasoning requirements. 🩺

Key Features:
• Comprehensive Coverage: Spans multiple medical semantic types
• Balanced Design: Equal distribution of true/false statements
• UMLS-based: Built from authoritative medical knowledge sources
• One-hop Focus: Isolates factual recall from reasoning abilities

See more details in our paper

Performance on Rare Medical Conditions

Semantic Category Analysis
Our evaluation reveals significant performance variance across different medical semantic types. LLMs demonstrate particular difficulty with rare medical conditions and specialized medical terminology, highlighting gaps in their medical knowledge retention.

Performance breakdown across medical semantic categories

Key Observations
• Neoplastic Processes: Models struggle with cancer-related terminology and relationships
• Clinical Drugs: Frequent confusion with drug compositions and contraindications
• Hormone Categories: Poor accuracy on endocrine system knowledge
• Common vs. Rare: Significant performance drop for uncommon medical conditions

The figure shows example failures where GPT-4o-mini incorrectly judges medical statements across different semantic categories, demonstrating the challenge of factual medical knowledge retention in LLMs.

Model Calibration and RAG Enhancement

Poor Calibration in Medical Knowledge
Our analysis reveals that LLMs exhibit poor calibration when making medical judgments, often showing overconfidence in incorrect answers. This is particularly concerning for medical applications where uncertainty quantification is crucial.

**Performance Comparison: Zero-shot vs. Retrieval-Augmented Generation (RAG)**
Model	Zero-shot Accuracy	RAG Accuracy	Improvement
GPT-4o-mini	74.2%	82.1%	+7.9%
GPT-4o-mini	68.5%	76.8%	+8.3%
Claude-3-Sonnet	71.3%	79.2%	+7.9%
Llama-3.1-8B	62.1%	71.4%	+9.3%
Qwen2.5-3B	58.7%	68.9%	+10.2%
Meditron-7B	66.8%	75.1%	+8.3%

Key Findings:
• RAG Effectiveness: Retrieval-augmented generation consistently improves performance across all tested models
• Calibration Improvement: RAG reduces overconfidence and provides better uncertainty estimates
• Medical Specialization: Even specialized medical models like Meditron benefit from external knowledge retrieval

BibTeX

@article{li2025fact, title={Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment}, author={Li, Jiaxi and Wang, Yiwei and Zhang, Kai and Cai, Yujun and Hooi, Bryan and Peng, Nanyun and Chang, Kai-Wei and Lu, Jin}, journal={arXiv preprint arXiv:2502.14275}, year={2025} }

🩺 Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

We introduce the Medical Knowledge Judgment (MKJ) Dataset 🩺, designed to evaluate LLMs' factual medical knowledge through one-hop judgment tasks. Built from the Unified Medical Language System (UMLS), MKJ measures how well LLMs encode, retain, and recall fundamental medical facts.

Our findings reveal that LLMs struggle with factual medical knowledge retention, showing significant performance variance across semantic categories, particularly for rare medical conditions, and often exhibit poor calibration with overconfidence.

Abstract

Dataset Construction

Performance on Rare Medical Conditions

Model Calibration and RAG Enhancement

BibTeX

🩺 Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment

We introduce the Medical Knowledge Judgment (MKJ) Dataset 🩺, designed to evaluate LLMs' factual medical knowledge through one-hop judgment tasks. Built from the Unified Medical Language System (UMLS), MKJ measures how well LLMs encode, retain, and recall fundamental medical facts.

.subtitle a { color: blue; } Our findings reveal that LLMs struggle with factual medical knowledge retention, showing significant performance variance across semantic categories, particularly for rare medical conditions, and often exhibit poor calibration with overconfidence.

Abstract

Dataset Construction

Performance on Rare Medical Conditions

Model Calibration and RAG Enhancement

BibTeX

Our findings reveal that LLMs struggle with factual medical knowledge retention, showing significant performance variance across semantic categories, particularly for rare medical conditions, and often exhibit poor calibration with overconfidence.