Large language models (LLMs) have been widely adopted in various downstream task domains. However, their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop reasoning, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
Construction of the MKJ Dataset
The Medical Knowledge Judgment (MKJ) dataset is specifically designed to measure LLMs' one-hop factual medical knowledge through binary judgment tasks. Built from the Unified Medical Language System (UMLS), it provides a comprehensive evaluation framework for medical knowledge assessment.
Our dataset construction process involves:
Key Features:
• Comprehensive Coverage: Spans multiple medical semantic types
• Balanced Design: Equal distribution of true/false statements
• UMLS-based: Built from authoritative medical knowledge sources
• One-hop Focus: Isolates factual recall from reasoning abilities
Semantic Category Analysis
Our evaluation reveals significant performance variance across different medical semantic types. LLMs demonstrate particular difficulty with rare medical conditions and specialized medical terminology, highlighting gaps in their medical knowledge retention.
Key Observations
• Neoplastic Processes: Models struggle with cancer-related terminology and relationships
• Clinical Drugs: Frequent confusion with drug compositions and contraindications
• Hormone Categories: Poor accuracy on endocrine system knowledge
• Common vs. Rare: Significant performance drop for uncommon medical conditions
The figure shows example failures where GPT-4o-mini incorrectly judges medical statements across different semantic categories, demonstrating the challenge of factual medical knowledge retention in LLMs.
Poor Calibration in Medical Knowledge
Our analysis reveals that LLMs exhibit poor calibration when making medical judgments, often showing overconfidence in incorrect answers. This is particularly concerning for medical applications where uncertainty quantification is crucial.
Model | Zero-shot Accuracy | RAG Accuracy | Improvement |
---|---|---|---|
GPT-4o-mini | 74.2% | 82.1% | +7.9% |
GPT-4o-mini | 68.5% | 76.8% | +8.3% |
Claude-3-Sonnet | 71.3% | 79.2% | +7.9% |
Llama-3.1-8B | 62.1% | 71.4% | +9.3% |
Qwen2.5-3B | 58.7% | 68.9% | +10.2% |
Meditron-7B | 66.8% | 75.1% | +8.3% |
Key Findings:
• RAG Effectiveness: Retrieval-augmented generation consistently improves performance across all tested models
• Calibration Improvement: RAG reduces overconfidence and provides better uncertainty estimates
• Medical Specialization: Even specialized medical models like Meditron benefit from external knowledge retrieval
@article{li2025fact,
title={Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment},
author={Li, Jiaxi and Wang, Yiwei and Zhang, Kai and Cai, Yujun and Hooi, Bryan and Peng, Nanyun and Chang, Kai-Wei and Lu, Jin},
journal={arXiv preprint arXiv:2502.14275},
year={2025}
}