🩺 Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment


University of Georgia UC Merced Lehigh University
University of Queensland National University of Singapore UCLA

We introduce the Medical Knowledge Judgment (MKJ) Dataset 🩺, designed to evaluate LLMs' factual medical knowledge through one-hop judgment tasks. Built from the Unified Medical Language System (UMLS), MKJ measures how well LLMs encode, retain, and recall fundamental medical facts. Dataset Pipeline

Our findings reveal that LLMs struggle with factual medical knowledge retention, showing significant performance variance across semantic categories, particularly for rare medical conditions, and often exhibit poor calibration with overconfidence.

Abstract

Large language models (LLMs) have been widely adopted in various downstream task domains. However, their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop reasoning, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.

  • Medical Knowledge Assessment Challenge. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate how well LLMs encode, retain, and recall fundamental medical facts. 🩺
  • MKJ Dataset. To bridge this gap, we introduce the Medical Knowledge Judgment Dataset (MKJ) 📊, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge. MKJ is constructed from the Unified Medical Language System (UMLS), a large-scale repository of standardized biomedical vocabularies and knowledge graphs.
  • Binary Judgment Framework. We frame knowledge assessment as a binary judgment task, requiring LLMs to verify the correctness of medical statements extracted from reliable and structured knowledge sources. This approach isolates factual recall from complex reasoning abilities. ⚖️
  • Key Findings. Our experiments reveal that LLMs struggle with factual medical knowledge retention, exhibiting significant performance variance across different semantic categories, particularly for rare medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers.
  • Retrieval-Augmented Solutions. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.
  • Dataset Construction

    Construction of the MKJ Dataset
    The Medical Knowledge Judgment (MKJ) dataset is specifically designed to measure LLMs' one-hop factual medical knowledge through binary judgment tasks. Built from the Unified Medical Language System (UMLS), it provides a comprehensive evaluation framework for medical knowledge assessment.

    Our dataset construction process involves:

    1. Extracting medical entities and relationships from UMLS knowledge graphs.
    2. Generating binary judgment statements using template-based approaches.
    3. Creating balanced positive and negative examples across diverse semantic categories.
    4. Ensuring coverage of rare and common medical conditions.
    MKJ enables direct assessment of factual medical knowledge without complex reasoning requirements. 🩺

    Key Features:
    Comprehensive Coverage: Spans multiple medical semantic types
    Balanced Design: Equal distribution of true/false statements
    UMLS-based: Built from authoritative medical knowledge sources
    One-hop Focus: Isolates factual recall from reasoning abilities

    See more details in our paper

    Performance on Rare Medical Conditions

    Semantic Category Analysis
    Our evaluation reveals significant performance variance across different medical semantic types. LLMs demonstrate particular difficulty with rare medical conditions and specialized medical terminology, highlighting gaps in their medical knowledge retention.

    Performance breakdown across medical semantic categories

    Key Observations
    Neoplastic Processes: Models struggle with cancer-related terminology and relationships
    Clinical Drugs: Frequent confusion with drug compositions and contraindications
    Hormone Categories: Poor accuracy on endocrine system knowledge
    Common vs. Rare: Significant performance drop for uncommon medical conditions

    The figure shows example failures where GPT-4o-mini incorrectly judges medical statements across different semantic categories, demonstrating the challenge of factual medical knowledge retention in LLMs.

    Model Calibration and RAG Enhancement

    Poor Calibration in Medical Knowledge
    Our analysis reveals that LLMs exhibit poor calibration when making medical judgments, often showing overconfidence in incorrect answers. This is particularly concerning for medical applications where uncertainty quantification is crucial.

    Performance Comparison: Zero-shot vs. Retrieval-Augmented Generation (RAG)
    Model Zero-shot Accuracy RAG Accuracy Improvement
    GPT-4o-mini 74.2% 82.1% +7.9%
    GPT-4o-mini 68.5% 76.8% +8.3%
    Claude-3-Sonnet 71.3% 79.2% +7.9%
    Llama-3.1-8B 62.1% 71.4% +9.3%
    Qwen2.5-3B 58.7% 68.9% +10.2%
    Meditron-7B 66.8% 75.1% +8.3%

    Key Findings:
    RAG Effectiveness: Retrieval-augmented generation consistently improves performance across all tested models
    Calibration Improvement: RAG reduces overconfidence and provides better uncertainty estimates
    Medical Specialization: Even specialized medical models like Meditron benefit from external knowledge retrieval

    BibTeX

    @article{li2025fact,
      title={Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment},
      author={Li, Jiaxi and Wang, Yiwei and Zhang, Kai and Cai, Yujun and Hooi, Bryan and Peng, Nanyun and Chang, Kai-Wei and Lu, Jin},
      journal={arXiv preprint arXiv:2502.14275},
      year={2025}
    }