Enhancing Vision-Language Models in Biomedicine with Fine-Grained PMC Data

1 Backgroud
Biomedical data is inherently multimodal, encompassing both physical measurements and textual descriptions. To support research in this domain, researchers [1] introduced PMC-15M, a large-scale biomedical dataset comprising 15 million image-text pairs extracted from 4.4 million scientific articles. Building upon this dataset, BiomedCLIP was developed as a multimodal foundation model, incorporating domain-specific adaptations for biomedical vision-language tasks. BiomedCLIP demonstrated strong performance across a range of benchmarks, including image-text retrieval, classification, and visual question answering, establishing new state-of-the-art results and significantly improving upon previous methods.
However, approximately 50% of the figures in PMC-15M are compound figures, which poses a challenge for CLIP-based models by degrading their performance. To address this issue, paper [1] proposes a data processing pipeline designed to decompose compound figures into individual sub-figures along with their corresponding captions. Using this enhanced curation process, the PMC-Fine-Grained-46M dataset was constructed by refining and extracting more precise image-text pairs from PMC-15M. However, the PMC-Fine-Grained-46M dataset has not been publicly released.
2 Details
We reviewed relevant research and designed a data processing pipeline to decompose compound figures along with their captions. First, an image segmentation model (YOLO) was trained to extract individual sub-figures. Next, the Qwen2.5 large language model (LLM) was introduced to separate the compound figure's caption into sub-captions based on explicit sub-figure labels. Finally, a multimodal LLM was used to match the decomposed sub-figures with the corresponding sub-captions.
3 Train & Evaluation
We constructed a fine-grained PMC dataset using the previously introduced data processing pipeline. During preprocessing, we applied a series of filtering criteria to ensure data quality. Invalid samples were removed, including those where the sub-caption contained fewer than 10 characters or where the resolution of the sub-figure was too low to be meaningful for analysis. After cleaning, the dataset was split into training, validation, and test sets with a sample distribution of approximately 94.91%, 0.09%, and 5%, respectively.
The training paradigm follows the continually pre-train (CPT) approach, built upon the BiomedCLIP model. We utilized the OpenCLIP framework to implement the pre-training process, which enables efficient contrastive learning between biomedical sub-figures and their corresponding sub-captions.
Cross-modal Retrieval
To evaluate the model's cross-modal retrieval capability, we followed the methodology outlined in previous research [1]. Specifically, we precomputed the embeddings of both figures and their corresponding captions, and then performed an approximate nearest neighbor search. The pretrained image encoder and text encoder from the CLIP model were used to extract the embeddings for figures and captions, respectively.
For figure-to-text retrieval, given a figure embedding, we retrieved the top-k nearest neighbors from the caption embedding space in the test dataset. Conversely, for text-to-figure retrieval, we used a caption embedding to identify the most similar figure embeddings. The primary evaluation metric used was the top-k recall, denoted as Recall@k, which measures whether the original caption for the figure (or the original figure for the caption) is within the top-k retrieved candidates. This bidirectional retrieval evaluation provides a comprehensive assessment of the model's alignment between visual and textual modalities.
The computational complexity of performing cross-modal retrieval using the OpenCLIP framework is \(O(n^2)\) , as it requires computing pairwise similarities between all figure and caption embeddings. This becomes computationally prohibitive for large-scale datasets. To address this scalability challenge, we integrated the Faiss library to significantly accelerate the nearest neighbor search process. Faiss enables efficient similarity search through optimized indexing structures and approximate nearest neighbor algorithms. In our experiments, Faiss achieved retrieval accuracy on par with the brute-force search used in the original OpenCLIP implementation, while drastically reducing computational time and memory consumption. By leveraging Faiss’s highly efficient vector indexing and search capabilities, cross-modal retrieval over large-scale datasets becomes not only feasible but also practical, enabling rapid and accurate retrieval in real-world scenarios.
The experimental results for cross-modal retrieval are as follows:
Model | img2txt Recall@1 | txt2img Recall@1 |
BiomedCLIP | 33.54% | 34.01% |
FineGrainedPMC-CLIP | 43.68% | 43.88% |
Biomedical image classification
Following the methodology established in prior work [1], we evaluated the performance of our Fine-Grained CLIP model on biomedical image classification tasks across three benchmark datasets. Our evaluation includes both zero-shot and full-shot learning settings.
Zero-Shot Classification
In the zero-shot setting, the model performs classification without any fine-tuning on the target dataset. Instead, it leverages a short, descriptive text prompt for each class to guide classification. These prompts follow the template used in [1]. The model computes the similarity between the image embedding and the class text embeddings, and the class with the highest similarity is selected as the prediction. We use accuracy as the evaluation metric in this setting.
Full-Shot Classification
In the full-shot learning experiments, we freeze the pretrained image encoder (the "image tower") and append two fully connected (dense) layers for classification. The model is then trained on the training dataset and evaluated on the test set. Accuracy remains the primary metric for assessing performance.
Datasets
PatchCamelyon (PCam): Comprises 327,680 color images extracted from histopathological scans of lymph node sections.
LC25000-Lung: A subset of the LC25000 dataset, consisting of histopathological images of lung tissue. The LC25000 dataset includes a total of 25,000 images.
LC25000-Colon: Another subset of LC25000, containing colon tissue images.
Experimental Results
Zero-shot learning results:
Dataset/Model | BiomedCLIP | FineGrainedPMC-CLIP |
PCam | 69.56% | 71.77% |
LC25000-Lung | 60.21% | 65.62% |
LC25000-Colon | 91.81% | 91.75% |
Full-shot learning results:
Dataset/Model | BiomedCLIP | FineGrainedPMC-CLIP |
PCam | 84.76% | 84.83% |
LC25000-Lung | 97.08% | 96.88% |
LC25000-Colon | 98.31% | 99.19% |
The results demonstrate that FineGrainedPMC-CLIP consistently outperforms BiomedCLIP, achieving higher accuracy scores across most datasets and in both learning settings. This indicates the effectiveness of our fine-grained training pipeline in enhancing the model’s ability to understand and classify biomedical imagery.
4 Reference
[1] Zhang, Sheng, et al. “BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.” arXiv preprint arXiv:2303.00915 (2023).
Subscribe to my newsletter
Read articles from Jiangyu Zheng directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
