How we compared ChatGPT and OpenEvidence on their ability to provide accurate, guideline-based medical advice for melanoma staging, work-up, and treatment.

Introduction

Large language models (LLMs) such as ChatGPT (GPT-4) are increasingly being explored as tools to assist clinicians by generating evidence-based answers to medical questions, including staging, workup, and treatment recommendations. However, while LLMs excel at producing fluent, contextually appropriate responses, they may sometimes generate outdated or incorrect information if not carefully aligned to current guidelines. In contrast, tools like OpenEvidence are explicitly designed to cite and link to the latest medical literature, improving transparency and traceability, but they can be slower or less conversational in their output.

In this post, we evaluate two leading AI-based clinical tools, ChatGPT (GPT-4) and OpenEvidence, on their ability to provide accurate and up-to-date melanoma staging, workup, and treatment recommendations according to the latest clinical guidelines. By comparing the strengths and limitations of each approach, we aim to highlight how these technologies perform in a guideline-driven clinical context.

What We're Testing

The AI Models

ChatGPT (GPT-4): A general-purpose AI assistant that can answer questions on any topic
OpenEvidence: A specialized AI system designed specifically for medical knowledge and evidence-based medicine

The Question

Can these AI systems provide accurate, up-to-date medical advice that matches what expert physicians would recommend based on the latest clinical guidelines?

Methods

How We Tested the AI Systems

The Clinical Scenarios

We gave both AI systems the same three medical scenarios and asked them to provide staging and treatment recommendations. For each clinical scenario, we generated several independent responses (runs) from each model. This approach allowed us to measure not only the average performance but also the consistency and robustness of each model across multiple attempts.Here's an example of one of the prompts we used:

Actual Prompt Used:

Note: “Melanoma scenario” changed for each prompt

You are a dermatologist.

Respond in EXACTLY three sections, each beginning with the Markdown H2 header shown (## Stage, ## Work-up, ## Treatment). No other headers.

─────────────────────────────────────
Melanoma scenario
• Pathology: 1.4 mm Breslow thickness, non-ulcerated superficial-spreading melanoma on the anterior calf  
• Lymph nodes: no palpable inguinal or popliteal nodes  
• Imaging: none yet
─────────────────────────────────────

1. Stage – Provide the TNM designation and overall stage.  
2. Work-up – Summarize the recommended next diagnostic or staging steps.  
3. Treatment – Summarize the initial management you would recommend.

Cite your sources for each section, if possible, using references or links. Cite all sources inline in each section using raw identifiers only — write out full URLs, DOIs (doi:...), PMIDs (PMID: ...), or PMCs (PMC: ...), not hyperlinks or plain text.

Example Response from ChatGPT to above prompt:

Stage
The TNM classification for this melanoma scenario is as follows: T2a N0 M0. The "T2" indicates a Breslow thickness of more than 1.0 mm but not more than 2.0 mm, and "a" signifies that it is non-ulcerated. "N0" implies no regional lymph node metastasis, and "M0" indicates no distant metastasis. According to the American Joint Committee on Cancer (AJCC) 8th Edition Cancer Staging Manual, this corresponds to a Stage IB melanoma (doi:10.1002/cncr.36683).

Work-up
For a Stage IB melanoma, additional work-up may include a sentinel lymph node biopsy (SLNB) to evaluate the presence of metastasis in the regional lymph nodes, as the SLNB is the standard of care for melanomas with a Breslow thickness greater than 0.8 mm or with other high-risk features (doi:10.1200/JCO.2018.78.2005). Advanced imaging studies, such as PET/CT or MRI, are not routinely recommended unless there are clinical indications of metastasis (PMID: 31675400).

Treatment
Initial management for a Stage IB melanoma typically involves a wide local excision of the primary lesion with 1-2 cm margins, depending on the anatomical location and surgical considerations, to ensure complete removal of the melanoma (PMID: 29116527). If the SLNB is positive, further management may include a completion lymph node dissection or consideration of adjuvant therapy with immune checkpoint inhibitors such as pembrolizumab in certain cases (doi:10.1056/NEJMoa1709030). Regular follow-up for surveillance is also recommended to monitor for recurrence or new primary melanomas (PMID: 31675400).

The Gold Standard

For each scenario, we compared the AI responses against the official NCCN Clinical Practice Guidelines for Melanoma (v2.2024). These guidelines represent the consensus of expert physicians and are considered the standard of care.

Here are the exact gold standards we used for each scenario:

Stage 0/IA Melanoma (Melanoma in situ)

Stage 0: Melanoma in situ
Stage IA: Tumor <0.8 mm thick, no ulceration

Work-up
History and physical examination (H&P)
Routine imaging and laboratory tests not recommended
Imaging only if needed to evaluate specific signs or symptoms

Treatment
Wide excision (category 1 for stage IA)
Proceed to follow-up (ME-10)

Stage IB (T2a) Melanoma

Stage IB (T2a):
T2a: Tumor ≥1.0–2.0 mm thick without ulceration (Stage IB)

Work-up
History and physical examination (H&P)
Baseline imaging and laboratory tests not recommended, unless:
Needed for surgical planning
Prior to systemic treatment discussion/initiation
Imaging if needed to evaluate specific signs or symptoms
Discuss and offer sentinel node biopsy (SLNB)

Treatment
Wide excision (category 1)
Either without SLNB
Or with SLNB
If sentinel node negative →
Clinical trial for stage II
Or observation (ME-11)
Then proceed to follow-up (ME-10 and ME-11)
If sentinel node positive → proceed to Stage III workup and treatment (ME-5)

Stage II (T2b or higher) Melanoma

Stage II (T2b or higher):
T2b or higher: Tumor ≥1.0 mm with ulceration, or thicker

Work-up
History and physical examination (H&P)
Baseline imaging and laboratory tests not recommended, unless:
Needed for surgical planning
Prior to systemic treatment discussion/initiation
Imaging if needed to evaluate specific signs or symptoms
Discuss and offer sentinel node biopsy (SLNB)

Treatment
Wide excision (category 1)
Either without SLNB
Or with SLNB
If sentinel node negative →
Clinical trial for stage II
Or observation (ME-11)
Or for pathological stage IIB or IIC:
Pembrolizumab (category 1)
Nivolumab (category 1)
+/- primary tumor site radiation therapy (category 2B)
Then proceed to follow-up (ME-10 and ME-11)
If sentinel node positive → proceed to Stage III workup and treatment (ME-5)

How We Evaluated the Responses

1. Similarity Metrics (How Close to the Gold Standard?)

We used three different ways to measure how similar the AI responses were to the expert guidelines:

SBERT Similarity: Measures how similar the meaning is between answers (ignores exact words)
ROUGE Similarity: Measures how many words/phrases overlap between answers
BLEU Similarity: Measures exact word matching (very strict - low scores are normal)

2. AI Physician Grading (Medical Expert Evaluation)

We created an AI "physician grader" that evaluates responses like a real doctor would, meant to compare against other grading metrics and to aid manual grading for a “gold standard” of human grading. Here's how it works:

What is a System Prompt?

Think of a system prompt as the "job description" we give to an AI. It tells the AI what role to play and how to behave. In our case, we told the AI: "You are a dermatologist and expert in melanoma. Grade this answer against the gold standard."

The Grading System

Our AI physician grader evaluates each response on two main categories:

Medical Accuracy (0-6 points total):

Stage: Is the cancer staging correct? (0-2 points)
Workup: Are the recommended tests appropriate? (0-2 points)
Treatment: Is the treatment plan correct? (0-2 points)

Communication Quality (0-10 points total):

Accuracy: Are the medical facts correct? (0-2 points)
Relevance: Does it answer the specific question? (0-2 points)
Depth: Is there enough detail? (0-2 points)
Clarity: Is it well-written and clear? (0-2 points)
Completeness: Does it cover everything needed? (0-2 points)

Total Score: 0-16 points (medical accuracy + communication quality)

3. Citation Analysis and Validity Checks

To evaluate the reliability and recency of the references provided by each model, we performed a detailed citation analysis for every answer. Our process included:

Extraction: All citations (DOIs, PMIDs, URLs) were extracted from each model's output for every run and prompt variant. Duplicate citations within runs were counted only once per model answer.
Validation: Each citation was checked for validity by attempting to resolve it via official registries (CrossRef, PubMed, or direct URL access). Citations that did not resolve or were not found in the registry were marked as invalid.
Recency: For valid citations, we extracted the publication year. Citations from before 2021 were flagged as 'old' to assess whether models referenced up-to-date literature.

The Complete Results

Here's the full comparison table showing how both AI systems performed across all metrics:

Metric	GPT-4	OpenEvidence	Winner	What This Measures
SBERT Similarity	0.717 ± 0.029	0.709 ± 0.040	GPT-4	How similar the meaning is
ROUGE Similarity	0.156 ± 0.020	0.177 ± 0.020	OpenEvidence	How much text overlaps
BLEU Similarity	0.009 ± 0.003	0.018 ± 0.009	OpenEvidence	Exact word matching
LLM Score (0-16)	11.667 ± 1.414	12.889 ± 2.315	OpenEvidence	Overall physician evaluation
LLM Normalized (0-1)	0.729 ± 0.088	0.806 ± 0.145	OpenEvidence	Physician score scaled to 0-1
Section Score (0-6)	4.222 ± 0.441	4.667 ± 0.866	OpenEvidence	Medical accuracy only
Global Score (0-10)	7.444 ± 1.014	8.222 ± 1.481	OpenEvidence	Communication quality only

Citation results

Citation Analysis by Model	Valid	Invalid	Old (<2021)
ChatGPT	18/22 (81.8%)	4/22 (18.2%)	16/22 (72.7%)
OpenEvidence	121/121 (100.0%)	0/121 (0.0%)	S62/121 (51.2%)

Valid: Citation resolves to a real reference (DOI/PMID/URL)
Invalid: Citation does not resolve, is fake, or not in the official registry (CrossRef/PubMed)
Old: Citation is from before 2021.
Note: see invalid citations below in supplemental material

Key Findings

OpenEvidence Wins 6 Out of 7 Metrics (85.7% Win Rate)

GPT-4 wins only 1 metric:

SBERT Similarity (0.717 vs 0.709) - semantic similarity

OpenEvidence wins:

ROUGE Similarity (0.177 vs 0.156) - text overlap
BLEU Similarity (0.018 vs 0.009) - exact word matching
LLM Score (13.2 vs 12.2) - physician evaluation
LLM Normalized (0.826 vs 0.764) - scaled physician evaluation
Section Score (4.8 vs 4.4) - medical accuracy
Global Score (8.4 vs 7.8) - communication quality

Key findings from the variability results:

OpenEvidence's LLM scores show higher variability for some prompts, indicating it sometimes gives a wider range of answer quality.
ChatGPT's scores are generally more consistent (lower standard deviation), but its average scores are often lower than OpenEvidence's.
For the easiest prompt (stage_0_ia), both models are highly consistent (very low standard deviation).
For more complex prompts (stage_ib_t2a, stage_ii_t2b_or_higher), OpenEvidence's higher mean is sometimes accompanied by higher variability, suggesting it occasionally produces both very strong and weaker answers.
Overall: OpenEvidence is more likely to produce top scores, but with a bit more spread; ChatGPT is steadier, but less likely to hit the highest marks.

Key findings from citation analysis:

OpenEvidence consistently provided valid citations: Across all runs and prompt variants, OpenEvidence produced only valid citations, with none found to be hallucinated or broken. This demonstrates the strength of specialized medical LLMs in evidence-based referencing.
ChatGPT occasionally hallucinated or provided broken citations: Several citations from ChatGPT did not resolve or were not found in official registries, highlighting a known limitation of general-purpose LLMs in generating reliable references.
Recency gap: Both models frequently cited older literature, but OpenEvidence had a higher proportion of recent (post-2020) references compared to ChatGPT.

What This Means

OpenEvidence is more accurate - It provides more medically correct information
OpenEvidence is more complete - It covers more of the required details
OpenEvidence is clearer - It communicates medical information better
OpenEvidence can give the best answers, but is less consistent - ChatGPT is more consistent, but rarely the most accurate
OpenEvidence cites more extensively - It includes more references, which are usually valid, though sometimes slightly older
Overall, specialized medical LLMs work better for this task - A system designed for medicine outperforms a general-purpose LLM

Conclusions

Key Insights

In straightforward melanoma clinical scenarios, one LLM outperformed the other. In our evaluation of clear-cut clinical scenarios involving melanoma staging, work-up, and treatment recommendations, OpenEvidence, an LLM trained for evidence-based medicine, produced more accurate, complete, and guideline-concordant answers than the general-purpose LLM across most metrics.
Guideline adherence can be systematically assessed. Using a structured evaluation pipeline that combined semantic similarity, physician-style grading, and human validation allowed us to measure how closely LLM responses followed the NCCN melanoma guidelines.
Human oversight remains necessary. Even in these straightforward melanoma cases, both LLMs occasionally omitted important details or introduced minor inaccuracies, showing that expert review is still essential.
Evaluation frameworks are valuable for benchmarking. Our structured, multi-metric approach demonstrates that with appropriate tools and benchmarks, it is possible to meaningfully compare LLMs for specific clinical tasks such as interpreting melanoma guidelines.

Limitations

Focused on melanoma staging scenarios. Our evaluation was limited to straightforward melanoma cases and may not generalize to other skin cancers or more complex situations.
Emphasized guideline adherence over outcomes. We assessed how well LLMs followed guidelines but did not evaluate whether their recommendations would improve patient outcomes.
Did not measure clinical impact. The study did not test the effect of using LLMs in real clinical settings.
Single evaluation per model. Each LLM was tested on one set of runs, which may not capture variability in performance.

Future Work

Broaden to other skin cancers and more complex/multi-step clinical scenarios.
Study clinical utility. Research should measure how LLM recommendations affect care quality, safety, and efficiency in practice.
Develop quality control tools. Building automated checks for LLM outputs could help maintain accuracy and reliability at scale.

The Technical Details (For Those Who Want to Dig Deeper)

The AI Grader

How We Built the AI Physician Grader

We used OpenAI's GPT-4 to create an AI "physician" that evaluates medical responses. Here's the system prompt we used:

You are a dermatologist and expert in melanoma. You will GRADE a model answer against the following gold standard excerpt from the most recent NCCN melanoma guidelines. Base your evaluation strictly on this reference.

GRADE the model answer on the following:

Section Accuracy (score each 0, 1, or 2 - WHOLE NUMBERS ONLY):
- Stage: 2 = fully correct, 1 = minor error/omission, 0 = major error/omission
- Workup: 2 = fully correct, 1 = minor error/omission, 0 = major error/omission
- Treatment: 2 = fully correct, 1 = minor error/omission, 0 = major error/omission

Global Criteria (score each 0, 1, or 2 - WHOLE NUMBERS ONLY):
1. ACCURACY: Is the medical information factually correct?
2. RELEVANCE: Does the answer address the specific question asked?
3. DEPTH: Does the answer provide sufficient detail and explanation?
4. CLARITY: Is the answer clearly written and easy to understand?
5. COMPLETENESS: Does the answer cover all necessary aspects of the question?

For each issue, be as specific as possible about what is incorrect, missing, or unclear.

Structured Outputs Explained

Instead of having the AI write free-form text like "This answer is pretty good but missing some details," we made it fill out a specific form with exact scores and specific issues. This ensures consistent, comparable results.

The AI returns graded results in this exact format:

{
  "section_accuracy": {
    "Stage": 2,
    "Workup": 1, 
    "Treatment": 2
  },
  "global": {
    "Accuracy": 2,
    "Relevance": 2,
    "Depth": 1,
    "Clarity": 2,
    "Completeness": 1
  },
  "issues": [
    "Treatment section omits specific adjuvant therapy options",
    "Workup section suggests unnecessary imaging"
  ]
}

Human Validation

To ensure our AI grading was reliable, we created a system that:

Exports all data to CSV files that human physicians can review
Provides detailed scoring breakdowns for each response
Lists specific issues found by the AI grader
Includes the gold standard guidelines for comparison

This allows human physicians to:

Review each AI response against the guidelines
Compare their scores with the AI scores
Identify any discrepancies
Provide their own expert evaluation

Supplemental Materials

Detailed Results by Prompt

Stage 0/IA Melanoma (Melanoma in situ)

Model	Run	SBERT	ROUGE	BLEU	LLM Score	Key Issues
ChatGPT	1	0.744	0.127	0.006	11/16	Suggested dermatoscopic evaluation and Mohs surgery (not in guidelines); omitted follow-up reference
ChatGPT	2	0.751	0.178	0.009	15/16	Omitted "category 1 for stage IA" specification; missing follow-up guideline reference
ChatGPT	3	0.745	0.132	0.007	16/16	Perfect score - no issues identified
OpenEvidence	1	0.737	0.192	0.029	14/16	Suggested non-surgical options for melanoma in situ
OpenEvidence	2	0.738	0.171	0.010	11/16	Omitted history and physical examination; missing "wide excision" specification
OpenEvidence	3	0.725	0.158	0.007	14/16	Suggested 9-10 mm margins (not specified in guidelines)

Stage IB (T2a) Melanoma

Model	Run	SBERT	ROUGE	BLEU	LLM Score	Key Issues
ChatGPT	1	0.690	0.163	0.009	11/16	Suggested high-resolution ultrasound; omitted clinical trial/observation options; missing Stage III referral
ChatGPT	2	0.692	0.163	0.011	11/16	Suggested baseline imaging; omitted clinical trial/observation; missing follow-up procedures
ChatGPT	3	0.745	0.177	0.011	13/16	Omitted clinical trial or observation options for negative SLNB
OpenEvidence	1	0.758	0.201	0.023	16/16	Perfect score - no issues identified
OpenEvidence	2	0.742	0.214	0.026	16/16	Perfect score - no issues identified
OpenEvidence	3	0.661	0.161	0.003	16/16	Perfect score - no issues identified

Stage II (T2b or higher) Melanoma

Model	Run	SBERT	ROUGE	BLEU	LLM Score	Key Issues
ChatGPT	1	0.709	0.149	0.010	11/16	Suggested baseline CT/PET imaging; omitted pembrolizumab/nivolumab for stage IIB/IIC
ChatGPT	2	0.702	0.141	0.004	11/16	Omitted specific adjuvant therapy options; missing clinical trial/observation
ChatGPT	3	0.678	0.177	0.012	9/16	Incorrect T2b definition (1.01-2.0 mm); suggested unnecessary imaging
OpenEvidence	1	0.648	0.158	0.021	11/16	Omitted specific conditions for baseline imaging; missing adjuvant therapy options
OpenEvidence	2	0.683	0.174	0.023	13/16	Omitted pembrolizumab/nivolumab for pathological stage IIB/IIC
OpenEvidence	3	0.685	0.167	0.023	11/16	Omitted specific adjuvant therapy options; missing follow-up procedures

Citation Validity Supplmental Material

Invalid citations found

Model	Prompt	Run#	Type	Citation
ChatGPT	stage_0_ia	1	DOI	10.1007/s12094-014-1218-5
ChatGPT	stage_ib_t2a	1	DOI	10.1001/jama.2017.16261
ChatGPT	stage_ib_t2a	3	DOI	10.1002/cncr.36683
ChatGPT	stage_ib_t2a	3	DOI	10.1200/JCO.2018.78.2005

Full citation list

Model	Prompt	Run#	Type	Citation	Valid	Year	Old
ChatGPT	stage_0_ia	1	DOI	10.1007/s12094-014-1218-5	INVALID	-
ChatGPT	stage_0_ia	1	DOI	10.3322/caac.21348	VALID	2016	OLD
ChatGPT	stage_ib_t2a	1	DOI	10.1002/cncr.32764	VALID	2020	OLD
ChatGPT	stage_ib_t2a	1	DOI	10.1001/jama.2017.16261	INVALID	-
ChatGPT	stage_ii_t2b_or_higher	1	DOI	10.1007/978-3-319-40618-3_41	VALID	2017	OLD
ChatGPT	stage_ii_t2b_or_higher	1	DOI	10.1200/JCO.2009.23.4799	VALID	2009	OLD
ChatGPT	stage_ii_t2b_or_higher	1	DOI	10.1097/CMR.0000000000000743	VALID	2021
ChatGPT	stage_0_ia	2	DOI	10.1007/978-3-319-40618-3_48	VALID	2017	OLD
ChatGPT	stage_0_ia	2	DOI	10.1016/j.jaad.2011.06.038	VALID	2012	OLD
ChatGPT	stage_0_ia	2	DOI	10.1001/jamadermatol.2013.7117	VALID	2014	OLD
ChatGPT	stage_ib_t2a	2	DOI	10.3322/caac.21392	VALID	2017	OLD
ChatGPT	stage_ib_t2a	2	DOI	10.1097/CMR.0000000000000785	VALID	2021
ChatGPT	stage_ii_t2b_or_higher	2	DOI	10.1200/JCO.2016.67.1529	VALID	2016	OLD
ChatGPT	stage_ii_t2b_or_higher	2	DOI	10.1016/j.jaad.2018.02.022	VALID	2018	OLD
ChatGPT	stage_ii_t2b_or_higher	2	DOI	10.1016/j.jaad.2018.02.022	VALID	2018	OLD
ChatGPT	stage_0_ia	3	DOI	10.3322/caac.21388	VALID	2017	OLD
ChatGPT	stage_0_ia	3	DOI	10.1016/j.jaad.2018.03.037	VALID	2018	OLD
ChatGPT	stage_0_ia	3	DOI	10.1016/j.jaad.2018.03.037	VALID	2018	OLD
ChatGPT	stage_ib_t2a	3	DOI	10.1002/cncr.36683	INVALID	-
ChatGPT	stage_ib_t2a	3	DOI	10.1200/JCO.2018.78.2005	INVALID	-
ChatGPT	stage_ib_t2a	3	DOI	10.1056/NEJMoa1709030	VALID	2017	OLD
ChatGPT	stage_ii_t2b_or_higher	3	DOI	10.1007/978-3-319-40618-3	VALID	2017	OLD
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1001/jamadermatol.2023.4193	VALID	2023
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1111/bjd.16892	VALID	2018	OLD
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1001/jamasurg.2023.6904	VALID	2024
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ii_t2b_or_higher	1	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ii_t2b_or_higher	1	PMID	31758078	VALID	2020	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1001/jamadermatol.2023.4193	VALID	2023
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1111/bjd.16892	VALID	2018	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1111/bjd.16892	VALID	2018	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1001/jamasurg.2023.6904	VALID	2024
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ii_t2b_or_higher	3	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	PMID	31758078	VALID	2020	OLD
OpenEvidence	stage_ii_t2b_or_higher	3	PMID	31758078	VALID	2020	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1111/bjd.16892	VALID	2018	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1111/bjd.16892	VALID	2018	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1001/jamasurg.2023.6904	VALID	2024
OpenEvidence	stage_ib_t2a	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ib_t2a	1	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_ib_t2a	1	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ib_t2a	1	PMID	31758078	VALID	2020	OLD
OpenEvidence	stage_ib_t2a	1	PMID	31758078	VALID	2020	OLD
OpenEvidence	stage_ib_t2a	1	PMID	31758078	VALID	2020	OLD
OpenEvidence	stage_ib_t2a	2	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ib_t2a	2	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_ib_t2a	2	DOI	10.1038/s41379-019-0402-x	VALID	2020	OLD
OpenEvidence	stage_ib_t2a	2	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ib_t2a	2	DOI	10.3390/jcm13061607	VALID	2024
OpenEvidence	stage_ib_t2a	2	DOI	10.1001/jamasurg.2023.6904	VALID	2024
OpenEvidence	stage_ib_t2a	2	DOI	10.1097/PRS.0000000000002367	VALID	2016	OLD
OpenEvidence	stage_ib_t2a	2	DOI	10.1007/s11912-019-0843-x	VALID	2019	OLD
OpenEvidence	stage_ib_t2a	2	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_ib_t2a	2	DOI	10.1001/jamanetworkopen.2022.50613	VALID	2023
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.1016/j.jaad.2018.08.055	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.3322/caac.21409	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.1038/s41379-019-0402-x	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.1056/NEJMra2034861	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.3390/jcm13061607	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.1001/jamasurg.2023.6904	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.1097/PRS.0000000000002367	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.1007/s11912-019-0843-x	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.1200/JCO.2017.75.7724	VALID	-
OpenEvidence	stage_ib_t2a	2	URL	https://doi.org/10.1001/jamanetworkopen.2022.50613	VALID	-
OpenEvidence	stage_ib_t2a	3	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_ib_t2a	3	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_ib_t2a	3	DOI	10.1038/s41379-019-0402-x	VALID	2020	OLD
OpenEvidence	stage_ib_t2a	3	DOI	10.1111/bjd.16892	VALID	2018	OLD
OpenEvidence	stage_ib_t2a	3	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_ib_t2a	3	DOI	10.1001/jamasurg.2023.6904	VALID	2024
OpenEvidence	stage_ib_t2a	3	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_ib_t2a	3	URL	https://doi.org/10.1016/j.jaad.2018.08.055	VALID	-
OpenEvidence	stage_ib_t2a	3	URL	https://doi.org/10.3322/caac.21409	VALID	-
OpenEvidence	stage_ib_t2a	3	URL	https://doi.org/10.1038/s41379-019-0402-x	VALID	-
OpenEvidence	stage_ib_t2a	3	URL	https://doi.org/10.1111/bjd.16892	VALID	-
OpenEvidence	stage_ib_t2a	3	URL	https://doi.org/10.1200/JCO.2017.75.7724	VALID	-
OpenEvidence	stage_ib_t2a	3	URL	https://doi.org/10.1001/jamasurg.2023.6904	VALID	-
OpenEvidence	stage_ib_t2a	3	URL	https://doi.org/10.1056/NEJMra2034861	VALID	-
OpenEvidence	stage_0_ia	1	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_0_ia	1	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_0_ia	1	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_0_ia	1	DOI	10.1016/j.suc.2014.07.001	VALID	2014	OLD
OpenEvidence	stage_0_ia	1	DOI	10.1001/jamadermatol.2016.2668	VALID	2016	OLD
OpenEvidence	stage_0_ia	1	DOI	10.1097/PRS.0000000000002367	VALID	2016	OLD
OpenEvidence	stage_0_ia	1	URL	https://doi.org/10.1016/j.jaad.2018.08.055	VALID	-
OpenEvidence	stage_0_ia	1	URL	https://doi.org/10.3322/caac.21409	VALID	-
OpenEvidence	stage_0_ia	1	URL	https://doi.org/10.1056/NEJMra2034861	VALID	-
OpenEvidence	stage_0_ia	1	URL	https://doi.org/10.1016/j.suc.2014.07.001	VALID	-
OpenEvidence	stage_0_ia	1	URL	https://doi.org/10.1001/jamadermatol.2016.2668	VALID	-
OpenEvidence	stage_0_ia	1	URL	https://doi.org/10.1097/PRS.0000000000002367	VALID	-
OpenEvidence	stage_0_ia	2	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_0_ia	2	DOI	10.1111/bjd.16892	VALID	2018	OLD
OpenEvidence	stage_0_ia	2	DOI	10.1056/NEJMra2034861	VALID	2021
OpenEvidence	stage_0_ia	2	DOI	10.1016/j.suc.2014.07.001	VALID	2014	OLD
OpenEvidence	stage_0_ia	2	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_0_ia	2	DOI	10.1001/jamadermatol.2016.2668	VALID	2016	OLD
OpenEvidence	stage_0_ia	2	DOI	10.1097/PRS.0000000000002367	VALID	2016	OLD
OpenEvidence	stage_0_ia	2	URL	https://doi.org/10.1016/j.jaad.2018.08.055	VALID	-
OpenEvidence	stage_0_ia	2	URL	https://doi.org/10.1111/bjd.16892	VALID	-
OpenEvidence	stage_0_ia	2	URL	https://doi.org/10.1056/NEJMra2034861	VALID	-
OpenEvidence	stage_0_ia	2	URL	https://doi.org/10.1016/j.suc.2014.07.001	VALID	-
OpenEvidence	stage_0_ia	2	URL	https://doi.org/10.1200/JCO.2017.75.7724	VALID	-
OpenEvidence	stage_0_ia	2	URL	https://doi.org/10.1001/jamadermatol.2016.2668	VALID	-
OpenEvidence	stage_0_ia	2	URL	https://doi.org/10.1097/PRS.0000000000002367	VALID	-
OpenEvidence	stage_0_ia	3	DOI	10.1016/j.jaad.2018.08.055	VALID	2019	OLD
OpenEvidence	stage_0_ia	3	DOI	10.1111/bjd.16892	VALID	2018	OLD
OpenEvidence	stage_0_ia	3	DOI	10.3322/caac.21409	VALID	2017	OLD
OpenEvidence	stage_0_ia	3	DOI	10.1016/j.cps.2021.05.004	VALID	2021
OpenEvidence	stage_0_ia	3	DOI	10.1016/j.suc.2014.07.001	VALID	2014	OLD
OpenEvidence	stage_0_ia	3	DOI	10.1200/JCO.2017.75.7724	VALID	2018	OLD
OpenEvidence	stage_0_ia	3	DOI	10.1016/j.jaad.2019.01.051	VALID	2019	OLD
OpenEvidence	stage_0_ia	3	URL	https://doi.org/10.1016/j.jaad.2018.08.055	VALID	-
OpenEvidence	stage_0_ia	3	URL	https://doi.org/10.1111/bjd.16892	VALID	-
OpenEvidence	stage_0_ia	3	URL	https://doi.org/10.3322/caac.21409	VALID	-
OpenEvidence	stage_0_ia	3	URL	https://doi.org/10.1016/j.cps.2021.05.004	VALID	-
OpenEvidence	stage_0_ia	3	URL	https://doi.org/10.1016/j.suc.2014.07.001	VALID	-
OpenEvidence	stage_0_ia	3	URL	https://doi.org/10.1200/JCO.2017.75.7724	VALID	-
OpenEvidence	stage_0_ia	3	URL	https://doi.org/10.1016/j.jaad.2019.01.051	VALID	-

Comparing OpenEvidence and ChatGPT: Evaluating Adherence to NCCN Melanoma Guidelines for Staging, Workup, and Treatment Options

Table of contents

Introduction

What We're Testing

The AI Models

The Question

Methods

How We Tested the AI Systems

The Clinical Scenarios

The Gold Standard

Stage 0/IA Melanoma (Melanoma in situ)

Stage IB (T2a) Melanoma

Stage II (T2b or higher) Melanoma

How We Evaluated the Responses

1. Similarity Metrics (How Close to the Gold Standard?)

2. AI Physician Grading (Medical Expert Evaluation)

What is a System Prompt?

The Grading System

3. Citation Analysis and Validity Checks

The Complete Results

Citation results

Key Findings

OpenEvidence Wins 6 Out of 7 Metrics (85.7% Win Rate)

Key findings from the variability results:

Key findings from citation analysis:

What This Means

Conclusions

Key Insights

Limitations

Future Work

The Technical Details (For Those Who Want to Dig Deeper)

The AI Grader

How We Built the AI Physician Grader

Structured Outputs Explained

Human Validation

Supplemental Materials

Detailed Results by Prompt

Stage 0/IA Melanoma (Melanoma in situ)

Stage IB (T2a) Melanoma

Stage II (T2b or higher) Melanoma

Citation Validity Supplmental Material

Invalid citations found

Full citation list

Subscribe to my newsletter

Christina Bear

Christina Bear