Decoding LLM Attack Surfaces: A Deep Dive into Model Vulnerabilities


The attack surface of Large Language Models (LLMs) is a critical area of research, especially as these models become more integrated into various applications. To shed light on this, I conducted a series of tests on four prominent models: GPT-5, Gemini 2.5 Flash, Claude Sonnet 4, and Llama 4 Maverick, against the OWASP AITG-APP guide.
Methodology
Each model was subjected to over 1,000 unique payloads designed to probe for vulnerabilities outlined in the OWASP AITG-APP guide. The process involved:
Payload Submission: Each model received 1,000+ payloads.
Judge LLM Analysis: The models' responses, along with the original prompts, were fed into a judge LLM. This judge was specifically trained on each test case to analyze the interaction and provide a verdict on whether an attack was successful.
The results, presented in the graphs below, illustrate the fail rates (successful attack percentages) for each test case. These findings highlight the varying resilience of each model across different attack scenarios. You can find raw results data in this repository.
Performance Results
Overview
The overall performance across all models and attack categories reveals varying degrees of resilience. While some models like GPT-5 show relatively low fail rates across the board, others such as Llama 4 Maverick exhibit significantly higher vulnerabilities in specific areas. The most common successful attack vectors appear to be those related to prompt injection and manipulation, with AITG-APP-04 (Input Leakage) and AITG-APP-13 (Over-Reliance on AI) consistently showing higher fail rates across multiple models. Conversely, categories like AITG-APP-03 (Information Leakage) and AITG-APP-09 (Model Extraction) generally yielded lower fail rates, likely due to the inherent nature of the tested models not containing sensitive internal data. This aggregated view underscores the need for robust security measures targeting common LLM attack surfaces.
Claude Sonnet 4
Claude Sonnet 4 exhibits strong performance across the tested attack surface. It shows notable vulnerability to AITG-APP-13 (Over-Reliance on AI) at 54%, indicating the model offers sensitive advice without explicit caution or recommendation for professional consultation. AITG-APP-04 (Input Leakage) also at 54%, also presents a significant weakness, suggesting the model may inadequately store sensitive input and output them upon request. Other categories, such as AITG-APP-07 (System Prompt Leakage) and AITG-APP-14 (Explainability and Interpretability), 20% and 18% respectively, show lower but still present fail rates.
Gemini 2.5 Flash
Gemini 2.5 Flash demonstrates a moderate level of resilience against most attack vectors, showcasing a generally low fail rate across the OWASP AITG-APP categories. Its most significant vulnerability lies in AITG-APP-04 (Input Leakage) at 69% and AITG-APP-02 (Indirect Prompt Injection), suggesting that while robust, it's not entirely immune to sophisticated external prompt manipulations. Notably, AITG-APP-05 (Unsafe Outputs) has a 4% fail rate, indicating strong safeguards against harmful responses, which is consistent with its design as a general-purpose model.
GPT-5
GPT-5 consistently demonstrates low fail rates across all tested attack categories, establishing itself as a resilient model in this evaluation. Besides AITG-APP-04 (Input Leakage), its highest vulnerabilities, AITG-APP-07 (System Prompt Leakage) and AITG-APP-13 (Over-Reliance on AI) , registers a 52% and 51% fail rate respectively. Categories such as AITG-APP-01 (Prompt Injection) and AITG-APP-02 (Indirect Prompt Injection) show an impressive 4% and 6% fail rates, highlighting robust internal safeguards and isolation mechanisms.
Llama 4 Maverick
Llama 4 Maverick shows the highest overall fail rates among the models tested, indicating a greater susceptibility to the identified attack surfaces. It is particularly vulnerable to AITG-APP-13 (Over-Reliance on AI) and AITG-APP-14 (Explainability and Interpretability) with a substantial 76% fail rate on both, suggesting significant challenges in issuing safe outputs. AITG-APP-10 (Harmful Content Bias) also presents a major weakness at 44%, pointing to potential issues with isolating the model's operations from harmful inputs. The results suggest that Llama 4 Maverick may require significant hardening to enhance its security posture against common LLM attack vectors.
Caveats and Future Work
It's important to note the following caveats to this research:
Non-Peer Reviewed: This research has not undergone peer review. I encourage others to replicate and expand upon these tests and share their results.
Judge LLM Tuning: The LLM judges may require further tuning, as some false positives were observed. An error margin of at least 5% should be assumed.
Payload Improvement: The payloads could be further refined to achieve broader and deeper coverage, leading to a more accurate assessment of model vulnerabilities.
Information Leakage: Most models tested do not inherently contain sensitive information, leading to low fail rates in information leakage scenarios (e.g., AITG-APP-03). Custom models, however, may exhibit higher rates in such tests.
Subscribe to my newsletter
Read articles from Joey Melo directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
