Multi-agent systems (MAS) powered by Large Language Models (LLMs) offer scalable, modular, and collaborative AI capabilities. However, reliably engineering such systems requires careful mitigation of complex, interwoven failure modes. The Multi-Agent System Failure Taxonomy (MAST) provides the first empirically grounded, structured classification of 14 critical failure modes categorized into:

Specification and System Design Issues
Inter-Agent Misalignment and Coordination Breakdowns
Task Verification and Quality Control Failures

Applying a Failure Modes and Effects Analysis (FMEA) lens on these modes helps systematically analyze their impact, root causes, detection challenges, and prioritize risk mitigation.

MAST Failure Mode Categories and Their Significance

Based on analysis of 7 popular MAS frameworks and 200+ execution traces covering diverse tasks, the MAST taxonomy divides failures as follows:

Category	Percentage of Failures	Description
Specification Issues	41.77%	Failures from ambiguous instructions, poor role definition, or inadequate system/specification design.
Inter-Agent Misalignment	36.94%	Failures due to ineffective communication, ignored inputs, conflicting agent behavior, or task derailment.
Task Verification and Quality Control	21.30%	Failures caused by incomplete, incorrect, or missing output verification leading to error propagation.

This distribution underscores that nearly 80% of MAS failures arise from specification and coordination shortcomings, with verification issues also playing a critical role.

The 14 Failure Modes (MAST Details)

Failure Mode	Category	Description
Disobey Task Specification	Specification Issues	Agents fail to follow task instructions due to ambiguity or conflict.
Disobey Role Specification	Specification Issues	Role definitions are unclear or violated, causing overlapping or conflicting actions.
Step Repetition	Specification Issues	Agents redundantly repeat steps due to poor state tracking or ambiguous stopping criteria.
Loss of Conversation History	Specification Issues	Agents lose track of important context, causing task derailment.
Unaware of Termination Conditions	Specification Issues	Agents lack proper termination checks, leading to infinite loops or premature stops.
Conversation Reset	Inter-Agent Misalignment	Shared context is lost due to session mishandling, causing restart or repetition.
Fail to Ask for Clarification	Inter-Agent Misalignment	Agents neglect to query ambiguous inputs, propagating errors.
Task Derailment	Inter-Agent Misalignment	Agents diverge from the assigned goal, resulting in incoherent or irrelevant outputs.
Information Withholding	Inter-Agent Misalignment	One or more agents deliberately or accidentally withhold information, weakening collaboration.
Ignored Input from Other Agents	Inter-Agent Misalignment	Agents disregard input, breaking down cooperation and causing redundant work.
Reasoning-Action Mismatch	Inter-Agent Misalignment	Agents' logical reasoning does not correctly map to appropriate actions.
Premature Termination	Task Verification	Tasks end before completion, losing productivity or completeness.
No or Incomplete Verification	Task Verification	Weak or absent verification fails to detect errors or inconsistencies.
Incorrect Verification	Task Verification	Verification processes falsely report success, allowing flawed outputs to persist.

FMEA Table: Mapping Failure Modes to Risk and Detection

Failure Mode	Potential Effects	Root Cause	Risk Priority (S/O/D)	Approx. RPN	Detection Method
Disobey Task Specification	Wrong output, wasted resources	Ambiguous/conflicting instructions	5/4/4	80	Task audits, prompt clarity checks
Disobey Role Specification	Overlapping/conflicting agent actions	Poor role clarity/enforcement	5/4/3	60	Role compliance review
Step Repetition	Inefficiency, redundant computations	Poor state or memory tracking	4/4/3	48	Execution trace logs, step counters
Loss of Conversation History	Context loss, derailment	Insufficient persistence/storage	5/3/5	75	Conversation replay, state audits
Unaware of Termination	Endless loops, incomplete tasks	Missing/ambiguous termination checks	4/3/4	48	End condition verification
Conversation Reset	Rework, lost progress	Session or state resets	5/3/4	60	Event trace and session logs
Fail to Ask for Clarification	Error propagation	Weak inter-agent protocol	3/4/4	48	Message inspection
Task Derailment	Goal divergence, incoherent output	Incomplete/inaccurate goals	5/3/4	60	Output-goal alignment checks
Information Withholding	Reduced intelligence, incomplete tasks	Communication lapses	4/3/4	48	Communication logs
Ignored Input from Others	Redundancy, missed insights	Protocol non-enforcement	4/3/4	48	Peer interaction audits
Reasoning-Action Mismatch	Invalid outputs	Faulty reasoning chains	5/2/5	50	Logical consistency tests
Premature Termination	Incomplete results	Wrong stop signals	4/2/4	32	Execution logs
No or Incomplete Verification	Error propagation	Weak QA or missing checks	5/3/5	75	Automated output validations
Incorrect Verification	False success, flawed final output	Ineffective QC processes	5/3/5	75	Secondary audits, human review

Severity (S), Occurrence (O), and Detectability (D) scores are on a 1–5 scale and multiplied to calculate Risk Priority Number (RPN).
Higher RPNs flag more critical failure modes needing urgent intervention.

Technical Insights & Interactions

Cascading Failure Effect: Early specification flaws (e.g., ambiguous roles) cascade into downstream coordination breakdowns and verification failures, magnifying impact.
Detection Difficulty: Subtle failures like conversation resets and ignored inputs demand sophisticated runtime state monitoring, meta-agent oversight, and cross-agent consistency checking.
Communication Protocols are Critical: MAS depend heavily on well-defined, standardized protocols. Misalignment here reduces collective problem-solving capacity.
Verification is a Gatekeeper: Without rigorous multi-layered verification, errors silently propagate, undermining system reliability.

Practical FMEA Application: Risk Mitigation Strategies

Risk Prioritization:
- Focus on high RPN failure modes such as Disobey Task Specification, Loss of Conversation History, and No/Incorrect Verification.
Specification & Role Clarity:
- Employ formal schemas (e.g., JSON schema) and strict role definition documents.
- Pre-execution audits to enforce prompt and role compliance.
Robust Conversation State Management:
- Persist context robustly in memory and persistent stores to prevent resets and loss.
- Use checkpointing and session restoration techniques.
Standardized Communication & Alignment Checks:
- Enforce message formats and communication protocols.
- Implement meta-agents to monitor agent interactions continually.
Multi-Layered Verification Pipelines:
- Combine automated output validation, logical consistency checks, and human-in-the-loop reviews.
- Use domain-specific validators to check outputs precisely.
Fine-Grained Logging and Monitoring:
- Maintain detailed execution logs with timestamps, message tracing, and error flags.
- Apply anomaly detection/machine learning on logs for early failure detection.
Continuous Feedback and Adaptation:
- Use failure pattern data to iteratively improve roles, prompts, and protocols.
- Update training and prompts with competency mapping to maintain role-specific excellence.

Illustrative Real-World Example

In a financial data reconciliation MAS:

AnalystAgent fetched data; ValidatorAgent verified data.
Due to Loss of Conversation History, Validator used stale data.
Because of Incorrect Verification (only format checks, not content logic), inaccurate records passed undetected into final reports.
Addressing these involved improving context persistence and adding semantic validations, cutting error rates drastically.

Advanced Research and Practice

The Chat-of-Thought framework (Constantinides et al., 2024) demonstrated multi-agent specialized roles (Facilitator, Validator, Reliability Engineer) collaborating to automate FMEA generation, showing the power of well-organized MAS collaboration. Paper
Research into agent competency mapping and fine-tuning for task-specific roles ensures better role differentiation and reliability. Paper
Popular tools like LangGraph, OpenTelemetry, Cerberus, and Great Expectations facilitate structured workflow, traceability, schema validation, and data quality assurance in MAS.

Summary Table: MAS Failures to FMEA Actions

Failure Category	Key Risks	FMEA Focus	Detection Focus
Specification & Role Definition	Ambiguity, overlapping roles	Clear roles, strict task schemas	Prompt and role compliance checks
Coordination & Communication	Misalignment, information silos	Protocol standardization, active messaging	Multi-modal logs and peer audits
Verification & Quality Control	Undetected or false verification	Rigorous multi-layer validation	Automated + manual output reviews

References

Cemri et al. Why Do Multi-Agent LLM Systems Fail? (Berkeley, 2025) arXiv:2503.13657
Constantinides et al. Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information, IBM Research, 2024 arXiv:2506.10086v1
Kaur and Kumar. Competency Mapping and LLM Fine-Tuning for MAS, 2020 arXiv:2404.04834v4
Gradient Flow, Hugging Face Community Papers and Practical MAS Failure Analyses

In conclusion, integrating the 14 failure modes into an FMEA framework gives MAS designers a robust roadmap to identify, prioritize, detect, and mitigate systemic vulnerabilities. Systematic attention to specification, communication, and especially verification intricately paired with improved tooling, role clarity, and continuous feedback loops charts a path toward robust, scalable, and trustworthy multi-agent LLM deployments.

Bibliography:

Integrating the 14 Failure Modes of Multi-Agent LLM Systems into an FMEA Framework: A Deep Technical Analysis