BART Architecture: Runtime Analysis and Implementation Considerations

BART (Bidirectional and Auto-Regressive Transformers) represents Facebook AI's approach to sequence-to-sequence modeling that combines the bidirectional encoding of BERT with the auto-regressive decoding of GPT. Here we examines BART's architecture, computational flow, and performance characteristics from a systems implementation perspective.
BART Architecture Overview
BART follows an encoder-decoder structure that differs from models like BERT by enabling both understanding and generation tasks:
Encoder: 6 or 12 bidirectional Transformer layers (similar to BERT)
Decoder: 6 or 12 auto-regressive Transformer layers (similar to GPT)
Cross-attention: Mechanism connecting encoder and decoder
Standard configurations include:
BART-Base: 6 encoder/decoder layers, 768 hidden dimensions, 12 attention heads (~140M parameters)
BART-Large: 12 encoder/decoder layers, 1024 hidden dimensions, 16 attention heads (~400M parameters)
Key architectural features include:
50,265 token vocabulary based on RoBERTa's byte-level BPE tokenizer
Learned positional embeddings (up to 1024 positions)
Final layer normalization in both encoder and decoder
Additional layer normalization before the decoder's final output projection
Computational Pipeline: From Input to Output
BART's dual-component architecture creates a distinct processing flow:
1. Input Processing
Tokenization: Text converted to byte-level BPE tokens using the SentencePiece/RoBERTa tokenizer
Special Tokens: <s> and </s> added to mark sequence boundaries
Vocabulary Mapping: Tokens mapped to IDs from the 50,265-token vocabulary
2. Encoder Stack
Embedding Layer:
Token embeddings: Maps tokens to 768/1024-dimensional vectors
Position embeddings: Adds learned positional information
Embedding addition and normalization
Dropout (p=0.1) during training
Encoder Transformer Layers (x6 for BART-Base, x12 for BART-Large):
Self-Attention Sublayer:
Input: Previous layer output (or embedding output for first layer)
Computes multi-head self-attention (all tokens attend to all tokens)
Layer normalization and residual connection
Feed-Forward Sublayer:
Two linear transformations with ReLU activation
Dimensions: hidden_size → 4*hidden_size → hidden_size
Layer normalization and residual connection
Final Layer Normalization:
- Normalizes the output of the last encoder layer
Output: Sequence of contextualized representations for input tokens (shape: [batch_size, enc_seq_length, hidden_size])
3. Decoder Stack
Embedding Layer:
Token embeddings: Maps target tokens to vectors
Position embeddings: Adds position information
Embedding addition and normalization
Dropout (p=0.1) during training
Decoder Transformer Layers (x6 for BART-Base, x12 for BART-Large):
Masked Self-Attention Sublayer:
Causal masking to ensure tokens only attend to previous positions
Multi-head attention, normalization, and residual connection
Cross-Attention Sublayer:
Attends to encoder outputs (queries from decoder, keys/values from encoder)
Connects encoder and decoder information flow
Normalization and residual connection
Feed-Forward Sublayer:
Two linear transformations with ReLU activation
Layer normalization and residual connection
Final Layer Normalization:
- Applied before the output projection
Output Projection:
Linear projection to vocabulary size (50,265)
Weight-tying with input embeddings (shares parameters)
Output during generation: Probability distribution over vocabulary for next token
4. Generation Process
During inference, BART generates output auto-regressively:
Encoder processes the full input sequence
Decoder generates one token at a time
Each new token is appended to previously generated ones
Process continues until EOS token or maximum length is reached
BART Training Methodology
BART's training process combines denoising pre-training with task-specific fine-tuning:
Pre-training Phase
Corruption Strategies:
Token masking: Random tokens replaced with [MASK]
Token deletion: Random tokens removed
Text infilling: Spans of text replaced with single [MASK]
Sentence permutation: Sentences shuffled in random order
Document rotation: Document rotated to start with random token
Training Objective:
Reconstruct the original text from corrupted input
Cross-entropy loss between predicted and actual tokens
Training Parameters:
Adam optimizer with β₁=0.9, β₂=0.999
Learning rate: 3e-5 with polynomial decay
Dropout: 0.1
Attention dropout: 0.1
Weight decay: 0.01
Label smoothing: 0.1
Training corpus: Similar to RoBERTa (160GB of text)
Hardware: 256 NVIDIA V100 GPUs for several days
Fine-tuning Approaches
BART excels at different types of tasks with specific fine-tuning strategies:
Sequence Classification:
Encoder processes input
Decoder is given <s> token as initial input
Final decoder output at first position used for classification
All parameters updated during fine-tuning
Token Classification:
Encoder processes input
Final encoder outputs used for token-level predictions
Decoder not typically used
Sequence Generation:
Full encoder-decoder architecture used
Auto-regressive generation with beam search (beam size: 4-5)
Length penalty applied to favor longer or shorter outputs
Machine Translation:
Additional encoder layers trained from scratch
Original BART encoder frozen
New encoder converts source language to BART encoder space
Runtime Performance Characteristics
BART's encoder-decoder architecture has specific performance implications:
Computational Complexity
For input sequence length n, output sequence length m, and hidden dimension d:
Encoder self-attention: O(n² × d) operations
Decoder self-attention: O(m² × d) operations (cumulative during generation)
Cross-attention: O(n × m × d) operations
Feed-forward networks: O((n+m) × d²) operations
Overall complexity: O(n² × d + m² × d + n × m × d + (n+m) × d²)
Memory Requirements
Model parameters: O(L × d²) - constant during inference
Encoder attention cache: O(L × n² × h) where h is number of attention heads
Decoder attention cache: O(L × m² × h) - grows quadratically during generation
Cross-attention cache: O(L × n × m × h) - grows linearly during generation
Encoder-decoder activations: O(L × (n+m) × d) - grows linearly during generation
Performance Bottlenecks
Generation latency:
Auto-regressive generation requires m sequential steps
Each step requires full decoder forward pass
Cannot be parallelized during inference
Memory bandwidth:
Attention key/value caches grow during generation
Frequent memory access patterns may exceed cache capacity
Encoder-decoder balance:
Encoder computes once; decoder computes m times
Load balancing between components affects utilization
System Optimization Strategies
Several approaches can optimize BART's performance in production:
Generation Optimization
Attention caching:
Cache key/value projections from previous decoding steps
Reduces redundant computation during generation
Critical for efficient auto-regressive decoding
Beam search optimization:
Batch multiple beams for parallel computation
Prune unlikely beams early to reduce computation
Optimize hypothesis tracking data structures
Length normalization:
Tune length penalty to control output size
Affects both quality and computation time
Encoder Optimizations
Encoder result caching:
Cache encoder outputs for frequently used inputs
Particularly effective for retrieval/QA systems
Attention pattern optimization:
Sparse attention patterns can reduce the O(n²) complexity
Particularly beneficial for long documents
Hardware-Specific Tuning
Mixed precision:
FP16/BF16 computation reduces memory bandwidth requirements
Can double throughput on compatible hardware
Model parallelism:
Shard encoder and decoder across multiple devices
Reduces per-device memory requirements
Introduces communication overhead
Kernel fusion:
Combine multiple small operations (e.g., bias add, activation, dropout)
Reduces memory round-trips
Practical Implementation Considerations
When implementing BART in production environments:
Memory Management
Dynamic batch sizing:
Adjust batch size based on sequence lengths
Maximizes GPU utilization while preventing OOM errors
Attention mechanism optimization:
Efficient implementations of multi-head attention
FP16 accumulation with selective FP32 conversion
Generation Control
Early stopping conditions:
Configurable criteria to terminate generation
Reduces unnecessary computation
Output filtering:
Post-processing to remove repetition/artifacts
Length constraints based on input characteristics
Input Preprocessing
Efficient tokenization:
Optimized byte-level BPE implementation
Batch tokenization for throughput
Input truncation strategies:
Document segmentation for long inputs
Affects encoding quality and performance
Conclusion
BART represents a versatile and powerful sequence-to-sequence architecture that combines the strengths of bidirectional encoding and auto-regressive decoding. Its dual-component design enables both understanding and generation capabilities but introduces computational challenges, particularly during the generation phase.
The encoder-decoder architecture creates distinct performance characteristics compared to encoder-only (BERT) or decoder-only (GPT) models. The sequential nature of generation remains a fundamental bottleneck, but techniques like attention caching, mixed precision computation, and hardware-specific optimizations can significantly improve performance.
For system implementations, careful attention to memory management, generation control, and preprocessing optimizations is essential to achieve maximum efficiency while maintaining output quality.
In the next BART article I will train BART for different use cases. Stay tuned…
Subscribe to my newsletter
Read articles from Dove-Wing directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
