BART (Bidirectional and Auto-Regressive Transformers) represents Facebook AI's approach to sequence-to-sequence modeling that combines the bidirectional encoding of BERT with the auto-regressive decoding of GPT. Here we examines BART's architecture, computational flow, and performance characteristics from a systems implementation perspective.

BART Architecture Overview

BART follows an encoder-decoder structure that differs from models like BERT by enabling both understanding and generation tasks:

Encoder: 6 or 12 bidirectional Transformer layers (similar to BERT)
Decoder: 6 or 12 auto-regressive Transformer layers (similar to GPT)
Cross-attention: Mechanism connecting encoder and decoder

Standard configurations include:

BART-Base: 6 encoder/decoder layers, 768 hidden dimensions, 12 attention heads (~140M parameters)
BART-Large: 12 encoder/decoder layers, 1024 hidden dimensions, 16 attention heads (~400M parameters)

Key architectural features include:

50,265 token vocabulary based on RoBERTa's byte-level BPE tokenizer
Learned positional embeddings (up to 1024 positions)
Final layer normalization in both encoder and decoder
Additional layer normalization before the decoder's final output projection

Computational Pipeline: From Input to Output

BART's dual-component architecture creates a distinct processing flow:

1. Input Processing

Tokenization: Text converted to byte-level BPE tokens using the SentencePiece/RoBERTa tokenizer
Special Tokens: <s> and </s> added to mark sequence boundaries
Vocabulary Mapping: Tokens mapped to IDs from the 50,265-token vocabulary

2. Encoder Stack

Embedding Layer:
- Token embeddings: Maps tokens to 768/1024-dimensional vectors
- Position embeddings: Adds learned positional information
- Embedding addition and normalization
- Dropout (p=0.1) during training
Encoder Transformer Layers (x6 for BART-Base, x12 for BART-Large):
- Self-Attention Sublayer:
  - Input: Previous layer output (or embedding output for first layer)
  - Computes multi-head self-attention (all tokens attend to all tokens)
  - Layer normalization and residual connection
- Feed-Forward Sublayer:
  - Two linear transformations with ReLU activation
  - Dimensions: hidden_size → 4*hidden_size → hidden_size
  - Layer normalization and residual connection
Final Layer Normalization:
- Normalizes the output of the last encoder layer

Output: Sequence of contextualized representations for input tokens (shape: [batch_size, enc_seq_length, hidden_size])

3. Decoder Stack

Embedding Layer:
- Token embeddings: Maps target tokens to vectors
- Position embeddings: Adds position information
- Embedding addition and normalization
- Dropout (p=0.1) during training
Decoder Transformer Layers (x6 for BART-Base, x12 for BART-Large):
- Masked Self-Attention Sublayer:
  - Causal masking to ensure tokens only attend to previous positions
  - Multi-head attention, normalization, and residual connection
- Cross-Attention Sublayer:
  - Attends to encoder outputs (queries from decoder, keys/values from encoder)
  - Connects encoder and decoder information flow
  - Normalization and residual connection
- Feed-Forward Sublayer:
  - Two linear transformations with ReLU activation
  - Layer normalization and residual connection
Final Layer Normalization:
- Applied before the output projection
Output Projection:
- Linear projection to vocabulary size (50,265)
- Weight-tying with input embeddings (shares parameters)

Output during generation: Probability distribution over vocabulary for next token

4. Generation Process

During inference, BART generates output auto-regressively:

Encoder processes the full input sequence
Decoder generates one token at a time
Each new token is appended to previously generated ones
Process continues until EOS token or maximum length is reached

BART Training Methodology

BART's training process combines denoising pre-training with task-specific fine-tuning:

Pre-training Phase

Corruption Strategies:
- Token masking: Random tokens replaced with [MASK]
- Token deletion: Random tokens removed
- Text infilling: Spans of text replaced with single [MASK]
- Sentence permutation: Sentences shuffled in random order
- Document rotation: Document rotated to start with random token
Training Objective:
- Reconstruct the original text from corrupted input
- Cross-entropy loss between predicted and actual tokens
Training Parameters:
- Adam optimizer with β₁=0.9, β₂=0.999
- Learning rate: 3e-5 with polynomial decay
- Dropout: 0.1
- Attention dropout: 0.1
- Weight decay: 0.01
- Label smoothing: 0.1
- Training corpus: Similar to RoBERTa (160GB of text)
- Hardware: 256 NVIDIA V100 GPUs for several days

Fine-tuning Approaches

BART excels at different types of tasks with specific fine-tuning strategies:

Sequence Classification:
- Encoder processes input
- Decoder is given <s> token as initial input
- Final decoder output at first position used for classification
- All parameters updated during fine-tuning
Token Classification:
- Encoder processes input
- Final encoder outputs used for token-level predictions
- Decoder not typically used
Sequence Generation:
- Full encoder-decoder architecture used
- Auto-regressive generation with beam search (beam size: 4-5)
- Length penalty applied to favor longer or shorter outputs
Machine Translation:
- Additional encoder layers trained from scratch
- Original BART encoder frozen
- New encoder converts source language to BART encoder space

Runtime Performance Characteristics

BART's encoder-decoder architecture has specific performance implications:

Computational Complexity

For input sequence length n, output sequence length m, and hidden dimension d:

Encoder self-attention: O(n² × d) operations
Decoder self-attention: O(m² × d) operations (cumulative during generation)
Cross-attention: O(n × m × d) operations
Feed-forward networks: O((n+m) × d²) operations
Overall complexity: O(n² × d + m² × d + n × m × d + (n+m) × d²)

Memory Requirements

Model parameters: O(L × d²) - constant during inference
Encoder attention cache: O(L × n² × h) where h is number of attention heads
Decoder attention cache: O(L × m² × h) - grows quadratically during generation
Cross-attention cache: O(L × n × m × h) - grows linearly during generation
Encoder-decoder activations: O(L × (n+m) × d) - grows linearly during generation

Performance Bottlenecks

Generation latency:
- Auto-regressive generation requires m sequential steps
- Each step requires full decoder forward pass
- Cannot be parallelized during inference
Memory bandwidth:
- Attention key/value caches grow during generation
- Frequent memory access patterns may exceed cache capacity
Encoder-decoder balance:
- Encoder computes once; decoder computes m times
- Load balancing between components affects utilization

System Optimization Strategies

Several approaches can optimize BART's performance in production:

Generation Optimization

Attention caching:
- Cache key/value projections from previous decoding steps
- Reduces redundant computation during generation
- Critical for efficient auto-regressive decoding
Beam search optimization:
- Batch multiple beams for parallel computation
- Prune unlikely beams early to reduce computation
- Optimize hypothesis tracking data structures
Length normalization:
- Tune length penalty to control output size
- Affects both quality and computation time

Encoder Optimizations

Encoder result caching:
- Cache encoder outputs for frequently used inputs
- Particularly effective for retrieval/QA systems
Attention pattern optimization:
- Sparse attention patterns can reduce the O(n²) complexity
- Particularly beneficial for long documents

Hardware-Specific Tuning

Mixed precision:
- FP16/BF16 computation reduces memory bandwidth requirements
- Can double throughput on compatible hardware
Model parallelism:
- Shard encoder and decoder across multiple devices
- Reduces per-device memory requirements
- Introduces communication overhead
Kernel fusion:
- Combine multiple small operations (e.g., bias add, activation, dropout)
- Reduces memory round-trips

Practical Implementation Considerations

When implementing BART in production environments:

Memory Management

Dynamic batch sizing:
- Adjust batch size based on sequence lengths
- Maximizes GPU utilization while preventing OOM errors
Attention mechanism optimization:
- Efficient implementations of multi-head attention
- FP16 accumulation with selective FP32 conversion

Generation Control

Early stopping conditions:
- Configurable criteria to terminate generation
- Reduces unnecessary computation
Output filtering:
- Post-processing to remove repetition/artifacts
- Length constraints based on input characteristics

Input Preprocessing

Efficient tokenization:
- Optimized byte-level BPE implementation
- Batch tokenization for throughput
Input truncation strategies:
- Document segmentation for long inputs
- Affects encoding quality and performance

Conclusion

BART represents a versatile and powerful sequence-to-sequence architecture that combines the strengths of bidirectional encoding and auto-regressive decoding. Its dual-component design enables both understanding and generation capabilities but introduces computational challenges, particularly during the generation phase.

The encoder-decoder architecture creates distinct performance characteristics compared to encoder-only (BERT) or decoder-only (GPT) models. The sequential nature of generation remains a fundamental bottleneck, but techniques like attention caching, mixed precision computation, and hardware-specific optimizations can significantly improve performance.

For system implementations, careful attention to memory management, generation control, and preprocessing optimizations is essential to achieve maximum efficiency while maintaining output quality.

In the next BART article I will train BART for different use cases. Stay tuned…

BART Architecture: Runtime Analysis and Implementation Considerations