BART Architecture: Runtime Analysis and Implementation Considerations

Dove-WingDove-Wing
7 min read

BART (Bidirectional and Auto-Regressive Transformers) represents Facebook AI's approach to sequence-to-sequence modeling that combines the bidirectional encoding of BERT with the auto-regressive decoding of GPT. Here we examines BART's architecture, computational flow, and performance characteristics from a systems implementation perspective.

BART Architecture Overview

BART follows an encoder-decoder structure that differs from models like BERT by enabling both understanding and generation tasks:

  • Encoder: 6 or 12 bidirectional Transformer layers (similar to BERT)

  • Decoder: 6 or 12 auto-regressive Transformer layers (similar to GPT)

  • Cross-attention: Mechanism connecting encoder and decoder

Standard configurations include:

  • BART-Base: 6 encoder/decoder layers, 768 hidden dimensions, 12 attention heads (~140M parameters)

  • BART-Large: 12 encoder/decoder layers, 1024 hidden dimensions, 16 attention heads (~400M parameters)

Key architectural features include:

  • 50,265 token vocabulary based on RoBERTa's byte-level BPE tokenizer

  • Learned positional embeddings (up to 1024 positions)

  • Final layer normalization in both encoder and decoder

  • Additional layer normalization before the decoder's final output projection

Computational Pipeline: From Input to Output

BART's dual-component architecture creates a distinct processing flow:

1. Input Processing

  • Tokenization: Text converted to byte-level BPE tokens using the SentencePiece/RoBERTa tokenizer

  • Special Tokens: <s> and </s> added to mark sequence boundaries

  • Vocabulary Mapping: Tokens mapped to IDs from the 50,265-token vocabulary

2. Encoder Stack

  • Embedding Layer:

    • Token embeddings: Maps tokens to 768/1024-dimensional vectors

    • Position embeddings: Adds learned positional information

    • Embedding addition and normalization

    • Dropout (p=0.1) during training

  • Encoder Transformer Layers (x6 for BART-Base, x12 for BART-Large):

    • Self-Attention Sublayer:

      • Input: Previous layer output (or embedding output for first layer)

      • Computes multi-head self-attention (all tokens attend to all tokens)

      • Layer normalization and residual connection

    • Feed-Forward Sublayer:

      • Two linear transformations with ReLU activation

      • Dimensions: hidden_size → 4*hidden_size → hidden_size

      • Layer normalization and residual connection

  • Final Layer Normalization:

    • Normalizes the output of the last encoder layer

Output: Sequence of contextualized representations for input tokens (shape: [batch_size, enc_seq_length, hidden_size])

3. Decoder Stack

  • Embedding Layer:

    • Token embeddings: Maps target tokens to vectors

    • Position embeddings: Adds position information

    • Embedding addition and normalization

    • Dropout (p=0.1) during training

  • Decoder Transformer Layers (x6 for BART-Base, x12 for BART-Large):

    • Masked Self-Attention Sublayer:

      • Causal masking to ensure tokens only attend to previous positions

      • Multi-head attention, normalization, and residual connection

    • Cross-Attention Sublayer:

      • Attends to encoder outputs (queries from decoder, keys/values from encoder)

      • Connects encoder and decoder information flow

      • Normalization and residual connection

    • Feed-Forward Sublayer:

      • Two linear transformations with ReLU activation

      • Layer normalization and residual connection

  • Final Layer Normalization:

    • Applied before the output projection
  • Output Projection:

    • Linear projection to vocabulary size (50,265)

    • Weight-tying with input embeddings (shares parameters)

Output during generation: Probability distribution over vocabulary for next token

4. Generation Process

During inference, BART generates output auto-regressively:

  1. Encoder processes the full input sequence

  2. Decoder generates one token at a time

  3. Each new token is appended to previously generated ones

  4. Process continues until EOS token or maximum length is reached

BART Training Methodology

BART's training process combines denoising pre-training with task-specific fine-tuning:

Pre-training Phase

  • Corruption Strategies:

    • Token masking: Random tokens replaced with [MASK]

    • Token deletion: Random tokens removed

    • Text infilling: Spans of text replaced with single [MASK]

    • Sentence permutation: Sentences shuffled in random order

    • Document rotation: Document rotated to start with random token

  • Training Objective:

    • Reconstruct the original text from corrupted input

    • Cross-entropy loss between predicted and actual tokens

  • Training Parameters:

    • Adam optimizer with β₁=0.9, β₂=0.999

    • Learning rate: 3e-5 with polynomial decay

    • Dropout: 0.1

    • Attention dropout: 0.1

    • Weight decay: 0.01

    • Label smoothing: 0.1

    • Training corpus: Similar to RoBERTa (160GB of text)

    • Hardware: 256 NVIDIA V100 GPUs for several days

Fine-tuning Approaches

BART excels at different types of tasks with specific fine-tuning strategies:

  1. Sequence Classification:

    • Encoder processes input

    • Decoder is given <s> token as initial input

    • Final decoder output at first position used for classification

    • All parameters updated during fine-tuning

  2. Token Classification:

    • Encoder processes input

    • Final encoder outputs used for token-level predictions

    • Decoder not typically used

  3. Sequence Generation:

    • Full encoder-decoder architecture used

    • Auto-regressive generation with beam search (beam size: 4-5)

    • Length penalty applied to favor longer or shorter outputs

  4. Machine Translation:

    • Additional encoder layers trained from scratch

    • Original BART encoder frozen

    • New encoder converts source language to BART encoder space

Runtime Performance Characteristics

BART's encoder-decoder architecture has specific performance implications:

Computational Complexity

For input sequence length n, output sequence length m, and hidden dimension d:

  • Encoder self-attention: O(n² × d) operations

  • Decoder self-attention: O(m² × d) operations (cumulative during generation)

  • Cross-attention: O(n × m × d) operations

  • Feed-forward networks: O((n+m) × d²) operations

  • Overall complexity: O(n² × d + m² × d + n × m × d + (n+m) × d²)

Memory Requirements

  • Model parameters: O(L × d²) - constant during inference

  • Encoder attention cache: O(L × n² × h) where h is number of attention heads

  • Decoder attention cache: O(L × m² × h) - grows quadratically during generation

  • Cross-attention cache: O(L × n × m × h) - grows linearly during generation

  • Encoder-decoder activations: O(L × (n+m) × d) - grows linearly during generation

Performance Bottlenecks

  1. Generation latency:

    • Auto-regressive generation requires m sequential steps

    • Each step requires full decoder forward pass

    • Cannot be parallelized during inference

  2. Memory bandwidth:

    • Attention key/value caches grow during generation

    • Frequent memory access patterns may exceed cache capacity

  3. Encoder-decoder balance:

    • Encoder computes once; decoder computes m times

    • Load balancing between components affects utilization

System Optimization Strategies

Several approaches can optimize BART's performance in production:

Generation Optimization

  1. Attention caching:

    • Cache key/value projections from previous decoding steps

    • Reduces redundant computation during generation

    • Critical for efficient auto-regressive decoding

  2. Beam search optimization:

    • Batch multiple beams for parallel computation

    • Prune unlikely beams early to reduce computation

    • Optimize hypothesis tracking data structures

  3. Length normalization:

    • Tune length penalty to control output size

    • Affects both quality and computation time

Encoder Optimizations

  1. Encoder result caching:

    • Cache encoder outputs for frequently used inputs

    • Particularly effective for retrieval/QA systems

  2. Attention pattern optimization:

    • Sparse attention patterns can reduce the O(n²) complexity

    • Particularly beneficial for long documents

Hardware-Specific Tuning

  1. Mixed precision:

    • FP16/BF16 computation reduces memory bandwidth requirements

    • Can double throughput on compatible hardware

  2. Model parallelism:

    • Shard encoder and decoder across multiple devices

    • Reduces per-device memory requirements

    • Introduces communication overhead

  3. Kernel fusion:

    • Combine multiple small operations (e.g., bias add, activation, dropout)

    • Reduces memory round-trips

Practical Implementation Considerations

When implementing BART in production environments:

Memory Management

  • Dynamic batch sizing:

    • Adjust batch size based on sequence lengths

    • Maximizes GPU utilization while preventing OOM errors

  • Attention mechanism optimization:

    • Efficient implementations of multi-head attention

    • FP16 accumulation with selective FP32 conversion

Generation Control

  • Early stopping conditions:

    • Configurable criteria to terminate generation

    • Reduces unnecessary computation

  • Output filtering:

    • Post-processing to remove repetition/artifacts

    • Length constraints based on input characteristics

Input Preprocessing

  • Efficient tokenization:

    • Optimized byte-level BPE implementation

    • Batch tokenization for throughput

  • Input truncation strategies:

    • Document segmentation for long inputs

    • Affects encoding quality and performance

Conclusion

BART represents a versatile and powerful sequence-to-sequence architecture that combines the strengths of bidirectional encoding and auto-regressive decoding. Its dual-component design enables both understanding and generation capabilities but introduces computational challenges, particularly during the generation phase.

The encoder-decoder architecture creates distinct performance characteristics compared to encoder-only (BERT) or decoder-only (GPT) models. The sequential nature of generation remains a fundamental bottleneck, but techniques like attention caching, mixed precision computation, and hardware-specific optimizations can significantly improve performance.

For system implementations, careful attention to memory management, generation control, and preprocessing optimizations is essential to achieve maximum efficiency while maintaining output quality.

In the next BART article I will train BART for different use cases. Stay tuned…

0
Subscribe to my newsletter

Read articles from Dove-Wing directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Dove-Wing
Dove-Wing