Transformer Architectures Across Frameworks: TensorFlow vs. PyTorch


This discussion is based on the post at https://medium.com/lexiconia/transformers-tensorflow-vs-pytorch-implementation-3f4e5a7239e3
[1] Similarities Between TensorFlow and PyTorch Implementations
Model Architecture
Both implementations follow the standard Transformer architecture:
Tokenization and padding
Positional encoding
Multi-head attention
Feed-forward neural networks
Layer normalization
Encoder/decoder stacking
Functionality
Core functionalities such as
positional_encoding
,create_padding_mask
,multi-head attention
, andencoder layer
are implemented in both frameworks.Masking is used in both to handle padding tokens properly.
Math and Logic
Matrix multiplications, softmax, and scaling in attention are handled similarly.
The computation of angle rates and sine/cosine functions for positional encoding is mathematically identical.
[2] Differences Between TensorFlow and PyTorch Implementations
Aspect | TensorFlow | PyTorch |
Syntax | Uses tf.keras.layers , tf.Tensor , @tf.function (if optimized) | Uses torch.nn.Module , torch.Tensor , and decorators like @ torch.no _grad or @staticmethod |
Layer Definition | Inherits from tf.keras.layers.Layer | Inherits from torch.nn.Module |
Sequential Models | tf.keras.Sequential([...]) | torch.nn.Sequential(...) |
Tensor Operations | tf.reshape , tf.transpose , broadcasting via tf.expand_dims | torch.reshape , torch.permute , unsqueeze |
Training Paradigm | High-level with model.compile() and model.fit () | Low-level training loop with manual optimizer.zero _grad() , loss.backward() , and optimizer.step() |
Data Handling | TensorFlow uses tf.data .Dataset | PyTorch uses torch.utils.data .DataLoader |
Device Management | Less explicit (e.g., eager mode runs on CPU/GPU transparently) | Explicit (you must push models/tensors to device: model.to (device) , tensor.to (device) ) |
Side-by-side comparison of the EncoderLayer
and Transformer
classes in TensorFlow and PyTorch
1. EncoderLayer Comparison
Aspect | TensorFlow (tf.keras.layers.Layer ) | PyTorch (torch.nn.Module ) |
Inheritance | class EncoderLayer(tf.keras.layers.Layer) | class EncoderLayer(nn.Module) |
Attention Layer | Uses custom MultiHeadAttention layer with call() method | Uses custom MultiHeadAttention layer with forward() method |
Feedforward (FFN) | tf.keras.Sequential([...]) | nn.Sequential(...) |
Normalization | tf.keras.layers.LayerNormalization(epsilon=1e-6) | nn.LayerNorm(d_model, eps=1e-6) |
Forward Call | def call(self, x, mask=None) | def forward(self, x, mask=None) |
Dropout | tf.keras.layers.Dropout(0.2) | nn.Dropout(0.2) |
2. Transformer Class Comparison
Aspect | TensorFlow (Transformer(tf.keras.Model) ) | PyTorch (Transformer(nn.Module) ) |
Embedding | tf.keras.layers.Embedding(...) | nn.Embedding(...) |
Positional Encoding | x += self.positional_encoding[:, :tf.shape(x)[1], :] | x += self.positional_encoding[:, :x.size(1), :] |
Stacking Encoder Layers | Python list: [EncoderLayer(...) for _ in range(n)] | nn.ModuleList([EncoderLayer(...) for _ in range(n)]) |
Output Layer | tf.keras.layers.Dense(input_vocab_size) | nn.Linear(d_model, input_vocab_size) |
Forward Pass | def call(self, inputs) | def forward(self, inputs) |
Summary of Key Differences
Method Names: TensorFlow uses
call()
for forward logic; PyTorch usesforward()
.Layer Definition: TensorFlow layers are Keras-based; PyTorch uses
torch.nn
.Weight Registration: PyTorch needs
nn.ModuleList
for tracking submodules.Execution: TensorFlow is graph-based (eager by default now); PyTorch is natively eager.
Training Routine Comparison
Aspect | TensorFlow | PyTorch |
Dataset Handling | Uses Python list, tokenized and padded, converted to tf.convert_to_tensor() | (Not shown in current code but would use torch.tensor(...) for similar data) |
Loss Function | tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction="none") | Typically nn.CrossEntropyLoss(reduction='none') |
Masking | Applies a mask on loss with tf.math.logical_not and scales losses accordingly | Would use (targets != pad_idx) boolean masks with PyTorch tensor operations |
Gradient Calculation | with tf.GradientTape() as tape: for automatic differentiation | loss.backward() in standard autograd flow |
Optimizer | tf.keras.optimizers.Adam(...) | torch.optim.Adam(...) |
Weight Update | tape.gradient(...) and optimizer.apply_gradients(...) | optimizer.step() after optimizer.zero _grad() |
Epoch Logging | if epoch % 50 == 0: print(...) | Same idea, implemented manually |
Inference Routine Comparison
Aspect | TensorFlow | PyTorch |
Inputs | Sentence string → tokenized → padded → tensor | Same concept; would use torch.tensor(...) |
Beam Search | Implemented as a custom function using tf.nn.softmax(logits) | Would require similar logic using torch.softmax(...) |
Model Prediction | outputs = transformer(test_tensor) | In PyTorch: with torch.no _grad(): outputs = transformer(inputs) |
Output Processing | Beam search used to get best sequence of token indices | Same idea would be used in PyTorch |
Detokenization | Uses detokenize() to convert predicted tokens back to string | Same function reused in both implementations |
Summary of Training & Inference Differences
Feature | TensorFlow | PyTorch |
Autodiff | tf.GradientTape() | loss.backward() |
Step control | apply_gradients(...) | optimizer.step() |
Execution | Implicit graph-based | Explicit control |
Inference context | No special decorator needed | Use torch.no _grad() to disable gradients |
Subscribe to my newsletter
Read articles from Mohamad Mahmood directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Mohamad Mahmood
Mohamad Mahmood
Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He studies at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).