Model Architecture
- Both implementations follow the standard Transformer architecture:
  - Tokenization and padding
  - Positional encoding
  - Multi-head attention
  - Feed-forward neural networks
  - Layer normalization
  - Encoder/decoder stacking
Functionality
- Core functionalities such as positional_encoding, create_padding_mask, multi-head attention, and encoder layer are implemented in both frameworks.
- Masking is used in both to handle padding tokens properly.
Math and Logic
- Matrix multiplications, softmax, and scaling in attention are handled similarly.
- The computation of angle rates and sine/cosine functions for positional encoding is mathematically identical.

Aspect	TensorFlow	PyTorch
Syntax	Uses `tf.keras.layers`, `tf.Tensor`, `@tf.function` (if optimized)	Uses `torch.nn.Module`, `torch.Tensor`, and decorators like `@torch.no_grad` or `@staticmethod`
Layer Definition	Inherits from `tf.keras.layers.Layer`	Inherits from `torch.nn.Module`
Sequential Models	`tf.keras.Sequential([...])`	`torch.nn.Sequential(...)`
Tensor Operations	`tf.reshape`, `tf.transpose`, broadcasting via `tf.expand_dims`	`torch.reshape`, `torch.permute`, `unsqueeze`
Training Paradigm	High-level with `model.compile()` and `model.fit()`	Low-level training loop with manual `optimizer.zero_grad()`, `loss.backward()`, and `optimizer.step()`
Data Handling	TensorFlow uses `tf.data.Dataset`	PyTorch uses `torch.utils.data.DataLoader`
Device Management	Less explicit (e.g., eager mode runs on CPU/GPU transparently)	Explicit (you must push models/tensors to device: `model.to(device)`, `tensor.to(device)`)

Aspect	TensorFlow (`tf.keras.layers.Layer`)	PyTorch (`torch.nn.Module`)
Inheritance	`class EncoderLayer(tf.keras.layers.Layer)`	`class EncoderLayer(nn.Module)`
Attention Layer	Uses custom `MultiHeadAttention` layer with `call()` method	Uses custom `MultiHeadAttention` layer with `forward()` method
Feedforward (FFN)	`tf.keras.Sequential([...])`	`nn.Sequential(...)`
Normalization	`tf.keras.layers.LayerNormalization(epsilon=1e-6)`	`nn.LayerNorm(d_model, eps=1e-6)`
Forward Call	`def call(self, x, mask=None)`	`def forward(self, x, mask=None)`
Dropout	`tf.keras.layers.Dropout(0.2)`	`nn.Dropout(0.2)`

Aspect	TensorFlow (`Transformer(tf.keras.Model)`)	PyTorch (`Transformer(nn.Module)`)
Embedding	`tf.keras.layers.Embedding(...)`	`nn.Embedding(...)`
Positional Encoding	`x += self.positional_encoding[:, :tf.shape(x)[1], :]`	`x += self.positional_encoding[:, :x.size(1), :]`
Stacking Encoder Layers	Python list: `[EncoderLayer(...) for _ in range(n)]`	`nn.ModuleList([EncoderLayer(...) for _ in range(n)])`
Output Layer	`tf.keras.layers.Dense(input_vocab_size)`	`nn.Linear(d_model, input_vocab_size)`
Forward Pass	`def call(self, inputs)`	`def forward(self, inputs)`

Method Names: TensorFlow uses call() for forward logic; PyTorch uses forward().
Layer Definition: TensorFlow layers are Keras-based; PyTorch uses torch.nn.
Weight Registration: PyTorch needs nn.ModuleList for tracking submodules.
Execution: TensorFlow is graph-based (eager by default now); PyTorch is natively eager.

Aspect	TensorFlow	PyTorch
Dataset Handling	Uses Python list, tokenized and padded, converted to `tf.convert_to_tensor()`	(Not shown in current code but would use `torch.tensor(...)` for similar data)
Loss Function	`tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction="none")`	Typically `nn.CrossEntropyLoss(reduction='none')`
Masking	Applies a mask on loss with `tf.math.logical_not` and scales losses accordingly	Would use `(targets != pad_idx)` boolean masks with PyTorch tensor operations
Gradient Calculation	`with tf.GradientTape() as tape:` for automatic differentiation	`loss.backward()` in standard autograd flow
Optimizer	`tf.keras.optimizers.Adam(...)`	`torch.optim.Adam(...)`
Weight Update	`tape.gradient(...)` and `optimizer.apply_gradients(...)`	`optimizer.step()` after `optimizer.zero_grad()`
Epoch Logging	`if epoch % 50 == 0: print(...)`	Same idea, implemented manually

Aspect	TensorFlow	PyTorch
Inputs	Sentence string → tokenized → padded → tensor	Same concept; would use `torch.tensor(...)`
Beam Search	Implemented as a custom function using `tf.nn.softmax(logits)`	Would require similar logic using `torch.softmax(...)`
Model Prediction	`outputs = transformer(test_tensor)`	In PyTorch: `with` `torch.no_grad(): outputs = transformer(inputs)`
Output Processing	Beam search used to get best sequence of token indices	Same idea would be used in PyTorch
Detokenization	Uses `detokenize()` to convert predicted tokens back to string	Same function reused in both implementations

Feature	TensorFlow	PyTorch
Autodiff	`tf.GradientTape()`	`loss.backward()`
Step control	`apply_gradients(...)`	`optimizer.step()`
Execution	Implicit graph-based	Explicit control
Inference context	No special decorator needed	Use `torch.no_grad()` to disable gradients

Transformer Architectures Across Frameworks: TensorFlow vs. PyTorch