Exploring Parameter Reduction in ResNeXt Architectures

Normal Convolution/Grouped Convolution
ResNeXt introduces a simple and highly effective architectural innovation to convolutional neural networks: cardinality, the number of parallel paths or groups within a convolutional layer. Unlike traditional methods that focus solely on depth or width, ResNeXt leverages grouped convolutions to split computations across multiple branches, reducing parameter count while maintaining and often improving performance.
What’s Grouped Convolution?
Grouped convolutions are a variation of standard convolutions where the input and output channels are split into separate groups, and convolutions are applied independently within each group. Normal Convolution (groups = 1) processes all input channels. Grouped Convolution (groups > 1) divides input channels into n groups.
What’s Cardinality?
Cardinality refers to the number of paralel paths or groups in a convolutional block. It represents a third dimension for network design complementing depth and width.
depth: Number of layers in an Architecture
width: Number of output channels per layer
cardinality: Number of independent paths per layer
Benefits of Higher Cardinality
Feature diversity: Different groups learn complementary feature representations
Regularization effect: Reduced parameter sharing acts as implicit regularization
Computational efficiency: Parallel groups enable efficient computation
Scalability: Easy to adjust network capacity by changing cardinality
Why Does ResNeXt Reduce the Number of Parameters?
In convolutional neural networks, gradients during backpropagation flow through the kernels, input channels, and output feature maps. Standard convolutions create dense connections between all input and output channels. Every input channel contributes to every output channel through learnable weights, resulting in comprehensive feature mixing but high parameter counts.
However, Grouped convolutions partition the input channels into separate groups, where each group undergoes independent convolution operations. This architectural choice fundamentally changes how information flows through the network.
The parameter reduction occurs because instead of each input channel connecting to all output channels, connections are limited within groups:
Reduced connectivity: Each input channel only affects output channels within its group
Independent processing: Groups learn specialized feature representations
Maintained expressiveness: Multiple groups capture diverse feature patterns
Understanding Grouped Convolutions
Standard and Grouped Convolutions
Standard convolution: Parameters = C_in × C_out × K × K
Grouped convulution: Parameters = (C_in × C_out × K × K) / G
C_in
: Input channels
C_out
: Output channels
K
: Kernel size
G
: Number of groups
Parameter count in normal convolution layers:
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, groups=1)
total parameter count: 64\128*3*3 = 73728*
Having many input and output channels leads to too many connections.
Parameter count in grouped convolution layers:
nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, groups=32)
each 64 input channels divides into 32 groups. Each group takes 2 channels (64 / 32) and every group produces 4 output channels (128 / 32).
total parameter count for every group: 2\4*3*3 = 72*
since we have 32 groups; total parameter count : 72\32 = 2304*
each group convolved individually. It led to less connections and fewer parameters. 96.9% reduction in parameters while maintaining the similar or higher accuracy.
Pytorch already supports grouped convolution so only difference between ResNet and ResNext implemention is “groups” parameters defined in second convolution layer in BasicBlock and BottleNeck.
Gradient Flow and Training Dynamics
Localized Gradient Updates
Grouped convolutions create more localized gradient flow patterns:
Within-group updates: Gradients primarily affect parameters within the same group
Reduced interference: Different groups can learn independently without interference
Stable training: More stable gradient flow can lead to better convergence
Regularization Effects
The architectural constraints of grouped convolutions provide implicit regularization:
Reduced overfitting: Fewer parameters decrease the risk of memorizing training data
Better generalization: Forced specialization within groups improves feature quality
Robust representations: Multiple independent paths create more robust feature hierarchies
Subscribe to my newsletter
Read articles from Ramazan Turan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
