The largest neural networks today — 70B, 175B, even trillion-parameter models — cannot fit on a single GPU. Training them requires distributing the model itself across dozens or hundreds of devices. But "distributing the model" is not one technique — it is three distinct strategies (data parallelism, tensor parallelism, and pipeline parallelism), each with different communication patterns, memory requirements, and efficiency characteristics. Understanding the distinctions is essential for configuring training runs that scale efficiently rather than wastefully.
1. Data Parallelism: Replicate and Synchronize
Data parallelism is the oldest and simplest distributed training strategy, and it remains the dominant approach for models that fit on a single device. In data parallelism, the full model is replicated on each GPU, and each GPU processes a different mini-batch of training data. At the end of the backward pass, gradients from all replicas are aggregated (AllReduce) and each replica updates its weights identically, maintaining synchrony.
Data parallelism scales linearly for throughput: double the GPUs, double the data processed per second (holding batch size constant). The communication overhead is one AllReduce per optimizer step, with volume proportional to model size — manageable for models up to ~10B parameters on fast networks, but increasingly expensive as model size grows.
ZeRO (Zero Redundancy Optimizer) extends data parallelism by partitioning the optimizer state, gradients, and model parameters across data-parallel ranks instead of replicating them. ZeRO Stage 1 shards optimizer states (ADAM momentum and variance), reducing per-GPU memory from O(model_size × DP_degree) to O(model_size). ZeRO Stage 2 additionally shards gradients. ZeRO Stage 3 shards all three: optimizer states, gradients, and parameters. The trade-off is increased communication volume (AllGather before forward pass, ReduceScatter after backward), but the memory savings often allow larger batch sizes or larger model sizes that more than compensate.
2. Tensor Parallelism: Split Layers Across Devices
When a single model layer is too large to fit on one device, or when we want to reduce the memory footprint of individual layers beyond what ZeRO-3 can achieve, tensor parallelism splits the computations within each layer across multiple devices.
The canonical example is the MLP (multi-layer perceptron) block in a transformer. A standard transformer MLP has two linear projections: an up-projection (d_model → 4×d_model) followed by a down-projection (4×d_model → d_model). With tensor parallelism degree N, the up-projection weight matrix is split column-wise across N devices (each device holds 4×d_model / N columns), and the down-projection is split row-wise correspondingly. Each device computes its portion of the result independently; an AllReduce at the end of the block aggregates the final output.
Self-attention layers parallelize similarly: attention heads are distributed across devices. With N-way tensor parallelism and H attention heads, each device handles H/N heads. This requires H to be divisible by N — a constraint on model architecture design when targeting specific tensor parallel degrees.
The communication pattern for tensor parallelism is frequent small AllReduces (one per layer boundary) rather than one large AllReduce per step. This is a crucial property: frequent small collectives benefit from the highest-bandwidth, lowest-latency interconnect available — specifically, NVLink within a node. For this reason, tensor parallelism is almost always bounded to within a single node. Multi-node tensor parallelism is possible but rarely cost-effective because inter-node bandwidth (InfiniBand at 200Gb/s) is 3–10x slower than NVLink within-node bandwidth (600GB/s for NVLink 3.0).
3. Pipeline Parallelism: Split Layers Across Stages
Pipeline parallelism takes a different approach: rather than splitting individual layers, it splits the sequence of model layers. A 96-layer transformer with 4-stage pipeline parallelism assigns layers 1–24 to Stage 0, 25–48 to Stage 1, 49–72 to Stage 2, and 73–96 to Stage 3. Data flows forward through the pipeline (activations from Stage 0 to Stage 1 to Stage 2 to Stage 3), then gradients flow backward.
The communication pattern is fundamentally different from data and tensor parallelism: point-to-point activation transfers between adjacent pipeline stages, rather than collective AllReduce across all devices. The volume per communication is proportional to activation size (batch_size × sequence_length × d_model), not model size. This is much smaller and tolerates higher inter-stage latency — meaning pipeline parallelism can efficiently cross node boundaries where tensor parallelism cannot.
The key challenge with pipeline parallelism is the "pipeline bubble." In naive pipeline execution (GPipe), each stage processes one micro-batch, then sits idle waiting for the backward pass to catch up. If there are K pipeline stages, the pipeline bubble fraction is (K-1)/(M+K-1) where M is the number of micro-batches per optimizer step. To keep bubble fraction below 5%, you need M much larger than K — for a 4-stage pipeline, at least 80 micro-batches per step.
The 1F1B (one-forward-one-backward) schedule dramatically improves on GPipe by interleaving forward and backward passes: as soon as Stage 0 finishes the forward pass of micro-batch 1, it begins the backward pass of micro-batch 1 while Stage 1 processes micro-batch 2's forward pass. This halves memory requirements compared to GPipe while maintaining similar throughput. The interleaved 1F1B variant (Megatron-LM's preferred schedule) further reduces bubble fraction by splitting model layers into non-contiguous groups across stages, at the cost of more complex communication patterns.
4. Combining All Three: 3D Parallelism
3D parallelism — simultaneously applying data, tensor, and pipeline parallelism — is the standard configuration for training very large models (70B+) on large clusters. Megatron-LM's implementation organizes N total GPUs into a 3D grid: TP × PP × DP = N.
A representative configuration: training a 70B model on 256 A100s (8 per node, 32 nodes). Set TP=8 (within each node, using NVLink), PP=4 (4 pipeline stages, 8 nodes each), DP=8 (8 independent data-parallel replicas of the full TP+PP job). This uses 8 × 4 × 8 = 256 GPUs exactly.
The communication hierarchy maps to the network hierarchy: TP communication (AllReduce, AllGather) stays on NVLink; PP communication (activation send/receive) crosses node boundaries on InfiniBand; DP gradient AllReduce is the outer-most collective, crossing the full cluster. By aligning communication patterns with available bandwidth at each network tier, 3D parallelism achieves near-linear scaling efficiency that would be impossible with any single parallelism strategy at this scale.
Configuration sensitivity is real: changing TP from 4 to 8 on a 32-GPU-per-node cluster (16 GPUs per node, each 80GB) might allow a larger model per PP stage but cuts DP degree in half, reducing statistical efficiency. The optimal 3D parallelism configuration is model-size, cluster-topology, and batch-size dependent. Automated configuration search tools — and platforms like Deepiix that encode these heuristics — significantly reduce the manual tuning burden.
5. Expert Parallelism: MoE Models
Mixture-of-Experts (MoE) models introduce a fourth parallelism dimension. In MoE transformers, each transformer block contains many "expert" sub-networks (FFN variants), and a routing function selects a small subset of experts to process each token. Since most experts are inactive for any given token, MoE models achieve larger parameter counts without proportional compute increases.
Expert parallelism distributes the expert sub-networks across devices. For a model with 64 experts and 8-way expert parallelism, each device hosts 8 experts. The routing function sends each token to whichever devices hold its assigned experts — a point-to-point all-to-all communication that becomes a bottleneck at large expert counts. GShard, Switch Transformer, and GPT-MoE implementations handle this with careful load balancing to ensure all experts receive roughly equal token volume, preventing routing hot-spots that degrade throughput.
Key Takeaways
- Data parallelism scales throughput; ZeRO-3 extends it to large models. The right starting point for models that fit in aggregate GPU memory across your cluster.
- Tensor parallelism should be confined within-node. NVLink bandwidth supports frequent small AllReduces; InfiniBand typically cannot.
- Pipeline parallelism crosses node boundaries efficiently. Low-volume activation transfers tolerate inter-node latency better than gradient AllReduce.
- 3D parallelism is required for 70B+ models at scale. Single-strategy parallelism cannot match the communication efficiency of the hierarchical 3D approach.
- Pipeline bubble fraction determines efficiency. Sufficient micro-batches per step (M >> K) and 1F1B scheduling are essential for practical pipeline parallelism.
Conclusion
Model parallelism is not a single choice — it is a configuration space defined by three orthogonal axes, each with distinct tradeoffs and network topology requirements. The teams that train the largest models most efficiently are those that have internalized these distinctions deeply enough to configure parallelism for their specific model architecture and cluster topology, rather than accepting default settings that may leave 30–50% of hardware performance unused.
Deepiix's platform provides automated parallelism configuration that analyzes your model and cluster, recommends an optimal TP × PP × DP configuration, and validates it against profiling data before committing to a full training run. Talk to us about optimizing your parallelism strategy.