Reliability

Smart Checkpointing: Never Lose a Training Run Again

By Mei Lin — June 5, 2025 — 11 min read

Deep learning training checkpoint and recovery system

Losing 47 hours of a 48-hour training run to a node failure is one of the most demoralizing experiences in ML engineering. I have been there — at ByteDance, managing a platform that ran over 10,000 training jobs per day, this kind of failure happened multiple times per week at scale. The question is never "will something fail?" but "how quickly can you recover?"

The answer, in our case, went from "restart from scratch" to "resume within 3 minutes" by building a checkpoint system designed from first principles for the specific demands of large model training. This post covers the design decisions that make that possible.

Why Standard Checkpointing Is Broken

The conventional approach to training checkpointing is straightforward: every N steps, serialize the model weights and optimizer state to disk. The problem is that "every N steps" involves a fundamental tradeoff between checkpoint frequency and overhead.

For a 13B parameter model with Adam optimizer state, a full checkpoint requires saving approximately:

26GB of FP16 model weights
52GB of Adam first moments (FP32)
52GB of Adam second moments (FP32)
Total: approximately 130GB per checkpoint

Writing 130GB to NVMe SSDs takes 30–60 seconds. Writing to networked storage (NFS, Lustre, S3) takes 3–8 minutes. If you checkpoint every 500 steps and a step takes 1.2 seconds, you are checkpointing every 10 minutes — spending 30–80% of that interval on I/O overhead at the 3–8 minute end.

The usual "solution" is to checkpoint less frequently — every 2000 or 5000 steps. But this means that when a failure occurs, you lose up to 100 minutes of training. On a 256-GPU cluster at $4/GPU-hour, that is $17,000 of wasted compute per incident. Organizations that train large models regularly accept millions of dollars in annual losses this way.

Incremental Delta Checkpointing

The core insight behind Deepiix's checkpoint system is that consecutive checkpoints are highly redundant. Between step 5000 and step 5100, only a small fraction of model weights change significantly. The optimizer moments change more broadly, but even they exhibit significant repetition from checkpoint to checkpoint.

Incremental checkpointing saves only the delta — the changes since the last checkpoint — rather than the full state. The system maintains a base snapshot and a sequence of compact delta files. Recovery involves applying the deltas to the base to reconstruct any point in the training history.

In practice, deltas between checkpoints 100 steps apart are typically 3–8% the size of a full checkpoint, depending on the model architecture and optimizer. This means a model that previously required 130GB per checkpoint might now require 4–10GB per delta — a 13–32x reduction in checkpoint I/O.

Tensor-Level Compression

Beyond delta encoding, Deepiix applies tensor-level compression tailored for the statistical properties of neural network weight tensors and optimizer moments. Key observations that inform the compression strategy:

Weight tensors are smooth: Adjacent weight values tend to be correlated. Delta-coding along the flattened tensor dimension captures this structure efficiently.
Adam second moments are slowly varying: The exponential moving average used for the second moment changes slowly relative to the first moment. Aggressive quantization to FP8 introduces minimal error while halving storage.
Gradient histories have low entropy: After the LZW step in our compression pipeline, optimizer state typically compresses 4–6x versus raw FP32 storage.

Combining delta encoding with tensor-specific compression, Deepiix achieves 70% overall reduction in checkpoint storage versus full FP32 checkpoints — down from 130GB to approximately 39GB for a 13B parameter model with Adam. The compression and decompression operations are performed on CPU in parallel with GPU training, adding negligible overhead to step time.

Asynchronous Checkpoint I/O

Even at 39GB, serializing a checkpoint synchronously would interrupt training for 1–3 minutes. Deepiix uses an asynchronous checkpoint writer that serializes tensors to a pinned memory buffer on the CPU while the GPU continues training. The background writer process then flushes the buffer to disk without blocking the main training process.

The asynchronous approach requires maintaining a consistent snapshot of the training state at the checkpoint moment — if the model updates weight tensors during serialization, the checkpoint could contain a partially-updated, inconsistent state. Deepiix handles this with copy-on-write semantics: when a checkpoint is triggered, tensor pages that will be modified by the next training step are duplicated before modification. This is implemented at the CUDA memory allocator level and adds approximately 0.3–0.8% overhead per step during checkpointing.

Automatic Recovery

Checkpointing is only half the equation — the other half is recovery. When a node fails or a job is preempted, the Deepiix platform automatically detects the failure event (via the scheduler's health monitoring), identifies the latest valid checkpoint from the job's checkpoint log, and schedules a recovery job on available nodes.

Recovery jobs load the checkpoint, reconstruct the full optimizer state from the delta chain, and resume training from the saved step. For delta chains longer than 20 snapshots, the system automatically consolidates them into a new base snapshot to keep recovery time bounded.

End-to-end, from failure detection to resumed training, typically takes 2–4 minutes. Compare this to the 1–8 hours of compute lost in a naive restart-from-scratch workflow, and the ROI of a proper checkpoint system becomes clear: for any training run longer than 2 hours on multi-GPU hardware, automatic recovery pays for itself on the first failure.

Implications for Spot Instance Training

Fast, low-overhead checkpointing unlocks a cost optimization that many teams leave on the table: training on spot or preemptible instances. Cloud spot GPU instances are typically 60–70% cheaper than on-demand, but their interruption rate (1–5% per hour for A100 spots) has historically made them unsuitable for long training runs.

With Deepiix's checkpoint system, a job interrupted by a spot preemption resumes within 3 minutes from the last checkpoint — typically losing no more than 100 steps of training. The expected cost of interruptions (lost compute + recovery overhead) is approximately 0.5–2% of total training cost, well below the 60–70% savings from spot pricing.

We now recommend spot instances as the default choice for all non-interactive training jobs under the Deepiix platform, with on-demand capacity reserved for production inference and interactive debugging sessions.

← Back to Blog