[Bugfix] Validate fp16.loss_scale is finite in DeepSpeedFP16Config by s-zx · Pull Request #7892 · deepspeedai/DeepSpeed

s-zx · 2026-03-08T20:00:53Z

Summary

DeepSpeedFP16Config.loss_scale silently accepted float('inf') with no validation, allowing the engine to initialise successfully. During training this caused all gradients to become NaN silently — the static loss scaler multiplied every loss value by infinity. Other config fields (e.g. stage3_max_live_parameters) already used Pydantic ge= constraints and correctly raised ValidationError for invalid inputs, but loss_scale was unguarded.

Root Cause

loss_scale was declared as a plain float = 0 with no constraint. Pydantic v2 accepts inf for bare float fields.

Fix

Change the field declaration to Field(0, ge=0) — rejects negative values at the Pydantic level.
Add a field_validator("loss_scale") that calls math.isfinite(v) — rejects inf and nan with a clear error message.
The semantic where loss_scale=0 selects dynamic loss scaling is preserved unchanged.

Tests

Added to tests/unit/runtime/test_ds_config_model.py:

test_fp16_loss_scale_valid — parametrized over [0, 1.0, 128.0, 65536.0]
test_fp16_loss_scale_rejects_non_finite — parametrized over [inf, -inf, nan]
test_fp16_loss_scale_rejects_negative — checks -1.0

Fixes #7852

`DeepSpeedFP16Config.loss_scale` accepted `float("inf")` without any validation, silently allowing the engine to initialise. During training this caused all gradients to become NaN because the static loss scaler multiplied every loss by infinity, while other config fields such as `stage3_max_live_parameters` already used Pydantic `ge=` constraints and would correctly raise `ValidationError` for invalid inputs. Fix: add `Field(0, ge=0)` so that negative values are rejected at the Pydantic level, and add a `field_validator` that calls `math.isfinite` so that `inf` and `nan` are also rejected with a clear error message. The existing semantic where `loss_scale=0` selects dynamic loss scaling is preserved unchanged. Add unit tests to `tests/unit/runtime/test_ds_config_model.py` covering the valid range (0, 1.0, 128.0, 65536.0) and the three previously accepted invalid inputs (`inf`, `-inf`, `nan`, `-1.0`). Fixes deepspeedai#7852

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d7af93551

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-08T20:04:12Z

tests/unit/runtime/test_ds_config_model.py

+@pytest.mark.parametrize("loss_scale", [float("inf"), float("-inf"), float("nan")])
+def test_fp16_loss_scale_rejects_non_finite(loss_scale):
+    """Non-finite loss_scale values must raise ValidationError (issue #7852)."""
+    with pytest.raises(ValidationError, match="finite"):


Relax non-finite error match for constrained loss_scale

This assertion is too strict for -inf/nan: DeepSpeedFP16Config.loss_scale now has Field(..., ge=0), and Pydantic runs field constraints before the default @field_validator (after-mode), so some non-finite inputs can raise a ge validation message instead of one containing "finite". In that case this test fails even though the config correctly rejects the invalid value, causing a false negative in CI for the new coverage.

Useful? React with 👍 / 👎.

Good catch — fixed. Removed the match="finite" argument from the parametrized test. Pydantic runs the ge=0 field constraint before the after-mode field_validator, so for -inf/nan the error message may be "greater than or equal to 0" rather than our "finite" message. The assertion now only checks that a ValidationError is raised, which is the important invariant.

Addressed — relaxed the test assertion to accept either a message containing "finite" (from our field_validator) or "greater than or equal" (from Pydantic's ge=0 constraint), since Pydantic may run field constraints before the validator for -inf/nan.

…constraint order Pydantic runs field constraints (ge=0) before the after-mode field_validator, so for -inf and nan the ValidationError may contain "greater than or equal to" rather than "finite". The assertion `match="finite"` was therefore too strict and would fail in CI even though the invalid value is correctly rejected. Drop the match argument; validating that a ValidationError is raised is sufficient to confirm the fix.

tohtana

Hi @s-zx,
This has some overlap with #7889. Can we discuss how to consolidate in threads of #7889?
#7889 catches True and False as invalid values, but (from my understanding) it currently doesn't raise Pydantic validation errors.

…ssage Pydantic may run the ge=0 field constraint before the field_validator, producing a "greater than or equal" message for -inf/nan instead of our "finite" message. Use a flexible match to accept either. Addresses Codex review feedback.

tohtana · 2026-03-16T22:20:46Z

Hi @s-zx
#7889 now checks math.isfinite(v) and raises ValueError. Do we still need this change?

s-zx requested review from loadams, tjruwase and tohtana as code owners March 8, 2026 20:00

chatgpt-codex-connector bot reviewed Mar 8, 2026

View reviewed changes

tohtana reviewed Mar 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Validate fp16.loss_scale is finite in DeepSpeedFP16Config#7892

[Bugfix] Validate fp16.loss_scale is finite in DeepSpeedFP16Config#7892
s-zx wants to merge 3 commits intodeepspeedai:masterfrom
s-zx:fix/7852-fp16-loss-scale-validation

s-zx commented Mar 8, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Uh oh!

s-zx Mar 8, 2026

Uh oh!

s-zx Mar 11, 2026

Uh oh!

tohtana left a comment

Uh oh!

tohtana commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

s-zx commented Mar 8, 2026

Summary

Root Cause

Fix

Tests

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

s-zx Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

s-zx Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

tohtana commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants