Skip to content

[Bugfix] Validate fp16.loss_scale is finite in DeepSpeedFP16Config#7892

Open
s-zx wants to merge 3 commits intodeepspeedai:masterfrom
s-zx:fix/7852-fp16-loss-scale-validation
Open

[Bugfix] Validate fp16.loss_scale is finite in DeepSpeedFP16Config#7892
s-zx wants to merge 3 commits intodeepspeedai:masterfrom
s-zx:fix/7852-fp16-loss-scale-validation

Conversation

@s-zx
Copy link

@s-zx s-zx commented Mar 8, 2026

Summary

DeepSpeedFP16Config.loss_scale silently accepted float('inf') with no validation, allowing the engine to initialise successfully. During training this caused all gradients to become NaN silently — the static loss scaler multiplied every loss value by infinity. Other config fields (e.g. stage3_max_live_parameters) already used Pydantic ge= constraints and correctly raised ValidationError for invalid inputs, but loss_scale was unguarded.

Root Cause

loss_scale was declared as a plain float = 0 with no constraint. Pydantic v2 accepts inf for bare float fields.

Fix

  1. Change the field declaration to Field(0, ge=0) — rejects negative values at the Pydantic level.
  2. Add a field_validator("loss_scale") that calls math.isfinite(v) — rejects inf and nan with a clear error message.
  3. The semantic where loss_scale=0 selects dynamic loss scaling is preserved unchanged.

Tests

Added to tests/unit/runtime/test_ds_config_model.py:

  • test_fp16_loss_scale_valid — parametrized over [0, 1.0, 128.0, 65536.0]
  • test_fp16_loss_scale_rejects_non_finite — parametrized over [inf, -inf, nan]
  • test_fp16_loss_scale_rejects_negative — checks -1.0

Fixes #7852

`DeepSpeedFP16Config.loss_scale` accepted `float("inf")` without any
validation, silently allowing the engine to initialise.  During
training this caused all gradients to become NaN because the static
loss scaler multiplied every loss by infinity, while other config
fields such as `stage3_max_live_parameters` already used Pydantic
`ge=` constraints and would correctly raise `ValidationError` for
invalid inputs.

Fix: add `Field(0, ge=0)` so that negative values are rejected at the
Pydantic level, and add a `field_validator` that calls `math.isfinite`
so that `inf` and `nan` are also rejected with a clear error message.
The existing semantic where `loss_scale=0` selects dynamic loss scaling
is preserved unchanged.

Add unit tests to `tests/unit/runtime/test_ds_config_model.py` covering
the valid range (0, 1.0, 128.0, 65536.0) and the three previously
accepted invalid inputs (`inf`, `-inf`, `nan`, `-1.0`).

Fixes deepspeedai#7852
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d7af93551

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@pytest.mark.parametrize("loss_scale", [float("inf"), float("-inf"), float("nan")])
def test_fp16_loss_scale_rejects_non_finite(loss_scale):
"""Non-finite loss_scale values must raise ValidationError (issue #7852)."""
with pytest.raises(ValidationError, match="finite"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Relax non-finite error match for constrained loss_scale

This assertion is too strict for -inf/nan: DeepSpeedFP16Config.loss_scale now has Field(..., ge=0), and Pydantic runs field constraints before the default @field_validator (after-mode), so some non-finite inputs can raise a ge validation message instead of one containing "finite". In that case this test fails even though the config correctly rejects the invalid value, causing a false negative in CI for the new coverage.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. Removed the match="finite" argument from the parametrized test. Pydantic runs the ge=0 field constraint before the after-mode field_validator, so for -inf/nan the error message may be "greater than or equal to 0" rather than our "finite" message. The assertion now only checks that a ValidationError is raised, which is the important invariant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — relaxed the test assertion to accept either a message containing "finite" (from our field_validator) or "greater than or equal" (from Pydantic's ge=0 constraint), since Pydantic may run field constraints before the validator for -inf/nan.

…constraint order

Pydantic runs field constraints (ge=0) before the after-mode
field_validator, so for -inf and nan the ValidationError may contain
"greater than or equal to" rather than "finite".  The assertion
`match="finite"` was therefore too strict and would fail in CI even
though the invalid value is correctly rejected.

Drop the match argument; validating that a ValidationError is raised
is sufficient to confirm the fix.
Copy link
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @s-zx,
This has some overlap with #7889. Can we discuss how to consolidate in threads of #7889?
#7889 catches True and False as invalid values, but (from my understanding) it currently doesn't raise Pydantic validation errors.

…ssage

Pydantic may run the ge=0 field constraint before the field_validator,
producing a "greater than or equal" message for -inf/nan instead of our
"finite" message. Use a flexible match to accept either.

Addresses Codex review feedback.
@tohtana
Copy link
Collaborator

tohtana commented Mar 16, 2026

Hi @s-zx
#7889 now checks math.isfinite(v) and raises ValueError. Do we still need this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Missing input validation for fp16 loss_scale accepts float('inf')

2 participants