[Bugfix] Validate fp16.loss_scale is finite in DeepSpeedFP16Config#7892
[Bugfix] Validate fp16.loss_scale is finite in DeepSpeedFP16Config#7892s-zx wants to merge 3 commits intodeepspeedai:masterfrom
Conversation
`DeepSpeedFP16Config.loss_scale` accepted `float("inf")` without any
validation, silently allowing the engine to initialise. During
training this caused all gradients to become NaN because the static
loss scaler multiplied every loss by infinity, while other config
fields such as `stage3_max_live_parameters` already used Pydantic
`ge=` constraints and would correctly raise `ValidationError` for
invalid inputs.
Fix: add `Field(0, ge=0)` so that negative values are rejected at the
Pydantic level, and add a `field_validator` that calls `math.isfinite`
so that `inf` and `nan` are also rejected with a clear error message.
The existing semantic where `loss_scale=0` selects dynamic loss scaling
is preserved unchanged.
Add unit tests to `tests/unit/runtime/test_ds_config_model.py` covering
the valid range (0, 1.0, 128.0, 65536.0) and the three previously
accepted invalid inputs (`inf`, `-inf`, `nan`, `-1.0`).
Fixes deepspeedai#7852
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6d7af93551
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @pytest.mark.parametrize("loss_scale", [float("inf"), float("-inf"), float("nan")]) | ||
| def test_fp16_loss_scale_rejects_non_finite(loss_scale): | ||
| """Non-finite loss_scale values must raise ValidationError (issue #7852).""" | ||
| with pytest.raises(ValidationError, match="finite"): |
There was a problem hiding this comment.
Relax non-finite error match for constrained loss_scale
This assertion is too strict for -inf/nan: DeepSpeedFP16Config.loss_scale now has Field(..., ge=0), and Pydantic runs field constraints before the default @field_validator (after-mode), so some non-finite inputs can raise a ge validation message instead of one containing "finite". In that case this test fails even though the config correctly rejects the invalid value, causing a false negative in CI for the new coverage.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed. Removed the match="finite" argument from the parametrized test. Pydantic runs the ge=0 field constraint before the after-mode field_validator, so for -inf/nan the error message may be "greater than or equal to 0" rather than our "finite" message. The assertion now only checks that a ValidationError is raised, which is the important invariant.
There was a problem hiding this comment.
Addressed — relaxed the test assertion to accept either a message containing "finite" (from our field_validator) or "greater than or equal" (from Pydantic's ge=0 constraint), since Pydantic may run field constraints before the validator for -inf/nan.
…constraint order Pydantic runs field constraints (ge=0) before the after-mode field_validator, so for -inf and nan the ValidationError may contain "greater than or equal to" rather than "finite". The assertion `match="finite"` was therefore too strict and would fail in CI even though the invalid value is correctly rejected. Drop the match argument; validating that a ValidationError is raised is sufficient to confirm the fix.
…ssage Pydantic may run the ge=0 field constraint before the field_validator, producing a "greater than or equal" message for -inf/nan instead of our "finite" message. Use a flexible match to accept either. Addresses Codex review feedback.
Summary
DeepSpeedFP16Config.loss_scalesilently acceptedfloat('inf')with no validation, allowing the engine to initialise successfully. During training this caused all gradients to become NaN silently — the static loss scaler multiplied every loss value by infinity. Other config fields (e.g.stage3_max_live_parameters) already used Pydanticge=constraints and correctly raisedValidationErrorfor invalid inputs, butloss_scalewas unguarded.Root Cause
loss_scalewas declared as a plainfloat = 0with no constraint. Pydantic v2 acceptsinffor barefloatfields.Fix
Field(0, ge=0)— rejects negative values at the Pydantic level.field_validator("loss_scale")that callsmath.isfinite(v)— rejectsinfandnanwith a clear error message.loss_scale=0selects dynamic loss scaling is preserved unchanged.Tests
Added to
tests/unit/runtime/test_ds_config_model.py:test_fp16_loss_scale_valid— parametrized over[0, 1.0, 128.0, 65536.0]test_fp16_loss_scale_rejects_non_finite— parametrized over[inf, -inf, nan]test_fp16_loss_scale_rejects_negative— checks-1.0Fixes #7852