fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905)#7906
fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905)#7906
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f35fcc8790
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| copy_event = get_accelerator().Event() | ||
| get_accelerator().current_stream().record_event(copy_event) |
There was a problem hiding this comment.
Guard async copy event creation on synchronized accelerators
The new CPU-offload path unconditionally creates and records an accelerator event after fp32_grad_tensor.copy_, but on synchronized accelerators (e.g. CPU_Accelerator) Event and current_stream() are None, so this branch throws before finishing gradient partitioning. This regression is triggered when super_offload is enabled in CPU-only environments, where subgroup_to_device[i] == 'cpu' is always true.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
SuperOffload is currently only supported on superchips, so this path assumes CUDA-style async stream/event support.
There was a problem hiding this comment.
@xylian86 can we add an assert during DS initialization that SuperOffload is running on CUDA accelerator or specifically Superchips?
…ZeRO-3 subgroup buffering
| if len(bucket.params) == 0: | ||
| self._cur_bucket_index = i | ||
| if getattr(param, "ds_grad_is_ready", True): | ||
| if getattr(param, "ds_grad_is_ready", True) and param.grad is not None: |
There was a problem hiding this comment.
While this approach works, I am curious why execution gets this far for params without grads. We have a mechanism for excluding frozen params in Z1/2 and Z3. Could it be that we are incorrectly adding backward hooks to frozen parameters?
Looking further, I see that for Z1/Z2 we exclude frozen parameters from backward hooks. It seems we don't have similar exclusion in Z3/SuperOffload. Can you please double check in this logic?
|
|
||
| while self.params_in_ipg_bucket_buffer: | ||
| buffered_param = self.params_in_ipg_bucket_buffer.popleft() | ||
| if buffered_param.grad is None: |
There was a problem hiding this comment.
I think if we can filter out frozen params early then we can simplify the code avoiding these checks.
| self._DeepSpeedZeroOptimizer_Stage3__add_grad_to_ipg_bucket(buffered_param) | ||
| deferred_params = deque() | ||
|
|
||
| while self.params_in_ipg_bucket_buffer: |
There was a problem hiding this comment.
Can we achieve further simplification that avoids double nesting by defining self.params_in_ipg_bucket_buffer as a dict of lists for each optimizer group?
Fix issue #7905
The figures below compare per-iteration time and GPU memory usage against the non-offload. The second figure presents a correctness check of the updated version.
