Skip to content

Improve normalization of losses and metrics, fix bugs, run distributed model tests on cpu#477

Draft
jlamypoirier wants to merge 14 commits intomainfrom
jlp_reduce_losses_with_counts
Draft

Improve normalization of losses and metrics, fix bugs, run distributed model tests on cpu#477
jlamypoirier wants to merge 14 commits intomainfrom
jlp_reduce_losses_with_counts

Conversation

@jlamypoirier
Copy link
Collaborator

✨ Description

@jlamypoirier jlamypoirier changed the base branch from main to jlp_batch_fixes March 24, 2026 10:23
jlamypoirier and others added 4 commits March 25, 2026 09:13
fakeredis 2.34 introduced Resp3Writer hardcoded for all TCP connections
regardless of protocol negotiation. When XREADGROUP BLOCK times out on
an empty stream, Resp3Writer.dump(None) sends RESP3 null (b'_\r\n').
The redis-py RESP2 parser (used by default) raises Protocol Error: b'_'.

Fix: monkey-patch TCPFakeRequestHandler.setup in fake_redis_server() to
replace Resp3Writer with Resp2Writer, restoring correct RESP2 null
encoding (b'*-1\r\n') for blocking timeouts. The patch is guarded on
the presence of Resp3Writer (2.34+ only) and raises explicitly if
Resp2Writer is missing so future breakage is immediately diagnosable.
- Add `divisor` parameter to fused loss functions (entropy, z-loss, grpo) to allow normalizing by actual token count rather than total sequence positions
- Fix `_get_grad_output` to not pre-divide by parallel/split factors (handled by divisor)
- Fix loss accumulation across cross-entropy splits in LM head
- Fix variable naming bug in `_set_distributed_reduction_map`
- Update tests to pass explicit divisor and match new normalization behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix schedule tflops divide by 1e12 (was reporting raw flops)
- Change loss reductions from AVG to SUM (needed with token-count weighting)
- Add CPU/gloo fallback support in distributed test configs
- Fix pp tied weight bias ignore_duplicates
- Adjust micro_batch_size and compare targets for distributed configs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jlamypoirier jlamypoirier changed the base branch from jlp_batch_fixes to main March 26, 2026 16:10
@jlamypoirier jlamypoirier changed the title Improve normalization of losses and metrics Improve normalization of losses and metrics, fix bugs Mar 26, 2026
jlamypoirier and others added 2 commits March 26, 2026 12:33
MTPLlamaModel uses mtp_norms[0] for the first prediction head instead
of model.norm (as in standard Llama). The converter was inheriting the
Llama mapping (head.final_norm → model.norm), so the native HuggingFace
model loaded converted checkpoints with mtp_norms[0] uninitialized.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g datasets

Add __getstate__/__setstate__ to DistributedDim to drop the process group
when pickling, so DataLoader worker processes can be spawned even when the
dataset or collate_fn captures a DistributedConfig with active process groups.

Also expand test_data_streaming to cover num_workers=1 and increase
_NUM_BATCHES from 2 to 10 for better coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jlamypoirier jlamypoirier changed the title Improve normalization of losses and metrics, fix bugs Improve normalization of losses and metrics, fix bugs, run distributed model tests on cpu Mar 26, 2026
jlamypoirier and others added 2 commits March 26, 2026 13:46
Add tests for padding, multi-token prediction, micro-batch splits,
prediction mask, label counts, GRPO data, position index, inference
phase, document count, and cumulative sequence lengths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pand preprocessing tests

- Guard cross-document label masking against documents shorter than prediction distance
- Fix num_documents to exclude the padding pseudo-document from the count
- Add comprehensive test coverage: all split/target indices, predicted_tokens in (1,3),
  padding variants, and complex multi-document cases with loss masking spans and GRPO data
- Refactor test helpers into cached properties indexed by [split_index][target_index]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants