Improve normalization of losses and metrics, fix bugs, run distributed model tests on cpu by jlamypoirier · Pull Request #477 · ServiceNow/Fast-LLM

jlamypoirier · 2026-03-24T10:22:44Z

✨ Description

fakeredis 2.34 introduced Resp3Writer hardcoded for all TCP connections regardless of protocol negotiation. When XREADGROUP BLOCK times out on an empty stream, Resp3Writer.dump(None) sends RESP3 null (b'_\r\n'). The redis-py RESP2 parser (used by default) raises Protocol Error: b'_'. Fix: monkey-patch TCPFakeRequestHandler.setup in fake_redis_server() to replace Resp3Writer with Resp2Writer, restoring correct RESP2 null encoding (b'*-1\r\n') for blocking timeouts. The patch is guarded on the presence of Resp3Writer (2.34+ only) and raises explicitly if Resp2Writer is missing so future breakage is immediately diagnosable.

- Add `divisor` parameter to fused loss functions (entropy, z-loss, grpo) to allow normalizing by actual token count rather than total sequence positions - Fix `_get_grad_output` to not pre-divide by parallel/split factors (handled by divisor) - Fix loss accumulation across cross-entropy splits in LM head - Fix variable naming bug in `_set_distributed_reduction_map` - Update tests to pass explicit divisor and match new normalization behavior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_losses_with_counts

- Fix schedule tflops divide by 1e12 (was reporting raw flops) - Change loss reductions from AVG to SUM (needed with token-count weighting) - Add CPU/gloo fallback support in distributed test configs - Fix pp tied weight bias ignore_duplicates - Adjust micro_batch_size and compare targets for distributed configs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MTPLlamaModel uses mtp_norms[0] for the first prediction head instead of model.norm (as in standard Llama). The converter was inheriting the Llama mapping (head.final_norm → model.norm), so the native HuggingFace model loaded converted checkpoints with mtp_norms[0] uninitialized. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…g datasets Add __getstate__/__setstate__ to DistributedDim to drop the process group when pickling, so DataLoader worker processes can be spawned even when the dataset or collate_fn captures a DistributedConfig with active process groups. Also expand test_data_streaming to cover num_workers=1 and increase _NUM_BATCHES from 2 to 10 for better coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add tests for padding, multi-token prediction, micro-batch splits, prediction mask, label counts, GRPO data, position index, inference phase, document count, and cumulative sequence lengths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pand preprocessing tests - Guard cross-document label masking against documents shorter than prediction distance - Fix num_documents to exclude the padding pseudo-document from the count - Add comprehensive test coverage: all split/target indices, predicted_tokens in (1,3), padding variants, and complex multi-document cases with loss masking spans and GRPO data - Refactor test helpers into cached properties indexed by [split_index][target_index] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jlamypoirier added 5 commits March 20, 2026 17:33

fixes

19c6c8a

fix

1b6fcd0

fix

573c6d8

fix

3658c02

stuff

ab39e26

jlamypoirier changed the base branch from main to jlp_batch_fixes March 24, 2026 10:23

jlamypoirier mentioned this pull request Mar 25, 2026

[WIP][PipelineRL] Normalization of new_logprobs and addition of other RL metrics #476

Draft

25 tasks

jlamypoirier and others added 4 commits March 25, 2026 09:13

stuff

2255845

Merge remote-tracking branch 'origin/jlp_batch_fixes' into jlp_reduce…

ac7fa09

…_losses_with_counts

jlamypoirier mentioned this pull request Mar 25, 2026

[Prototype] Normalising by valid tokens #426

Closed

25 tasks

jlamypoirier changed the base branch from jlp_batch_fixes to main March 26, 2026 16:10

jlamypoirier changed the title ~~Improve normalization of losses and metrics~~ Improve normalization of losses and metrics, fix bugs Mar 26, 2026

jlamypoirier and others added 2 commits March 26, 2026 12:33

jlamypoirier changed the title ~~Improve normalization of losses and metrics, fix bugs~~ Improve normalization of losses and metrics, fix bugs, run distributed model tests on cpu Mar 26, 2026

jlamypoirier and others added 2 commits March 26, 2026 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve normalization of losses and metrics, fix bugs, run distributed model tests on cpu#477

Improve normalization of losses and metrics, fix bugs, run distributed model tests on cpu#477
jlamypoirier wants to merge 14 commits intomainfrom
jlp_reduce_losses_with_counts

jlamypoirier commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jlamypoirier commented Mar 24, 2026

✨ Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants