[serve][docs] Introduce gang scheduling documentation by jeffreywang-anyscale · Pull Request #61737 · ray-project/ray

jeffreywang-anyscale · 2026-03-14T04:43:27Z

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale · 2026-03-14T04:44:01Z

doc/source/serve/advanced-guides/gang-scheduling.md

+:language: python
+```
+
+In this example, each gang of 2 replicas creates a single gang placement group with 2 bundles (one `{"CPU": 1, "GPU": 1}` bundle per replica) upon scheduling. Note that `ray_actor_options={"num_cpus": 0}` is set so the replica actor doesn't request resources outside the placement group — all resource reservation is handled through the bundles.


Add a diagram

jeffreywang-anyscale · 2026-03-14T04:44:21Z

doc/source/serve/advanced-guides/gang-scheduling.md

+
+In Ray Serve autoscaler, gang quantization is handled automatically by a `GangSchedulingAutoscalingPolicy` wrapper that is injected around the base autoscaling policy.
+
+**Example**: With `gang_size=4` and 8 current replicas, if the base autoscaling policy recommends 5 replicas (scale down), the gang-aware policy rounds down to 4, releasing one complete gang. If the policy recommends 10 replicas (scale up), the gang-aware policy rounds up to 12, creating one complete new gang.


Add a diagram

gemini-code-assist

Code Review

This pull request introduces comprehensive documentation for the new gang scheduling feature in Ray Serve. The documentation covers the feature's motivation, usage, configuration options, and internal workings. The changes are well-structured and the examples are clear. I've provided a few suggestions to improve clarity and accuracy in the documentation.

gemini-code-assist · 2026-03-14T04:44:50Z

doc/source/serve/advanced-guides/gang-scheduling.md

+Gang scheduling enables you to co-schedule groups of deployment replicas atomically. A **gang** is a set of replicas that are reserved and started together using a single [Ray placement group](ray-placement-group-doc-ref). If the cluster doesn't have enough resources for the entire gang, none of the replicas in that gang are started.
+
+This is useful for workloads where a partial set of replicas is useless, such as:
+- **Data parallel attention deployment**: In WideEP deployments, data parallel attention - expert parallelism ranks are required coordinate with each other to perform dispatch-combine collective communication. Any rank failure leads to dispatch-combine collective hangs, and the entire data parallel attention - expert parallelism group needs to go through failover mechanism to re-establish collectives.


This sentence has a minor grammatical issue and could be rephrased for better clarity. The phrase 'are required coordinate' should be 'are required to coordinate'. Also, using 'and' instead of a hyphen between 'data parallel attention' and 'expert parallelism' would improve readability.

Suggested change

- **Data parallel attention deployment**: In WideEP deployments, data parallel attention - expert parallelism ranks are required coordinate with each other to perform dispatch-combine collective communication. Any rank failure leads to dispatch-combine collective hangs, and the entire data parallel attention - expert parallelism group needs to go through failover mechanism to re-establish collectives.

- **Data parallel attention deployment**: In WideEP deployments, data parallel attention and expert parallelism ranks are required to coordinate with each other to perform dispatch-combine collective communication. Any rank failure leads to dispatch-combine collective hangs, and the entire data parallel attention and expert parallelism group needs to go through failover mechanism to re-establish collectives.

gemini-code-assist · 2026-03-14T04:44:50Z

doc/source/serve/advanced-guides/gang-scheduling.md

+
+### PACK (default)
+
+Packs all replicas in a gang onto as few nodes as possible. This is best for workloads that benefit from locality, such as data parallel ranks within data parallel attention - expert parallelism deployment for MoE LLMs.


This phrasing is a bit awkward. Using 'and' instead of a hyphen would make it more readable.

Suggested change

Packs all replicas in a gang onto as few nodes as possible. This is best for workloads that benefit from locality, such as data parallel ranks within data parallel attention - expert parallelism deployment for MoE LLMs.

Packs all replicas in a gang onto as few nodes as possible. This is best for workloads that benefit from locality, such as data parallel ranks within a data parallel attention and expert parallelism deployment for MoE LLMs.

gemini-code-assist · 2026-03-14T04:44:50Z

doc/source/serve/advanced-guides/gang-scheduling.md

+
+If each replica needed multiple bundles (for example, one for the replica actor and one for a worker), the gang PG would contain `gang_size * len(placement_group_bundles)` total bundles. Replica 0 would occupy bundle indices 0 and 1, replica 1 would occupy indices 2 and 3, and so on.
+
+You can also use `placement_group_bundle_label_selector` to control which nodes the gang's bundles are placed on. The per-replica label selector is replicated across all replicas in the gang, so every replica is steered to nodes matching the selector. For example, to schedule all gang members on nodes with A100 GPUs:


The phrase 'per-replica label selector' is a bit misleading, as the selector is defined per-bundle within placement_group_bundles, not per-replica. This could be rephrased for better accuracy.

Suggested change

You can also use `placement_group_bundle_label_selector` to control which nodes the gang's bundles are placed on. The per-replica label selector is replicated across all replicas in the gang, so every replica is steered to nodes matching the selector. For example, to schedule all gang members on nodes with A100 GPUs:

You can also use `placement_group_bundle_label_selector` to control which nodes the gang's bundles are placed on. The label selectors are replicated for each replica in the gang, steering all of the gang's bundles to nodes matching the selectors. For example, to schedule all gang members on nodes with A100 GPUs:

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Introduce gang scheduling docs

7e51a3e

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale commented Mar 14, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

Add pg_name attribute

19c7580

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve][docs] Introduce gang scheduling documentation#61737

[serve][docs] Introduce gang scheduling documentation#61737
jeffreywang-anyscale wants to merge 2 commits intomasterfrom
gang-docs

jeffreywang-anyscale commented Mar 14, 2026

Uh oh!

jeffreywang-anyscale Mar 14, 2026

Uh oh!

jeffreywang-anyscale Mar 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Uh oh!

gemini-code-assist bot Mar 14, 2026

Uh oh!

gemini-code-assist bot Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		In Ray Serve autoscaler, gang quantization is handled automatically by a `GangSchedulingAutoscalingPolicy` wrapper that is injected around the base autoscaling policy.

		Example: With `gang_size=4` and 8 current replicas, if the base autoscaling policy recommends 5 replicas (scale down), the gang-aware policy rounds down to 4, releasing one complete gang. If the policy recommends 10 replicas (scale up), the gang-aware policy rounds up to 12, creating one complete new gang.


		### PACK (default)

		Packs all replicas in a gang onto as few nodes as possible. This is best for workloads that benefit from locality, such as data parallel ranks within data parallel attention - expert parallelism deployment for MoE LLMs.


		If each replica needed multiple bundles (for example, one for the replica actor and one for a worker), the gang PG would contain `gang_size * len(placement_group_bundles)` total bundles. Replica 0 would occupy bundle indices 0 and 1, replica 1 would occupy indices 2 and 3, and so on.

		You can also use `placement_group_bundle_label_selector` to control which nodes the gang's bundles are placed on. The per-replica label selector is replicated across all replicas in the gang, so every replica is steered to nodes matching the selector. For example, to schedule all gang members on nodes with A100 GPUs:

Conversation

jeffreywang-anyscale commented Mar 14, 2026

Description

Related issues

Additional information

Uh oh!

jeffreywang-anyscale Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant