Skip to content

[serve][docs] Introduce gang scheduling documentation#61737

Draft
jeffreywang-anyscale wants to merge 2 commits intomasterfrom
gang-docs
Draft

[serve][docs] Introduce gang scheduling documentation#61737
jeffreywang-anyscale wants to merge 2 commits intomasterfrom
gang-docs

Conversation

@jeffreywang-anyscale
Copy link
Contributor

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
:language: python
```

In this example, each gang of 2 replicas creates a single gang placement group with 2 bundles (one `{"CPU": 1, "GPU": 1}` bundle per replica) upon scheduling. Note that `ray_actor_options={"num_cpus": 0}` is set so the replica actor doesn't request resources outside the placement group — all resource reservation is handled through the bundles.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a diagram


In Ray Serve autoscaler, gang quantization is handled automatically by a `GangSchedulingAutoscalingPolicy` wrapper that is injected around the base autoscaling policy.

**Example**: With `gang_size=4` and 8 current replicas, if the base autoscaling policy recommends 5 replicas (scale down), the gang-aware policy rounds down to 4, releasing one complete gang. If the policy recommends 10 replicas (scale up), the gang-aware policy rounds up to 12, creating one complete new gang.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a diagram

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive documentation for the new gang scheduling feature in Ray Serve. The documentation covers the feature's motivation, usage, configuration options, and internal workings. The changes are well-structured and the examples are clear. I've provided a few suggestions to improve clarity and accuracy in the documentation.

Gang scheduling enables you to co-schedule groups of deployment replicas atomically. A **gang** is a set of replicas that are reserved and started together using a single [Ray placement group](ray-placement-group-doc-ref). If the cluster doesn't have enough resources for the entire gang, none of the replicas in that gang are started.

This is useful for workloads where a partial set of replicas is useless, such as:
- **Data parallel attention deployment**: In WideEP deployments, data parallel attention - expert parallelism ranks are required coordinate with each other to perform dispatch-combine collective communication. Any rank failure leads to dispatch-combine collective hangs, and the entire data parallel attention - expert parallelism group needs to go through failover mechanism to re-establish collectives.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence has a minor grammatical issue and could be rephrased for better clarity. The phrase 'are required coordinate' should be 'are required to coordinate'. Also, using 'and' instead of a hyphen between 'data parallel attention' and 'expert parallelism' would improve readability.

Suggested change
- **Data parallel attention deployment**: In WideEP deployments, data parallel attention - expert parallelism ranks are required coordinate with each other to perform dispatch-combine collective communication. Any rank failure leads to dispatch-combine collective hangs, and the entire data parallel attention - expert parallelism group needs to go through failover mechanism to re-establish collectives.
- **Data parallel attention deployment**: In WideEP deployments, data parallel attention and expert parallelism ranks are required to coordinate with each other to perform dispatch-combine collective communication. Any rank failure leads to dispatch-combine collective hangs, and the entire data parallel attention and expert parallelism group needs to go through failover mechanism to re-establish collectives.


### PACK (default)

Packs all replicas in a gang onto as few nodes as possible. This is best for workloads that benefit from locality, such as data parallel ranks within data parallel attention - expert parallelism deployment for MoE LLMs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This phrasing is a bit awkward. Using 'and' instead of a hyphen would make it more readable.

Suggested change
Packs all replicas in a gang onto as few nodes as possible. This is best for workloads that benefit from locality, such as data parallel ranks within data parallel attention - expert parallelism deployment for MoE LLMs.
Packs all replicas in a gang onto as few nodes as possible. This is best for workloads that benefit from locality, such as data parallel ranks within a data parallel attention and expert parallelism deployment for MoE LLMs.


If each replica needed multiple bundles (for example, one for the replica actor and one for a worker), the gang PG would contain `gang_size * len(placement_group_bundles)` total bundles. Replica 0 would occupy bundle indices 0 and 1, replica 1 would occupy indices 2 and 3, and so on.

You can also use `placement_group_bundle_label_selector` to control which nodes the gang's bundles are placed on. The per-replica label selector is replicated across all replicas in the gang, so every replica is steered to nodes matching the selector. For example, to schedule all gang members on nodes with A100 GPUs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrase 'per-replica label selector' is a bit misleading, as the selector is defined per-bundle within placement_group_bundles, not per-replica. This could be rephrased for better accuracy.

Suggested change
You can also use `placement_group_bundle_label_selector` to control which nodes the gang's bundles are placed on. The per-replica label selector is replicated across all replicas in the gang, so every replica is steered to nodes matching the selector. For example, to schedule all gang members on nodes with A100 GPUs:
You can also use `placement_group_bundle_label_selector` to control which nodes the gang's bundles are placed on. The label selectors are replicated for each replica in the gang, steering all of the gang's bundles to nodes matching the selectors. For example, to schedule all gang members on nodes with A100 GPUs:

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

1 participant