[serve] Stabilize gang scheduling tests by jeffreywang-anyscale · Pull Request #61727 · ray-project/ray

jeffreywang-anyscale · 2026-03-13T19:21:19Z

Description

Stabilizing flaky tests:

Approach

The dashboard may be unavailable because a previous test's dashboard process is still holding port 8265 when the next test starts a new Ray cluster. The new cluster's dashboard fails to bind to that port, so list_actors() (which requires the dashboard) fails. Using use_controller=True avoids this by querying replica states through serve.status(), which goes through the Serve controller via GCS.
Remove file-based synchronizations and prefer signal actors.
Relax timeouts.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

gemini-code-assist

Code Review

This pull request focuses on improving the stability of gang scheduling tests. The changes include replacing flaky file-based synchronization with more reliable actor-based mechanisms, using serve.status() with use_controller=True to avoid dependencies on the dashboard, and relaxing test timeouts. Additionally, error handling in tests is enhanced to differentiate between expected failures during recovery and actual issues. A bug fix is also included in deployment_state.py to prevent errors when recovering an already assigned replica rank, making the system more robust.

python/ray/serve/_private/deployment_state.py

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-14T02:36:42Z

python/ray/serve/tests/test_gang_scheduling.py

+        # is dead but the handle may still route to the dead replica actors
+        # until the controller detects the failure and restarts them.
+        # After full recovery, no errors should occur.
+        assert len(errors_after_recovery) == 0


Race condition in error classification may cause flaky test

Medium Severity

The recovered event is checked after .result() returns, but a request can be submitted before recovered.set() and fail after it. The background thread calls handle.remote().result() which blocks; if the request was dispatched to a replica that's still unhealthy right before fully_recovered() returns, the exception may only be raised after the main thread calls recovered.set(). The thread then sees recovered.is_set() as True and appends to errors_after_recovery, causing assert len(errors_after_recovery) == 0 to fail spuriously. Since this PR aims to stabilize flaky tests, this new race condition works against that goal.

abrarsheikh · 2026-03-14T05:51:01Z

python/ray/serve/_private/deployment_state.py

+                    # Skip if the rank is already assigned (e.g., health-check failure
+                    # put the replica into RECOVERING without a controller crash, so the
+                    # rank was never released).


nice find, can you add a test for this in test_replica_ranks

Stabilize tests

c13c8fe

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale requested a review from a team as a code owner March 13, 2026 19:21

jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Mar 13, 2026

jeffreywang-anyscale requested a review from abrarsheikh March 13, 2026 19:22

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

jeffreywang-anyscale commented Mar 13, 2026

View reviewed changes

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved

Apply suggestion from @jeffreywang-anyscale

105dee4

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

ray-gardener bot added the serve Ray Serve Related Issue label Mar 14, 2026

Merge branch 'master' into gang-flaky-tests

49e7a8f

cursor bot reviewed Mar 14, 2026

View reviewed changes

abrarsheikh approved these changes Mar 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Stabilize gang scheduling tests#61727

[serve] Stabilize gang scheduling tests#61727
jeffreywang-anyscale wants to merge 3 commits intomasterfrom
gang-flaky-tests

jeffreywang-anyscale commented Mar 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 14, 2026

Uh oh!

abrarsheikh Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffreywang-anyscale commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Approach

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 14, 2026

Choose a reason for hiding this comment

Race condition in error classification may cause flaky test

Uh oh!

abrarsheikh Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffreywang-anyscale commented Mar 13, 2026 •

edited

Loading