Skip to content

[serve] Stabilize gang scheduling tests#61727

Open
jeffreywang-anyscale wants to merge 3 commits intomasterfrom
gang-flaky-tests
Open

[serve] Stabilize gang scheduling tests#61727
jeffreywang-anyscale wants to merge 3 commits intomasterfrom
gang-flaky-tests

Conversation

@jeffreywang-anyscale
Copy link
Contributor

@jeffreywang-anyscale jeffreywang-anyscale commented Mar 13, 2026

Description

Stabilizing flaky tests:
Screenshot 2026-03-13 at 12 21 04 PM

Approach

  • The dashboard may be unavailable because a previous test's dashboard process is still holding port 8265 when the next test starts a new Ray cluster. The new cluster's dashboard fails to bind to that port, so list_actors() (which requires the dashboard) fails. Using use_controller=True avoids this by querying replica states through serve.status(), which goes through the Serve controller via GCS.
  • Remove file-based synchronizations and prefer signal actors.
  • Relax timeouts.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang-anyscale jeffreywang-anyscale requested a review from a team as a code owner March 13, 2026 19:21
@jeffreywang-anyscale jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Mar 13, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request focuses on improving the stability of gang scheduling tests. The changes include replacing flaky file-based synchronization with more reliable actor-based mechanisms, using serve.status() with use_controller=True to avoid dependencies on the dashboard, and relaxing test timeouts. Additionally, error handling in tests is enhanced to differentiate between expected failures during recovery and actual issues. A bug fix is also included in deployment_state.py to prevent errors when recovering an already assigned replica rank, making the system more robust.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Mar 14, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

# is dead but the handle may still route to the dead replica actors
# until the controller detects the failure and restarts them.
# After full recovery, no errors should occur.
assert len(errors_after_recovery) == 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition in error classification may cause flaky test

Medium Severity

The recovered event is checked after .result() returns, but a request can be submitted before recovered.set() and fail after it. The background thread calls handle.remote().result() which blocks; if the request was dispatched to a replica that's still unhealthy right before fully_recovered() returns, the exception may only be raised after the main thread calls recovered.set(). The thread then sees recovered.is_set() as True and appends to errors_after_recovery, causing assert len(errors_after_recovery) == 0 to fail spuriously. Since this PR aims to stabilize flaky tests, this new race condition works against that goal.

Fix in Cursor Fix in Web

Comment on lines +3404 to +3406
# Skip if the rank is already assigned (e.g., health-check failure
# put the replica into RECOVERING without a controller crash, so the
# rank was never released).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find, can you add a test for this in test_replica_ranks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants