[serve] Stabilize gang scheduling tests#61727
[serve] Stabilize gang scheduling tests#61727jeffreywang-anyscale wants to merge 3 commits intomasterfrom
Conversation
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request focuses on improving the stability of gang scheduling tests. The changes include replacing flaky file-based synchronization with more reliable actor-based mechanisms, using serve.status() with use_controller=True to avoid dependencies on the dashboard, and relaxing test timeouts. Additionally, error handling in tests is enhanced to differentiate between expected failures during recovery and actual issues. A bug fix is also included in deployment_state.py to prevent errors when recovering an already assigned replica rank, making the system more robust.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| # is dead but the handle may still route to the dead replica actors | ||
| # until the controller detects the failure and restarts them. | ||
| # After full recovery, no errors should occur. | ||
| assert len(errors_after_recovery) == 0 |
There was a problem hiding this comment.
Race condition in error classification may cause flaky test
Medium Severity
The recovered event is checked after .result() returns, but a request can be submitted before recovered.set() and fail after it. The background thread calls handle.remote().result() which blocks; if the request was dispatched to a replica that's still unhealthy right before fully_recovered() returns, the exception may only be raised after the main thread calls recovered.set(). The thread then sees recovered.is_set() as True and appends to errors_after_recovery, causing assert len(errors_after_recovery) == 0 to fail spuriously. Since this PR aims to stabilize flaky tests, this new race condition works against that goal.
| # Skip if the rank is already assigned (e.g., health-check failure | ||
| # put the replica into RECOVERING without a controller crash, so the | ||
| # rank was never released). |
There was a problem hiding this comment.
nice find, can you add a test for this in test_replica_ranks


Description
Stabilizing flaky tests:

Approach
list_actors()(which requires the dashboard) fails. Usinguse_controller=Trueavoids this by querying replica states throughserve.status(), which goes through the Serve controller via GCS.Related issues
Additional information