Skip to content

fix(execute_class): add async lock to prevent double deploy#274

Open
deanq wants to merge 2 commits intomainfrom
fix/AE-2370-ensure-initialized-race
Open

fix(execute_class): add async lock to prevent double deploy#274
deanq wants to merge 2 commits intomainfrom
fix/AE-2370-ensure-initialized-race

Conversation

@deanq
Copy link
Member

@deanq deanq commented Mar 15, 2026

Summary

  • Add asyncio.Lock to RemoteClassWrapper._ensure_initialized() to prevent concurrent calls from both deploying resources (AE-2370)
  • Uses double-checked locking: fast-path if self._initialized check before lock acquisition, second check inside the lock
  • Add bug probe test TestNEW1_EnsureInitializedRace with 3 tests validating the fix

What was happening

Two concurrent requests both pass the if not self._initialized check, both call get_or_deploy_resource, both deploy — wasting resources and orphaning one stub. The second assignment silently overwrites the first.

Changes

File Change
execute_class.py Add import asyncio, self._init_lock in __init__, wrap _ensure_initialized with double-checked lock
tests/bug_probes/test_class_execution.py 3 tests: concurrent-calls-deploy-once, initialized-flag-set, second-call-skips

Test plan

  • make quality-check passes (85.50% coverage)
  • Bug probe TestNEW1_EnsureInitializedRace validates concurrent calls deploy exactly once
  • Existing test suite unaffected (53 passed, 1 skipped)

Closes AE-2370

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. The fix — correct

Double-checked locking is the right pattern here. Fast path avoids lock overhead on every method call after initialization; slow path acquires the lock and re-checks before deploying.

if self._initialized:           # fast path — no lock after init
    return
async with self._init_lock:
    if self._initialized:       # re-check after acquiring lock
        return
    ...
    self._initialized = True    # set only after stub is ready

Three details that are all correct:

  • asyncio.Lock() in __init__ — safe on Python ≥3.10 (Flash's minimum). In 3.10+, locks bind lazily to the running loop on first await, not at construction time. No DeprecationWarning or RuntimeError.
  • _initialized = True after stub_resource() — if stub creation raises, _initialized stays False and the lock is released by async with. Retry works correctly. This is tested by test_deploy_failure_releases_lock_and_allows_retry.
  • _init_lock not accessible via __getattr____getattr__ only fires on missing attributes. Since _init_lock is set in __init__ it's found by normal attribute lookup before __getattr__ is called. No interaction.

2. Question: asyncio.sleep(0.05) timing assumption in concurrency test

test_concurrent_calls_deploy_only_once creates two tasks and sleeps 50ms to let both reach the gate:

task1 = asyncio.create_task(wrapper_instance._ensure_initialized())
task2 = asyncio.create_task(wrapper_instance._ensure_initialized())
await asyncio.sleep(0.05)   # hope both tasks reached gate.wait() by now
gate.set()

If the host is slow (loaded CI runner), task2 may not have reached await gate.wait() before gate.set() fires — task2 then starts after _initialized is already True and the test still passes, but it no longer proves the lock works. The test becomes a timing-sensitive no-op rather than a race proof.

A more reliable pattern uses a counter to confirm both tasks are in-flight before releasing:

arrived = 0
all_arrived = asyncio.Event()

async def slow_deploy(config):
    nonlocal deploy_call_count, arrived
    deploy_call_count += 1
    arrived += 1
    if arrived >= 2:
        all_arrived.set()
    await gate.wait()
    return MagicMock()

# After creating tasks, wait until both have called deploy before releasing
await all_arrived.wait()
gate.set()
await asyncio.gather(task1, task2)

Not blocking — the current test catches the bug reliably on any reasonable machine — but worth knowing for CI robustness.


3. Gap: lock is per-instance, not per resource

If the same resource config is passed to two separate create_remote_class() calls, two RemoteClassWrapper instances are created with two independent _init_lock instances. Concurrent initialization of those two wrappers could still double-deploy at the ResourceManager level. That's out of scope for this PR — but worth confirming: does ResourceManager.get_or_deploy_resource guard against concurrent deploys for the same resource config from different wrapper instances? If not, that's a separate ticket.


4. Tests — solid

Four tests covering concurrent deploy-once, flag set correctly, idempotency, and failure-path lock release. The failure-path test (added in the second commit) is the most important correctness guarantee and it's well structured.


Verdict

PASS. The fix is correct, uses the right asyncio primitives for Python ≥3.10, and the flag ordering is right. Two asks: (1) acknowledge the timing assumption in the concurrency test or use the counter pattern above, and (2) confirm whether ResourceManager provides the cross-instance guarantee, or file a follow-up ticket if it doesn't.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses AE-2370 by preventing concurrent _ensure_initialized() calls on the same RemoteClassWrapper instance from triggering multiple deployments, using an asyncio.Lock with double-checked locking.

Changes:

  • Add an asyncio.Lock (self._init_lock) to RemoteClassWrapper and guard _ensure_initialized() with double-checked locking.
  • Add bug-probe tests validating single-deploy behavior under concurrent calls and retry behavior after transient deploy failures.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/runpod_flash/execute_class.py Adds an async init lock and wraps initialization to prevent double deployment under concurrency.
tests/bug_probes/test_class_execution.py Adds async race-condition regression tests for _ensure_initialized() (including retry-on-failure).
tests/bug_probes/__init__.py Initializes the new bug_probes test package (empty file).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +10 to +15
"""AE-2370: _ensure_initialized has no async lock — concurrent calls cause double deploy.

Without a lock, two concurrent calls to _ensure_initialized both pass
the `if not self._initialized` check and both call get_or_deploy_resource,
causing a double deploy and orphaning one stub.
"""
Comment on lines +70 to +76
task1 = asyncio.create_task(wrapper_instance._ensure_initialized())
task2 = asyncio.create_task(wrapper_instance._ensure_initialized())

await asyncio.sleep(0.05)
gate.set()

await asyncio.gather(task1, task2)
Comment on lines +115 to +120
@pytest.mark.asyncio
async def test_deploy_failure_releases_lock_and_allows_retry(
self, wrapper_instance
):
"""If deploy fails, the lock must be released and a subsequent call must retry."""
call_count = 0
@deanq deanq force-pushed the fix/AE-2370-ensure-initialized-race branch from c5ca393 to b076846 Compare March 17, 2026 20:00
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up on prior review

The four tests are solid — flag-set, idempotency, concurrent deploy-once, and the failure-path retry are all covered.

Two open items from the prior review remain, both non-blocking:

  1. asyncio.sleep(0.05) timing — counter pattern not adopted. The sleep-based approach works in practice but can silently degrade to a no-op on a loaded CI runner where task2 doesn't reach gate.wait() before gate.set() fires. Low risk, but worth knowing.

  2. Cross-instance double-deploy — if two separate create_remote_class() calls produce two RemoteClassWrapper instances for the same resource config, they each have their own _init_lock and could still race at the ResourceManager level. Confirming whether ResourceManager.get_or_deploy_resource guards this case, or filing a follow-up ticket if it doesn't, would close the loop.


Verdict: PASS — fix is correct, lock ordering is right.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

deanq added 2 commits March 19, 2026 11:02
…double deploy

Without a lock, concurrent calls to _ensure_initialized both pass the
check and both call get_or_deploy_resource, wasting resources and
orphaning one stub. Uses double-checked locking: fast-path check before
lock acquisition, second check inside the lock.

Closes AE-2370
- Replace misleading carried-over comment with accurate description
- Add inline comments explaining double-checked locking pattern
- Add failure-path test: deploy exception releases lock, allows retry
@deanq deanq force-pushed the fix/AE-2370-ensure-initialized-race branch from b076846 to 1d97119 Compare March 19, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants