Skip to content

Migration 0101 downgrade fails with out of shared memory due to per-DAG savepoints exhausting PostgreSQL lock table #63535

@vatsrahul1001

Description

@vatsrahul1001

Apache Airflow version

main (development)

If "Other Airflow 3 version" selected, which one?

No response

What happened?

Migration 0101_3_2_0_ui_improvements_for_deadlines downgrade creates a savepoint for every serialized_dag row via conn.begin_nested() to isolate per-DAG errors during deadline alert data restoration

for serialized_dag_id, dag_id, data, data_compressed in batch_results:
    savepoint = conn.begin_nested()
    try:
        # ... process DAG ...
        savepoint.commit()
    except Exception as e:
        savepoint.rollback()

On PostgreSQL, locks acquired within a savepoint are not released when the savepoint is committed — they persist in the parent transaction until the outer COMMIT. Since the entire migration runs in a single Alembic transaction, lock entries accumulate across all DAGs.
PostgreSQL's shared lock table size is max_locks_per_transaction × max_connections (default: 64 × 100 = 6,400 entries). Each savepoint acquires multiple lock entries (relation-level locks on serialized_dag, deadline_alert, plus the savepoint virtual transaction ID). With thousands of DAGs the lock table overflows.

Observed behavior:
With 100K serialized_dag entries the migration failed after processing ~20K DAGs (batch 2 of 11):

psycopg2.errors.OutOfMemory: out of shared memoryHINT: You might need to increase max_locks_per_transaction.CONTEXT: parallel worker

This was followed by a secondary RecursionError as SQLAlchemy tried to unwind 970+ nested savepoint objects via recursive _cancel() calls during rollback:
RecursionError: maximum recursion depth exceeded

Deployments with more than ~1K–2K DAGs will likely hit this on default PostgreSQL settings. The exact threshold depends on how many lock entries each savepoint consumes and the configured max_locks_per_transaction and max_connections.

What you think should happen instead?

No response

How to reproduce

Steps to reproduce:

  • Start Airflow on 3.1.8 with PostgreSQL backend (default max_locks_per_transaction = 64)
  • Create 2K+ DAGs into serialized_dag with deadline alert data
  • Upgrade to 3.2.0 (airflow db migrate)
  • Run airflow db downgrade -n 3.1.8
  • Migration 0101 fails with out of shared memory error

Operating System

Macos

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions