Skip to content

feat(cli): add reingest command for re-ingesting execution results#610

Open
lewisjared wants to merge 14 commits intomainfrom
feat/reingest-enum-mode
Open

feat(cli): add reingest command for re-ingesting execution results#610
lewisjared wants to merge 14 commits intomainfrom
feat/reingest-enum-mode

Conversation

@lewisjared
Copy link
Copy Markdown
Contributor

@lewisjared lewisjared commented Mar 31, 2026

Description

Add ref executions reingest command that re-runs build_execution_result() on existing output files and re-ingests the results into the database without re-executing diagnostics. This is useful when new series definitions or metadata extraction logic has been added.

Key features:

  • Three modes via ReingestMode enum: additive (keep existing, add new), replace (delete and re-ingest), versioned (new execution record)
  • Enum-based --mode option for better UX (tab completion, validated choices shown in help)
  • Safety guards: requires at least one filter, confirmation prompt, dry-run support
  • Scratch directory pattern: copies results to scratch before re-extraction, cleans up via try/finally
  • Savepoint-based rollback: DB mutations happen in a single savepoint so failures don't corrupt state
  • Path traversal safety: validates all computed paths stay within expected directories
  • Versioned re-run support: repeated versioned reingests produce unique output fragments (_v{hash}_n{count})
  • Failed execution recovery: mark_successful is called for all modes, allowing --include-failed to recover failed executions

Changes

  • packages/climate-ref/src/climate_ref/executor/reingest.py -- new module with reingest logic, mode enum, execution definition reconstruction, and metric ingestion
  • packages/climate-ref/src/climate_ref/cli/executions.py -- new reingest CLI command with enum-based mode, filters, dry-run, and confirmation
  • packages/climate-ref/src/climate_ref/executor/__init__.py -- no eager reingest import (avoids breaking imports-without-extras)
  • packages/climate-ref/tests/unit/executor/test_reingest.py -- 22 tests covering all modes, idempotency, rollback, dirty flag preservation, scratch cleanup, versioned re-runs, and failed execution recovery

Checklist

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Changelog item added to changelog/

Add `ref executions reingest` command that re-runs build_execution_result()
on existing outputs without re-executing diagnostics. Uses ReingestMode enum
for the --mode option, providing tab completion and validated choices
(additive, replace, versioned) via Typer's native enum support.
…d scratch cleanup

Collapse four near-identical metric ingestion functions into two by adding
an optional `existing` parameter for additive dedup. Hoist the dimension
query into `_ingest_metrics` to avoid duplicate DB queries. Fix double
iteration of `iter_results()` generator. Add try/finally scratch directory
cleanup to prevent disk accumulation during batch reingests. Remove
redundant dirty-flag save/restore in CLI.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “reingest” workflow to the Climate REF CLI/executor layer to re-run build_execution_result() against existing on-disk outputs and ingest updated metrics/metadata into the DB without re-executing diagnostics.

Changes:

  • Introduces climate_ref.executor.reingest with ReingestMode and reingest/query helpers.
  • Adds ref executions reingest CLI command with safety guards (filters required, confirm prompt, dry-run) and mode selection.
  • Adds unit tests for reingest modes and filtering, plus a changelog entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
packages/climate-ref/src/climate_ref/executor/reingest.py Implements reingest logic, scratch copying, savepoint-based DB mutation, and execution querying.
packages/climate-ref/src/climate_ref/cli/executions.py Adds ref executions reingest command wiring and UX/safety features.
packages/climate-ref/src/climate_ref/executor/init.py Exposes ReingestMode and reingest_execution in executor exports.
packages/climate-ref/tests/unit/executor/test_reingest.py New unit tests covering reingest behavior across modes and filters.
changelog/610.feature.md Documents the new CLI command and modes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
core ?
providers 91.85% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 81 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Fix ensure_relative_path to fall back to scratch base when output
  bundle contains absolute paths under scratch directory
- Always clear ExecutionOutput rows before re-inserting in all modes
  (fixes duplicate outputs in additive mode)
- Add _n suffix for repeated versioned reingests on the same execution
- Mark execution as successful for all modes, not just versioned
- Add path traversal safety checks in copy operations
- Document single-threaded scratch directory assumption
- Fix test to use ResultOutputType enum instead of raw string
- Fix comment wording (session -> transaction)
- Add tests for versioned re-run uniqueness, mark_successful on
  failed executions, and scratch directory cleanup
The reingest module imports pandas, CMECMetric, and other heavy
dependencies that are not available when the package is installed
without extras. The CLI imports directly from
climate_ref.executor.reingest, so the re-export is unnecessary.
Add tests for previously uncovered helper functions:
- _copy_results_to_scratch: path traversal rejection and directory copy
- _get_existing_metric_dimensions: empty and populated cases
- _handle_reingest_output_bundle: output registration with real data
- _ingest_metrics: scalar ingestion and additive dedup with real CMEC bundles
- reconstruct_execution_definition: with linked CMIP6 datasets

Add output_dir_with_data fixture with valid 2-level nested CMECMetric
JSON and output bundles with plot entries.
Extract mock_result_factory fixture and _patch_build_result helper to
eliminate 11 repetitions of 8-line mock ExecutionResult boilerplate.
Consolidate shared TSeries data into SAMPLE_SERIES constant and
results dir setup into _create_results_dir helper. Add autouse fixture
for provider registration in TestGetExecutionsForReingest.
Add 22 new tests covering the reingest CLI command and additional
reingest module paths. Fix transaction handling bug in the CLI
reingest command where db.session.begin() failed when a transaction
was already active.

Make --mode a required option so the user must explicitly choose
the reingest strategy (additive, replace, or versioned).

CLI tests (10): no-filter guard, no-results, dry-run, cancellation,
force mode, group IDs, mode parametrized, include-failed, success
path, missing mode error.

Unit tests (13): unsuccessful result, scratch fallback, copy helpers,
series ingestion, provider/diagnostic filters, empty datasets.
…n logic

Extract shared functions (ingest_scalar_values, ingest_series_values,
register_execution_outputs) into result_handling.py and have reingest.py
import them instead of maintaining parallel implementations.
…uivalence tests

Both handle_execution_result and reingest now delegate to a single
ingest_execution_result() function for output registration and metric
ingestion, preventing the two paths from drifting.

Adds unit test verifying versioned reingest produces equivalent DB state
to fresh ingestion, and integration tests that solve with the example
provider then reingest in all three modes (additive, replace, versioned).
@lewisjared lewisjared requested a review from Copilot April 3, 2026 11:30
The '--mode' text in typer's error output includes ANSI bold escape
codes around the dashes, causing the exact string match to fail in CI.
Split into two assertions that match the key parts independently.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

packages/climate-ref/src/climate_ref/executor/reingest.py:502

  • For non-versioned modes, this function copies into original_results_dir but never validates that original_results_dir stays within config.paths.results. The only safety check is for target_results_dir, which isn't the destination used in the else-branch. Add the same (resolved) containment validation for original_results_dir before copying.
    packages/climate-ref/src/climate_ref/executor/reingest.py:559
  • _copy_results_to_scratch() can raise (e.g., ValueError for unsafe fragments). reingest_execution() calls it outside of any try/except, so a single malformed Execution.output_fragment can abort the entire CLI run rather than being treated as a skipped execution. Consider catching exceptions from _copy_results_to_scratch (and from the subsequent filesystem copy-back) and returning False after logging, consistent with other skip paths.
    packages/climate-ref/src/climate_ref/executor/reingest.py:603
  • The results tree is updated on disk before the surrounding DB transaction is committed (the CLI commits after reingest_execution returns). If the outer commit fails/rolls back for any reason, DB state and filesystem state can diverge. To keep them consistent, consider moving the copy-back to the caller after a successful commit (e.g., return the scratch_dir/target_execution and copy after the transaction) or committing within reingest_execution before mutating the results directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +380 to +386
return False

cv = CV.load_from_file(config.paths.dimensions_cv)

try:
target_execution = _apply_reingest_mode(
config=config,
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path-traversal guard here relies on Path.is_relative_to() for src/dst. This check is purely lexical and does not eliminate .. segments or resolve symlinks, so an output_fragment like ../outside can still pass while escaping the base when used by shutil. Consider validating using resolved paths (e.g., (base / fragment).resolve().is_relative_to(base.resolve())) and/or explicitly rejecting absolute paths and any .. parts before calling rmtree/copytree.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants