feat(cli): add reingest command for re-ingesting execution results#610
feat(cli): add reingest command for re-ingesting execution results#610lewisjared wants to merge 14 commits intomainfrom
Conversation
Add `ref executions reingest` command that re-runs build_execution_result() on existing outputs without re-executing diagnostics. Uses ReingestMode enum for the --mode option, providing tab completion and validated choices (additive, replace, versioned) via Typer's native enum support.
…d scratch cleanup Collapse four near-identical metric ingestion functions into two by adding an optional `existing` parameter for additive dedup. Hoist the dimension query into `_ingest_metrics` to avoid duplicate DB queries. Fix double iteration of `iter_results()` generator. Add try/finally scratch directory cleanup to prevent disk accumulation during batch reingests. Remove redundant dirty-flag save/restore in CLI.
There was a problem hiding this comment.
Pull request overview
Adds a new “reingest” workflow to the Climate REF CLI/executor layer to re-run build_execution_result() against existing on-disk outputs and ingest updated metrics/metadata into the DB without re-executing diagnostics.
Changes:
- Introduces
climate_ref.executor.reingestwithReingestModeand reingest/query helpers. - Adds
ref executions reingestCLI command with safety guards (filters required, confirm prompt, dry-run) and mode selection. - Adds unit tests for reingest modes and filtering, plus a changelog entry.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/climate-ref/src/climate_ref/executor/reingest.py | Implements reingest logic, scratch copying, savepoint-based DB mutation, and execution querying. |
| packages/climate-ref/src/climate_ref/cli/executions.py | Adds ref executions reingest command wiring and UX/safety features. |
| packages/climate-ref/src/climate_ref/executor/init.py | Exposes ReingestMode and reingest_execution in executor exports. |
| packages/climate-ref/tests/unit/executor/test_reingest.py | New unit tests covering reingest behavior across modes and filters. |
| changelog/610.feature.md | Documents the new CLI command and modes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
- Fix ensure_relative_path to fall back to scratch base when output bundle contains absolute paths under scratch directory - Always clear ExecutionOutput rows before re-inserting in all modes (fixes duplicate outputs in additive mode) - Add _n suffix for repeated versioned reingests on the same execution - Mark execution as successful for all modes, not just versioned - Add path traversal safety checks in copy operations - Document single-threaded scratch directory assumption - Fix test to use ResultOutputType enum instead of raw string - Fix comment wording (session -> transaction) - Add tests for versioned re-run uniqueness, mark_successful on failed executions, and scratch directory cleanup
The reingest module imports pandas, CMECMetric, and other heavy dependencies that are not available when the package is installed without extras. The CLI imports directly from climate_ref.executor.reingest, so the re-export is unnecessary.
Add tests for previously uncovered helper functions: - _copy_results_to_scratch: path traversal rejection and directory copy - _get_existing_metric_dimensions: empty and populated cases - _handle_reingest_output_bundle: output registration with real data - _ingest_metrics: scalar ingestion and additive dedup with real CMEC bundles - reconstruct_execution_definition: with linked CMIP6 datasets Add output_dir_with_data fixture with valid 2-level nested CMECMetric JSON and output bundles with plot entries.
Extract mock_result_factory fixture and _patch_build_result helper to eliminate 11 repetitions of 8-line mock ExecutionResult boilerplate. Consolidate shared TSeries data into SAMPLE_SERIES constant and results dir setup into _create_results_dir helper. Add autouse fixture for provider registration in TestGetExecutionsForReingest.
Add 22 new tests covering the reingest CLI command and additional reingest module paths. Fix transaction handling bug in the CLI reingest command where db.session.begin() failed when a transaction was already active. Make --mode a required option so the user must explicitly choose the reingest strategy (additive, replace, or versioned). CLI tests (10): no-filter guard, no-results, dry-run, cancellation, force mode, group IDs, mode parametrized, include-failed, success path, missing mode error. Unit tests (13): unsuccessful result, scratch fallback, copy helpers, series ingestion, provider/diagnostic filters, empty datasets.
…n logic Extract shared functions (ingest_scalar_values, ingest_series_values, register_execution_outputs) into result_handling.py and have reingest.py import them instead of maintaining parallel implementations.
…uivalence tests Both handle_execution_result and reingest now delegate to a single ingest_execution_result() function for output registration and metric ingestion, preventing the two paths from drifting. Adds unit test verifying versioned reingest produces equivalent DB state to fresh ingestion, and integration tests that solve with the example provider then reingest in all three modes (additive, replace, versioned).
The '--mode' text in typer's error output includes ANSI bold escape codes around the dashes, causing the exact string match to fail in CI. Split into two assertions that match the key parts independently.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
packages/climate-ref/src/climate_ref/executor/reingest.py:502
- For non-versioned modes, this function copies into
original_results_dirbut never validates thatoriginal_results_dirstays withinconfig.paths.results. The only safety check is fortarget_results_dir, which isn't the destination used in the else-branch. Add the same (resolved) containment validation fororiginal_results_dirbefore copying.
packages/climate-ref/src/climate_ref/executor/reingest.py:559 - _copy_results_to_scratch() can raise (e.g., ValueError for unsafe fragments). reingest_execution() calls it outside of any try/except, so a single malformed Execution.output_fragment can abort the entire CLI run rather than being treated as a skipped execution. Consider catching exceptions from _copy_results_to_scratch (and from the subsequent filesystem copy-back) and returning False after logging, consistent with other skip paths.
packages/climate-ref/src/climate_ref/executor/reingest.py:603 - The results tree is updated on disk before the surrounding DB transaction is committed (the CLI commits after reingest_execution returns). If the outer commit fails/rolls back for any reason, DB state and filesystem state can diverge. To keep them consistent, consider moving the copy-back to the caller after a successful commit (e.g., return the scratch_dir/target_execution and copy after the transaction) or committing within reingest_execution before mutating the results directory.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return False | ||
|
|
||
| cv = CV.load_from_file(config.paths.dimensions_cv) | ||
|
|
||
| try: | ||
| target_execution = _apply_reingest_mode( | ||
| config=config, |
There was a problem hiding this comment.
The path-traversal guard here relies on Path.is_relative_to() for src/dst. This check is purely lexical and does not eliminate .. segments or resolve symlinks, so an output_fragment like ../outside can still pass while escaping the base when used by shutil. Consider validating using resolved paths (e.g., (base / fragment).resolve().is_relative_to(base.resolve())) and/or explicitly rejecting absolute paths and any .. parts before calling rmtree/copytree.
Description
Add
ref executions reingestcommand that re-runsbuild_execution_result()on existing output files and re-ingests the results into the database without re-executing diagnostics. This is useful when new series definitions or metadata extraction logic has been added.Key features:
ReingestModeenum:additive(keep existing, add new),replace(delete and re-ingest),versioned(new execution record)--modeoption for better UX (tab completion, validated choices shown in help)_v{hash}_n{count})mark_successfulis called for all modes, allowing--include-failedto recover failed executionsChanges
packages/climate-ref/src/climate_ref/executor/reingest.py-- new module with reingest logic, mode enum, execution definition reconstruction, and metric ingestionpackages/climate-ref/src/climate_ref/cli/executions.py-- newreingestCLI command with enum-based mode, filters, dry-run, and confirmationpackages/climate-ref/src/climate_ref/executor/__init__.py-- no eager reingest import (avoids breaking imports-without-extras)packages/climate-ref/tests/unit/executor/test_reingest.py-- 22 tests covering all modes, idempotency, rollback, dirty flag preservation, scratch cleanup, versioned re-runs, and failed execution recoveryChecklist
Please confirm that this pull request has done the following:
changelog/