tests: Replace massive dataset caches with deterministic hashes by ParticularlyPythonicBS · Pull Request #282 · TemoaProject/temoa

ParticularlyPythonicBS · 2026-03-21T23:40:39Z

This PR significantly reduces the Git footprint by addressing the massive {}_sets.json
arrays in tests/testing_data. Currently, small valid set changes can cause ~800KB files to flood pull requests with noisy, unreviewable Git diffs.

This update switches the validation paradigm from directly comparing lists of thousands of elements to comparing deterministic SHA-256 cryptographic hashes of the subsets.

Pros:

Tiny Cache Files: The caching JSON files dropped from ~800KB down to ~11KB.

Cons:

Test Error Outputs Limit: Pytest will explicitly flag that "The hash shifted for the 'regions' set", but since we no longer keep the massive expected array checked in, it will not be able to print exact line-by-line differences inside the CLI (Overages/Shortages compared to cache). Developers seeking exact diffs will fall back to default Pyomo output analysis tools.

Updating Test Data Workflow: No workflow mechanisms change for maintainers. If they make changes to logic that causes test failures:

pytest fails citing a shifted hash.
Maintainers run uv run python tests/utilities/capture_set_values_for_cache.pyl.
The new hashes are immediately rewritten over the cache datasets, and the 1-line update can committed.

Summary by CodeRabbit

Tests
- Switched test consistency checks to hash-based set comparison for clearer reporting of mismatched cached data.
Chores
- Automated test-data capture to always proceed, produce deterministic hashed outputs, and exclude index/domain helper sets for stable, reproducible fixtures.

coderabbitai · 2026-03-21T23:40:59Z

Walkthrough

Replaced element-level set comparisons with deterministic SHA-256 hashes in tests and a capture utility. Capture script now auto-confirms, uses updated testing config paths, instantiates sequencer in BUILD_ONLY mode, and excludes components with names containing '_index' or '_domain' from capture/comparison.

Changes

Cohort / File(s)	Summary
Test: set consistency `tests/test_set_consistency.py`	Added `hash_set()` helper to deterministically hash set contents; treat cached values as precomputed hashes; compare model vs. cached hashes and record mismatched set names (exclude names containing `'_index'` or `'_domain'`); updated assertions and output to report hash mismatches.
Utility: capture set values for cache `tests/utilities/capture_set_values_for_cache.py`	Always proceeds past confirmation (hardcoded 'Y'); updated scenario `config_file` paths to `tests/testing_configs/...`; instantiate `TemoaSequencer` with `mode_override=TemoaMode.BUILD_ONLY`; capture set values as deterministic SHA-256 digests; exclude components with `'_index'` or `'_domain'` in names.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly and accurately describes the main change: replacing massive dataset caches with deterministic hashes, which is the core objective of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ref/testing_data_improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/utilities/capture_set_values_for_cache.py (1)

20-32: 🧹 Nitpick | 🔵 Trivial

Remove or update the misleading warning and dead code.

The warning message (lines 20-28) still reads as if prompting for user input, but the script now runs automatically. Lines 31-32 are dead code since t = 'Y' always passes the check.

Consider either:

Removing the warning and dead check entirely if automation is intended
Or updating the warning to reflect automatic execution

♻️ Option 1: Remove dead code and update warning

 print(
-    'WARNING:  Continuing to execute this file will '
-    'update the cached values in the testing_data folder'
-    'from the sqlite databases in the same folder.  '
-    'This should only need to be done if the schema or'
-    'model have changed and that database has been updated.'
-    '\nRunning this basically resets the expected value sets'
-    'for Utopia, TestSystem, and Mediumville'
+    'INFO: Updating cached set hashes in testing_data folder '
+    'from the sqlite databases. This resets expected value sets '
+    'for Utopia, TestSystem, and Mediumville.'
 )
-
-t = 'Y'  # automated run
-if t not in {'y', 'Y'}:
-    sys.exit(0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/utilities/capture_set_values_for_cache.py` around lines 20 - 32, The
printed multi-line warning and the dead variable check (the print(...) block and
the variable t = 'Y' plus the if t not in {'y','Y'}: sys.exit(0)) are misleading
because the script runs automatically; either delete the entire warning and the
t/check lines to reflect automation, or replace the warning text with a concise
message that clearly states the script will run non-interactively and remove the
unreachable conditional; locate and update the exact print(...) block and the
variable t assignment to implement the chosen change.

tests/test_set_consistency.py (1)

110-113: 🧹 Nitpick | 🔵 Trivial

Consider using consistent assertion style.

Lines 110-113 use if x: assert False (requiring noqa B011) while lines 106-109 use the cleaner assert not x pattern. Using a consistent style improves readability.

♻️ Proposed fix

-    if cache_extra_sets:
-        assert False, 'Cache has extra sets'  # noqa B011
-    if model_extra_sets:
-        assert False, 'Model has extra sets'  # noqa B011
+    assert not cache_extra_sets, f'Cache has extra sets: {cache_extra_sets}'
+    assert not model_extra_sets, f'Model has extra sets: {model_extra_sets}'

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_set_consistency.py` around lines 110 - 113, Replace the
conditional-assert pattern with direct assertions for consistency: change the
"if cache_extra_sets: assert False, 'Cache has extra sets'" and "if
model_extra_sets: assert False, 'Model has extra sets'" to "assert not
cache_extra_sets, 'Cache has extra sets'" and "assert not model_extra_sets,
'Model has extra sets'" respectively (update the assertions near variables
cache_extra_sets and model_extra_sets and remove the unnecessary "# noqa B011").

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_set_consistency.py`:
- Around line 75-78: The two set-difference computations are duplicated: replace
uses of cache_extra_sets with the already-filtered missing_in_model (computed as
cached_sets.keys() - model_sets.keys() then filtered for '_index' and '_domain')
and delete the redundant cache_extra_sets definition; update the assertions that
currently check both variables (the ones around the earlier block and lines that
repeat the same check) to use missing_in_model only so the logic is consolidated
and the duplicate computation/assertion is removed.
- Around line 64-66: The empty special-case in the loop over model_sets (the `if
set_name == 'cost_emission_rpe': pass` block in tests/test_set_consistency.py)
is dead code — either remove that conditional entirely or replace it with the
intended behavior (e.g., skip this set explicitly with `continue`, or add the
specific assertions/handling that were intended for `cost_emission_rpe`); update
the loop that iterates `for set_name, s in model_sets.items():` accordingly and
ensure any referenced variables (`set_name`, `s`, `model_sets`) remain used or
removed to avoid linter/test warnings.

In `@tests/utilities/capture_set_values_for_cache.py`:
- Around line 62-68: The hash_set() function is duplicated across tests; extract
it into a shared test utility module (e.g., create a
tests/utilities/hash_utils.py with hash_set), update the current file's hash_set
call to import hash_set from that module, and remove the duplicate
implementation from tests/test_set_consistency.py so both tests import and use
the single shared hash_set function to ensure consistency and maintainability.
- Around line 62-68: The helper function hash_set is currently defined inside
the for scenario in scenarios: loop causing it to be redefined each iteration;
move the entire hash_set(s: Any) definition to module level before the loop so
it is defined once and reused, preserving its current behavior
(sorted_elements/TypeError fallback, JSON encode and sha256) and update any
references inside the loop to call hash_set as before.

---

Outside diff comments:
In `@tests/test_set_consistency.py`:
- Around line 110-113: Replace the conditional-assert pattern with direct
assertions for consistency: change the "if cache_extra_sets: assert False,
'Cache has extra sets'" and "if model_extra_sets: assert False, 'Model has extra
sets'" to "assert not cache_extra_sets, 'Cache has extra sets'" and "assert not
model_extra_sets, 'Model has extra sets'" respectively (update the assertions
near variables cache_extra_sets and model_extra_sets and remove the unnecessary
"# noqa B011").

In `@tests/utilities/capture_set_values_for_cache.py`:
- Around line 20-32: The printed multi-line warning and the dead variable check
(the print(...) block and the variable t = 'Y' plus the if t not in {'y','Y'}:
sys.exit(0)) are misleading because the script runs automatically; either delete
the entire warning and the t/check lines to reflect automation, or replace the
warning text with a concise message that clearly states the script will run
non-interactively and remove the unreachable conditional; locate and update the
exact print(...) block and the variable t assignment to implement the chosen
change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2da66896-5ba2-47f7-b8c1-a53cc505a224

📥 Commits

Reviewing files that changed from the base of the PR and between a64b0da and 0094585.

📒 Files selected for processing (5)

tests/test_set_consistency.py
tests/testing_data/mediumville_sets.json
tests/testing_data/test_system_sets.json
tests/testing_data/utopia_sets.json
tests/utilities/capture_set_values_for_cache.py

coderabbitai · 2026-03-21T23:46:07Z

tests/test_set_consistency.py

    for set_name, s in model_sets.items():
        if set_name == 'cost_emission_rpe':
            pass


⚠️ Potential issue | 🟡 Minor

Remove dead code or complete the implementation.

The if set_name == 'cost_emission_rpe': pass block does nothing. This appears to be leftover debug code or an incomplete special-case handler. Either remove it or implement the intended logic.

🧹 Proposed fix

for set_name, s in model_sets.items(): - if set_name == 'cost_emission_rpe': - pass if '_index' in set_name or '_domain' in set_name: continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for set_name, s in model_sets.items():

if set_name == 'cost_emission_rpe':

pass

for set_name, s in model_sets.items():

if '_index' in set_name or '_domain' in set_name:

continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_set_consistency.py` around lines 64 - 66, The empty special-case in the loop over model_sets (the `if set_name == 'cost_emission_rpe': pass` block in tests/test_set_consistency.py) is dead code — either remove that conditional entirely or replace it with the intended behavior (e.g., skip this set explicitly with `continue`, or add the specific assertions/handling that were intended for `cost_emission_rpe`); update the loop that iterates `for set_name, s in model_sets.items():` accordingly and ensure any referenced variables (`set_name`, `s`, `model_sets`) remain used or removed to avoid linter/test warnings.

coderabbitai · 2026-03-21T23:46:07Z

tests/test_set_consistency.py

    missing_in_model = cached_sets.keys() - model_sets.keys()
    # drop any set that has "_index" in the name as they are no longer reported by newer version of
    # pyomo
    missing_in_model = {s for s in missing_in_model if '_index' not in s and '_domain' not in s}


🧹 Nitpick | 🔵 Trivial

missing_in_model and cache_extra_sets are identical after filtering.

Both compute cached_sets.keys() - model_sets.keys() with the same '_index'/'_domain' filter applied. This results in duplicate computation and redundant assertions (lines 106 and 110-111 check the same condition).

♻️ Proposed consolidation

missing_in_model = cached_sets.keys() - model_sets.keys() # drop any set that has "_index" in the name as they are no longer reported by newer version of # pyomo missing_in_model = {s for s in missing_in_model if '_index' not in s and '_domain' not in s} ... - cache_extra_sets = { - k - for k in cached_sets.keys() - model_sets.keys() - if '_index' not in k and '_domain' not in k - } ... if cache_extra_sets: print('\nCache extra sets compared to model: ') ... - assert not missing_in_model, f'one or more cached set not in model: {missing_in_model}' ... - if cache_extra_sets: - assert False, 'Cache has extra sets' # noqa B011 + assert not missing_in_model, f'Cache has extra sets not in model: {missing_in_model}'

Use missing_in_model throughout and remove cache_extra_sets.

Also applies to: 91-95

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_set_consistency.py` around lines 75 - 78, The two set-difference computations are duplicated: replace uses of cache_extra_sets with the already-filtered missing_in_model (computed as cached_sets.keys() - model_sets.keys() then filtered for '_index' and '_domain') and delete the redundant cache_extra_sets definition; update the assertions that currently check both variables (the ones around the earlier block and lines that repeat the same check) to use missing_in_model only so the logic is consolidated and the duplicate computation/assertion is removed.

coderabbitai · 2026-03-21T23:46:07Z

tests/utilities/capture_set_values_for_cache.py

+    def hash_set(s: Any) -> str:
+        try:
+            sorted_elements = sorted(s)
+        except TypeError:
+            sorted_elements = sorted([str(e) for e in s])
+        s_bytes = json.dumps(sorted_elements).encode('utf-8')
+        return hashlib.sha256(s_bytes).hexdigest()


🧹 Nitpick | 🔵 Trivial

Consider extracting hash_set() to a shared test utility.

This function is duplicated verbatim in tests/test_set_consistency.py (lines 49-55). Extracting it to a shared module (e.g., tests/utilities/hash_utils.py) would ensure consistency and simplify maintenance if the hashing logic needs to change.

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 62-62: Dynamically typed expressions (typing.Any) are disallowed in s

(ANN401)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/utilities/capture_set_values_for_cache.py` around lines 62 - 68, The hash_set() function is duplicated across tests; extract it into a shared test utility module (e.g., create a tests/utilities/hash_utils.py with hash_set), update the current file's hash_set call to import hash_set from that module, and remove the duplicate implementation from tests/test_set_consistency.py so both tests import and use the single shared hash_set function to ensure consistency and maintainability.

🧹 Nitpick | 🔵 Trivial

Move hash_set() outside the loop.

The function is currently defined inside the for scenario in scenarios: loop (line 54), causing it to be recreated on every iteration. Move it to module level before the loop for clarity and to avoid unnecessary redefinition.

♻️ Proposed fix

Move the function definition before the loop:

+def hash_set(s: Any) -> str: + try: + sorted_elements = sorted(s) + except TypeError: + sorted_elements = sorted([str(e) for e in s]) + s_bytes = json.dumps(sorted_elements).encode('utf-8') + return hashlib.sha256(s_bytes).hexdigest() + + for scenario in scenarios: config = TemoaConfig.build_config( config_file=scenario['config_file'], output_path=output_path, silent=True ) ts = TemoaSequencer(config=config, mode_override=TemoaMode.BUILD_ONLY) built_instance = ts.build_model() # catch the built model - def hash_set(s: Any) -> str: - try: - sorted_elements = sorted(s) - except TypeError: - sorted_elements = sorted([str(e) for e in s]) - s_bytes = json.dumps(s_bytes).hexdigest() - model_sets = built_instance.component_map(ctype=pyo.Set)

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 62-62: Dynamically typed expressions (typing.Any) are disallowed in s

(ANN401)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/utilities/capture_set_values_for_cache.py` around lines 62 - 68, The helper function hash_set is currently defined inside the for scenario in scenarios: loop causing it to be redefined each iteration; move the entire hash_set(s: Any) definition to module level before the loop so it is defined once and reused, preserving its current behavior (sorted_elements/TypeError fallback, JSON encode and sha256) and update any references inside the loop to call hash_set as before.

Includes fix to capture script so it uses the standard test configs

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/utilities/capture_set_values_for_cache.py (1)
76-77: 🧹 Nitpick | 🔵 Trivial

Sort the cache keys when writing the JSON fixture.

The hashes are deterministic, but the file order still depends on component_map() iteration order. sort_keys=True will keep regenerated caches stable and avoid noisy diffs when component ordering changes.
♻️ Proposed change
-        json.dump(sets_dict, f_out, indent=2)
+        json.dump(sets_dict, f_out, indent=2, sort_keys=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utilities/capture_set_values_for_cache.py` around lines 76 - 77, When
writing the JSON fixture, ensure the keys are sorted to produce stable output by
adding sort_keys=True to the json.dump call that writes sets_dict to
scenario['output_file']; update the json.dump(sets_dict, f_out, indent=2)
invocation to include sort_keys=True so regenerated caches remain stable
regardless of component_map() iteration order.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/utilities/capture_set_values_for_cache.py`:
- Around line 30-32: The script currently hardcodes t = 'Y' which forces an
automated "yes" and immediately overwrites cached fixtures; change the flow in
the top-level confirmation block (the variable t and the if t not in {'y','Y'}:
sys.exit(0) check) to require an explicit opt-in: read a prompt from input()
unless a CLI flag like --yes is present (parse args or check os.environ), and
only set t = 'Y' when that flag is supplied; ensure the existing sys.exit(0)
behavior remains for non-consent cases and keep the set {'y','Y'} check for user
confirmation.

---

Outside diff comments:
In `@tests/utilities/capture_set_values_for_cache.py`:
- Around line 76-77: When writing the JSON fixture, ensure the keys are sorted
to produce stable output by adding sort_keys=True to the json.dump call that
writes sets_dict to scenario['output_file']; update the json.dump(sets_dict,
f_out, indent=2) invocation to include sort_keys=True so regenerated caches
remain stable regardless of component_map() iteration order.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d4fed7e6-ae2c-42ce-827d-a21652dc1179

📥 Commits

Reviewing files that changed from the base of the PR and between 0094585 and afd4eae.

📒 Files selected for processing (5)

tests/test_set_consistency.py
tests/testing_data/mediumville_sets.json
tests/testing_data/test_system_sets.json
tests/testing_data/utopia_sets.json
tests/utilities/capture_set_values_for_cache.py

coderabbitai · 2026-03-23T14:53:55Z

tests/utilities/capture_set_values_for_cache.py

+t = 'Y'  # automated run
 if t not in {'y', 'Y'}:
    sys.exit(0)


⚠️ Potential issue | 🟡 Minor

Restore an explicit opt-in before rewriting the cached fixtures.

Line 30 hardcodes the confirmation to "Y", so any invocation immediately refreshes the databases and overwrites all three checked-in cache files. Please keep a prompt or --yes flag here so accidental runs do not silently rewrite expected test data.

💡 Minimal direction

-t = 'Y' # automated run -if t not in {'y', 'Y'}: +if '--yes' not in sys.argv and input('Continue and overwrite cached set hashes? [y/N] ') not in {'y', 'Y'}: sys.exit(0)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

t = 'Y' # automated run

if t not in {'y', 'Y'}:

sys.exit(0)

if '--yes' not in sys.argv and input('Continue and overwrite cached set hashes? [y/N] ') not in {'y', 'Y'}:

sys.exit(0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/utilities/capture_set_values_for_cache.py` around lines 30 - 32, The script currently hardcodes t = 'Y' which forces an automated "yes" and immediately overwrites cached fixtures; change the flow in the top-level confirmation block (the variable t and the if t not in {'y','Y'}: sys.exit(0) check) to require an explicit opt-in: read a prompt from input() unless a CLI flag like --yes is present (parse args or check os.environ), and only set t = 'Y' when that flag is supplied; ensure the existing sys.exit(0) behavior remains for non-consent cases and keep the set {'y','Y'} check for user confirmation.

ParticularlyPythonicBS added the Maintenance Code quality fixes and deprecation management label Mar 21, 2026

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

tests(data): Replacing massive JSON set caches with deterministic hashes

afd4eae

Includes fix to capture script so it uses the standard test configs

ParticularlyPythonicBS force-pushed the ref/testing_data_improvements branch from 0094585 to afd4eae Compare March 23, 2026 14:45

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: Replace massive dataset caches with deterministic hashes#282

tests: Replace massive dataset caches with deterministic hashes#282
ParticularlyPythonicBS wants to merge 1 commit intounstablefrom
ref/testing_data_improvements

ParticularlyPythonicBS commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ParticularlyPythonicBS commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ParticularlyPythonicBS commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading