tests: Replace massive dataset caches with deterministic hashes#282
tests: Replace massive dataset caches with deterministic hashes#282ParticularlyPythonicBS wants to merge 1 commit intounstablefrom
Conversation
WalkthroughReplaced element-level set comparisons with deterministic SHA-256 hashes in tests and a capture utility. Capture script now auto-confirms, uses updated testing config paths, instantiates sequencer in BUILD_ONLY mode, and excludes components with names containing Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/utilities/capture_set_values_for_cache.py (1)
20-32: 🧹 Nitpick | 🔵 TrivialRemove or update the misleading warning and dead code.
The warning message (lines 20-28) still reads as if prompting for user input, but the script now runs automatically. Lines 31-32 are dead code since
t = 'Y'always passes the check.Consider either:
- Removing the warning and dead check entirely if automation is intended
- Or updating the warning to reflect automatic execution
♻️ Option 1: Remove dead code and update warning
print( - 'WARNING: Continuing to execute this file will ' - 'update the cached values in the testing_data folder' - 'from the sqlite databases in the same folder. ' - 'This should only need to be done if the schema or' - 'model have changed and that database has been updated.' - '\nRunning this basically resets the expected value sets' - 'for Utopia, TestSystem, and Mediumville' + 'INFO: Updating cached set hashes in testing_data folder ' + 'from the sqlite databases. This resets expected value sets ' + 'for Utopia, TestSystem, and Mediumville.' ) - -t = 'Y' # automated run -if t not in {'y', 'Y'}: - sys.exit(0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/utilities/capture_set_values_for_cache.py` around lines 20 - 32, The printed multi-line warning and the dead variable check (the print(...) block and the variable t = 'Y' plus the if t not in {'y','Y'}: sys.exit(0)) are misleading because the script runs automatically; either delete the entire warning and the t/check lines to reflect automation, or replace the warning text with a concise message that clearly states the script will run non-interactively and remove the unreachable conditional; locate and update the exact print(...) block and the variable t assignment to implement the chosen change.tests/test_set_consistency.py (1)
110-113: 🧹 Nitpick | 🔵 TrivialConsider using consistent assertion style.
Lines 110-113 use
if x: assert False(requiringnoqa B011) while lines 106-109 use the cleanerassert not xpattern. Using a consistent style improves readability.♻️ Proposed fix
- if cache_extra_sets: - assert False, 'Cache has extra sets' # noqa B011 - if model_extra_sets: - assert False, 'Model has extra sets' # noqa B011 + assert not cache_extra_sets, f'Cache has extra sets: {cache_extra_sets}' + assert not model_extra_sets, f'Model has extra sets: {model_extra_sets}'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_set_consistency.py` around lines 110 - 113, Replace the conditional-assert pattern with direct assertions for consistency: change the "if cache_extra_sets: assert False, 'Cache has extra sets'" and "if model_extra_sets: assert False, 'Model has extra sets'" to "assert not cache_extra_sets, 'Cache has extra sets'" and "assert not model_extra_sets, 'Model has extra sets'" respectively (update the assertions near variables cache_extra_sets and model_extra_sets and remove the unnecessary "# noqa B011").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_set_consistency.py`:
- Around line 75-78: The two set-difference computations are duplicated: replace
uses of cache_extra_sets with the already-filtered missing_in_model (computed as
cached_sets.keys() - model_sets.keys() then filtered for '_index' and '_domain')
and delete the redundant cache_extra_sets definition; update the assertions that
currently check both variables (the ones around the earlier block and lines that
repeat the same check) to use missing_in_model only so the logic is consolidated
and the duplicate computation/assertion is removed.
- Around line 64-66: The empty special-case in the loop over model_sets (the `if
set_name == 'cost_emission_rpe': pass` block in tests/test_set_consistency.py)
is dead code — either remove that conditional entirely or replace it with the
intended behavior (e.g., skip this set explicitly with `continue`, or add the
specific assertions/handling that were intended for `cost_emission_rpe`); update
the loop that iterates `for set_name, s in model_sets.items():` accordingly and
ensure any referenced variables (`set_name`, `s`, `model_sets`) remain used or
removed to avoid linter/test warnings.
In `@tests/utilities/capture_set_values_for_cache.py`:
- Around line 62-68: The hash_set() function is duplicated across tests; extract
it into a shared test utility module (e.g., create a
tests/utilities/hash_utils.py with hash_set), update the current file's hash_set
call to import hash_set from that module, and remove the duplicate
implementation from tests/test_set_consistency.py so both tests import and use
the single shared hash_set function to ensure consistency and maintainability.
- Around line 62-68: The helper function hash_set is currently defined inside
the for scenario in scenarios: loop causing it to be redefined each iteration;
move the entire hash_set(s: Any) definition to module level before the loop so
it is defined once and reused, preserving its current behavior
(sorted_elements/TypeError fallback, JSON encode and sha256) and update any
references inside the loop to call hash_set as before.
---
Outside diff comments:
In `@tests/test_set_consistency.py`:
- Around line 110-113: Replace the conditional-assert pattern with direct
assertions for consistency: change the "if cache_extra_sets: assert False,
'Cache has extra sets'" and "if model_extra_sets: assert False, 'Model has extra
sets'" to "assert not cache_extra_sets, 'Cache has extra sets'" and "assert not
model_extra_sets, 'Model has extra sets'" respectively (update the assertions
near variables cache_extra_sets and model_extra_sets and remove the unnecessary
"# noqa B011").
In `@tests/utilities/capture_set_values_for_cache.py`:
- Around line 20-32: The printed multi-line warning and the dead variable check
(the print(...) block and the variable t = 'Y' plus the if t not in {'y','Y'}:
sys.exit(0)) are misleading because the script runs automatically; either delete
the entire warning and the t/check lines to reflect automation, or replace the
warning text with a concise message that clearly states the script will run
non-interactively and remove the unreachable conditional; locate and update the
exact print(...) block and the variable t assignment to implement the chosen
change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2da66896-5ba2-47f7-b8c1-a53cc505a224
📒 Files selected for processing (5)
tests/test_set_consistency.pytests/testing_data/mediumville_sets.jsontests/testing_data/test_system_sets.jsontests/testing_data/utopia_sets.jsontests/utilities/capture_set_values_for_cache.py
| for set_name, s in model_sets.items(): | ||
| if set_name == 'cost_emission_rpe': | ||
| pass |
There was a problem hiding this comment.
Remove dead code or complete the implementation.
The if set_name == 'cost_emission_rpe': pass block does nothing. This appears to be leftover debug code or an incomplete special-case handler. Either remove it or implement the intended logic.
🧹 Proposed fix
for set_name, s in model_sets.items():
- if set_name == 'cost_emission_rpe':
- pass
if '_index' in set_name or '_domain' in set_name:
continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for set_name, s in model_sets.items(): | |
| if set_name == 'cost_emission_rpe': | |
| pass | |
| for set_name, s in model_sets.items(): | |
| if '_index' in set_name or '_domain' in set_name: | |
| continue |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_set_consistency.py` around lines 64 - 66, The empty special-case
in the loop over model_sets (the `if set_name == 'cost_emission_rpe': pass`
block in tests/test_set_consistency.py) is dead code — either remove that
conditional entirely or replace it with the intended behavior (e.g., skip this
set explicitly with `continue`, or add the specific assertions/handling that
were intended for `cost_emission_rpe`); update the loop that iterates `for
set_name, s in model_sets.items():` accordingly and ensure any referenced
variables (`set_name`, `s`, `model_sets`) remain used or removed to avoid
linter/test warnings.
| missing_in_model = cached_sets.keys() - model_sets.keys() | ||
| # drop any set that has "_index" in the name as they are no longer reported by newer version of | ||
| # pyomo | ||
| missing_in_model = {s for s in missing_in_model if '_index' not in s and '_domain' not in s} |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
missing_in_model and cache_extra_sets are identical after filtering.
Both compute cached_sets.keys() - model_sets.keys() with the same '_index'/'_domain' filter applied. This results in duplicate computation and redundant assertions (lines 106 and 110-111 check the same condition).
♻️ Proposed consolidation
missing_in_model = cached_sets.keys() - model_sets.keys()
# drop any set that has "_index" in the name as they are no longer reported by newer version of
# pyomo
missing_in_model = {s for s in missing_in_model if '_index' not in s and '_domain' not in s}
...
- cache_extra_sets = {
- k
- for k in cached_sets.keys() - model_sets.keys()
- if '_index' not in k and '_domain' not in k
- }
...
if cache_extra_sets:
print('\nCache extra sets compared to model: ')
...
- assert not missing_in_model, f'one or more cached set not in model: {missing_in_model}'
...
- if cache_extra_sets:
- assert False, 'Cache has extra sets' # noqa B011
+ assert not missing_in_model, f'Cache has extra sets not in model: {missing_in_model}'Use missing_in_model throughout and remove cache_extra_sets.
Also applies to: 91-95
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_set_consistency.py` around lines 75 - 78, The two set-difference
computations are duplicated: replace uses of cache_extra_sets with the
already-filtered missing_in_model (computed as cached_sets.keys() -
model_sets.keys() then filtered for '_index' and '_domain') and delete the
redundant cache_extra_sets definition; update the assertions that currently
check both variables (the ones around the earlier block and lines that repeat
the same check) to use missing_in_model only so the logic is consolidated and
the duplicate computation/assertion is removed.
| def hash_set(s: Any) -> str: | ||
| try: | ||
| sorted_elements = sorted(s) | ||
| except TypeError: | ||
| sorted_elements = sorted([str(e) for e in s]) | ||
| s_bytes = json.dumps(sorted_elements).encode('utf-8') | ||
| return hashlib.sha256(s_bytes).hexdigest() |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider extracting hash_set() to a shared test utility.
This function is duplicated verbatim in tests/test_set_consistency.py (lines 49-55). Extracting it to a shared module (e.g., tests/utilities/hash_utils.py) would ensure consistency and simplify maintenance if the hashing logic needs to change.
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 62-62: Dynamically typed expressions (typing.Any) are disallowed in s
(ANN401)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/utilities/capture_set_values_for_cache.py` around lines 62 - 68, The
hash_set() function is duplicated across tests; extract it into a shared test
utility module (e.g., create a tests/utilities/hash_utils.py with hash_set),
update the current file's hash_set call to import hash_set from that module, and
remove the duplicate implementation from tests/test_set_consistency.py so both
tests import and use the single shared hash_set function to ensure consistency
and maintainability.
🧹 Nitpick | 🔵 Trivial
Move hash_set() outside the loop.
The function is currently defined inside the for scenario in scenarios: loop (line 54), causing it to be recreated on every iteration. Move it to module level before the loop for clarity and to avoid unnecessary redefinition.
♻️ Proposed fix
Move the function definition before the loop:
+def hash_set(s: Any) -> str:
+ try:
+ sorted_elements = sorted(s)
+ except TypeError:
+ sorted_elements = sorted([str(e) for e in s])
+ s_bytes = json.dumps(sorted_elements).encode('utf-8')
+ return hashlib.sha256(s_bytes).hexdigest()
+
+
for scenario in scenarios:
config = TemoaConfig.build_config(
config_file=scenario['config_file'], output_path=output_path, silent=True
)
ts = TemoaSequencer(config=config, mode_override=TemoaMode.BUILD_ONLY)
built_instance = ts.build_model() # catch the built model
- def hash_set(s: Any) -> str:
- try:
- sorted_elements = sorted(s)
- except TypeError:
- sorted_elements = sorted([str(e) for e in s])
- s_bytes = json.dumps(s_bytes).hexdigest()
-
model_sets = built_instance.component_map(ctype=pyo.Set)🧰 Tools
🪛 Ruff (0.15.6)
[warning] 62-62: Dynamically typed expressions (typing.Any) are disallowed in s
(ANN401)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/utilities/capture_set_values_for_cache.py` around lines 62 - 68, The
helper function hash_set is currently defined inside the for scenario in
scenarios: loop causing it to be redefined each iteration; move the entire
hash_set(s: Any) definition to module level before the loop so it is defined
once and reused, preserving its current behavior (sorted_elements/TypeError
fallback, JSON encode and sha256) and update any references inside the loop to
call hash_set as before.
Includes fix to capture script so it uses the standard test configs
0094585 to
afd4eae
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/utilities/capture_set_values_for_cache.py (1)
76-77: 🧹 Nitpick | 🔵 TrivialSort the cache keys when writing the JSON fixture.
The hashes are deterministic, but the file order still depends on
component_map()iteration order.sort_keys=Truewill keep regenerated caches stable and avoid noisy diffs when component ordering changes.♻️ Proposed change
- json.dump(sets_dict, f_out, indent=2) + json.dump(sets_dict, f_out, indent=2, sort_keys=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/utilities/capture_set_values_for_cache.py` around lines 76 - 77, When writing the JSON fixture, ensure the keys are sorted to produce stable output by adding sort_keys=True to the json.dump call that writes sets_dict to scenario['output_file']; update the json.dump(sets_dict, f_out, indent=2) invocation to include sort_keys=True so regenerated caches remain stable regardless of component_map() iteration order.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/utilities/capture_set_values_for_cache.py`:
- Around line 30-32: The script currently hardcodes t = 'Y' which forces an
automated "yes" and immediately overwrites cached fixtures; change the flow in
the top-level confirmation block (the variable t and the if t not in {'y','Y'}:
sys.exit(0) check) to require an explicit opt-in: read a prompt from input()
unless a CLI flag like --yes is present (parse args or check os.environ), and
only set t = 'Y' when that flag is supplied; ensure the existing sys.exit(0)
behavior remains for non-consent cases and keep the set {'y','Y'} check for user
confirmation.
---
Outside diff comments:
In `@tests/utilities/capture_set_values_for_cache.py`:
- Around line 76-77: When writing the JSON fixture, ensure the keys are sorted
to produce stable output by adding sort_keys=True to the json.dump call that
writes sets_dict to scenario['output_file']; update the json.dump(sets_dict,
f_out, indent=2) invocation to include sort_keys=True so regenerated caches
remain stable regardless of component_map() iteration order.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d4fed7e6-ae2c-42ce-827d-a21652dc1179
📒 Files selected for processing (5)
tests/test_set_consistency.pytests/testing_data/mediumville_sets.jsontests/testing_data/test_system_sets.jsontests/testing_data/utopia_sets.jsontests/utilities/capture_set_values_for_cache.py
| t = 'Y' # automated run | ||
| if t not in {'y', 'Y'}: | ||
| sys.exit(0) |
There was a problem hiding this comment.
Restore an explicit opt-in before rewriting the cached fixtures.
Line 30 hardcodes the confirmation to "Y", so any invocation immediately refreshes the databases and overwrites all three checked-in cache files. Please keep a prompt or --yes flag here so accidental runs do not silently rewrite expected test data.
💡 Minimal direction
-t = 'Y' # automated run
-if t not in {'y', 'Y'}:
+if '--yes' not in sys.argv and input('Continue and overwrite cached set hashes? [y/N] ') not in {'y', 'Y'}:
sys.exit(0)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| t = 'Y' # automated run | |
| if t not in {'y', 'Y'}: | |
| sys.exit(0) | |
| if '--yes' not in sys.argv and input('Continue and overwrite cached set hashes? [y/N] ') not in {'y', 'Y'}: | |
| sys.exit(0) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/utilities/capture_set_values_for_cache.py` around lines 30 - 32, The
script currently hardcodes t = 'Y' which forces an automated "yes" and
immediately overwrites cached fixtures; change the flow in the top-level
confirmation block (the variable t and the if t not in {'y','Y'}: sys.exit(0)
check) to require an explicit opt-in: read a prompt from input() unless a CLI
flag like --yes is present (parse args or check os.environ), and only set t =
'Y' when that flag is supplied; ensure the existing sys.exit(0) behavior remains
for non-consent cases and keep the set {'y','Y'} check for user confirmation.
This PR significantly reduces the Git footprint by addressing the massive
{}_sets.jsonarrays in tests/testing_data. Currently, small valid set changes can cause ~800KB files to flood pull requests with noisy, unreviewable Git diffs.
This update switches the validation paradigm from directly comparing lists of thousands of elements to comparing deterministic SHA-256 cryptographic hashes of the subsets.
Pros:
Cons:
Updating Test Data Workflow: No workflow mechanisms change for maintainers. If they make changes to logic that causes test failures:
uv run python tests/utilities/capture_set_values_for_cache.pyl.Summary by CodeRabbit