Add bloom filter folding to automatically size SBBF filters by adriangb · Pull Request #9628 · apache/arrow-rs

adriangb · 2026-03-30T13:30:27Z

Summary

Bloom filters now support folding mode: allocate a conservatively large filter (sized for worst-case NDV), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront and produces optimally-sized filters automatically.

Changes

BloomFilterProperties.ndv changed from u64 to Option<u64> — when None (new default), the filter is sized based on max_row_group_row_count; when Some(n), the explicit NDV is used
DEFAULT_BLOOM_FILTER_NDV redefined to DEFAULT_MAX_ROW_GROUP_ROW_COUNT as u64 (was hardcoded 1_000_000)
Added Sbbf::fold_to_target_fpp() and supporting methods (fold_once, estimated_fpp_after_fold, num_blocks) with comprehensive documentation
flush_bloom_filter() in both ColumnValueEncoderImpl and ByteArrayEncoder now folds the filter before returning it
New create_bloom_filter() helper in encoder.rs centralizes bloom filter construction logic

How folding works

The SBBF fold operation merges adjacent block pairs (block[2i] | block[2i+1]) via bitwise OR, halving the filter size. This differs from standard Bloom filter folding (which merges halves at distance m/2) because SBBF uses multiplicative hashing for block selection:

block_index = ((hash >> 32) * num_blocks) >> 32

When num_blocks is halved, the new index becomes floor(original_index / 2), so adjacent blocks map to the same position.

FPP is estimated per-block as avg(block_fill^8) since SBBF membership checks are localized to a single 256-bit block.

References

Sailhan & Stehr, "Folding and Unfolding Bloom Filters", IEEE iThings 2012.

Liang, "Blocked Bloom Filters: Speeding Up Point Lookups in Tiger Postgres' Native Columnstore"

Breaking changes

BloomFilterProperties.ndv: u64 → Option<u64> (direct struct construction must be updated)

Test plan

All existing bloom filter unit tests pass
All existing integration tests (sync + async reader roundtrips) pass
New unit tests: fold correctness, no false negatives after folding, FPP target respected, minimum size guard
New unit tests: folded filter is bit-identical to a fresh filter of the same size (proves correctness via two lemmas about SBBF hashing)
New unit tests: multi-step folding, folded FPP matches fresh FPP empirically, fold size matches optimal fixed-size filter
New integration test: i32_column_bloom_filter_fixed_ndv — roundtrip with both overestimated and underestimated NDV
Full cargo test -p parquet passes

🤖 Generated with Claude Code

Instead of requiring users to guess NDV (number of distinct values) upfront, bloom filters now support a folding mode: allocate a conservatively large filter (sized for worst-case NDV = max row group rows), insert all values during writing, then fold down at flush time to meet a target FPP. When NDV is not explicitly set (the new default), folding mode activates automatically. Setting NDV explicitly preserves the existing fixed-size behavior for backward compatibility. Key changes: - BloomFilterProperties.ndv is now Option<u64> (None = folding mode) - Added BloomFilterProperties.max_bytes for explicit initial size control - Default FPP changed from 0.05 to 0.01 - Added Sbbf::fold_to_target_fpp() which merges adjacent block pairs - Both ColumnValueEncoderImpl and ByteArrayEncoder fold at flush time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Revert DEFAULT_BLOOM_FILTER_FPP back to 0.05 (no behavior change) - Add comprehensive docstrings on Sbbf, fold_once, estimated_fpp_after_fold, and fold_to_target_fpp explaining the mathematical basis, SBBF adaptation (adjacent pairs vs halves), FPP estimation, and correctness guarantees - Add citation to Sailhan & Stehr "Folding and Unfolding Bloom Filters" (IEEE iThings 2012, doi:10.1109/GreenCom.2012.16) - Keep module-level docs short, pointing to struct/method docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

etseidl · 2026-03-30T21:48:18Z

Hey @adriangb, cool idea. What motivated this if you don't mind me asking? Are any other Parquet implementations doing this?

adriangb · 2026-03-30T22:07:01Z

@viirya or @jimexist since you've worked on our bloom filters before, any interest in reviewing?

adriangb · 2026-03-30T22:10:33Z

Hey @adriangb, cool idea. What motivated this if you don't mind me asking? Are any other Parquet implementations doing this?

My motivation was that looking at our data this is a consistent problem: we have high cardinality data (trace ids) that when packed into 1M row row groups saturate the bloom filters (making them useless) but also waste a ton of space in small files. In looking for a solution I came across this neat trick.

I don't know if other Parquet implementations use this, but TimescaleDB does (linked above).

parquet/src/bloom_filter/mod.rs

Co-authored-by: emkornfield <emkornfield@gmail.com>

viirya · 2026-03-30T23:44:06Z

parquet/src/bloom_filter/mod.rs

+        assert!(
+            len >= 2,
+            "Cannot fold a bloom filter with fewer than 2 blocks"
+        );


assert!(len % 2 == 0)?

I think fold_once can only work if len is not odd.

I think is should work fine with odd values as long, we are sure that the last value doesn't do an out of bound index? (i.e. the last block is not modified for the odd case). But I think we probably truncate too much for odd values.

emkornfield · 2026-03-30T23:53:35Z

parquet/src/bloom_filter/mod.rs

+            let block_fill = set_bits as f64 / 256.0;
+            total_fpp += block_fill.powi(8);
+        }
+        total_fpp / half as f64


why is cast needed here, can this be avoided by setting the type explicitly on total_fpp?

emkornfield · 2026-03-31T00:00:41Z

parquet/src/bloom_filter/mod.rs

+    ///
+    /// ## Why adjacent pairs (not halves)?
+    ///
+    /// Standard Bloom filter folding merges the two halves (`B[i] | B[i + m/2]`) because


nit: as an explanation it might pay to reverse, this I'm not sure whether readers would commonly be aware of bloom filter folding. So it might be better to explain why half first and then indicate why this is different then the linked paper.

viirya · 2026-03-31T00:06:44Z

parquet/src/arrow/arrow_writer/byte_array.rs

-            .bloom_filter_properties(descr.path())
-            .map(|props| Sbbf::new_with_ndv_fpp(props.ndv, props.fpp))
-            .transpose()?;
+        let (bloom_filter, bloom_filter_target_fpp) =


Is the bloom filter creation logic the same as encoder.rs? Maybe we can extract fn create_bloom_filter?

parquet/src/bloom_filter/mod.rs

viirya

Do we have e2e tests that cover this folding mode behavior already?

adriangb · 2026-03-31T00:29:48Z

Do we have e2e tests that cover this folding mode behavior already?

I can add them. Where would you recommend? I'm not all that familiar with the test structure here.

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya · 2026-03-31T00:45:29Z

Do we have e2e tests that cover this folding mode behavior already?

I can add them. Where would you recommend? I'm not all that familiar with the test structure here.

Okay. Actually existing integration roundtrip tests after this PR will cover folding path automatically because they don't set NDV. So it turns out that old behavior fixed-size mode will not be covered by these roundtrip tests. Seems we should add roundtrip tests for fixed-size mode.

Arrow writer roundtrip tests are in parquet/src/arrow/arrow_writer/mod.rs, like i32_column_bloom_filter, i32_column_bloom_filter_at_end, etc.

Arrow reader roundtrip tests like
test_get_row_group_column_bloom_filter_with_length in parquet/tests/arrow_reader/bloom_filter/sync.rs, only calls set_bloom_filter_enabled(true).

adriangb · 2026-03-31T01:13:58Z

I added a test for the legacy path. Should we deprecate it? I think the intent is better captured by the new path. One may want to create exact size bloom filters, but I don't think setting the NDV and FPP is the right way to do that (a setting for specifying the size directly would be better).

viirya · 2026-03-31T01:36:10Z

I added a test for the legacy path. Should we deprecate it? I think the intent is better captured by the new path. One may want to create exact size bloom filters, but I don't think setting the NDV and FPP is the right way to do that (a setting for specifying the size directly would be better).

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

adriangb · 2026-03-31T02:10:41Z

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

viirya · 2026-03-31T02:44:31Z

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

I think we can do it in this PR.

wgtmac · 2026-03-31T03:09:30Z

Coming from the dev list. The parquet-java implementation tried to optimize the disk size by creating multiple bloom filter writers with different NDVs and choosing the best in the end. The approach in this PR looks more elegant and worth porting to other implementations.

adriangb · 2026-03-31T12:59:14Z

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

I think we can do it in this PR.

If we want to deprecate the existing NDV I think we're better off re-interpreting it to mean "maximum ndv" or "initial ndv". That way existing users who are setting the ndv also benefit from folding. This means there will be no way to disable folding but I also don't see any reason anyone would want to do that beyond requiring a fixed-size bloom filter (in which case relying on a combination of fpp + ndv giving you a fixed size was probably a bad choice to begin with given I don't think we made any such API promise, and they should open an issue requesting an explicit API for this).

Thus the only changes vs. main now are:

Adding the folding on write.
Default max ndv is derived from the max rows per row group instead of being hardcoded.

viirya · 2026-03-31T21:28:49Z

parquet/src/arrow/arrow_writer/mod.rs

+            (SMALL_SIZE as i32 + 1..SMALL_SIZE as i32 + 10).collect(),
+        );
+
+        // NDV smaller than actual distinct values — tests the underestimate path


array has only 7 distinct value. So "NDV smaller than actual distinct values" seems incorrect?

github-actions bot added the parquet Changes to the parquet crate label Mar 30, 2026

adriangb and others added 5 commits March 30, 2026 06:41

Fix rustdoc lint errors in bloom filter folding docs

372f750

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fmt

4c4e4ea

write proof

a6a232b

Add tests proving this works

507e739

emkornfield reviewed Mar 30, 2026

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

Update parquet/src/bloom_filter/mod.rs

e63fd0c

Co-authored-by: emkornfield <emkornfield@gmail.com>

viirya reviewed Mar 30, 2026

View reviewed changes

emkornfield reviewed Mar 30, 2026

View reviewed changes

emkornfield reviewed Mar 31, 2026

View reviewed changes

viirya reviewed Mar 31, 2026

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

viirya reviewed Mar 31, 2026

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

viirya reviewed Mar 31, 2026

View reviewed changes

adriangb and others added 2 commits March 30, 2026 17:31

Update parquet/src/bloom_filter/mod.rs

b45f0b2

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

Update parquet/src/bloom_filter/mod.rs

bdd4e2b

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

adriangb added 3 commits March 30, 2026 17:57

address pr feedback

f752807

tweak docs, rename variables to encapsulate domains

649b029

add test for fixed size / ndv filters

74abb8b

adriangb added 2 commits March 31, 2026 07:31

simplify

15a6d02

simplify

44f4574

tweak test

02f11d7

adriangb requested review from emkornfield and viirya March 31, 2026 20:43

viirya reviewed Mar 31, 2026

View reviewed changes

viirya approved these changes Mar 31, 2026

View reviewed changes

Conversation

adriangb commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

How folding works

References

Breaking changes

Test plan

Uh oh!

etseidl commented Mar 30, 2026

Uh oh!

adriangb commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Mar 30, 2026

Uh oh!

Uh oh!

viirya Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

emkornfield Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

emkornfield Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

viirya Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb commented Mar 31, 2026

Uh oh!

viirya commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Mar 31, 2026

Uh oh!

viirya commented Mar 31, 2026

Uh oh!

adriangb commented Mar 31, 2026

Uh oh!

viirya commented Mar 31, 2026

Uh oh!

wgtmac commented Mar 31, 2026

Uh oh!

adriangb commented Mar 31, 2026

Uh oh!

viirya Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

adriangb commented Mar 30, 2026 •

edited

Loading

adriangb commented Mar 30, 2026 •

edited

Loading

emkornfield Mar 30, 2026 •

edited

Loading

viirya commented Mar 31, 2026 •

edited

Loading