Skip to content

Add per-table byte size metrics (lean_table_bytes)#244

Open
pablodeymo wants to merge 4 commits intomainfrom
add-table-bytes-metrics
Open

Add per-table byte size metrics (lean_table_bytes)#244
pablodeymo wants to merge 4 commits intomainfrom
add-table-bytes-metrics

Conversation

@pablodeymo
Copy link
Collaborator

@pablodeymo pablodeymo commented Mar 18, 2026

Summary

  • Expose estimated byte size of each storage table as a Prometheus gauge (lean_table_bytes{table="..."}).
  • Uses RocksDB's estimate-live-data-size property — an O(1) metadata lookup per column family, no scanning.
  • Provides visibility into storage growth for improving pruning decisions.

How it works

  • StorageBackend trait gains estimate_table_bytes(table) with a default returning 0.
  • RocksDBBackend implements it via property_int_value_cf("rocksdb.estimate-live-data-size").
  • Store::estimate_table_bytes() passes through to the backend.
  • Prometheus lean_table_bytes IntGaugeVec with a table label, updated after each block is processed.

No startup scans, no atomic counters, no insert/prune instrumentation needed.

Tables tracked

Label Description
states Beacon states by root (largest — XMSS signatures)
block_headers Block headers by root
block_bodies Block bodies by root
block_signatures Block signatures with attestation
gossip_signatures Individual validator gossip signatures
attestation_data_by_root Content-addressed attestation data
live_chain Fork choice slot index
metadata Store metadata (head, config, checkpoints)

Files changed

  • crates/storage/src/api/traits.rsestimate_table_bytes on StorageBackend trait
  • crates/storage/src/backend/rocksdb.rs — RocksDB implementation via property_int_value_cf
  • crates/storage/src/api/tables.rsTable::name() for metric labels
  • crates/storage/src/store.rsestimate_table_bytes() pass-through
  • crates/blockchain/src/metrics.rslean_table_bytes IntGaugeVec
  • crates/blockchain/src/lib.rs — emit table bytes after each block processed

How to Test

  • Run a local devnet and query GET :5054/metrics
  • Verify lean_table_bytes{table="states"}, lean_table_bytes{table="block_headers"}, etc. are present
  • Values reflect RocksDB's estimate of live data per column family

Track the byte size (key + value) of each storage table as a live
Prometheus gauge. This provides visibility into storage growth and
is a prerequisite for improving pruning decisions.

The approach uses one AtomicU64 per table in the Store struct:
- On startup, a full scan seeds the counters from existing data.
- On insert, the SSZ-encoded key+value size is added.
- On prune/delete, the size of removed entries is subtracted.

Exposed as lean_table_bytes{table="states"}, lean_table_bytes{table="block_headers"}, etc.
@github-actions
Copy link

🤖 Codex Code Review

  1. The new byte accounting is incorrect for overwrites, and these code paths do overwrite existing keys. Both backends implement put_batch as an upsert, not an append, so blindly fetch_add-ing written bytes will overcount whenever a key is rewritten. The clearest case is the documented pending-block flow: insert_pending_block is explicitly followed by an idempotent overwrite in insert_signed_block / [/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:801), but both paths add header/body/signature bytes again at [/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:789], [/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:814], [/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:815], and [/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:816]. The overwrite semantics are visible in [/home/runner/work/ethlambda/ethlambda/crates/storage/src/backend/in_memory.rs:99] and [/home/runner/work/ethlambda/ethlambda/crates/storage/src/backend/rocksdb.rs:112]. This will make lean_table_bytes drift upward over time. The fix should compute a byte delta against any existing value, not just add the new serialized size.

  2. Metadata is included in the startup scan and exported as a metric, but it is never kept in sync afterward. scan_table_bytes counts all tables, including Metadata, and process_block exports every table. But both set_metadata and the batched metadata writes in update_checkpoints bypass the new counters entirely. Since head, justified/finalized checkpoints, time, and safe_target all live in Metadata, lean_table_bytes{table="metadata"} becomes stale almost immediately after startup.

  3. The metric update happens before deferred pruning, so exported table sizes can stay stale until another block arrives. process_block updates the gauges at [/home/runner/work/ethlambda/ethlambda/crates/blockchain/src/lib.rs:288], but the caller prunes old states/blocks later in on_block. Those prune paths do adjust the in-memory counters in storage, e.g. [/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:685] and [/home/runner/work/ethlambda/ethlambda/crates/storage/src/store.rs:759], but the Prometheus gauge is not refreshed afterward. If the node processes a block and then goes idle, the exported table sizes remain wrong.

No consensus-layer correctness, SSZ, or signature-validation regressions stood out here; the PR is operational/metrics-focused. Main gap is correctness of the new accounting itself, and I would want tests covering overwrite/idempotent writes, metadata updates, and post-prune metric refresh before merging.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link

🤖 Kimi Code Review

The PR adds per-table storage byte metrics to track database growth. Overall the approach is sound, but there are several correctness and performance issues to address.

Critical Issues

1. Incorrect byte accounting on gossip signature upserts (crates/storage/src/store.rs, lines 1178–1199)
The insert_gossip_signature method checks if a key exists to avoid inflating the entry counter, but it fails to adjust the byte counter when updating an existing entry. If the signature size changes (e.g., different XMSS scheme or serialization), the metric will drift.

Fix by calculating the delta:

if let Some(old_val) = old_value_len {
    let old_bytes = (encoded_key.len() + old_val.len()) as u64;
    self.sub_table_bytes(Table::GossipSignatures, old_bytes);
    self.add_table_bytes(Table::GossipSignatures, new_bytes);
} else {
    self.gossip_signatures_count.fetch_add(1, Ordering::Relaxed);
    self.add_table_bytes(Table::GossipSignatures, new_bytes);
}

2. Unhandled overwrites in other insert methods (insert_state, insert_attestation_data_by_root, insert_attestation_data_batch)
These methods unconditionally add bytes without checking for existing entries. While block roots and attestation roots should be unique in correct operation, defensive programming suggests either:

  • Checking for existing entries and adjusting the delta (safer)
  • Documenting

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Now I have everything I need. Here's my review:


Review: Add per-table byte size metrics (lean_table_bytes)

The overall approach is sound — AtomicU64 per table seeded at startup and adjusted on every mutation is the right pattern for this. Most of the implementation is clean and correct. A few issues to address.


Bug: Double-counting for blocks going through the two-phase insert flow

crates/storage/src/store.rs:783–791 and :801–817

The doc comment on insert_pending_block explicitly says:

"When the block is later processed via insert_signed_block, the same keys are overwritten (idempotent)."

Both methods call write_signed_block for the same root and both increment the byte counters. The upsert means disk bytes don't grow on the second write, but the counters do — so block_headers, block_bodies, and block_signatures are double-counted for every block that flows through gossip receipt → validation. This will cause the gauges to drift upward indefinitely.

Fix: in insert_signed_block, skip the add_table_bytes calls for the three block tables if the header already exists (i.e., the block came through insert_pending_block first). Alternatively, subtract the existing size before adding the new size.


Bug: Metadata table not tracked for writes

crates/storage/src/store.rs:404–410

set_metadata writes to Table::Metadata but never calls add_table_bytes. This is called on every set_time(), set_head(), update_checkpoints(), etc. The startup scan seeds the correct initial value, but every subsequent write is untracked. The gauge will be stale immediately after the first block is processed.

Since metadata keys are fixed-size and values are small, the drift won't be large, but it will diverge from reality. set_metadata should call add_table_bytes for the written bytes (and subtract the old value's size if overwriting an existing key).


Potential underflow wrap in sub_table_bytes

crates/storage/src/store.rs (new sub_table_bytes helper)

AtomicU64::fetch_sub wraps on underflow. If the counter ever goes below zero (due to the double-counting bug above or any future inconsistency), it wraps to near u64::MAX. The downstream set(bytes as i64) cast would then produce a large negative number in Prometheus, which is harder to notice than an obviously wrong positive value. This is unlikely in steady state but worth hardening with a saturating subtraction:

fn sub_table_bytes(&self, table: Table, bytes: u64) {
    self.table_bytes[table.index()].fetch_update(Ordering::Relaxed, Ordering::Relaxed, |v| {
        Some(v.saturating_sub(bytes))
    }).ok();
}

Minor: Misleading variable name in insert_gossip_signature

crates/storage/src/store.rs (upsert check)

let old_value_len = self
    .backend
    .begin_read()
    .expect("read view")
    .get(Table::GossipSignatures, &encoded_key)
    .expect("get");
let is_new = old_value_len.is_none();

old_value_len holds the full value bytes (Option<Vec<u8>>), not a length. The name implies intent to use the length (subtract old, add new) but only .is_none() is checked. Rename to existing or old_value, or inline the check as it was before.

On the same note: for upsert updates to an existing gossip signature, the counter is not adjusted. XMSS signatures are fixed-size so this is safe in practice, but it's fragile — a comment noting the fixed-size assumption would prevent a future regression.


Minor: TABLE_COUNT and Table::index() are manually synced

Adding a new Table variant requires updating TABLE_COUNT, ALL_TABLES, Table::index(), and Table::name() — four places. A mismatch in TABLE_COUNT would be caught at compile time (array size mismatch), but a mismatch in Table::index() vs ALL_TABLES ordering would silently assign bytes to the wrong table. Consider a compile-time or test assertion:

#[cfg(test)]
#[test]
fn table_index_matches_all_tables_order() {
    for (i, table) in ALL_TABLES.iter().enumerate() {
        assert_eq!(table.index(), i, "index() out of sync with ALL_TABLES for {:?}", table);
    }
}

Good

  • The BlockWriteBytes struct cleanly returns per-table write sizes from write_signed_block without further coupling.
  • prune_by_slot now returns (count, bytes) and handles the subtraction internally — callers that don't need the byte count use _ idiomatically.
  • scan_table_bytes correctly ignores the Metadata table's dynamic nature because it seeds all 8 tables; the startup numbers are correct.
  • Ordering::Relaxed is appropriate for metrics gauges where slight staleness is acceptable.

Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR adds per-table Prometheus gauges (lean_table_bytes{table="..."}) to expose estimated live byte sizes for all 8 RocksDB column families. It avoids manual byte-counting entirely by delegating to RocksDB's built-in rocksdb.estimate-live-data-size property — an O(1) metadata lookup — queried once per block processed. The implementation is clean and avoids the wrapping/drift issues that plague manually-tracked counters.

  • Table::name() introduced for human-readable Prometheus label strings (exhaustive match, compiler-enforced).
  • StorageBackend::estimate_table_bytes() trait method with a default 0 return keeps in-memory backends source-compatible.
  • RocksDBBackend implements it via property_int_value_cf, falling back to 0 gracefully if a CF handle is missing.
  • Store::estimate_table_bytes() is a thin pass-through with no added state.
  • update_table_bytes in metrics.rs uses the same LazyLock<IntGaugeVec> pattern as all other metrics in the file; the only minor inconsistency is bytes as i64 rather than the saturating .try_into().unwrap_or(i64::MAX) used by peer metrics.

Confidence Score: 5/5

  • This PR is safe to merge; it adds read-only observability with no writes, no new state, and no impact on consensus logic.
  • The approach is minimal and correct: it delegates to a well-known O(1) RocksDB property, uses a compiler-enforced exhaustive match for table names, and introduces zero manual counter management. The single flagged issue (bytes as i64 vs try_into) is a cosmetic inconsistency with negligible real-world impact.
  • No files require special attention.

Important Files Changed

Filename Overview
crates/blockchain/src/metrics.rs Adds update_table_bytes function using a LazyLock<IntGaugeVec> pattern consistent with the rest of the file. Minor: bytes as i64 cast should use try_into().unwrap_or(i64::MAX) to match the saturating-conversion pattern used by other metrics in the same file.
crates/storage/src/api/tables.rs Adds Table::name() returning human-readable metric label strings. Exhaustive match ensures the compiler will catch any missing variant. Strings are identical to cf_name() in rocksdb.rs but serve a distinct purpose (metric labels vs stable CF names), so the duplication is intentional.
crates/storage/src/api/traits.rs Adds estimate_table_bytes to StorageBackend with a sensible default of 0, keeping in-memory backends compatible without changes. Clean, minimal, non-breaking addition.
crates/storage/src/backend/rocksdb.rs Implements estimate_table_bytes via property_int_value_cf("rocksdb.estimate-live-data-size") — an O(1) metadata-only lookup. Gracefully falls back to 0 if the CF handle is not found. Correct and idiomatic.
crates/storage/src/store.rs Minimal pass-through estimate_table_bytes delegating directly to the backend. No state management needed since the approach relies on RocksDB's native property rather than manual counters.
crates/blockchain/src/lib.rs Iterates over all 8 tables and emits lean_table_bytes after each block is processed. The 8 O(1) RocksDB property reads per block add negligible overhead. Straightforward integration.
crates/storage/src/lib.rs Adds ALL_TABLES to the crate's public API surface so the blockchain crate can iterate over all tables. One-line change, correct.

Sequence Diagram

sequenceDiagram
    participant BC as BlockChainServer
    participant Store as Store
    participant Backend as RocksDBBackend
    participant RDB as RocksDB

    BC->>BC: process_block(signed_block)
    loop for each table in ALL_TABLES
        BC->>Store: estimate_table_bytes(table)
        Store->>Backend: estimate_table_bytes(table)
        Backend->>RDB: cf_handle(cf_name(table))
        RDB-->>Backend: ColumnFamily
        Backend->>RDB: property_int_value_cf(cf, "rocksdb.estimate-live-data-size")
        RDB-->>Backend: u64 bytes
        Backend-->>Store: u64 bytes
        Store-->>BC: u64 bytes
        BC->>BC: metrics::update_table_bytes(table.name(), bytes)
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/blockchain/src/metrics.rs
Line: 323

Comment:
**Silent negative metric for very large table sizes**

`bytes as i64` is a bitwise cast — for values above `i64::MAX` (~9.2 EB) it silently produces a **negative** Prometheus gauge value rather than panicking or clamping. While astronomically unlikely today, it is inconsistent with the pattern used elsewhere in this file (e.g. `update_validators_count` uses `.try_into().unwrap()`, `set_attestation_committee_count` uses `.try_into().unwrap_or_default()`). A negative byte-size metric would be confusing and hard to detect.

Prefer a saturating conversion to be consistent and safe:

```suggestion
        .set(bytes.try_into().unwrap_or(i64::MAX));
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "Use RocksDB estimate..."

- Extract sum_entry_bytes() helper to deduplicate byte-summing in
  prune_old_states, prune_old_blocks, and delete_gossip_signatures
- Merge count_gossip_signatures into scan_tables to avoid scanning
  the GossipSignatures table twice at startup
- Remove dead (usize, u64) return from prune_by_slot — callers
  never used the bytes value since sub_table_bytes is called inside
- Fix double-counting bug: insert_pending_block no longer tracks
  bytes since the same keys are overwritten by insert_signed_block
- Revert unnecessary old_value_len rename in insert_gossip_signature
Replace the AtomicU64 bookkeeping approach with a direct query to
RocksDB's per-column-family property. This eliminates:
- Startup table scan (scan_tables)
- Byte tracking on every insert and prune
- BlockWriteBytes struct, sum_entry_bytes helper, add/sub_table_bytes

The StorageBackend trait gains estimate_table_bytes(table) with a
default returning 0. RocksDB implements it via property_int_value_cf
on the "rocksdb.estimate-live-data-size" property — an O(1) metadata
lookup, no scanning required.
  rocksdb.estimate-live-data-size only counts flushed SST files,
  so the metric stays at 0 until RocksDB triggers a compaction.
  Add rocksdb.cur-size-all-mem-tables to capture in-memory data too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants