Skip to content

perf: add direct-mapped node cache to BTreeMap#416

Open
sasa-tomic wants to merge 55 commits intomainfrom
perf/direct-mapped-node-cache
Open

perf: add direct-mapped node cache to BTreeMap#416
sasa-tomic wants to merge 55 commits intomainfrom
perf/direct-mapped-node-cache

Conversation

@sasa-tomic
Copy link
Copy Markdown
Member

@sasa-tomic sasa-tomic commented Mar 18, 2026

Summary

  • Add a 32-slot direct-mapped node cache to BTreeMap, modeled after CPU caches: O(1) lookup via (address / page_size) % 32, collision = eviction (no LRU tracking)
  • Read paths (get, contains_key, first/last_key_value) use a take+return pattern to avoid re-loading hot upper-tree nodes from stable memory
  • Write paths invalidate affected cache slots in save_node, deallocate_node, merge, and clear_new
  • Switch get() from destructive extract_entry_at (swap_remove) to non-destructive node.value() (borrows via OnceCell)
  • Remove now-unused extract_entry_at method

This subsumes all four previous caching approaches (root-only, LRU+clone, LRU+Rc, page cache) into a single design that:

  • Has ~5 instructions overhead per cache lookup (vs ~330 for the Rc LRU's linear scan)
  • Stores Node<K> directly (no Rc, no Clone, no heap allocation per cache entry)
  • Uses cache.get_mut() on write paths (zero RefCell overhead)

Expected improvement: ~15-20% for random reads, ~65% for hot-key workloads, ~0% overhead for writes.

Add a 32-slot direct-mapped node cache to BTreeMap that avoids
re-loading hot nodes from stable memory. Modeled after CPU caches:
O(1) lookup via (address / page_size) % 32, collision = eviction.

Read paths (get, contains_key, first/last_key_value) use a
take+return pattern to borrow nodes from the cache without
RefCell lifetime issues. Write paths (insert, remove, split,
merge) invalidate affected cache slots.

Key changes:
- Switch get() from destructive extract_entry_at to node.value()
- Remove unused extract_entry_at method
- Change traverse() closure from Fn(&mut Node) to Fn(&Node)
- Invalidate cache in save_node, deallocate_node, merge, clear_new

Expected improvement: ~15-20% for random reads, ~65% for hot-key
workloads, ~0% overhead for writes (cache.get_mut() bypasses RefCell).
@sasa-tomic sasa-tomic requested a review from a team as a code owner March 18, 2026 17:46
@sasa-tomic sasa-tomic marked this pull request as draft March 18, 2026 17:52
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

canbench 🏋 (dir: ./benchmarks/btreeset) a910cb9 2026-03-31 14:48:48 UTC

./benchmarks/btreeset/canbench_results.yml is up to date
📦 canbench_results_btreeset.csv available in artifacts

---------------------------------------------------

Summary:
  instructions:
    status:   No significant changes 👍
    counts:   [total 100 | regressed 0 | improved 0 | new 0 | unchanged 100]
    change:   [max +1.06M | p75 0 | median 0 | p25 0 | min 0]
    change %: [max +0.20% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  heap_increase:
    status:   No significant changes 👍
    counts:   [total 100 | regressed 0 | improved 0 | new 0 | unchanged 100]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  stable_memory_increase:
    status:   No significant changes 👍
    counts:   [total 100 | regressed 0 | improved 0 | new 0 | unchanged 100]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

---------------------------------------------------
CSV results saved to canbench_results.csv

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

canbench 🏋 (dir: ./benchmarks/nns) a910cb9 2026-03-31 14:48:49 UTC

./benchmarks/nns/canbench_results.yml is up to date
📦 canbench_results_nns.csv available in artifacts

---------------------------------------------------

Summary:
  instructions:
    status:   Improvements detected 🟢
    counts:   [total 16 | regressed 0 | improved 1 | new 0 | unchanged 15]
    change:   [max +35.76M | p75 +289 | median -1.50K | p25 -413.07K | min -82.85M]
    change %: [max +0.40% | p75 0.00% | median -0.05% | p25 -0.25% | min -2.75%]

  heap_increase:
    status:   No significant changes 👍
    counts:   [total 16 | regressed 0 | improved 0 | new 0 | unchanged 16]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  stable_memory_increase:
    status:   No significant changes 👍
    counts:   [total 16 | regressed 0 | improved 0 | new 0 | unchanged 16]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

---------------------------------------------------

Only significant changes:
| status | name                              | calls |   ins |  ins Δ% | HI |  HI Δ% | SMI |  SMI Δ% |
|--------|-----------------------------------|-------|-------|---------|----|--------|-----|---------|
|   -    | vote_cascading_stable_chain_10k_5 |       | 2.92B |  -2.75% |  5 |  0.00% |   0 |   0.00% |

ins = instructions, HI = heap_increase, SMI = stable_memory_increase, Δ% = percent change

---------------------------------------------------
CSV results saved to canbench_results.csv

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

canbench 🏋 (dir: ./benchmarks/vec) a910cb9 2026-03-31 14:48:35 UTC

./benchmarks/vec/canbench_results.yml is up to date
📦 canbench_results_vec.csv available in artifacts

---------------------------------------------------

Summary:
  instructions:
    status:   No significant changes 👍
    counts:   [total 16 | regressed 0 | improved 0 | new 0 | unchanged 16]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  heap_increase:
    status:   No significant changes 👍
    counts:   [total 16 | regressed 0 | improved 0 | new 0 | unchanged 16]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  stable_memory_increase:
    status:   No significant changes 👍
    counts:   [total 16 | regressed 0 | improved 0 | new 0 | unchanged 16]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

---------------------------------------------------
CSV results saved to canbench_results.csv

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

canbench 🏋 (dir: ./benchmarks/memory_manager) a910cb9 2026-03-31 14:48:32 UTC

./benchmarks/memory_manager/canbench_results.yml is up to date
📦 canbench_results_memory-manager.csv available in artifacts

---------------------------------------------------

Summary:
  instructions:
    status:   No significant changes 👍
    counts:   [total 3 | regressed 0 | improved 0 | new 0 | unchanged 3]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  heap_increase:
    status:   No significant changes 👍
    counts:   [total 3 | regressed 0 | improved 0 | new 0 | unchanged 3]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  stable_memory_increase:
    status:   No significant changes 👍
    counts:   [total 3 | regressed 0 | improved 0 | new 0 | unchanged 3]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

---------------------------------------------------
CSV results saved to canbench_results.csv

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 20, 2026

canbench 🏋 (dir: ./benchmarks/io_chunks) a910cb9 2026-03-31 14:49:22 UTC

./benchmarks/io_chunks/canbench_results.yml is up to date
📦 canbench_results_io_chunks.csv available in artifacts

---------------------------------------------------

Summary:
  instructions:
    status:   Improvements detected 🟢
    counts:   [total 18 | regressed 0 | improved 2 | new 0 | unchanged 16]
    change:   [max +87.27M | p75 0 | median 0 | p25 -176 | min -23.13B]
    change %: [max +0.73% | p75 0.00% | median 0.00% | p25 -0.00% | min -59.18%]

  heap_increase:
    status:   No significant changes 👍
    counts:   [total 18 | regressed 0 | improved 0 | new 0 | unchanged 18]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  stable_memory_increase:
    status:   No significant changes 👍
    counts:   [total 18 | regressed 0 | improved 0 | new 0 | unchanged 18]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

---------------------------------------------------

Only significant changes:
| status | name                    | calls |     ins |  ins Δ% | HI |  HI Δ% | SMI |  SMI Δ% |
|--------|-------------------------|-------|---------|---------|----|--------|-----|---------|
|   -    | read_chunks_btreemap_1m |       |  17.81B | -56.50% |  0 |  0.00% |   0 |   0.00% |
|   -    | read_chunks_btreemap_1k |       | 203.38M | -59.18% |  0 |  0.00% |   0 |   0.00% |

ins = instructions, HI = heap_increase, SMI = stable_memory_increase, Δ% = percent change

---------------------------------------------------
CSV results saved to canbench_results.csv

@sasa-tomic sasa-tomic marked this pull request as ready for review March 20, 2026 11:23
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 20, 2026

canbench 🏋 (dir: ./benchmarks/btreemap) a910cb9 2026-03-31 14:49:55 UTC

./benchmarks/btreemap/canbench_results.yml is up to date
📦 canbench_results_btreemap.csv available in artifacts

---------------------------------------------------

Summary:
  instructions:
    status:   Improvements detected 🟢
    counts:   [total 229 | regressed 0 | improved 100 | new 0 | unchanged 129]
    change:   [max +62.94M | p75 0 | median -1.62M | p25 -92.98M | min -1.48B]
    change %: [max +1.17% | p75 0.00% | median -0.21% | p25 -25.74% | min -91.42%]

  heap_increase:
    status:   No significant changes 👍
    counts:   [total 229 | regressed 0 | improved 0 | new 0 | unchanged 229]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

  stable_memory_increase:
    status:   No significant changes 👍
    counts:   [total 229 | regressed 0 | improved 0 | new 0 | unchanged 229]
    change:   [max 0 | p75 0 | median 0 | p25 0 | min 0]
    change %: [max 0.00% | p75 0.00% | median 0.00% | p25 0.00% | min 0.00%]

---------------------------------------------------

Only significant changes:
| status | name                                      | calls |     ins |  ins Δ% | HI |  HI Δ% | SMI |  SMI Δ% |
|--------|-------------------------------------------|-------|---------|---------|----|--------|-----|---------|
|   -    | btreemap_v2_pop_first_vec_32_128          |       |   1.00B |  -6.23% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_first_vec_32_vec128       |       |   1.00B |  -6.23% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_first_blob_256_128        |       |   2.56B |  -6.31% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_vec_32_128           |       | 978.64M |  -6.40% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_vec_32_vec128        |       | 978.64M |  -6.40% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_blob_256_128         |       |   2.47B |  -6.47% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_blob_32_1024         |       | 976.88M |  -7.08% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_first_blob_32_1024        |       |   1.01B |  -7.10% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_first_blob_32_128         |       | 763.85M | -10.21% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_blob_32_128          |       | 732.59M | -10.27% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_u64_u64              |       | 585.12M | -10.59% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_first_blob_32_0           |       | 673.85M | -10.76% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_first_u64_u64             |       | 604.78M | -10.84% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_first_blob_8_128          |       | 530.70M | -11.22% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_blob_32_0            |       | 644.48M | -11.35% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_principal            |       | 691.07M | -11.88% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_last_blob_8_128           |       | 520.66M | -11.99% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_pop_first_principal           |       | 705.52M | -12.27% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_blob_32_0            |       | 286.88M | -12.95% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_blob_32_1024         |       | 280.85M | -13.87% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_u64_blob8                 |       | 193.98M | -14.34% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_blob_32_0                 |       | 288.59M | -14.48% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_100k_u64_u64              |       |   2.32B | -16.10% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_blob_32_1024              |       | 292.71M | -16.49% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_vec_128_128               |       | 479.29M | -16.79% |  0 |  0.00% |   0 |   0.00% |
|  ...   | ... 50 rows omitted ...                   |       |         |         |    |        |     |         |
|   -    | btreemap_v2_get_blob_1024_128             |       |   2.97B | -33.30% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_blob_256_128              |       | 903.19M | -33.93% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_blob_4_128                |       | 168.23M | -34.00% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_zipf_10k_u64_u64          |       | 140.00M | -37.01% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_blob_32_1024  |       | 255.02M | -42.74% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_blob_256_128  |       | 352.58M | -46.79% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_vec_4_128                 |       | 208.49M | -49.16% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_blob_32_128    |       | 122.35M | -54.16% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_blob_32_1024   |       | 168.95M | -56.65% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_zipf_heavy_10k_u64_u64    |       |  87.48M | -59.26% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_blob_32_128   |       | 106.21M | -61.42% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_u64_u64       |       |  84.89M | -63.27% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_blob_32_0     |       | 100.51M | -65.11% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_vec_32_128     |       |  73.27M | -66.86% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_vec_32_vec128  |       |  73.27M | -66.86% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_u64_u64        |       |  81.20M | -66.90% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_blob_256_128   |       | 150.71M | -76.51% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_blob_8_128    |       |  55.70M | -80.92% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_vec_32_128    |       |  38.66M | -82.98% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_vec_32_vec128 |       |  38.66M | -82.98% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_blob_8_128     |       |  52.86M | -83.54% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_blob_32_0      |       |  32.54M | -85.27% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_last_key_value_principal      |       |  32.06M | -85.28% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_10mib_values         |       |  18.48M | -87.00% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_first_key_value_principal     |       |  31.52M | -91.42% |  0 |  0.00% |   0 |   0.00% |

ins = instructions, HI = heap_increase, SMI = stable_memory_increase, Δ% = percent change

---------------------------------------------------
CSV results saved to canbench_results.csv

src/btreemap.rs Outdated
Comment on lines +169 to +172
if !self.is_enabled() {
self.misses += 1;
return None;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, feel free to ignore: rather than disable the cache (and incur the cost of a branch), might it be preferable to have a cache of size 1?

(As a side note, I would actually be curious to see the difference in benchmark scores from adding this and the stats. Maybe it's all a storm in a teacup. But maybe not.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a single entry that will strictly be more computationally expensive (always and is unlikely to give any benefit since we'll keep overwriting it all the time.
size-0 execution path will be correctly predicted every time and costs ~0 cycles in practice.
A size-1 cache, on the other hand, would still allocate a Node on the heap and would collide on every operation (every node maps to slot 0), producing worse miss behaviour than even a small real cache.

So I'd recommend we go with size-0 by default, in the next version. And then we can turn it on to size-32 or larger by default once we get some feedback from production runs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I now see in the benchmarks with default 0-size cache size some increase in instruction cost, up to +8%.
would it be possible to have some kind of short circuit not to do extra work when cache size is zero?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I now see in the benchmarks with default 0-size cache size some increase in instruction cost, up to +8%.

This is what I was referring to. Sure, a size 1 cache is going to be more expensive than a size 0 cache. But having the option of a size 0 cache makes all other cache sizes more expensive. And the size 0 cache more expensive than not having this feature at all.

So I guess it's a trade-off between optimizing for no caching vs. optimizing for caching. I was going for the latter, since caching appears to make a positive change in performance, but we could also stick with no caching as the default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By inlining one function, we're now roughly in the negative range (marginally improved performance) even with cache disabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: size 1 cache

In current implementation direct-mapped cache does not prioritise top level nodes over lower level nodes, meaning that the cache with size 1 will always be overwritten on each tree traversal and will always have 0 hits, which is a pure waste of cycles.

More over any cache with the size smaller than the tree height will have 0 hits. So the rough range for the cache sizes that make it practically usable:

  • lower bound must be at least the tree size, maybe x2 to account for collisions
  • upper bound any cache size that reaches 90-95% hit ratio, above that the ROI is too small

(optional exploration idea) Based on that, I suppose one interesting idea to explore (not now, in the future PRs) is to store in cache also the current node height (or level) and prioritize top nodes (closer to root) over lower level nodes. Or maybe when we decide to use set-assotiative cache store cache lines of different levels with prioritization by level.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added storing and checking depth to node cache, it showed some good performance boost:

Only significant changes:
| status | name                                            | calls |     ins |  ins Δ% | HI |  HI Δ% | SMI |  SMI Δ% |
|--------|-------------------------------------------------|-------|---------|---------|----|--------|-----|---------|
|   -    | btreemap_v2_get_vec_32_128_cached_32entry       |       | 257.72M | -13.20% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_vec_32_128_cached_32entry  |       | 251.19M | -13.50% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_u64_u64_cached_32entry          |       | 140.77M | -15.96% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_u64_u64_cached_32entry     |       | 136.11M | -16.42% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_blob_32_128_cached_32entry      |       | 206.55M | -17.49% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_blob_32_128_cached_32entry |       | 201.95M | -17.79% |  0 |  0.00% |   0 |   0.00% |

ins = instructions, HI = heap_increase, SMI = stable_memory_increase, Δ% = percent change

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — with depth-aware eviction the default now makes more sense as non-zero. Changed default to 16 slots, which covers the top 2 tree levels. The NodeCache::take/put/invalidate methods now early-return when disabled (0 slots), so there's no branch cost in the btreemap code itself.

src/btreemap.rs Outdated
Comment on lines +486 to +488
pub fn cache_stats(&self) -> CacheStats {
self.cache.borrow().stats()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to return the cache size in bytes. The end user might have a hard time figuring out the average node size or even just the page size.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there were some requests from users to add to stable structures methods that report memory usage: heap usage, stable memory allocated and actually used. if this is implemented it also covers any cache, so maybe it should not report back the cache size. also if user configures the cache size, it's expected to be full.

I'm fine if this PR does not implement reporting cache size.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting the cache size in bytes is easy and cheap, so I just added it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wasn't so easy after all, I'll look for another way to do it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a MemSize trait implementation to this branch.
It calculates heap usage with (in my opinion) an acceptable tradeoff between the cost and precision.
Please take a look and let me know what you think. It's ok to revert my changes if they don't work.

Copy link
Copy Markdown
Contributor

@maksymar maksymar Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented an accurate memory calculation for heap usage and compared against the approximate one: if value sizes were small, then the difference in numbers were x2-x4, but if the values were big (eg. key is u32 and value is blob of 900 bytes) then it went x40 and above.

I also removed the whole mem_size implementation to see how much extra instructions it will provide and it was between +12% and +24% which is pretty significant.

Only significant changes:
| status | name                                            | calls |     ins |  ins Δ% | HI |  HI Δ% | SMI |  SMI Δ% |
|--------|-------------------------------------------------|-------|---------|---------|----|--------|-----|---------|
|   -    | btreemap_v2_get_vec_32_128_cached_32entry       |       | 296.90M | -12.42% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_vec_32_128_cached_32entry  |       | 290.40M | -12.67% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_blob_32_128_cached_32entry      |       | 250.32M | -13.54% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_blob_32_128_cached_32entry |       | 245.66M | -13.76% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_get_u64_u64_cached_32entry          |       | 167.51M | -24.02% |  0 |  0.00% |   0 |   0.00% |
|   -    | btreemap_v2_contains_u64_u64_cached_32entry     |       | 162.86M | -24.54% |  0 |  0.00% |   0 |   0.00% |

ins = instructions, HI = heap_increase, SMI = stable_memory_increase, Δ% = percent change

now I changed my mind, and even though approximate value is not exactly accurate, it is still a possible upper bound user should take into consideration (because in some cases values can get cached).

I am fine with going with less accurate but with much simpler and faster solution. Sorry it took some time on debating & coding this topic. we can always add more accurate calculation later if we find a better way to do it, no need to get blocked on it now.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question for my own illumination:

I also removed the whole mem_size implementation to see how much extra instructions it will provide and it was between +12% and +24% which is pretty significant.

Is this on common workloads (i.e. do we call mem_size() frequently) or is this only the performance of computing the exact usage vs a (potentially quite bad) approximation?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's mostly because computing the exact usage turned out to be expensive, especially for keys, which are used on btreemap level in their original type, so the only way to get the size of the key is to convert it to bytes via to_bytes().len().

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. node_cache_size_bytes_approx is still available as a rough estimate. The exact-size MemSize approach was too expensive as you found (+12-24% instructions), so the approximate method stays as an order-of-magnitude guide.

@maksymar
Copy link
Copy Markdown
Contributor

FYI: I'll resolve merge conflicts.

Copy link
Copy Markdown
Member Author

@sasa-tomic sasa-tomic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing open review comments with code changes and replies.

…ard NULL

- Change DEFAULT_NODE_CACHE_NUM_SLOTS from 0 to 16 (covers top 2 tree levels)
- Remove redundant cache_num_slots field from BTreeMap; NodeCache methods
  now self-guard via is_enabled() and callers no longer need checks
- Add NULL address guard in NodeCache::take to prevent false cache hits
Copy link
Copy Markdown
Contributor

@maksymar maksymar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great results! 🎉 Thank you!

// compete on pure recency: a non-root node can evict any other
// non-root node. This lets hot leaves and working-set interior
// nodes stay cached while stale upper-level nodes get displaced.
if slot.node.is_none() || depth == 0 || slot.depth > 0 {
Copy link
Copy Markdown
Contributor

@maksymar maksymar Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I was thinking about this idea of always protecting nodes with lower depth, and realized it may not work very well for caches with bigger sizes, because it will always populate the top of the tree ignoring hot paths.

so instead I think it would be reasonable to protect only a few top levels (0-1) and let nodes on other levels compete freely. it should also work with the default size of 16.

@maksymar
Copy link
Copy Markdown
Contributor

maksymar commented Mar 31, 2026

[DONE]
I would suggest to wait with merging this PR until I merge another PR that improves benchmarks.
Currently our benchmarks do not cover potential cache cases, so I would like to remove some excessive benchmarks and add cache-specific use cases, so that we can also see the difference on that PR.

@maksymar
Copy link
Copy Markdown
Contributor

The chart shows how increasing node cache size reduces instruction count (relative to no cache) across operation categories.

  • first/last & pop plateau almost immediately because they only need the tree height number of nodes in the cache at most
  • get and contains keep improving steadily with the size of the cache
  • the rest categories are not currently supported by caching, but probably will be in the future

From this data I'm happy to see that "the default 16" is already good enough while being reasonably small, at the same time bigger caches will provide more value.

chart_by_category

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants