Add Iterator support for BinningAutoBatcher#544
Add Iterator support for BinningAutoBatcher#544craigxchen wants to merge 11 commits intoTorchSim:mainfrom
Conversation
BinningAutoBatcher.load_states now accepts Iterator[T] in addition to Sequence[T] and SimState. When given an iterator/generator, it pulls sample_size states at a time, bins them, yields the bins, then automatically pulls the next chunk — enabling processing of large datasets that don't fit in memory at once. Closes TorchSim#275 https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC
|
It's a bit difficult to read but it seems correct. My thoughts on this:
|
|
To make it clearer, the dataset can either pre-compute all metrics and perform optimal batch construction, or compute it on the fly and make greedy batch construction, which I think is fine in most cases. |
Tests two claims from review of autobatching streaming PR: 1. Batching throughput saturates past a threshold 2. Optimal bin-packing isn't much better than greedy in practice Results on CPU+LJ show greedy matches optimal for realistic distributions; need GPU+MLIP runs for definitive numbers. https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC
|
I think the Dataset approach makes sense for a longer term solution but that's probably a much larger scope refactor. I ran a quick test on CPU and it confirms your suspicion that optimal packing doesn't really help much. Will make that change for now since it simplifies things and maybe we can save the Dataset refactor for a different RP? |
Replaces to_constant_volume_bins optimal bin-packing with greedy first-fit in both eager and streaming paths, per review feedback on TorchSim#275. Rationale (see examples/benchmarking/batching_strategy.py for data): - Greedy and optimal produce identical bin counts on realistic distributions - Even on adversarial distributions, greedy was slightly faster in wall time - Removes the arbitrary sample_size parameter (no longer needed) - Simplifies the streaming implementation to mirror InFlightAutoBatcher Changes: - Remove sample_size, _bin_and_prepare, _load_next_chunk - Add _load_eager (for SimState/Sequence) and _load_streaming (for Iterator) - Greedy packing preserves input order (nicer for debugging) - _all_index_bins folded into self.index_bins (populated incrementally in streaming path) - to_constant_volume_bins retained as a standalone utility Test: streaming test renamed from _streaming_multiple_chunks to _streaming_multiple_batches and updated to not depend on sample_size. https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC
|
On CPU for sure it does not change anything because CPU can't batch operations. GPU can but the throughput is maximized relatively quickly, in the sense that for inference using the full length of your VRAM is not needed. On a H200, you can run around 50_000 atoms (depends on MLIP of course) but you probably have 100% gpu utilization with hardly 5_000 atoms. |
|
Waiting for other contributors to share their opinion on this. I'm fine with this change for now, and yes the dataset belongs to another PR, although I don't think that's a massive amount of work. But the plan around it needs to be carefully setup, Claude can do the implementation. |
orionarcher
left a comment
There was a problem hiding this comment.
I have mixed feelings about excising the bin packing from the BinningAutobatcher. It's a nice feature that in a best case can offer significant packing efficiencies. That said, I agree with @thomasloux that optimal batching isn't a crazy speedup. IMO the best case scenario is that with bin packing you could reduce the number of batches that must from, for example from 4 poorly packed to 3 well packed, which could be a significant speedup. I don't think that the streaming advantage justifies removing bin packing, even though I floated that idea in #275.
I think a better solution would be to sample n_systems, pack them, run them, then sample another n systems. n can be a settable number with the default value high, thus essentially preserving the current default behavior.
Also, I haven't dug deeply into the code but it strikes me as fairly verbose for a part of the package that is so central and highly scrutinized. I have misgivings about injecting vibe coded solutions into the package core without careful line by line review. (though very supportive of vibe coding in general)
|
|
||
| def main() -> None: | ||
| """Run all three benchmark experiments.""" | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument( | ||
| "--device", | ||
| default=None, | ||
| help="Override device (default: cuda if available else cpu)", |
There was a problem hiding this comment.
I hesitate to include this example script if it's never actually been run on GPU. If it's just added code that's never been operationalized I don't know that it belongs in the package. IMO the examples should have been at some point very useful to merit inclusion. I'd advocate leaving this script out.
| f"{t_grd / n_sys * 1000:>10.3f}" | ||
| ) | ||
| delta_pct = (t_grd - t_opt) / t_opt * 100 | ||
| print(f"\nGreedy is {delta_pct:+.1f}% slower than optimal (negative = faster).") |
There was a problem hiding this comment.
For example, some of the "conclusions" are already baked in
There was a problem hiding this comment.
Good catch thank you
This was exactly what @craigxchen implemented first. I'm not a fan, because I don't think it is crucial. My own use case is running 1000 crystals with 10 atoms approx on H200 with 160Gb of vram. In this case, even outlier systems with 1000 atoms is not a problem in term of optimal packling, because there are so many small systems. But overall I'm fine with both implementation |
|
I'll defer to you guys on the details since I haven't run enough simulations myself to get a good feel for whether I prefer the first or second approach. Regardless let me clean up the vibe-code a bit and we can evaluate from there. |
|
So right now the iterator path does greedy batching, but the List/Sequence inputs I've reverted back to the original global bin packing. The batching logic is split into eager vs streaming helper functions if we want to make adjustments later on it should be straightforward. |
|
I did have a WIP agent session looking at #275. What I did was define an ABC for all autobatchers and then introduces the greedy version as a new class leaving the existing binned autobatcher mostly unchanged. The principle reason I didn't open the PR was that even though it feels nice with the iterator all the tensors still end up just sitting on the GPU and I wasn't sure that there was an actual memory benefit. I was doing pure greedy but I think this partial greedy with the buffer is better |
|
|
Closes #275
New Features:
Implementation:
Testing
Added comprehensive tests for iterator/generator support:
test_binning_auto_batcher_with_iterator() - Basic iterator input
test_binning_auto_batcher_with_generator() - Generator function input
test_binning_auto_batcher_streaming_multiple_chunks() - Multi-chunk streaming with small sample_size
test_binning_auto_batcher_empty_iterator() - Error handling for empty iterators
All new tests verify correct batching, state counts, and restore_original_order() functionality across streaming chunks.