Add Iterator support for BinningAutoBatcher by craigxchen · Pull Request #544 · TorchSim/torch-sim

craigxchen · 2026-04-14T00:46:42Z

Closes #275

New Features:

load_states() now accepts Iterator[T] in addition to SimState and Sequence[T]
New sample_size parameter (default: 100) controls how many states are pulled from an iterator at a time
States are binned incrementally as chunks are loaded, enabling processing of datasets larger than available memory
next_batch() automatically loads the next chunk when current bins are exhausted
restore_original_order() works correctly across multiple streaming chunks via global index tracking

Implementation:

Extracted core binning logic into _bin_and_prepare() method used by both eager and streaming paths
Added _load_next_chunk() method to pull and bin the next batch of states from iterator
Added _states_iterator, _all_index_bins, and _global_index_offset attributes to track streaming state
Global indices are computed by offsetting local bin indices by the cumulative number of states processed

Testing
Added comprehensive tests for iterator/generator support:

test_binning_auto_batcher_with_iterator() - Basic iterator input
test_binning_auto_batcher_with_generator() - Generator function input
test_binning_auto_batcher_streaming_multiple_chunks() - Multi-chunk streaming with small sample_size
test_binning_auto_batcher_empty_iterator() - Error handling for empty iterators
All new tests verify correct batching, state counts, and restore_original_order() functionality across streaming chunks.

BinningAutoBatcher.load_states now accepts Iterator[T] in addition to Sequence[T] and SimState. When given an iterator/generator, it pulls sample_size states at a time, bins them, yields the bins, then automatically pulls the next chunk — enabling processing of large datasets that don't fit in memory at once. Closes TorchSim#275 https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC

https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC

thomasloux · 2026-04-14T08:39:41Z

It's a bit difficult to read but it seems correct.

My thoughts on this:

Batching is important, but up to a threshold, after which I don't believe you gain much from batching even more systems. So the optimal packing that you get from to_constant_volume_bins is probably not super useful.
If the previous statement is true, I would push rather to perform a greedy packing like InFlightAutoBatcher. But this means a different implementation between Sequence[SimState] and Iterator[SimState].
By the way because you can't access in the order that you want the Iterator[SimState], this data structure is not super well suited to the optimization of to_constant_volume_bins.
For optimal performance, it would push for TorchSim to support a Dataset like PyTorch: implement a __len__ and __getitem__. It could be reused in all autobatcher. It would further suggest to implement a preparation step that compute the metric on all systems to be able to run efficiently to_constant_volume_bins (this depends heavily on the metric, getting the number of atoms without loading a complete structure is easy, the density of neighbors isn't). In the perspective of Dataloader, the batch construction (and potentially computing the metric, constructing the batch in greedy mode), could be done async on CPU. The GPU can run the relaxation or MD, and CPU is preparing the new batch. If a user provide a SimState or list of SimState, this input is given to the dataset/dataloader. Otherwise, we specify very clearly and provide a standard implementation of a dataset, for instance reading for cif files in a directory, that users can extend quite easily (fetching Materials Projects or their own database). This would probably give a cleaner and unified backend for autobatching.

thomasloux · 2026-04-14T08:41:25Z

To make it clearer, the dataset can either pre-compute all metrics and perform optimal batch construction, or compute it on the fly and make greedy batch construction, which I think is fine in most cases.

Tests two claims from review of autobatching streaming PR: 1. Batching throughput saturates past a threshold 2. Optimal bin-packing isn't much better than greedy in practice Results on CPU+LJ show greedy matches optimal for realistic distributions; need GPU+MLIP runs for definitive numbers. https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC

craigxchen · 2026-04-14T14:52:57Z

I think the Dataset approach makes sense for a longer term solution but that's probably a much larger scope refactor. I ran a quick test on CPU and it confirms your suspicion that optimal packing doesn't really help much. Will make that change for now since it simplifies things and maybe we can save the Dataset refactor for a different RP?

Replaces to_constant_volume_bins optimal bin-packing with greedy first-fit in both eager and streaming paths, per review feedback on TorchSim#275. Rationale (see examples/benchmarking/batching_strategy.py for data): - Greedy and optimal produce identical bin counts on realistic distributions - Even on adversarial distributions, greedy was slightly faster in wall time - Removes the arbitrary sample_size parameter (no longer needed) - Simplifies the streaming implementation to mirror InFlightAutoBatcher Changes: - Remove sample_size, _bin_and_prepare, _load_next_chunk - Add _load_eager (for SimState/Sequence) and _load_streaming (for Iterator) - Greedy packing preserves input order (nicer for debugging) - _all_index_bins folded into self.index_bins (populated incrementally in streaming path) - to_constant_volume_bins retained as a standalone utility Test: streaming test renamed from _streaming_multiple_chunks to _streaming_multiple_batches and updated to not depend on sample_size. https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC

thomasloux · 2026-04-14T15:45:30Z

On CPU for sure it does not change anything because CPU can't batch operations. GPU can but the throughput is maximized relatively quickly, in the sense that for inference using the full length of your VRAM is not needed. On a H200, you can run around 50_000 atoms (depends on MLIP of course) but you probably have 100% gpu utilization with hardly 5_000 atoms.

thomasloux · 2026-04-14T15:46:45Z

Waiting for other contributors to share their opinion on this. I'm fine with this change for now, and yes the dataset belongs to another PR, although I don't think that's a massive amount of work. But the plan around it needs to be carefully setup, Claude can do the implementation.

orionarcher

I have mixed feelings about excising the bin packing from the BinningAutobatcher. It's a nice feature that in a best case can offer significant packing efficiencies. That said, I agree with @thomasloux that optimal batching isn't a crazy speedup. IMO the best case scenario is that with bin packing you could reduce the number of batches that must from, for example from 4 poorly packed to 3 well packed, which could be a significant speedup. I don't think that the streaming advantage justifies removing bin packing, even though I floated that idea in #275.

I think a better solution would be to sample n_systems, pack them, run them, then sample another n systems. n can be a settable number with the default value high, thus essentially preserving the current default behavior.

Also, I haven't dug deeply into the code but it strikes me as fairly verbose for a part of the package that is so central and highly scrutinized. I have misgivings about injecting vibe coded solutions into the package core without careful line by line review. (though very supportive of vibe coding in general)

orionarcher · 2026-04-14T17:47:53Z

+
+def main() -> None:
+    """Run all three benchmark experiments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--device",
+        default=None,
+        help="Override device (default: cuda if available else cpu)",


I hesitate to include this example script if it's never actually been run on GPU. If it's just added code that's never been operationalized I don't know that it belongs in the package. IMO the examples should have been at some point very useful to merit inclusion. I'd advocate leaving this script out.

orionarcher · 2026-04-14T17:48:34Z

+        f"{t_grd / n_sys * 1000:>10.3f}"
+    )
+    delta_pct = (t_grd - t_opt) / t_opt * 100
+    print(f"\nGreedy is {delta_pct:+.1f}% slower than optimal (negative = faster).")


For example, some of the "conclusions" are already baked in

Good catch thank you

thomasloux · 2026-04-14T18:41:23Z

I think a better solution would be to sample n_systems, pack them, run them, then sample another n systems. n can be a settable number with the default value high, thus essentially preserving the current default behavior.

This was exactly what @craigxchen implemented first. I'm not a fan, because I don't think it is crucial. My own use case is running 1000 crystals with 10 atoms approx on H200 with 160Gb of vram. In this case, even outlier systems with 1000 atoms is not a problem in term of optimal packling, because there are so many small systems.

But overall I'm fine with both implementation

craigxchen · 2026-04-14T21:50:46Z

I'll defer to you guys on the details since I haven't run enough simulations myself to get a good feel for whether I prefer the first or second approach. Regardless let me clean up the vibe-code a bit and we can evaluate from there.

craigxchen · 2026-04-15T04:27:20Z

So right now the iterator path does greedy batching, but the List/Sequence inputs I've reverted back to the original global bin packing. The batching logic is split into eager vs streaming helper functions if we want to make adjustments later on it should be straightforward.

CompRhys · 2026-04-15T13:17:48Z

I did have a WIP agent session looking at #275. What I did was define an ABC for all autobatchers and then introduces the greedy version as a new class leaving the existing binned autobatcher mostly unchanged. The principle reason I didn't open the PR was that even though it feels nice with the iterator all the tensors still end up just sitting on the GPU and I wasn't sure that there was an actual memory benefit.

I was doing pure greedy but I think this partial greedy with the buffer is better

curtischong · 2026-04-16T16:33:17Z

I think the reason why people are hesitant to merge this is because it adds a lot of code to an already hard-to-read function. This is on us - the original function wasn't very clean. I'm willing to fork your commits and try to see if I can improve the function later. Sorry I didn't realize at the time this could be taken the wrong way. I wanted to be supportive and try to help it get past the line which is what I thought was the main blocker.

claude added 3 commits April 13, 2026 20:00

Fix ruff formatting in BinningAutoBatcher

7cd06e3

https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC

Fix ty type-checking errors in BinningAutoBatcher

8a059c6

https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC

thomasloux requested review from CompRhys and orionarcher and removed request for orionarcher April 14, 2026 15:46

orionarcher reviewed Apr 14, 2026

View reviewed changes

craigxchen added 5 commits April 14, 2026 23:14

Remove benchmark example

58b145a

Restore eager bin packing in BinningAutoBatcher

ebefccf

inline the next_batch functions

dfa8c6e

cleanup

e2f3102

Polish BinningAutoBatcher docstrings

caf3c11

Fix autobatching ty issues and formatting

df9f865

Conversation

craigxchen commented Apr 14, 2026

Uh oh!

thomasloux commented Apr 14, 2026

Uh oh!

thomasloux commented Apr 14, 2026

Uh oh!

craigxchen commented Apr 14, 2026

Uh oh!

thomasloux commented Apr 14, 2026

Uh oh!

thomasloux commented Apr 14, 2026

Uh oh!

orionarcher left a comment

Choose a reason for hiding this comment

Uh oh!

orionarcher Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

orionarcher Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

craigxchen Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

thomasloux commented Apr 14, 2026

Uh oh!

craigxchen commented Apr 14, 2026

Uh oh!

craigxchen commented Apr 15, 2026

Uh oh!

CompRhys commented Apr 15, 2026

Uh oh!

curtischong commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

curtischong commented Apr 16, 2026 •

edited

Loading