Skip to content

Add Iterator support for BinningAutoBatcher#544

Open
craigxchen wants to merge 11 commits intoTorchSim:mainfrom
craigxchen:claude/fix-issue-275-cMO0x
Open

Add Iterator support for BinningAutoBatcher#544
craigxchen wants to merge 11 commits intoTorchSim:mainfrom
craigxchen:claude/fix-issue-275-cMO0x

Conversation

@craigxchen
Copy link
Copy Markdown
Contributor

Closes #275

New Features:

  • load_states() now accepts Iterator[T] in addition to SimState and Sequence[T]
  • New sample_size parameter (default: 100) controls how many states are pulled from an iterator at a time
  • States are binned incrementally as chunks are loaded, enabling processing of datasets larger than available memory
  • next_batch() automatically loads the next chunk when current bins are exhausted
  • restore_original_order() works correctly across multiple streaming chunks via global index tracking

Implementation:

  • Extracted core binning logic into _bin_and_prepare() method used by both eager and streaming paths
  • Added _load_next_chunk() method to pull and bin the next batch of states from iterator
  • Added _states_iterator, _all_index_bins, and _global_index_offset attributes to track streaming state
  • Global indices are computed by offsetting local bin indices by the cumulative number of states processed

Testing
Added comprehensive tests for iterator/generator support:

test_binning_auto_batcher_with_iterator() - Basic iterator input
test_binning_auto_batcher_with_generator() - Generator function input
test_binning_auto_batcher_streaming_multiple_chunks() - Multi-chunk streaming with small sample_size
test_binning_auto_batcher_empty_iterator() - Error handling for empty iterators
All new tests verify correct batching, state counts, and restore_original_order() functionality across streaming chunks.

claude added 3 commits April 13, 2026 20:00
BinningAutoBatcher.load_states now accepts Iterator[T] in addition to
Sequence[T] and SimState. When given an iterator/generator, it pulls
sample_size states at a time, bins them, yields the bins, then
automatically pulls the next chunk — enabling processing of large
datasets that don't fit in memory at once.

Closes TorchSim#275

https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC
@thomasloux
Copy link
Copy Markdown
Collaborator

It's a bit difficult to read but it seems correct.

My thoughts on this:

  • Batching is important, but up to a threshold, after which I don't believe you gain much from batching even more systems. So the optimal packing that you get from to_constant_volume_bins is probably not super useful.
  • If the previous statement is true, I would push rather to perform a greedy packing like InFlightAutoBatcher. But this means a different implementation between Sequence[SimState] and Iterator[SimState].
  • By the way because you can't access in the order that you want the Iterator[SimState], this data structure is not super well suited to the optimization of to_constant_volume_bins.
  • For optimal performance, it would push for TorchSim to support a Dataset like PyTorch: implement a __len__ and __getitem__. It could be reused in all autobatcher. It would further suggest to implement a preparation step that compute the metric on all systems to be able to run efficiently to_constant_volume_bins (this depends heavily on the metric, getting the number of atoms without loading a complete structure is easy, the density of neighbors isn't). In the perspective of Dataloader, the batch construction (and potentially computing the metric, constructing the batch in greedy mode), could be done async on CPU. The GPU can run the relaxation or MD, and CPU is preparing the new batch. If a user provide a SimState or list of SimState, this input is given to the dataset/dataloader. Otherwise, we specify very clearly and provide a standard implementation of a dataset, for instance reading for cif files in a directory, that users can extend quite easily (fetching Materials Projects or their own database). This would probably give a cleaner and unified backend for autobatching.

@thomasloux
Copy link
Copy Markdown
Collaborator

To make it clearer, the dataset can either pre-compute all metrics and perform optimal batch construction, or compute it on the fly and make greedy batch construction, which I think is fine in most cases.

Tests two claims from review of autobatching streaming PR:
  1. Batching throughput saturates past a threshold
  2. Optimal bin-packing isn't much better than greedy in practice

Results on CPU+LJ show greedy matches optimal for realistic distributions;
need GPU+MLIP runs for definitive numbers.

https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC
@craigxchen
Copy link
Copy Markdown
Contributor Author

I think the Dataset approach makes sense for a longer term solution but that's probably a much larger scope refactor. I ran a quick test on CPU and it confirms your suspicion that optimal packing doesn't really help much. Will make that change for now since it simplifies things and maybe we can save the Dataset refactor for a different RP?

Replaces to_constant_volume_bins optimal bin-packing with greedy first-fit
in both eager and streaming paths, per review feedback on TorchSim#275.

Rationale (see examples/benchmarking/batching_strategy.py for data):
  - Greedy and optimal produce identical bin counts on realistic distributions
  - Even on adversarial distributions, greedy was slightly faster in wall time
  - Removes the arbitrary sample_size parameter (no longer needed)
  - Simplifies the streaming implementation to mirror InFlightAutoBatcher

Changes:
  - Remove sample_size, _bin_and_prepare, _load_next_chunk
  - Add _load_eager (for SimState/Sequence) and _load_streaming (for Iterator)
  - Greedy packing preserves input order (nicer for debugging)
  - _all_index_bins folded into self.index_bins (populated incrementally
    in streaming path)
  - to_constant_volume_bins retained as a standalone utility

Test: streaming test renamed from _streaming_multiple_chunks to
_streaming_multiple_batches and updated to not depend on sample_size.

https://claude.ai/code/session_01P1qrZooaYiFMibUMWc7asC
@thomasloux
Copy link
Copy Markdown
Collaborator

On CPU for sure it does not change anything because CPU can't batch operations. GPU can but the throughput is maximized relatively quickly, in the sense that for inference using the full length of your VRAM is not needed. On a H200, you can run around 50_000 atoms (depends on MLIP of course) but you probably have 100% gpu utilization with hardly 5_000 atoms.

@thomasloux
Copy link
Copy Markdown
Collaborator

Waiting for other contributors to share their opinion on this. I'm fine with this change for now, and yes the dataset belongs to another PR, although I don't think that's a massive amount of work. But the plan around it needs to be carefully setup, Claude can do the implementation.

@thomasloux thomasloux requested review from CompRhys and orionarcher and removed request for orionarcher April 14, 2026 15:46
Copy link
Copy Markdown
Collaborator

@orionarcher orionarcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about excising the bin packing from the BinningAutobatcher. It's a nice feature that in a best case can offer significant packing efficiencies. That said, I agree with @thomasloux that optimal batching isn't a crazy speedup. IMO the best case scenario is that with bin packing you could reduce the number of batches that must from, for example from 4 poorly packed to 3 well packed, which could be a significant speedup. I don't think that the streaming advantage justifies removing bin packing, even though I floated that idea in #275.

I think a better solution would be to sample n_systems, pack them, run them, then sample another n systems. n can be a settable number with the default value high, thus essentially preserving the current default behavior.

Also, I haven't dug deeply into the code but it strikes me as fairly verbose for a part of the package that is so central and highly scrutinized. I have misgivings about injecting vibe coded solutions into the package core without careful line by line review. (though very supportive of vibe coding in general)

Comment on lines +301 to +308

def main() -> None:
"""Run all three benchmark experiments."""
parser = argparse.ArgumentParser()
parser.add_argument(
"--device",
default=None,
help="Override device (default: cuda if available else cpu)",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hesitate to include this example script if it's never actually been run on GPU. If it's just added code that's never been operationalized I don't know that it belongs in the package. IMO the examples should have been at some point very useful to merit inclusion. I'd advocate leaving this script out.

f"{t_grd / n_sys * 1000:>10.3f}"
)
delta_pct = (t_grd - t_opt) / t_opt * 100
print(f"\nGreedy is {delta_pct:+.1f}% slower than optimal (negative = faster).")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, some of the "conclusions" are already baked in

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch thank you

@thomasloux
Copy link
Copy Markdown
Collaborator

I think a better solution would be to sample n_systems, pack them, run them, then sample another n systems. n can be a settable number with the default value high, thus essentially preserving the current default behavior.

This was exactly what @craigxchen implemented first. I'm not a fan, because I don't think it is crucial. My own use case is running 1000 crystals with 10 atoms approx on H200 with 160Gb of vram. In this case, even outlier systems with 1000 atoms is not a problem in term of optimal packling, because there are so many small systems.

But overall I'm fine with both implementation

@craigxchen
Copy link
Copy Markdown
Contributor Author

I'll defer to you guys on the details since I haven't run enough simulations myself to get a good feel for whether I prefer the first or second approach. Regardless let me clean up the vibe-code a bit and we can evaluate from there.

@craigxchen
Copy link
Copy Markdown
Contributor Author

So right now the iterator path does greedy batching, but the List/Sequence inputs I've reverted back to the original global bin packing. The batching logic is split into eager vs streaming helper functions if we want to make adjustments later on it should be straightforward.

@CompRhys
Copy link
Copy Markdown
Member

I did have a WIP agent session looking at #275. What I did was define an ABC for all autobatchers and then introduces the greedy version as a new class leaving the existing binned autobatcher mostly unchanged. The principle reason I didn't open the PR was that even though it feels nice with the iterator all the tensors still end up just sitting on the GPU and I wasn't sure that there was an actual memory benefit.

I was doing pure greedy but I think this partial greedy with the buffer is better

@curtischong
Copy link
Copy Markdown
Collaborator

curtischong commented Apr 16, 2026

I think the reason why people are hesitant to merge this is because it adds a lot of code to an already hard-to-read function. This is on us - the original function wasn't very clean. I'm willing to fork your commits and try to see if I can improve the function later. Sorry I didn't realize at the time this could be taken the wrong way. I wanted to be supportive and try to help it get past the line which is what I thought was the main blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make BinningAutoBatcher support streams of systems

6 participants