ld4 is not emitted for int32 with vectorization factor 4

If `check(arm32 ? "vld4.32" : "ld4", 4, ld(in_f32, 4));` is added in [simd_op_check_arm.cpp](https://github.com/halide/Halide/blob/6d9ab55367f58a63aa196e0fcd24d8138fc70af6/test/correctness/simd_op_check_arm.cpp#L330), it ends up with 8 scalar load instructions although one might expect a single `ld4` instruction.
This seems to be an expected behavior as shown in [staged_strided_loads.cpp](https://github.com/halide/Halide/blob/6d9ab55367f58a63aa196e0fcd24d8138fc70af6/test/correctness/stage_strided_loads.cpp#L171) "Strides up to the the vector size are worth densifying. After that, it's better to just gather."

What is the exact reason that this case is hindered? Is there any idea to make it happen?
The same applies to ld2 for int64 with vectorization factor 2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ld4 is not emitted for int32 with vectorization factor 4 #8819

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ld4 is not emitted for int32 with vectorization factor 4 #8819

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions