If check(arm32 ? "vld4.32" : "ld4", 4, ld(in_f32, 4)); is added in simd_op_check_arm.cpp, it ends up with 8 scalar load instructions although one might expect a single ld4 instruction.
This seems to be an expected behavior as shown in staged_strided_loads.cpp "Strides up to the the vector size are worth densifying. After that, it's better to just gather."
What is the exact reason that this case is hindered? Is there any idea to make it happen?
The same applies to ld2 for int64 with vectorization factor 2