Add SIMD impls of lines_fwd and lines_bwd for LoongArch#539
Add SIMD impls of lines_fwd and lines_bwd for LoongArch#539lhecker merged 1 commit intomicrosoft:mainfrom heiher:loong-simd-lines
lines_fwd and lines_bwd for LoongArch#539Conversation
src/lib.rs
Outdated
| maybe_uninit_slice, | ||
| maybe_uninit_uninit_array_transpose | ||
| )] | ||
| #![cfg_attr(target_arch = "loongarch64", feature(stdarch_loongarch))] |
There was a problem hiding this comment.
The stdarch_loongarch_feature_detection and loongarch_target_feature feature gates have been stabilized in rust 1.89.0. However, since the current stable compiler version is 1.88.0, there gates still need to be explicitly enabled. So, it still makes sense to include them, even though this project declares nightly in its rust-toolchain.toml.
Benchmark results on LA664:
- LASX
```
simd/lines_fwd/1 time: [2.4046 ns 2.4076 ns 2.4118 ns]
thrpt: [395.42 MiB/s 396.10 MiB/s 396.60 MiB/s]
change:
time: [+20.046% +20.332% +20.762%] (p = 0.00 < 0.05)
thrpt: [−17.193% −16.896% −16.699%]
Performance has regressed.
simd/lines_fwd/8 time: [8.4050 ns 8.4114 ns 8.4192 ns]
thrpt: [906.19 MiB/s 907.03 MiB/s 907.72 MiB/s]
change:
time: [+4.9243% +5.0308% +5.1538%] (p = 0.00 < 0.05)
thrpt: [−4.9012% −4.7898% −4.6932%]
Performance has regressed.
simd/lines_fwd/128 time: [35.622 ns 35.650 ns 35.685 ns]
thrpt: [3.3406 GiB/s 3.3439 GiB/s 3.3465 GiB/s]
change:
time: [−66.111% −65.957% −65.864%] (p = 0.00 < 0.05)
thrpt: [+192.94% +193.74% +195.08%]
Performance has improved.
simd/lines_fwd/1024 time: [53.349 ns 53.400 ns 53.457 ns]
thrpt: [17.840 GiB/s 17.859 GiB/s 17.876 GiB/s]
change:
time: [−93.548% −93.540% −93.533%] (p = 0.00 < 0.05)
thrpt: [+1446.4% +1448.1% +1449.8%]
Performance has improved.
simd/lines_fwd/131072 time: [3.0780 µs 3.0815 µs 3.0866 µs]
thrpt: [39.549 GiB/s 39.613 GiB/s 39.659 GiB/s]
change:
time: [−97.069% −97.065% −97.060%] (p = 0.00 < 0.05)
thrpt: [+3301.2% +3307.0% +3311.8%]
Performance has improved.
simd/lines_fwd/134217728
time: [4.5887 ms 4.5919 ms 4.5958 ms]
thrpt: [27.199 GiB/s 27.222 GiB/s 27.241 GiB/s]
change:
time: [−95.733% −95.729% −95.725%] (p = 0.00 < 0.05)
thrpt: [+2239.3% +2241.5% +2243.5%]
Performance has improved.
```
- LSX
```
simd/lines_fwd/1 time: [6.4032 ns 6.4068 ns 6.4116 ns]
thrpt: [148.74 MiB/s 148.85 MiB/s 148.94 MiB/s]
change:
time: [+219.68% +219.98% +220.24%] (p = 0.00 < 0.05)
thrpt: [−68.773% −68.748% −68.719%]
Performance has regressed.
simd/lines_fwd/8 time: [12.406 ns 12.413 ns 12.422 ns]
thrpt: [614.20 MiB/s 614.63 MiB/s 614.96 MiB/s]
change:
time: [+54.884% +55.133% +55.502%] (p = 0.00 < 0.05)
thrpt: [−35.692% −35.539% −35.436%]
Performance has regressed.
simd/lines_fwd/128 time: [24.412 ns 24.427 ns 24.448 ns]
thrpt: [4.8761 GiB/s 4.8801 GiB/s 4.8832 GiB/s]
change:
time: [−76.775% −76.669% −76.607%] (p = 0.00 < 0.05)
thrpt: [+327.48% +328.62% +330.58%]
Performance has improved.
simd/lines_fwd/1024 time: [49.467 ns 49.530 ns 49.599 ns]
thrpt: [19.228 GiB/s 19.255 GiB/s 19.279 GiB/s]
change:
time: [−94.014% −94.006% −93.998%] (p = 0.00 < 0.05)
thrpt: [+1566.2% +1568.3% +1570.4%]
Performance has improved.
simd/lines_fwd/131072 time: [4.5825 µs 4.5858 µs 4.5900 µs]
thrpt: [26.595 GiB/s 26.619 GiB/s 26.638 GiB/s]
change:
time: [−95.639% −95.632% −95.624%] (p = 0.00 < 0.05)
thrpt: [+2185.1% +2189.3% +2192.9%]
Performance has improved.
simd/lines_fwd/134217728
time: [5.4066 ms 5.4103 ms 5.4151 ms]
thrpt: [23.084 GiB/s 23.104 GiB/s 23.120 GiB/s]
change:
time: [−94.972% −94.968% −94.963%] (p = 0.00 < 0.05)
thrpt: [+1885.3% +1887.3% +1889.0%]
Performance has improved.
```
| let func = if is_loongarch_feature_detected!("lasx") { | ||
| lines_bwd_lasx | ||
| } else if is_loongarch_feature_detected!("lsx") { | ||
| lines_bwd_lsx |
There was a problem hiding this comment.
Is there value in supporting both? AVX2 is supported by most x64 devices, so that's the only thing we support. I'd prefer if we could do the same with LSX and only support LASX. This would reduce our long-term maintenance burden. But I don't know how widespread LASX support is...
There was a problem hiding this comment.
LASX is not always available alongside LSX. For example, both LSX and LASX are present on the LA464 and LA664 cores, but the LA364E only supports LSX and lacks LASX. I'm happy to help maintain and keep this part in good shape.
|
Thank you for your work on this! This is highly appreciated. 🙂 |
Benchmark results on LA664: