Skip to content

Add SIMD impl of memset for LoongArch#547

Merged
lhecker merged 1 commit intomicrosoft:mainfrom
heiher:loong-simd-memset
Jul 2, 2025
Merged

Add SIMD impl of memset for LoongArch#547
lhecker merged 1 commit intomicrosoft:mainfrom
heiher:loong-simd-memset

Conversation

@heiher
Copy link
Contributor

@heiher heiher commented Jun 30, 2025

Benchmark results on LA664:

  • LASX
simd/memset<u32>/8      time:   [2.5735 ns 2.6308 ns 2.6957 ns]
                        thrpt:  [2.7639 GiB/s 2.8321 GiB/s 2.8951 GiB/s]
                 change:
                        time:   [−14.812% −11.240% −7.7107%] (p = 0.00 < 0.05)
                        thrpt:  [+8.3549% +12.664% +17.387%]
                        Performance has improved.

simd/memset<u32>/136    time:   [8.0049 ns 8.0098 ns 8.0159 ns]
                        thrpt:  [15.801 GiB/s 15.813 GiB/s 15.823 GiB/s]
                 change:
                        time:   [−51.251% −51.202% −51.145%] (p = 0.00 < 0.05)
                        thrpt:  [+104.69% +104.92% +105.13%]
                        Performance has improved.

simd/memset<u32>/1024   time:   [12.407 ns 12.414 ns 12.422 ns]
                        thrpt:  [76.770 GiB/s 76.824 GiB/s 76.866 GiB/s]
                 change:
                        time:   [−88.281% −88.262% −88.249%] (p = 0.00 < 0.05)
                        thrpt:  [+750.97% +751.94% +753.34%]
                        Performance has improved.

simd/memset<u32>/131072 time:   [2.4655 µs 2.4668 µs 2.4685 µs]
                        thrpt:  [49.450 GiB/s 49.485 GiB/s 49.512 GiB/s]
                 change:
                        time:   [−81.223% −81.209% −81.195%] (p = 0.00 < 0.05)
                        thrpt:  [+431.76% +432.17% +432.57%]
                        Performance has improved.

simd/memset<u32>/134217728
                        time:   [4.4058 ms 4.4173 m`s 4.4313 ms]
                        thrpt:  [28.208 GiB/s 28.298 GiB/s 28.372 GiB/s]
                 change:
                        time:   [−67.246% −67.154% −67.062%] (p = 0.00 < 0.05)
                        thrpt:  [+203.60% +204.45% +205.30%]
                        Performance has improved.

simd/memset<u8>/8       time:   [3.2015 ns 3.2029 ns 3.2050 ns]
                        thrpt:  [2.3247 GiB/s 2.3262 GiB/s 2.3272 GiB/s]
                 change:
                        time:   [−0.0718% +0.0012% +0.0858%] (p = 0.97 > 0.05)
                        thrpt:  [−0.0857% −0.0012% +0.0719%]
                        No change in performance detected.

simd/memset<u8>/136     time:   [3.6125 ns 3.6174 ns 3.6229 ns]
                        thrpt:  [34.961 GiB/s 35.014 GiB/s 35.062 GiB/s]
                 change:
                        time:   [−0.6087% −0.1314% +0.2680%] (p = 0.58 > 0.05)
                        thrpt:  [−0.2673% +0.1316% +0.6124%]
                        No change in performance detected.

simd/memset<u8>/1024    time:   [11.341 ns 11.346 ns 11.353 ns]
                        thrpt:  [84.002 GiB/s 84.055 GiB/s 84.092 GiB/s]
                 change:
                        time:   [−0.1288% −0.0636% +0.0134%] (p = 0.06 > 0.05)
                        thrpt:  [−0.0134% +0.0636% +0.1290%]
                        No change in performance detected.

simd/memset<u8>/131072  time:   [2.4705 µs 2.4717 µs 2.4733 µs]
                        thrpt:  [49.354 GiB/s 49.388 GiB/s 49.411 GiB/s]
                 change:
                        time:   [−0.2564% −0.0972% +0.0179%] (p = 0.20 > 0.05)
                        thrpt:  [−0.0179% +0.0973% +0.2570%]
                        No change in performance detected.

simd/memset<u8>/134217728
                        time:   [4.4030 ms 4.4104 ms 4.4190 ms]
                        thrpt:  [28.287 GiB/s 28.342 GiB/s 28.390 GiB/s]
                 change:
                        time:   [−0.0583% +0.1614% +0.3954%] (p = 0.17 > 0.05)
                        thrpt:  [−0.3938% −0.1611% +0.0583%]
                        No change in performance detected.
  • LSX
simd/memset<u32>/8      time:   [2.3534 ns 2.4342 ns 2.5242 ns]
                        thrpt:  [2.9516 GiB/s 3.0608 GiB/s 3.1658 GiB/s]
                 change:
                        time:   [−19.116% −15.130% −11.603%] (p = 0.00 < 0.05)
                        thrpt:  [+13.126% +17.828% +23.633%]
                        Performance has improved.
simd/memset<u32>/136    time:   [7.0382 ns 7.0426 ns 7.0480 ns]
                        thrpt:  [17.971 GiB/s 17.985 GiB/s 17.996 GiB/s]
                 change:
                        time:   [−57.141% −57.110% −57.079%] (p = 0.00 < 0.05)
                        thrpt:  [+132.99% +133.15% +133.33%]
                        Performance has improved.

simd/memset<u32>/1024   time:   [30.019 ns 30.037 ns 30.060 ns]
                        thrpt:  [31.725 GiB/s 31.750 GiB/s 31.769 GiB/s]
                 change:
                        time:   [−71.652% −71.606% −71.575%] (p = 0.00 < 0.05)
                        thrpt:  [+251.80% +252.19% +252.76%]
                        Performance has improved.

simd/memset<u32>/131072 time:   [3.2877 µs 3.2897 µs 3.2923 µs]
                        thrpt:  [37.077 GiB/s 37.107 GiB/s 37.130 GiB/s]
                 change:
                        time:   [−74.960% −74.930% −74.885%] (p = 0.00 < 0.05)
                        thrpt:  [+298.17% +298.89% +299.36%]
                        Performance has improved.

simd/memset<u32>/134217728
                        time:   [4.4147 ms 4.4179 ms 4.4218 ms]
                        thrpt:  [28.269 GiB/s 28.294 GiB/s 28.314 GiB/s]
                 change:
                        time:   [−67.181% −67.149% −67.114%] (p = 0.00 < 0.05)
                        thrpt:  [+204.08% +204.41% +204.71%]
                        Performance has improved.

simd/memset<u8>/8       time:   [3.2016 ns 3.2031 ns 3.2052 ns]
                        thrpt:  [2.3245 GiB/s 2.3260 GiB/s 2.3271 GiB/s]
                 change:
                        time:   [−0.0666% +0.0051% +0.0807%] (p = 0.90 > 0.05)
                        thrpt:  [−0.0807% −0.0051% +0.0667%]
                        No change in performance detected.

simd/memset<u8>/136     time:   [3.6111 ns 3.6152 ns 3.6197 ns]
                        thrpt:  [34.991 GiB/s 35.035 GiB/s 35.075 GiB/s]
                 change:
                        time:   [−0.7119% −0.2467% +0.1774%] (p = 0.29 > 0.05)
                        thrpt:  [−0.1771% +0.2473% +0.7170%]
                        No change in performance detected.

simd/memset<u8>/1024    time:   [11.342 ns 11.349 ns 11.358 ns]
                        thrpt:  [83.962 GiB/s 84.030 GiB/s 84.082 GiB/s]
                 change:
                        time:   [−0.1126% −0.0288% +0.0686%] (p = 0.59 > 0.05)
                        thrpt:  [−0.0685% +0.0288% +0.1127%]
                        No change in performance detected.

simd/memset<u8>/131072  time:   [2.4707 µs 2.4723 µs 2.4742 µs]
                        thrpt:  [49.337 GiB/s 49.376 GiB/s 49.407 GiB/s]
                 change:
                        time:   [−0.2277% −0.0571% +0.0834%] (p = 0.52 > 0.05)
                        thrpt:  [−0.0833% +0.0571% +0.2282%]
                        No change in performance detected.

simd/memset<u8>/134217728
                        time:   [4.3991 ms 4.4057 ms 4.4138 ms]
                        thrpt:  [28.320 GiB/s 28.372 GiB/s 28.415 GiB/s]
                 change:
                        time:   [−0.1523% +0.0554% +0.2680%] (p = 0.62 > 0.05)
                        thrpt:  [−0.2673% −0.0554% +0.1525%]
                        No change in performance detected.

Benchmark results on LA664:

- LASX

```
simd/memset<u32>/8      time:   [2.5735 ns 2.6308 ns 2.6957 ns]
                        thrpt:  [2.7639 GiB/s 2.8321 GiB/s 2.8951 GiB/s]
                 change:
                        time:   [−14.812% −11.240% −7.7107%] (p = 0.00 < 0.05)
                        thrpt:  [+8.3549% +12.664% +17.387%]
                        Performance has improved.

simd/memset<u32>/136    time:   [8.0049 ns 8.0098 ns 8.0159 ns]
                        thrpt:  [15.801 GiB/s 15.813 GiB/s 15.823 GiB/s]
                 change:
                        time:   [−51.251% −51.202% −51.145%] (p = 0.00 < 0.05)
                        thrpt:  [+104.69% +104.92% +105.13%]
                        Performance has improved.

simd/memset<u32>/1024   time:   [12.407 ns 12.414 ns 12.422 ns]
                        thrpt:  [76.770 GiB/s 76.824 GiB/s 76.866 GiB/s]
                 change:
                        time:   [−88.281% −88.262% −88.249%] (p = 0.00 < 0.05)
                        thrpt:  [+750.97% +751.94% +753.34%]
                        Performance has improved.

simd/memset<u32>/131072 time:   [2.4655 µs 2.4668 µs 2.4685 µs]
                        thrpt:  [49.450 GiB/s 49.485 GiB/s 49.512 GiB/s]
                 change:
                        time:   [−81.223% −81.209% −81.195%] (p = 0.00 < 0.05)
                        thrpt:  [+431.76% +432.17% +432.57%]
                        Performance has improved.

simd/memset<u32>/134217728
                        time:   [4.4058 ms 4.4173 m`s 4.4313 ms]
                        thrpt:  [28.208 GiB/s 28.298 GiB/s 28.372 GiB/s]
                 change:
                        time:   [−67.246% −67.154% −67.062%] (p = 0.00 < 0.05)
                        thrpt:  [+203.60% +204.45% +205.30%]
                        Performance has improved.

simd/memset<u8>/8       time:   [3.2015 ns 3.2029 ns 3.2050 ns]
                        thrpt:  [2.3247 GiB/s 2.3262 GiB/s 2.3272 GiB/s]
                 change:
                        time:   [−0.0718% +0.0012% +0.0858%] (p = 0.97 > 0.05)
                        thrpt:  [−0.0857% −0.0012% +0.0719%]
                        No change in performance detected.

simd/memset<u8>/136     time:   [3.6125 ns 3.6174 ns 3.6229 ns]
                        thrpt:  [34.961 GiB/s 35.014 GiB/s 35.062 GiB/s]
                 change:
                        time:   [−0.6087% −0.1314% +0.2680%] (p = 0.58 > 0.05)
                        thrpt:  [−0.2673% +0.1316% +0.6124%]
                        No change in performance detected.

simd/memset<u8>/1024    time:   [11.341 ns 11.346 ns 11.353 ns]
                        thrpt:  [84.002 GiB/s 84.055 GiB/s 84.092 GiB/s]
                 change:
                        time:   [−0.1288% −0.0636% +0.0134%] (p = 0.06 > 0.05)
                        thrpt:  [−0.0134% +0.0636% +0.1290%]
                        No change in performance detected.

simd/memset<u8>/131072  time:   [2.4705 µs 2.4717 µs 2.4733 µs]
                        thrpt:  [49.354 GiB/s 49.388 GiB/s 49.411 GiB/s]
                 change:
                        time:   [−0.2564% −0.0972% +0.0179%] (p = 0.20 > 0.05)
                        thrpt:  [−0.0179% +0.0973% +0.2570%]
                        No change in performance detected.

simd/memset<u8>/134217728
                        time:   [4.4030 ms 4.4104 ms 4.4190 ms]
                        thrpt:  [28.287 GiB/s 28.342 GiB/s 28.390 GiB/s]
                 change:
                        time:   [−0.0583% +0.1614% +0.3954%] (p = 0.17 > 0.05)
                        thrpt:  [−0.3938% −0.1611% +0.0583%]
                        No change in performance detected.
```

- LSX

```
simd/memset<u32>/8      time:   [2.3534 ns 2.4342 ns 2.5242 ns]
                        thrpt:  [2.9516 GiB/s 3.0608 GiB/s 3.1658 GiB/s]
                 change:
                        time:   [−19.116% −15.130% −11.603%] (p = 0.00 < 0.05)
                        thrpt:  [+13.126% +17.828% +23.633%]
                        Performance has improved.
simd/memset<u32>/136    time:   [7.0382 ns 7.0426 ns 7.0480 ns]
                        thrpt:  [17.971 GiB/s 17.985 GiB/s 17.996 GiB/s]
                 change:
                        time:   [−57.141% −57.110% −57.079%] (p = 0.00 < 0.05)
                        thrpt:  [+132.99% +133.15% +133.33%]
                        Performance has improved.

simd/memset<u32>/1024   time:   [30.019 ns 30.037 ns 30.060 ns]
                        thrpt:  [31.725 GiB/s 31.750 GiB/s 31.769 GiB/s]
                 change:
                        time:   [−71.652% −71.606% −71.575%] (p = 0.00 < 0.05)
                        thrpt:  [+251.80% +252.19% +252.76%]
                        Performance has improved.

simd/memset<u32>/131072 time:   [3.2877 µs 3.2897 µs 3.2923 µs]
                        thrpt:  [37.077 GiB/s 37.107 GiB/s 37.130 GiB/s]
                 change:
                        time:   [−74.960% −74.930% −74.885%] (p = 0.00 < 0.05)
                        thrpt:  [+298.17% +298.89% +299.36%]
                        Performance has improved.

simd/memset<u32>/134217728
                        time:   [4.4147 ms 4.4179 ms 4.4218 ms]
                        thrpt:  [28.269 GiB/s 28.294 GiB/s 28.314 GiB/s]
                 change:
                        time:   [−67.181% −67.149% −67.114%] (p = 0.00 < 0.05)
                        thrpt:  [+204.08% +204.41% +204.71%]
                        Performance has improved.

simd/memset<u8>/8       time:   [3.2016 ns 3.2031 ns 3.2052 ns]
                        thrpt:  [2.3245 GiB/s 2.3260 GiB/s 2.3271 GiB/s]
                 change:
                        time:   [−0.0666% +0.0051% +0.0807%] (p = 0.90 > 0.05)
                        thrpt:  [−0.0807% −0.0051% +0.0667%]
                        No change in performance detected.

simd/memset<u8>/136     time:   [3.6111 ns 3.6152 ns 3.6197 ns]
                        thrpt:  [34.991 GiB/s 35.035 GiB/s 35.075 GiB/s]
                 change:
                        time:   [−0.7119% −0.2467% +0.1774%] (p = 0.29 > 0.05)
                        thrpt:  [−0.1771% +0.2473% +0.7170%]
                        No change in performance detected.

simd/memset<u8>/1024    time:   [11.342 ns 11.349 ns 11.358 ns]
                        thrpt:  [83.962 GiB/s 84.030 GiB/s 84.082 GiB/s]
                 change:
                        time:   [−0.1126% −0.0288% +0.0686%] (p = 0.59 > 0.05)
                        thrpt:  [−0.0685% +0.0288% +0.1127%]
                        No change in performance detected.

simd/memset<u8>/131072  time:   [2.4707 µs 2.4723 µs 2.4742 µs]
                        thrpt:  [49.337 GiB/s 49.376 GiB/s 49.407 GiB/s]
                 change:
                        time:   [−0.2277% −0.0571% +0.0834%] (p = 0.52 > 0.05)
                        thrpt:  [−0.0833% +0.0571% +0.2282%]
                        No change in performance detected.

simd/memset<u8>/134217728
                        time:   [4.3991 ms 4.4057 ms 4.4138 ms]
                        thrpt:  [28.320 GiB/s 28.372 GiB/s 28.415 GiB/s]
                 change:
                        time:   [−0.1523% +0.0554% +0.2680%] (p = 0.62 > 0.05)
                        thrpt:  [−0.2673% −0.0554% +0.1525%]
                        No change in performance detected.
```
@lhecker lhecker enabled auto-merge (squash) July 2, 2025 15:55
@lhecker lhecker merged commit e9ad756 into microsoft:main Jul 2, 2025
3 checks passed
@heiher heiher deleted the loong-simd-memset branch July 2, 2025 16:02
Lou32Verbose pushed a commit to Lou32Verbose/edit that referenced this pull request Jan 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants