Skip to content

RISC-V:Optimize decompression throughput by mirroring AVX fast-path for RVV short memcpy +15%#235

Open
anthony-zy wants to merge 1 commit intogoogle:mainfrom
anthony-zy:refact_memcopy64
Open

RISC-V:Optimize decompression throughput by mirroring AVX fast-path for RVV short memcpy +15%#235
anthony-zy wants to merge 1 commit intogoogle:mainfrom
anthony-zy:refact_memcopy64

Conversation

@anthony-zy
Copy link
Copy Markdown
Contributor

PR Title

Optimize RVV memcpy path to mirror AVX fast-path for short copies


Summary

Refactors the RISC-V Vector (RVV) acceleration path in the short memcpy helper to mirror the existing AVX implementation. Instead of a generic vsetvl loop, this version performs a fixed 32-byte vector copy (with an optional second 32-byte segment), matching the profile-driven assumption that nearly all copies are short.

This eliminates loop overhead in the hot path and aligns RVV behavior with the well-tuned AVX code path.

Performance

Tested with lzbench on Spacemit(R) X60 (RISC-V, RVV 1.0):

  • Before: 269 MB/s
  • After: 310 MB/s
  • Improvement: +15%

Implementation Details

  • Uses __riscv_vsetvl_e8m2(32) to configure a 32-byte vector operation, matching the kShortMemCopy constant used in the AVX path.
  • Performs an unconditional first 32-byte load/store.
  • A second 32-byte segment is conditionally executed (with SNAPPY_PREDICT_FALSE) only when size > kShortMemCopy, since profiling shows long copies are rare.
  • Note: with e8m1 and VLEN < 256 (e.g., VLEN=128, vl=16), two segments per 32 B would be required; using e8m2 ensures 32 B per op on common configurations.

Verification

  • Environment: Spacemit(R) X60 (RISC-V, RVV 1.0)
  • Benchmark tool: lzbench
  • Compiled with -march=rv64gcv
  • Passed all existing Snappy unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant