arXiv:2605.108601 PaperLens breakdown

Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors

This paper investigates the performance of RISC-V Vector Extension (RVV) 1.0 hardware, focusing on compiler support and performance monitoring. It uses assembly microbenchmarks to identify performance bottlenecks like predication overhead and stride loads, and evaluates GCC 15 and LLVM 21 autovectorization across HPC and ML applications, finding GCC generally superior except for specific matrix operations where LLVM excels due to aggressive instruction reduction. The study also examines RVV's support for complex applications like Google's Qsim, highlighting compiler immaturity for intricate memory access patterns.

Built with PaperLens

Key Takeaways

RVV 1.0 compiler support and performance monitoring are still evolving.

Predication overhead and stride loads are significant performance challenges not fully addressed by current compiler cost models.

GCC 15 generally outperforms LLVM 21 in autovectorization for HPC/ML proxy applications.

LLVM 21 shows superior performance in SGEMM/DGEMM due to aggressive instruction reduction.

Default LMUL selection in compilers is often close to optimal.

Current RVV compilers struggle with complex memory access patterns, as seen in quantum simulator Qsim.

Core Concepts

RISC-V Vector Extension (RVV)

RVV is a powerful hardware feature for parallel data processing, but its effectiveness heavily relies on good compiler support.

Autovectorization

Autovectorization is a key compiler feature that bridges the gap between high-level code and high-performance vector hardware, but its effectiveness varies.

Predication Overhead

Predication allows conditional vector operations but can introduce performance overhead if not carefully managed by compilers or hardware.

Stride Load

Stride loads are common but can be a performance bottleneck if not efficiently handled by hardware and compilers, often due to poor cache behavior.

Why It Matters

This research directly impacts the viability and performance of RISC-V in high-demand computing fields. By identifying compiler deficiencies and hardware bottlenecks, it guides the development of more efficient compilers and potentially future RVV hardware revisions, ensuring that RISC-V can truly compete with established architectures in HPC and AI.

Optimizing scientific simulations (e.g., weather modeling, molecular dynamics) on RISC-V.Accelerating machine learning inference and training on RISC-V-based edge devices and data centers.Improving the performance of quantum simulators and other complex physics applications.Guiding compiler developers (GCC, LLVM) in prioritizing RVV-specific optimizations.Informing hardware architects on areas for future RVV microarchitecture improvements.