CUDA has been transplanted, and risc-v-based GPU is playing?

release time:2022-03-17Author source:SlkorBrowse:11203

RISC-V has been one of the hottest topics in computing because this instruction set architecture (ISA) allows for extensive customization and is easy to understand, in addition to the whole open source, license-free benefit. There's even a project to design a general-purpose GPU based on the RISC-V ISA, and now we're witnessing the porting of Nvidia's CUDA software library to the Vortex RISC-V GPGPU platform.

Nvidia's CUDA (Computing Unified Device Architecture) represents a unique computing platform and application programming interface (API) that runs on Nvidia's line of graphics cards. When an application is written for CUDA support, as soon as the system discovers a CUDA-based GPU, it gets a lot of GPU acceleration of the code.

Today, researchers investigated a way to enable CUDA software toolkit support on a RISC-V GPGPU project called Vortex. The Vortex RISC-V GPGPU is designed to provide a system-wide RISC-V GPU based on the RV32IMF ISA. This means that 32-bit cores can scale from 1-core to 32-core GPU designs. It supports the OpenCL 1.2 graphics API, and today it also supports some CUDA operations.

The researchers explain: "...In this project, we propose and build a pipeline to support end-to-end CUDA migration: the pipeline accepts CUDA source code as input and executes them on the extended RISC-V GPU architecture. Our The pipeline consists of several steps: translate CUDA source code to NVVM IR, convert NVVM IR to SPIR-V IR, forward SPIR-V IR to POCL to get RISC-V binary, and finally execute on extended RISC-V GPU Binary file architecture."

The process is visualized in the image above, showing all the steps to make it work. In simple terms, CUDA source code is represented in an intermediate representation (IR) format called NVVM IR, based on the open source LLVM IR. It was later converted to Standard Portable Intermediate Representation (SPIR-V) IR, which was then forwarded into a portable open-source implementation of the OpenCL standard, called POCL. Since Vortex supports OpenCL, it provides supported code and can execute it without problems.

For more details on this complex process, click below to read the original article. Importantly, you must thank these researchers for their efforts to make CUDA run on RISC-V GPGPUs. While this is only a small step for now, it could be the beginning of an era where RISC-V is used to accelerate computing applications, much like Nvidia's GPU lineup today.

Extended reading: Can RISC-V change the GPU?

Can RISC-V handle GPU transactions? This is a work in progress that can be achieved by creating a small area-efficient design with custom programmability and scalability.
Anyone who has studied GPU architecture knows that this is the SIMD construct of a vector processor. It's an ultra-efficient parallel processor that's been used for everything from running simulations and great games to teaching robots how to get AI and helping smart people manipulate the stock market. It even checks my grammar as I write this.
But the GPU field has become a private field, and its inner work is done by the IP and effective supplies of developers such as AMD, Intel, and Nvidia. What if there was a new set of graphics instructions designed for 3D graphics and media processing? Well, there may be.
The new instructions are being built on the RISC-V basic vector instruction set. They will add support for new graph-specific data types as layered extensions in the spirit of the core RISC-V ISA. Supports vector, prior math, pixel and texture, and Z/Frame buffer operations. It could be a fused CPU-GPU ISA. The lilibrary - RISC 3D group calls it RV64X (Figure 1) because the instructions will be 64 bits long (32 bits won't be enough to support a robust ISA).

Figure 1. The RV64X graphics processor includes multiple DSPs in addition to dedicated texture units and function blocks.

The group stated that their motivation and goal was to create a small, efficient design with custom programmability and extensibility. It should provide low-cost IP ownership and development, not compete with commercial products. It can be implemented on FPGA and ASIC targets and is free and open source. The original design targets low power microcontrollers that will be Khronos Vulkan compatible and support other APIs (OpenGL, DirectX, etc.).

GPU + RISC-V

The target hardware will have a GPU functional unit and a RISC-V core. This combination comes in the form of processors where 64-bit instructions are encoded as scalar instructions. The point is that the compiler will generate SIMD instructions from prefixed scalar opcodes. Other features include mutable issues, a predicate-based SIMD backend; branch tracking; precise exceptions; and a vector frontend. Designs will include a 16-bit fixed-point version and a 32-bit floating-point version. The former is suitable for FPGA implementation.
The team said: "There is no need to use the RPC/IPC call mechanism to send 3D API calls to unused CPU memory space or unused CPU memory space to GPU memory space and vice versa, "
The advantage of the "fused" CPU-GPU ISA approach is that standard graphics pipelines can be used in microcode, and custom shaders can be supported. Can even include ray tracing extensions.
The design will be in Vblock format (from a Libre GPU effort):

It's kind of like VLIW (but not really).
Instruction blocks are preceded by register markers that provide additional context for scalar instructions within that block.
Subblocks include vector length, rotation, vector/width overlay and prediction.
All this is added to scalar opcodes!
There are no vector opcodes (nor need any).
In a vector context, it goes like this: if a scalar opcode uses a register, and that register is listed in the vector context, then vector mode will be activated.
Activation causes the hardware-level for loop to emit multiple consecutive scalar operations (instead of just one).
Implementers are free to implement loops in any way they want - SIMD, multi-issue, single-execute; pretty much anything.

RV32-V vector handles 8-bit, 16-bit or 32-bit/element vector operations of 2 to 4 elements. There will also be dedicated instructions for the general 3D graphics rendering pipeline for 64-bit and 128-bit fixed and floating point XYZW points. 8-, 16-, 24- and 32-bit RGBA pixels; 8-bit, 16-bit UVW texels per component; and light and material settings (Ia, ka, Id, kd, Is, ks, etc.).
The attribute vector is represented as a 4×4 matrix. The system will natively support 2×2 and 3×3 matrices. Vector support may also be suitable for numerical simulations using 8-bit integer data types common in AI and machine learning applications.
Custom rasterizers such as splines, SubDiv surfaces and patches can be included in the design. This method also allows the inclusion of custom pipeline stages, custom geometry/pixel/framebuffer stages, custom subdividers, and custom instancing operations.

RV64X

RV64X reference implementation includes:

Instruction/Data SRAM cache (32 kB)
Microcode SRAM (8 kB)
Dual function instruction decoder (hardwired for RV32V and X; microcoded instruction decoder for custom ISA)
Quad-vector ALU (32-bit/ALU-fixed/float)
136-bit register file (1k elements)
special function unit
texture unit
Configurable local framebuffer

RV64X is a scalable architecture (Figure 2). Its fusion method is new, as is the use of configurable registers for custom data types. User-defined SRAM-based microcode can be used to implement extensions such as custom rasterizer stages, ray tracing, machine vision and machine learning. A single design can be applied to a stand-alone graphics microcontroller or a multi-core solution with scalable shader units.

Figure 2. The RV64X can scale from a simple low-end design (left) to a multi-core solution (right).
Graphics extensions to RISC-V address scalability and multilingualism. This enables higher-level use cases that lead to more innovation.

what's next

The RV64X specification is still in early development and subject to change. A discussion forum is being established. The immediate goal is to build an example implementation using the instruction set simulator. This will open usingSource IP and FPGA implementations of custom IP designed as open source projects.

Disclaimer: This article is reproduced from "Semiconductor Industry Observation". This article only represents the author's personal opinion, not the opinion of Sac Micro and the industry, only for reprinting and sharing, support To protect intellectual property rights, please indicate the original source and author for reprinting. If there is any infringement, please contact us to delete it.

Previous:10 free and easy-to-use circuit design software inventory! Do you have your "dishes"?
Next:What should be paid attention to when using MOS tubes