Skip to content

Accelerator Design

Let's first dive into the SNAX accelerator wrapper which is the yellow box from the main figure in Architectural Overview. The figure below is the same shell but with more signal details:

image

We labeled a few important details about the shell:

(1) - The entire SNAX accelerator wrapper encapsulates the streamers, the CSR manager, and the accelerator data path.

(2) - The CSR manager handles the CSR read and write transactions between the Snitch core and the accelerator data path. Towards the Snitch core side, CSR transactions are handled with decoupled interfaces (csr_req_* and csr_rsp_*) of requests and responses. Towards the accelerator side, the read-write registers (register_rw_set) use a decoupled interface (register_rw_valid and register_rw_ready) while the read-only registers (register_ro_set) use a direct mapping. There are more details in CSR Manager Design.

(3) - The streamer provides flexible data access from the L1 TCDM to the accelerator. It serves as an intermediate interface between the TCDM interconnect and the accelerator. On the interconnect side the streamer controls the TCDM request and response (tcdm_req and tcdm_rsp) interfaces towards the memory. On the accelerator side, the streamer has its own data decoupled interfaces (acc2stream_* and stream2acc_*). The direction from the accelerator to the streamer is write-only ports while the direction from the streamer to the accelerator is read-only ports. More details are in Streamer Design.

(4) - The accelerator data path is the focus of this section. In our example, we will use a very simple ALU datapath with basic operations only. The SNAX ALU is already built for you. You can find it under the ./hw/snax_alu/ directory. Take time to check the simple design.

SNAX ALU Datapath

The figure below shows the SNAX ALU datapath in more detail:

image

You can find all accelerator files under ./hw/snax_alu/src/. directory. The main files are:

  • snax_alu_pe.sv as the SNAX ALU Processing Element (PE)
  • snax_alu_csr.sv as the main control-status-register (CSR) component
  • snax_alu_shell_wrapper.sv as the top-level shell wrapper encapsulating the SNAX PEs, SNAX CSRs, and glue logic to interface between the CSR manager and the SNAX streamer.

Again, we label points of interest with numbers. We facilitate the discussion in a bottom-up approach to see how the accelerator is built:

(1) SNAX ALU Processing Elements (PE)

The snax_alu_pe is the computing unit of the accelerator. The PE can do addition (+), subtraction (-), multiplication (X), and a bit-wise XOR (^). Each processing element takes in two data inputs, a and b. Each input with a parameter data width size DataWidth. By default DataWidth=64. The output is c of data width size 2*DataWidth to accommodate the multiplication. The other operations leave the upper bits to 0.

(2) The inputs and outputs of the PE have a simple decoupled interface (valid-ready protocol). The ports only consist of a single data channel. The valid signal of the inputs comes from the streamers when the data is valid. The ready signal depends on the busy status register. When the valid signals of inputs a and b are high, then it combinationally sets the valid signal of output c. The ready signal of c comes from the streamer when it's ready to load data into the TCDM memory. The entire PE is fully combinational.

(3) SNAX ALU CSR Register Set

The snax_alu_csr is a control and status register set with signals to modify the operation of the ALU PEs. It also contains a busy status signal and a simple performance counter. The table below shows the register set with addresses, type of register (RW for read-write and RO for read-only), and functional descriptions.

register name register addr type description
mode 0 RW Operating modes: 0 - add, 1 - sub, 2 - mul, 3 - XOR
length 1 RW Number of elements to process
start 2 RW Set 1 to LSB only to start the accelerator
busy 3 RO Busy status. 1 - busy, 0 - idle
perf. counter 4 RO Performance counter indicating number of cycles

RW registers can read or write from the snitch core’s perspective. The values of these registers are input signals from the accelerator’s perspective, which can used for configurations and start signals that get to the main data path.

RO registers are read-only from the snitch core’s perspective .The values of these registers are output signals from the accelerator’s perspective. These are mostly used for monitoring purposes like status or performance counters.

The mode signal is broadcast to all PEs to configure the kernel that each PE processes. The busy signal acts like an active state also broadcasted to all PEs. If it's high then the PEs set their input ready signals high to allow data to stream continuously.

From the outside, a CSR manager (in this case our SNAX CSR manager) handles the read-and-write transactions from and to the accelerator's CSR register set. The snax_alu_csr also uses a decoupled interface but with all RW channels linked to the accelerator. The RO channels are wired directly without any decoupled interface. It is up to the accelerator designer to handle these operations.

(4) SNAX ALU Shell Wrapper

The snax_alu_shell_wrapper is the main wrapper for encapsulating the processing elements, the CSR manager, and the glue logic to connect to the streamers. The top-level shell has configurable parameters tabulated below:

parameter description default values
RegRWCount Number of RW registers 3
RegROCount Number of RO registers 2
NumPE Number of parallel PEs 4
DataWidth INput data width of each PE 64
OutDataWidth Output data width of each PE DataWidth*2
RegDataWidth Data width of each register 32
RegAddrwidth Address width for selecting a register 32

For the CSR side, we have the RW and RO ports directly connecting the snax_alu_csr to the CSR manager. The RW ports have a decoupled interface to properly handle the register writes. The RO ports are directly wired to the CSR manager. The shell module's ports for the CSR are:

//-------------------------------
// CSR manager ports
//-------------------------------
input  logic [RegRWCount-1:0][RegDataWidth-1:0] csr_reg_set_i,
input  logic                                    csr_reg_set_valid_i,
output logic                                    csr_reg_set_ready_o,
output logic [RegROCount-1:0][RegDataWidth-1:0] csr_reg_ro_set_o

To visualize this better, take note that the CSR register ports are packed signals. Referring to the SNAX ALU's RW register table above then we can "unpack" them to see:

csr_reg_set_i [0] = mode;
csr_reg_set_i [1] = len;
csr_reg_set_i [2] = start;
The same concept goes for the RO register ports:

busy         = csr_reg_ro_set[0];
perf_counter = csr_reg_ro_set[1];

The PEs that connect to an external data streamer have data signals (both input and output) concatenated together. (5) For the PEs that connect to an external data streamer, the PE data signals (both input and output) data channels decoupled interfaces. The module ports are:

//-------------------------------
// Accelerator ports
//-------------------------------
// Note, we maintained the form of these signals
// just to comply with the top-level wrapper

// Ports from accelerator to streamer
output logic [(NumPE*OutDataWidth)-1:0] acc2stream_0_data_o,
output logic acc2stream_0_valid_o,
input  logic acc2stream_0_ready_i,

// Ports from streamer to accelerator
input  logic [(NumPE*DataWidth)-1:0] stream2acc_0_data_i,
input  logic stream2acc_0_valid_i,
output logic stream2acc_0_ready_o,

input  logic [(NumPE*DataWidth)-1:0] stream2acc_1_data_i,
input  logic stream2acc_1_valid_i,
output logic stream2acc_1_ready_o,

Where stream2acc and acc2stream indicate the input and output ports respectively. You could also treat stream2acc as read ports of the streamer and acc2stream as write ports to the streamer.

These ports are concatenated signals of the PEs. For example, consider the example where we have NumPE=4 PEs and each PE can process DataWidth size per port (2*DataWidth for the output). Then SNAX streamer uses a data width of 4*DataWidth for both inputs A (stream2acc_0) and B (stream2acc_1) . Then we split A and B contigiously into ports a and b, respectively. These are annotated with (2) and (5) in the figure. The output C is a concatenation of each c port.

Any user attaching their accelerator to the SNAX platform must create their own shell wrapper with the correct CSR manager and streamer interfaces. This shell should serve as an example on how to attach the interfaces.

What is a Decoupled Interface Anyway?

The decoupled interface uses two signals: valid and ready with the following rules.

  • The initiator asserts valid. The assertion of valid must not depend on ready. However, ready may depend on valid.
  • Once valid has been asserted all data must remain stable.
  • The receiver asserts ready whenever it is ready to receive the transaction.
  • When both valid and ready are high the transaction is successful.

This is an interface that is common in AXI protocol. You can find more details here.

Some Exercises!!

For the CSR interface connecting to the CSR manager, what are the important ports? We have the RW request and response ports: `csr_reg_set_i`, `csr_reg_set_valid_i`, and `csr_reg_set_ready_o`. Then, we have the RO port: `csr_reg_ro_set_o`. They also use valid-ready responses except for the RO port.
What are the important ports of the accelerator <-> streamer interfaces? We have reader ports tagged with `stream2acc` names and writer ports tagged with `acc2stream` names. They also use valid-ready responses.
What is the decoupled interface? The decoupled interface us a valid-ready protocol. A transaction is only successful when both valid and ready are high.

Adding Your Accelerator to the Configuration File

You can add your accelerator configurations in a configuration file. The SNAX ALU configuration snax-alu.hjson has core templates which configure the Snitch core and how it connects to an accelerator. The snax-alu core template is:

// SNAX Accelerator Core Templates
snax_alu_core_template: {
    isa: "rv32imafd",
    xssr: true,
    xfrep: true,
    xdma: false,
    xf16: true,
    xf16alt: true,
    xf8: true,
    xf8alt: true,
    xfdotp: true,
    xfvec: true,
    snax_acc_cfg: {
        snax_acc_name: "snax_alu",
        bender_target: ["snax_alu"],
        snax_tcdm_ports: 16,
        snax_num_rw_csr: 3,
        snax_num_ro_csr: 2,
        snax_streamer_cfg: {$ref: "#/snax_alu_csr_streamer_template" }
    },
    num_int_outstanding_loads: 1,
    num_int_outstanding_mem: 4,
    num_fp_outstanding_loads: 4,
    num_fp_outstanding_mem: 4,
    num_sequencer_instructions: 16,
    num_dtlb_entries: 1,
    num_itlb_entries: 1,
    // Enable division/square root unit
    // Xdiv_sqrt: true,
},
The first operations before the snax_acc_cfg pertain to the Snitch core configurations. Particularly what ISA to use and which additional features it includes. You would usually keep this by default.

The snax_acc_cfg contains the configurations for the accelerator. The configuration definitions are:

  • snax_acc_name: Is the name appended to the different wrappers discussed in Building the System section.
  • bender_target: This is for the bender target name that you will use later in Building the System section.
  • snax_tcdm_ports: Is the number of tightly coupled data memory (TCDM) that your accelerator needs.
  • snax_num_rw_csr: Is the number of read-write (RW) registers your accelerators has. This affects the connection ports of the CSR manager. More details in SNAX CSR Manager.
  • snax_num_ro_csr: Is the number of read-only (RO) registers your accelerator has. This affects the connection ports CSR manager. More details in SNAX CSR Manager.
  • snax_streamer_cfg: Contains the settings for your streamer. More details are in SNAX Streamer

Note

At the top of the configuration file, you will also see the cluster bender target name. bender_target: ["snax_alu_cluster"], You need to put this at the top too so that your cluster would have its own unique name and the bender targets generated will also match.

You can find more details in the Hardware Schema file.