Accelerator Design
Let's first dive into the SNAX accelerator wrapper which is the yellow box from the main figure in Architectural Overview. The figure below is the same shell but with more signal details:
We labeled a few important details about the shell:
(1) - The entire SNAX accelerator wrapper encapsulates the streamers, the CSR manager, and the accelerator data path.
(2) - The CSR manager handles the CSR read and write transactions between the Snitch core and the accelerator data path. Towards the Snitch core side, CSR transactions are handled with decoupled interfaces (csr_req_*
and csr_rsp_*
) of requests and responses. Towards the accelerator side, the read-write registers (register_rw_set
) use a decoupled interface (register_rw_valid
and register_rw_ready
) while the read-only registers (register_ro_set
) use a direct mapping. There are more details in CSR Manager Design.
(3) - The streamer provides flexible data access from the L1 TCDM to the accelerator. It serves as an intermediate interface between the TCDM interconnect and the accelerator. On the interconnect side the streamer controls the TCDM request and response (tcdm_req
and tcdm_rsp
) interfaces towards the memory. On the accelerator side, the streamer has its own data decoupled interfaces (acc2stream_*
and stream2acc_*
). The direction from the accelerator to the streamer is write-only ports while the direction from the streamer to the accelerator is read-only ports. More details are in Streamer Design.
(4) - The accelerator data path is the focus of this section. In our example, we will use a very simple ALU datapath with basic operations only. The SNAX ALU is already built for you. You can find it under the ./hw/snax_alu/
directory. Take time to check the simple design.
SNAX ALU Datapath
The figure below shows the SNAX ALU datapath in more detail:
You can find all accelerator files under ./hw/snax_alu/src/.
directory. The main files are:
snax_alu_pe.sv
as the SNAX ALU Processing Element (PE)snax_alu_csr.sv
as the main control-status-register (CSR) componentsnax_alu_shell_wrapper.sv
as the top-level shell wrapper encapsulating the SNAX PEs, SNAX CSRs, and glue logic to interface between the CSR manager and the SNAX streamer.
Again, we label points of interest with numbers. We facilitate the discussion in a bottom-up approach to see how the accelerator is built:
(1) SNAX ALU Processing Elements (PE)
The snax_alu_pe
is the computing unit of the accelerator. The PE can do addition (+), subtraction (-), multiplication (X), and a bit-wise XOR (^). Each processing element takes in two data inputs, a
and b
. Each input with a parameter data width size DataWidth
. By default DataWidth=64
. The output is c
of data width size 2*DataWidth
to accommodate the multiplication. The other operations leave the upper bits to 0.
(2) The inputs and outputs of the PE have a simple decoupled interface (valid-ready protocol). The ports only consist of a single data
channel. The valid signal of the inputs comes from the streamers when the data is valid. The ready signal depends on the busy status register. When the valid signals of inputs a
and b
are high, then it combinationally sets the valid signal of output c
. The ready signal of c
comes from the streamer when it's ready to load data into the TCDM memory. The entire PE is fully combinational.
(3) SNAX ALU CSR Register Set
The snax_alu_csr
is a control and status register set with signals to modify the operation of the ALU PEs. It also contains a busy status signal and a simple performance counter. The table below shows the register set with addresses, type of register (RW
for read-write and RO
for read-only), and functional descriptions.
register name | register addr | type | description |
---|---|---|---|
mode | 0 | RW | Operating modes: 0 - add, 1 - sub, 2 - mul, 3 - XOR |
length | 1 | RW | Number of elements to process |
start | 2 | RW | Set 1 to LSB only to start the accelerator |
busy | 3 | RO | Busy status. 1 - busy, 0 - idle |
perf. counter | 4 | RO | Performance counter indicating number of cycles |
RW registers can read or write from the snitch core’s perspective. The values of these registers are input signals from the accelerator’s perspective, which can used for configurations and start signals that get to the main data path.
RO registers are read-only from the snitch core’s perspective .The values of these registers are output signals from the accelerator’s perspective. These are mostly used for monitoring purposes like status or performance counters.
The mode signal is broadcast to all PEs to configure the kernel that each PE processes. The busy signal acts like an active state also broadcasted to all PEs. If it's high then the PEs set their input ready signals high to allow data to stream continuously.
From the outside, a CSR manager (in this case our SNAX CSR manager) handles the read-and-write transactions from and to the accelerator's CSR register set. The snax_alu_csr
also uses a decoupled interface but with all RW channels linked to the accelerator. The RO channels are wired directly without any decoupled interface. It is up to the accelerator designer to handle these operations.
(4) SNAX ALU Shell Wrapper
The snax_alu_shell_wrapper
is the main wrapper for encapsulating the processing elements, the CSR manager, and the glue logic to connect to the streamers. The top-level shell has configurable parameters tabulated below:
parameter | description | default values |
---|---|---|
RegRWCount | Number of RW registers | 3 |
RegROCount | Number of RO registers | 2 |
NumPE | Number of parallel PEs | 4 |
DataWidth | INput data width of each PE | 64 |
OutDataWidth | Output data width of each PE | DataWidth*2 |
RegDataWidth | Data width of each register | 32 |
RegAddrwidth | Address width for selecting a register | 32 |
For the CSR side, we have the RW and RO ports directly connecting the snax_alu_csr
to the CSR manager. The RW ports have a decoupled interface to properly handle the register writes. The RO ports are directly wired to the CSR manager. The shell module's ports for the CSR are:
//-------------------------------
// CSR manager ports
//-------------------------------
input logic [RegRWCount-1:0][RegDataWidth-1:0] csr_reg_set_i,
input logic csr_reg_set_valid_i,
output logic csr_reg_set_ready_o,
output logic [RegROCount-1:0][RegDataWidth-1:0] csr_reg_ro_set_o
To visualize this better, take note that the CSR register ports are packed signals. Referring to the SNAX ALU's RW register table above then we can "unpack" them to see:
csr_reg_set_i [0] = mode;
csr_reg_set_i [1] = len;
csr_reg_set_i [2] = start;
busy = csr_reg_ro_set[0];
perf_counter = csr_reg_ro_set[1];
The PEs that connect to an external data streamer have data signals (both input and output) concatenated together. (5) For the PEs that connect to an external data streamer, the PE data signals (both input and output) data channels decoupled interfaces. The module ports are:
//-------------------------------
// Accelerator ports
//-------------------------------
// Note, we maintained the form of these signals
// just to comply with the top-level wrapper
// Ports from accelerator to streamer
output logic [(NumPE*OutDataWidth)-1:0] acc2stream_0_data_o,
output logic acc2stream_0_valid_o,
input logic acc2stream_0_ready_i,
// Ports from streamer to accelerator
input logic [(NumPE*DataWidth)-1:0] stream2acc_0_data_i,
input logic stream2acc_0_valid_i,
output logic stream2acc_0_ready_o,
input logic [(NumPE*DataWidth)-1:0] stream2acc_1_data_i,
input logic stream2acc_1_valid_i,
output logic stream2acc_1_ready_o,
Where stream2acc
and acc2stream
indicate the input and output ports respectively. You could also treat stream2acc
as read ports of the streamer and acc2stream
as write ports to the streamer.
These ports are concatenated signals of the PEs. For example, consider the example where we have NumPE=4
PEs and each PE can process DataWidth
size per port (2*DataWidth
for the output). Then SNAX streamer uses a data width of 4*DataWidth
for both inputs A
(stream2acc_0
) and B
(stream2acc_1
) . Then we split A
and B
contigiously into ports a
and b
, respectively. These are annotated with (2) and (5) in the figure. The output C
is a concatenation of each c
port.
Any user attaching their accelerator to the SNAX platform must create their own shell wrapper with the correct CSR manager and streamer interfaces. This shell should serve as an example on how to attach the interfaces.
What is a Decoupled Interface Anyway?
The decoupled interface uses two signals: valid
and ready
with the following rules.
- The initiator asserts
valid
. The assertion ofvalid
must not depend onready
. However,ready
may depend onvalid
. - Once
valid
has been asserted all data must remain stable. - The receiver asserts
ready
whenever it is ready to receive the transaction. - When both
valid
andready
are high the transaction is successful.
This is an interface that is common in AXI protocol. You can find more details here.
Some Exercises!!
For the CSR interface connecting to the CSR manager, what are the important ports?
We have the RW request and response ports: `csr_reg_set_i`, `csr_reg_set_valid_i`, and `csr_reg_set_ready_o`. Then, we have the RO port: `csr_reg_ro_set_o`. They also use valid-ready responses except for the RO port.What are the important ports of the accelerator <-> streamer interfaces?
We have reader ports tagged with `stream2acc` names and writer ports tagged with `acc2stream` names. They also use valid-ready responses.What is the decoupled interface?
The decoupled interface us a valid-ready protocol. A transaction is only successful when both valid and ready are high.Adding Your Accelerator to the Configuration File
You can add your accelerator configurations in a configuration file. The SNAX ALU configuration snax-alu.hjson
has core templates which configure the Snitch core and how it connects to an accelerator. The snax-alu
core template is:
// SNAX Accelerator Core Templates
snax_alu_core_template: {
isa: "rv32imafd",
xssr: true,
xfrep: true,
xdma: false,
xf16: true,
xf16alt: true,
xf8: true,
xf8alt: true,
xfdotp: true,
xfvec: true,
snax_acc_cfg: {
snax_acc_name: "snax_alu",
bender_target: ["snax_alu"],
snax_tcdm_ports: 16,
snax_num_rw_csr: 3,
snax_num_ro_csr: 2,
snax_streamer_cfg: {$ref: "#/snax_alu_csr_streamer_template" }
},
num_int_outstanding_loads: 1,
num_int_outstanding_mem: 4,
num_fp_outstanding_loads: 4,
num_fp_outstanding_mem: 4,
num_sequencer_instructions: 16,
num_dtlb_entries: 1,
num_itlb_entries: 1,
// Enable division/square root unit
// Xdiv_sqrt: true,
},
snax_acc_cfg
pertain to the Snitch core configurations. Particularly what ISA to use and which additional features it includes. You would usually keep this by default.
The snax_acc_cfg
contains the configurations for the accelerator. The configuration definitions are:
snax_acc_name
: Is the name appended to the different wrappers discussed in Building the System section.bender_target
: This is for the bender target name that you will use later in Building the System section.snax_tcdm_ports
: Is the number of tightly coupled data memory (TCDM) that your accelerator needs.snax_num_rw_csr
: Is the number of read-write (RW) registers your accelerators has. This affects the connection ports of the CSR manager. More details in SNAX CSR Manager.snax_num_ro_csr
: Is the number of read-only (RO) registers your accelerator has. This affects the connection ports CSR manager. More details in SNAX CSR Manager.snax_streamer_cfg
: Contains the settings for your streamer. More details are in SNAX Streamer
Note
At the top of the configuration file, you will also see the cluster bender target name. bender_target: ["snax_alu_cluster"],
You need to put this at the top too so that your cluster would have its own unique name and the bender targets generated will also match.
You can find more details in the Hardware Schema file.