Summarizing the Steps
Congratulations! You're now an expert in using the SNAX shell. Let's review the major steps you worked on.
1 - Building your accelerator.
- We first built the toy SNAX ALU accelerator.
- The important component is to have an accelerator shell that complies with the interface of our system.
2 - Configuring the CSR manager for your accelerator.
- Specifying the planned number of read-write (RW) and read-only (RO) registers.
3 - Configuring the streamer for your accelerator.
- Specifying the number of read and write ports of your accelerator.
- Configuring the data widths, FIFO depths, even the spatial parallelism considerations, etc.
4 - Configuring the system according to your needs.
- Attaching the accelerator configurations.
- Specifying the memory sizes.
5 - Setting up the file list and makefiles.
- Modifying the
Bender.yml
to add the file list. - Modifying
Makefiles
6 - Programming your system.
- Making the C code.
- Making the data generation.
- Building the software.
- Run the RISC-V binary program
7 - Profiling your system.
- Using waveforms.
- Using stack tracing with
spike
.
Summarizing Commands
For generating the RTL files you do:
make CFG_OVERRIDE=cfg/snax-alu.hjson rtl-gen
For building hardware you do:
cd target/snitch_cluster
make CFG_OVERRIDE=cfg/snax-alu.hjson bin/snitch_cluster.vlt
For building software you do:
make CFG_OVERRIDE=cfg/snax-alu.hjson SELECT_RUNTIME=rtl-generic SELECT_TOOLCHAIN=llvm-generic sw
You Need an Exercise to Get Strong!!!
With everything you've learned, let's do a simple exercise for a new accelerator! The figure below shows the accelerator data path of interest. This accelerator is built for you already but you need to integrate it into the SNAX system.
You can get the snax_exercise
RTL files from the snax-exercise-tutorial
branch.
git checkout snax-exercise-tutorial
Accelerator Datapath Specifications
There are specific features for this accelerator:
-
Let's name the accelerator
snax_exercise
-
Name your accelerator configuration file as
snax_exercise.hjson
-
The accelerator processes the kernel fully combinationally. Written in pseudo-code as:
O = 0;
parfor (int i = 0; i < 8; i++):
O += A[i]*B[i]
O += bias
-
Take note that the
parfor
indicates that the MAC needs to be processed spatially in 8 parallel ports. -
The accelerator is an oversimplified MAC with a bias addition that takes in 16 inputs, 8 parallel ports for input A, and 8 parallel ports for input B.
-
Each input port takes in 64 bits of data. Each input port for A and B is multiplied together and accumulated at the end. This should be done fully-combinationally.
-
We also add a feature that adds a 64-bit fixed bias.
-
Since the bias is 64 bits wide, you need to set the upper and lower bits of the bias through CSR registers.
-
The output produces a single 128-bit output.
-
For the inputs and outputs, make sure to comply with the decoupled interface.
-
The CSR registers it has are tabulated below:
register name | register offset | type | description |
---|---|---|---|
upper bias | 0 | RW | Upper 32 bits of the bias. |
lower bias | 1 | RW | Lower 32 bits of the bias. |
num. of iter. | 2 | RW | Number of iterations to process. |
accelerator start | 3 | RW | Set 1 to LSB only to start the accelerator |
busy | 4 | RO | Busy status. 1 - busy, 0 - idle |
perf. counter | 5 | RO | Performance counter indicating number of cycles |
-
There is a
snax_exercise_top.sv
which is the top module already of the data path shown above. -
The control side signals are unpacked registers but with the labels per port. They should correspond to the register table above.
//-------------------------------
// Register RW from CSR manager
//-------------------------------
input logic [RegDataWidth-1:0] csr_rw_reg_upper_i,
input logic [RegDataWidth-1:0] csr_rw_reg_lower_i,
input logic [RegDataWidth-1:0] csr_rw_reg_len_i,
input logic [RegDataWidth-1:0] csr_rw_reg_start_i,
input logic csr_rw_reg_valid_i,
output logic csr_rw_reg_ready_o,
//-------------------------------
// Register RO to CSR manager
//-------------------------------
output logic [RegDataWidth-1:0] csr_ro_reg_busy_o,
output logic [RegDataWidth-1:0] csr_ro_reg_perf_count_o,
- The data ports are packed signals. Observe that we have the 8 input ports of size
DataWidth
. We have only 1 output port of size2*DataWidth
.
//-------------------------------
// Data path IO
//-------------------------------
input logic [7:0][DataWidth-1:0] a_i,
input logic a_valid_i,
output logic a_ready_o,
input logic [7:0][DataWidth-1:0] b_i,
input logic b_valid_i,
output logic b_ready_o,
output logic [2*DataWidth-1:0] out_o,
output logic out_valid_o,
input logic out_ready_i
CSR Manager and Streamer Specifications
For the CSR manager, you just need to ensure that the register configurations match that of the accelerator's register specs.
For the streamer, you have the following specifications:
-
You need to feed 512 bits for each input port A and B. Then you split them inside the accelerator's data path, just like in the
snax_alu
example. -
This necessitates 16 TCDM ports for all the inputs.
-
The output is just a single 128-bit port output and hence leads to 2 TCDM ports only.
-
This leads to a total of 18 TCDM ports for the streamer.
-
Use a depth of 16 for all FIFOs.
-
Let's keep 1 loop dimension only.
Your Goal
Since the accelerator is already prepared for you, your goals are to:
- Create the
snax_exercise_shell_wrapper.sv
that complies with the interface of the generatedsnax_exericse_wrapper.sv
. - Modify the necessary RTL setups:
snax_exercise.hjson
,Bender.yml
, and theMakefile
for handling HW builds. - Plan and create a simple C-code test, have your own data generation, and modify the necessary
Makefiles
for handling SW builds. - Run your code and see if it works!
- Have fun!