====== Stages ====== This document explains the concept of stages within the Stream framework. It details the different implemented stages and explains how to create your own. Introduction ============ Stages within Stream are used to modularly and easily adapt the functionality of the framework. The different stages and their sequence of execution determine the goal of running the framework. The sequence of stages the framework will run through are defined in the main file. An example as follows: .. code-block:: python mainstage = MainStage( [ # Initializes the MainStage as entry point AcceleratorParserStage, # Parses the accelerator StreamONNXModelParserStage, # Parses the ONNX Model into the workload LayerSplittingStage, # Split the workload StreamONNXModelParserStage, # Parses the potentially split ONNX model into the workload GenerateCNWorkloadHybridStage, # Generate fine-grained CN workload graph IntraCoreMappingStage, # Find the optimal CME for each valid layer-core assignment InterCoreMappingStage, # Find the optimal layer-core assignment for the entire workload ], accelerator=accelerator, # required by AcceleratorParserStage workload_path=workload_path, # required by ModelParserStage mapping_path=mapping_path, # required by ModelParserStage loma_lpf_limit=6, # required by LomaEngine nb_ga_individuals=32, # number of individuals in each genetic algorithm generation nb_ga_generations=100, # number of genetic algorithm generations node_hw_performances_path=node_hw_performances_path, # saved node_hw_performances to skip re-computation plot_hof=True, # Save schedule and memory usage plot of each individual in the Genetic Algorithm hall of fame plot_file_name=plot_file_name, plot_full_schedule=plot_full_schedule, plot_data_transfer=plot_data_transfer, cn_define_mode=CN_define_mode, hint_loops=hint_loops, scheduler_candidate_selection="memory", operands_to_prefetch=[], split_onnx_model_path=split_onnx_model_path, split_W_double_buffered=split_W_double_buffered, ) # Launch the MainStage scme, _ = mainstage.run() # Run the MainStage scme = scme[0] # Select one of the returned cost models for later inspection Implemented stages ================== This section is still being updated. For a missing description, please look at the stages requirements in `__init__.py `_ and the stage implementation in the `stages `_ folder. .. _custom-stages-label: The following stages are implemented in Stream: `CustomSpatialMappingGeneratorStage `_ ------------------------------------------------------------------------------------------------------------------------------------------------------------- Stage that finds spatial mappings given a accelerator, core allocation, interconnection pattern on the allocated core, layer. The spatial mappings are found using the interconnection pattern present on the core. The inner-most memory level served dimensions is used, as this is how the memories connect to the operational array. `GenerateCNWorkloadHybridStage `_ ---------------------------------------------------------------------------------------------------------------------------------------------------- Stage that transforms the layer-by-layer workload into finer CN workload graph. Multiple modes are applicable through the `cn_define_mode` parameter in conjunction with the `hint_loops` parameter: 1. `hint_loops` specifies the outer-cn loops based on which the layer will be split. 2. `hint_loops` specifies the inner-cn loops. The outer-cn loops are all remaining loops. 3. `hint_loops` specifies a nested list of loops. `layer_cutoffs` specifies until which layer index each list of outer-cn loops is applicable. 4. `hint_loops` specifies the outer-cn loops. `split_W_percentage` specifies the maximal percentage the constant operands may occupy on the respective memories in the cores they can be allocated to. If multiple cores have a different constant operand memory capacity, the capacity is taken to be the smallest. If a layer has a larger footprint, it will be split in terms of output channels by appending the `K` loops to the `hint_loops`. `InterCoreMappingStage `_ ---------------------------------------------------------------------------------------------------------------------------------- Stage that finds the best inter-core mapping using a genetic algorithm. From the IntraCoreMappingStage we receive the `node_hw_performances`, containing for each node and its valid core allocations the best CME. We then initialize the genetic algorithm. `IntraCoreMappingStage `_ ----------------------------------------------------------------------------------------------------------------------------------- Stage that finds the optimal ZigZag CME for each valid node-core allocation. This is saved to a dictionary which is passed to the subsequent stages. The `loop_ranges` attribute op each CN determines the unique nodes to be evaluated. If two nodes have a difference in `loop_ranges` in a dimension that is relevant for the constant operands of the node, e.g. the `K` loop in a traditoinal convolutional layer, the node is assigned to a different group which will be allocated separately in the `InterCoreMappingStage`. `ONNXModelParserStage `_ ---------------------------------------------------------------------------------------------------------------------------- Stage that parses the input workload residing in accelerator_path. The "workload" dict is converted to a NetworkX graph. Besides these stages, the `implemented stages from the ZigZag framework `_ can be used as well. Creating your custom stage ========================== Let's say you are not interested in saving the CME with minimal energy, but want to save based on another metric provided by the CME, or you want to define a new temporal mapping generator stage, you can easily create a custom stage. The easiest way is copying an existing stage class definition, and modifying it according to your intended behaviour. To guarantee correctness, following aspects have to be taken into account when creating a custom stage: * It must inherit from the abstract ``Stage`` class. * It must create its ``substage`` as the first element of the list of callables, with the remaining list as its first argument, and ``**kwargs`` as the second argument. These kwargs can be updated to change e.g. the accelerator, spatial mapping, temporal mapping, etc. * It must iterate over the different ``(CME, extra_info)`` tuples yielded by the ``substage.run()`` call in a for loop. * If the stage is a reduction (like e.g. the ``MinimalLatencyStage``), its ``yield`` statement must be outside the for loop which iterates over the returned ``(CME, extra_info)`` tuples, where some processing happens inside the for loop.