Multiple launches, herd with single core #627

hunhoffe · 2024-06-27T21:37:04Z

I'm working on an example where I have an 2D matrix of input data, then I break it into four data tiles, and then I am attempt to process one data tile per one compute tile in a variety of ways using AIR constructs. I am sanity checking my programs by making each compute core add a unique tile_num to each value in the data tile they modify, so I can reassure myself that the compute tile I think is doing some work is actually the compute tile doing the work.

Anyways, I am trying to compose an example of this scenario that uses four launches, where the herd size is 1x1. My first attempt is here where I have a while loop within the herd because I hear the kernel will be persistent across launches.

Anyways, even with that persistence, I'd like to somehow parameterize the herd with the launch indices so I can calculate a unique tile_num per launch. Is this something that is possible to do? If not, how do I reassure myself that one data tile is being processed per launch?

The text was updated successfully, but these errors were encountered:

erwei-xilinx · 2024-06-28T23:21:00Z

We do not have an example in AIR today to parametrise herd using launch induction variables, or any runtime scalar parameter passed in from the host. For now, we only have launch induction variables parametrising the SHIM DMA BDs for streaming the correct data into the herd.

Would be really cool to have an example in AIR which can lower any runtime parameters into herd as MLIR-AIE RTP.

hunhoffe · 2024-07-01T22:30:47Z

I'm trying my hand at this to see if I can figure it out.

My example is here: https://github.com/Xilinx/mlir-air/blob/1bbc92a4b1dcaa66cbab50dc494af53bec1d8472/programming_examples/matrix_scalar_add/multi_launch_channel/multi_launch_channel.py

The initial AIR MLIR looks somewhat reasonable (to my novice eye):

#map = affine_map<()[s0] -> (s0 * 16)>
#map1 = affine_map<()[s0, s1] -> (s0 + s1)>
module {
  air.channel @ChanIn []
  air.channel @ChanOut []
  func.func @copy(%arg0: memref<32x16xi32>, %arg1: memref<32x16xi32>) {
    %c2 = arith.constant 2 : index
    %c2_0 = arith.constant 2 : index
    air.launch (%arg2, %arg3) in (%arg4=%c2, %arg5=%c2_0) args(%arg6=%arg0, %arg7=%arg1) : memref<32x16xi32>, memref<32x16xi32> {
      %0 = affine.apply #map()[%arg2]
      %1 = affine.apply #map()[%arg3]
      %2 = affine.apply #map1()[%0, %arg3]
      %3 = arith.index_cast %2 : index to i32
      %c8 = arith.constant 8 : index
      %c16 = arith.constant 16 : index
      %c32 = arith.constant 32 : index
      %c1 = arith.constant 1 : index
      air.channel.put  @ChanIn[] (%arg6[%0, %1] [%c8, %c16] [%c32, %c1]) : (memref<32x16xi32>)
      %c8_1 = arith.constant 8 : index
      %c16_2 = arith.constant 16 : index
      %c32_3 = arith.constant 32 : index
      %c1_4 = arith.constant 1 : index
      air.channel.get  @ChanOut[] (%arg7[%0, %1] [%c8_1, %c16_2] [%c32_3, %c1_4]) : (memref<32x16xi32>)
      air.segment @seg  args(%arg8=%3) : i32 {
        %c1_5 = arith.constant 1 : index
        %c1_6 = arith.constant 1 : index
        air.herd @xaddherd  tile (%arg9, %arg10) in (%arg11=%c1_5, %arg12=%c1_6) args(%arg13=%arg8) : i32 {
          %alloc = memref.alloc() : memref<16x8xi32, 2 : i32>
          %alloc_7 = memref.alloc() : memref<16x8xi32, 2 : i32>
          air.channel.get  @ChanIn[] (%alloc[] [] []) : (memref<16x8xi32, 2 : i32>)
          %c0 = arith.constant 0 : index
          %c8_8 = arith.constant 8 : index
          %c1_9 = arith.constant 1 : index
          scf.for %arg14 = %c0 to %c8_8 step %c1_9 {
            %c0_10 = arith.constant 0 : index
            %c16_11 = arith.constant 16 : index
            %c1_12 = arith.constant 1 : index
            scf.for %arg15 = %c0_10 to %c16_11 step %c1_12 {
              %4 = memref.load %alloc[%arg15, %arg14] : memref<16x8xi32, 2 : i32>
              %5 = arith.addi %4, %arg13 : i32
              memref.store %5, %alloc_7[%arg15, %arg14] : memref<16x8xi32, 2 : i32>
            }
          }
          air.channel.put  @ChanOut[] (%alloc_7[] [] []) : (memref<16x8xi32, 2 : i32>)
          memref.dealloc %alloc : memref<16x8xi32, 2 : i32>
          memref.dealloc %alloc_7 : memref<16x8xi32, 2 : i32>
        }
      }
    }
    return
  }
}

However, my current compilation error is:

mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel$ make clean && make
rm -rf build __pycache__
mkdir -p build
cd build &&  python3 mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel/run.py
Traceback (most recent call last):
  File "mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel/run.py", line 35, in <module>
    test_main(build_module, verbose=args.verbose)
  File "mlir-air/programming_examples/matrix_scalar_add/common.py", line 54, in test_main
    addone = backend.compile_and_load(mlir_module)
  File "mlir-air/install-xrt/python/air/backend/xrt.py", line 166, in compile_and_load
    c = self.compile(module)
  File "mlir-air/install-xrt/python/air/backend/xrt.py", line 89, in compile
    aircc.run(air_module, aircc_options)
  File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 414, in run
    run_passes(
  File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 112, in run_passes
    PassManager.parse(pass_pipeline).run(mlir_module.operation)
air._mlir_libs._site_initialize.<locals>.MLIRError: Failure while executing pass pipeline:
error: "-":21:9: branch has 0 operands for successor #0, but target block has 1
 note: "-":21:9: see current operation: "cf.br"()[^bb2] : () -> ()
make: *** [Makefile:7: run] Error 1

I haven't pinned down exactly at what stage things seem to go wrong, but I thought I'd record my progress here!

hunhoffe · 2024-07-01T22:33:46Z

@erwei-xilinx Right now I'm trying to percolate the value from the launch through the segment to the herd using parameters.

An alternative option I could think of would be to write the value to memory somewhere in the launch and use a channel to percolate it through to the herd.

Do you have any insight as to which strategy seems most reasonable?

erwei-xilinx · 2024-07-01T23:42:36Z

Here's one example from mlir-aie which passes runtime parameters from host to the aie design: https://github.com/Xilinx/mlir-aie/blob/a764c8b7a6f944a5a892491ed781c81f618f1437/programming_examples/ml/conv2d_fused_relu/aie2.py#L232

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

fifield · 2024-07-02T14:57:02Z

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

Maybe you missed it, but there is work in progress here: #585, but it is a side project so isn't advancing very quickly. It still requires some cleanup of how the air->airrt->npu lowering works. i.e. we need to spend time uncutting some corners.

erwei-xilinx · 2024-07-02T15:51:29Z

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

Maybe you missed it, but there is work in progress here: #585, but it is a side project so isn't advancing very quickly. It still requires some cleanup of how the air->airrt->npu lowering works. i.e. we need to spend time uncutting some corners.

Oh yeah indeed I missed that PR. It would be great if we could use that feature to generate runtime parametrisable designs.

The air->airrt flow currently have three implementations to deal with aie1, aie2 and aie2 with objectFifo respectively. Would be super useful to unify them.

hunhoffe · 2024-07-02T16:43:29Z

I tried unrolling the launches (so I'm manually creating 4 1x1 launches in a python for-loop). However, I get a segfault when I attempt to compile. I was wondering if anyone has a few minutes to check the new code (it is here) to see if it is a reasonable way to use the air abstractions (at which point I'll try to see how far I get in debugging) or if I should redesign it?

Something that I'm also a bit unclear on is how the original arguments to the module are sent/received when there are multiple launches. If someone could shed some light on that process, I'd appreciate it!

erwei-xilinx · 2024-07-02T17:43:18Z

Yeah lowering a design with multiple air.launches is not yet well exercised.

hunhoffe · 2024-07-30T16:33:40Z

Ah, so this issue wasn't actually fixed. I fixed the multi-launch example so it fails correctly in this PR here: #673

hunhoffe added the question Further information is requested label Jun 27, 2024

hunhoffe mentioned this issue Jun 28, 2024

Matrix Scalar Addition #621

Merged

hunhoffe mentioned this issue Jul 3, 2024

Channel Examples #648

Open

11 tasks

hunhoffe mentioned this issue Jul 19, 2024

Fixup data layout in programming examples #672

Merged

erwei-xilinx closed this as completed Jul 19, 2024

hunhoffe mentioned this issue Jul 30, 2024

XRTRunner Utility Class & Programming Examples Cleanup #673

Merged

hunhoffe reopened this Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple launches, herd with single core #627

Multiple launches, herd with single core #627

hunhoffe commented Jun 27, 2024

erwei-xilinx commented Jun 28, 2024

hunhoffe commented Jul 1, 2024

hunhoffe commented Jul 1, 2024

erwei-xilinx commented Jul 1, 2024

fifield commented Jul 2, 2024

erwei-xilinx commented Jul 2, 2024

hunhoffe commented Jul 2, 2024

erwei-xilinx commented Jul 2, 2024

hunhoffe commented Jul 30, 2024

Multiple launches, herd with single core #627

Multiple launches, herd with single core #627

Comments

hunhoffe commented Jun 27, 2024

erwei-xilinx commented Jun 28, 2024

hunhoffe commented Jul 1, 2024

hunhoffe commented Jul 1, 2024

erwei-xilinx commented Jul 1, 2024

fifield commented Jul 2, 2024

erwei-xilinx commented Jul 2, 2024

hunhoffe commented Jul 2, 2024

erwei-xilinx commented Jul 2, 2024

hunhoffe commented Jul 30, 2024