Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple launches, herd with single core #627

Open
hunhoffe opened this issue Jun 27, 2024 · 9 comments
Open

Multiple launches, herd with single core #627

hunhoffe opened this issue Jun 27, 2024 · 9 comments
Labels
question Further information is requested

Comments

@hunhoffe
Copy link
Contributor

I'm working on an example where I have an 2D matrix of input data, then I break it into four data tiles, and then I am attempt to process one data tile per one compute tile in a variety of ways using AIR constructs. I am sanity checking my programs by making each compute core add a unique tile_num to each value in the data tile they modify, so I can reassure myself that the compute tile I think is doing some work is actually the compute tile doing the work.

Anyways, I am trying to compose an example of this scenario that uses four launches, where the herd size is 1x1. My first attempt is here where I have a while loop within the herd because I hear the kernel will be persistent across launches.

Anyways, even with that persistence, I'd like to somehow parameterize the herd with the launch indices so I can calculate a unique tile_num per launch. Is this something that is possible to do? If not, how do I reassure myself that one data tile is being processed per launch?

@hunhoffe hunhoffe added the question Further information is requested label Jun 27, 2024
@erwei-xilinx
Copy link
Collaborator

We do not have an example in AIR today to parametrise herd using launch induction variables, or any runtime scalar parameter passed in from the host. For now, we only have launch induction variables parametrising the SHIM DMA BDs for streaming the correct data into the herd.

Would be really cool to have an example in AIR which can lower any runtime parameters into herd as MLIR-AIE RTP.

@hunhoffe
Copy link
Contributor Author

hunhoffe commented Jul 1, 2024

I'm trying my hand at this to see if I can figure it out.

My example is here: https://github.com/Xilinx/mlir-air/blob/1bbc92a4b1dcaa66cbab50dc494af53bec1d8472/programming_examples/matrix_scalar_add/multi_launch_channel/multi_launch_channel.py

The initial AIR MLIR looks somewhat reasonable (to my novice eye):

#map = affine_map<()[s0] -> (s0 * 16)>
#map1 = affine_map<()[s0, s1] -> (s0 + s1)>
module {
  air.channel @ChanIn []
  air.channel @ChanOut []
  func.func @copy(%arg0: memref<32x16xi32>, %arg1: memref<32x16xi32>) {
    %c2 = arith.constant 2 : index
    %c2_0 = arith.constant 2 : index
    air.launch (%arg2, %arg3) in (%arg4=%c2, %arg5=%c2_0) args(%arg6=%arg0, %arg7=%arg1) : memref<32x16xi32>, memref<32x16xi32> {
      %0 = affine.apply #map()[%arg2]
      %1 = affine.apply #map()[%arg3]
      %2 = affine.apply #map1()[%0, %arg3]
      %3 = arith.index_cast %2 : index to i32
      %c8 = arith.constant 8 : index
      %c16 = arith.constant 16 : index
      %c32 = arith.constant 32 : index
      %c1 = arith.constant 1 : index
      air.channel.put  @ChanIn[] (%arg6[%0, %1] [%c8, %c16] [%c32, %c1]) : (memref<32x16xi32>)
      %c8_1 = arith.constant 8 : index
      %c16_2 = arith.constant 16 : index
      %c32_3 = arith.constant 32 : index
      %c1_4 = arith.constant 1 : index
      air.channel.get  @ChanOut[] (%arg7[%0, %1] [%c8_1, %c16_2] [%c32_3, %c1_4]) : (memref<32x16xi32>)
      air.segment @seg  args(%arg8=%3) : i32 {
        %c1_5 = arith.constant 1 : index
        %c1_6 = arith.constant 1 : index
        air.herd @xaddherd  tile (%arg9, %arg10) in (%arg11=%c1_5, %arg12=%c1_6) args(%arg13=%arg8) : i32 {
          %alloc = memref.alloc() : memref<16x8xi32, 2 : i32>
          %alloc_7 = memref.alloc() : memref<16x8xi32, 2 : i32>
          air.channel.get  @ChanIn[] (%alloc[] [] []) : (memref<16x8xi32, 2 : i32>)
          %c0 = arith.constant 0 : index
          %c8_8 = arith.constant 8 : index
          %c1_9 = arith.constant 1 : index
          scf.for %arg14 = %c0 to %c8_8 step %c1_9 {
            %c0_10 = arith.constant 0 : index
            %c16_11 = arith.constant 16 : index
            %c1_12 = arith.constant 1 : index
            scf.for %arg15 = %c0_10 to %c16_11 step %c1_12 {
              %4 = memref.load %alloc[%arg15, %arg14] : memref<16x8xi32, 2 : i32>
              %5 = arith.addi %4, %arg13 : i32
              memref.store %5, %alloc_7[%arg15, %arg14] : memref<16x8xi32, 2 : i32>
            }
          }
          air.channel.put  @ChanOut[] (%alloc_7[] [] []) : (memref<16x8xi32, 2 : i32>)
          memref.dealloc %alloc : memref<16x8xi32, 2 : i32>
          memref.dealloc %alloc_7 : memref<16x8xi32, 2 : i32>
        }
      }
    }
    return
  }
}

However, my current compilation error is:

mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel$ make clean && make
rm -rf build __pycache__
mkdir -p build
cd build &&  python3 mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel/run.py
Traceback (most recent call last):
  File "mlir-air/programming_examples/matrix_scalar_add/multi_launch_channel/run.py", line 35, in <module>
    test_main(build_module, verbose=args.verbose)
  File "mlir-air/programming_examples/matrix_scalar_add/common.py", line 54, in test_main
    addone = backend.compile_and_load(mlir_module)
  File "mlir-air/install-xrt/python/air/backend/xrt.py", line 166, in compile_and_load
    c = self.compile(module)
  File "mlir-air/install-xrt/python/air/backend/xrt.py", line 89, in compile
    aircc.run(air_module, aircc_options)
  File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 414, in run
    run_passes(
  File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 112, in run_passes
    PassManager.parse(pass_pipeline).run(mlir_module.operation)
air._mlir_libs._site_initialize.<locals>.MLIRError: Failure while executing pass pipeline:
error: "-":21:9: branch has 0 operands for successor #0, but target block has 1
 note: "-":21:9: see current operation: "cf.br"()[^bb2] : () -> ()
make: *** [Makefile:7: run] Error 1

I haven't pinned down exactly at what stage things seem to go wrong, but I thought I'd record my progress here!

@hunhoffe
Copy link
Contributor Author

hunhoffe commented Jul 1, 2024

@erwei-xilinx Right now I'm trying to percolate the value from the launch through the segment to the herd using parameters.

An alternative option I could think of would be to write the value to memory somewhere in the launch and use a channel to percolate it through to the herd.

Do you have any insight as to which strategy seems most reasonable?

@erwei-xilinx
Copy link
Collaborator

Here's one example from mlir-aie which passes runtime parameters from host to the aie design: https://github.com/Xilinx/mlir-aie/blob/a764c8b7a6f944a5a892491ed781c81f618f1437/programming_examples/ml/conv2d_fused_relu/aie2.py#L232

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

@fifield
Copy link
Collaborator

fifield commented Jul 2, 2024

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

Maybe you missed it, but there is work in progress here: #585, but it is a side project so isn't advancing very quickly. It still requires some cleanup of how the air->airrt->npu lowering works. i.e. we need to spend time uncutting some corners.

@erwei-xilinx
Copy link
Collaborator

Currently AIR doesn't have any compilation pass that can lower to that op, but I would imagine that's the way to pass the parameter from host into the design.

Maybe you missed it, but there is work in progress here: #585, but it is a side project so isn't advancing very quickly. It still requires some cleanup of how the air->airrt->npu lowering works. i.e. we need to spend time uncutting some corners.

Oh yeah indeed I missed that PR. It would be great if we could use that feature to generate runtime parametrisable designs.

The air->airrt flow currently have three implementations to deal with aie1, aie2 and aie2 with objectFifo respectively. Would be super useful to unify them.

@hunhoffe
Copy link
Contributor Author

hunhoffe commented Jul 2, 2024

I tried unrolling the launches (so I'm manually creating 4 1x1 launches in a python for-loop). However, I get a segfault when I attempt to compile. I was wondering if anyone has a few minutes to check the new code (it is here) to see if it is a reasonable way to use the air abstractions (at which point I'll try to see how far I get in debugging) or if I should redesign it?

Something that I'm also a bit unclear on is how the original arguments to the module are sent/received when there are multiple launches. If someone could shed some light on that process, I'd appreciate it!

@erwei-xilinx
Copy link
Collaborator

Yeah lowering a design with multiple air.launches is not yet well exercised.

@hunhoffe
Copy link
Contributor Author

Ah, so this issue wasn't actually fixed. I fixed the multi-launch example so it fails correctly in this PR here: #673

@hunhoffe hunhoffe reopened this Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants