Some questions about the number of Compute tiles? #1964

ngdymx · 2024-12-10T16:39:20Z

Hi team,

I read the tutorial and got that the device is present in HawkPoint (e.g., 8040HS) SOCs. has 5 Columns and 6 Rows, as shown below:

However, I cannot control the leftmost column CTS. I want to confirm whether we cannot access it. Is there any way to access and control that column? Thanks a lot!

Additionally, the following is noted about device partitions:

jgmelber · 2024-12-10T16:48:00Z

Hi! If you use the device type npu1 you should be able to access the whole 5 column device. Remember that you will need to start your shimDMAs at column 1 in that case as there is not a shimTile in the leftmost column.

ngdymx · 2024-12-10T17:01:11Z

Hi @jgmelber,

Thanks.

I try it, but get the error shown in the following figure.

Could you help me with it? Is there any example I can follow? Thanks a lot!

jgmelber · 2024-12-10T23:07:32Z

@ngdymx can you share the entire design file, or at least the @sequence?

ngdymx · 2024-12-11T16:24:21Z

Hi @jgmelber,

Sorry for the late reply. I write a passthrough kernel to test it. It works well when I set the dev is AIEDevice.npu1_1col. When I change the dev to AIEDevice.npu1, I will get the error I reported yesterday. I need to rewrite the sequence function but don't know how to do it. Could you help me with it? Thanks!

Case 1:

dev = AIEDevice.npu1_1col
/*************************/
ShimTile = tile(0, 0)
ComputeTile = tile(0, 2)

# To/from AIE-array data movement
@runtime_sequence(tensor_ty, tensor_ty)
def sequence(A, C):
  npu_dma_memcpy_nd(metadata=of_in, bd_id=1, mem=A, sizes=[1, 1, 1, N], issue_token=True)
  npu_dma_memcpy_nd(metadata=of_out, bd_id=0, mem=C, sizes=[1, 1, 1, N], issue_token=True)
  dma_wait(of_in, of_out)

The following is the result:

Case 2:
Then I only change the dev to AIEDevice.npu1 and the corresponding tile.

dev = AIEDevice.npu1
/*************************/
 # Tile declarations
ShimTile = tile(1, 0)
ComputeTile = tile(1, 2)

The error I get:

Here is the project:
Kernel code:

#include <aie_api/aie.hpp>

template <typename T_in, typename T_out>
void passthrough_aie(const T_in *__restrict in0, T_out *__restrict out, const int N) {
    for (int i = 0; i < N; i++) {
            out[i] = in0[i];
    }
}

extern "C" {

    void passThrough(const int32_t *__restrict in0, int32_t *__restrict out, const int N) {
          passthrough_aie<int32_t, int32_t>(in0, out, N);
          
    }
    
}

Then the working mlir code:

import numpy as np

from aie.dialects.aie import *
from aie.dialects.aiex import *
from aie.helpers.dialects.ext.scf import _for as range_
from aie.extras.context import mlir_mod_ctx

import sys

N = 64
dev = AIEDevice.npu1_1col

def pass_loops():
    with mlir_mod_ctx() as ctx:

        @device(dev)
        def device_body():
            tensor_ty = np.ndarray[(N,), np.dtype[np.int32]]

            # Tile declarations
            ShimTile = tile(0, 0)
            ComputeTile = tile(0, 2)

            # AIE-array data movement with object fifos
            of_in = object_fifo("in", ShimTile, ComputeTile, 2, tensor_ty)
            of_out = object_fifo("out", ComputeTile, ShimTile, 2, tensor_ty)

            # AIE Core Function declarations
            passthrough = external_func(
                "passThrough", inputs=[tensor_ty, tensor_ty, np.int32]
            )

            # Set up compute tiles
            @core(ComputeTile, "passThrough.o")
            def core_body():
                for _ in range_(sys.maxsize):
                    elemIn = of_in.acquire(ObjectFifoPort.Consume, 1)
                    elemOut = of_out.acquire(ObjectFifoPort.Produce, 1)
                    passthrough(elemIn, elemOut, N)
                    of_out.release(ObjectFifoPort.Produce, 1)
                    of_in.release(ObjectFifoPort.Consume, 1)

            # To/from AIE-array data movement
            @runtime_sequence(tensor_ty, tensor_ty)
            def sequence(A, C):
                npu_dma_memcpy_nd(
                    metadata=of_in, bd_id=1, mem=A, sizes=[1, 1, 1, N], issue_token=True
                )
                npu_dma_memcpy_nd(metadata=of_out, bd_id=0, mem=C, sizes=[1, 1, 1, N], issue_token=True)
                dma_wait(of_in, of_out)

    print(ctx.module)


pass_loops()

The host code:

#include <cstdint>
#include <fstream>
#include <iostream>
#include <sstream>

#include "test_utils.h"
#include "xrt/xrt_bo.h"

#ifndef DATATYPES_USING_DEFINED
#define DATATYPES_USING_DEFINED
// ------------------------------------------------------
// Configure this to match your buffer data type
// ------------------------------------------------------
using DATATYPE = std::int32_t;
#endif

#define PASSTHROUGH_SIZE 64
namespace po = boost::program_options;

int main(int argc, const char *argv[]) {

  // Program arguments parsing
  po::options_description desc("Allowed options");
  po::variables_map vm;
  test_utils::add_default_options(desc);

  test_utils::parse_options(argc, argv, desc, vm);
  int verbosity = vm["verbosity"].as<int>();
  int trace_size = vm["trace_sz"].as<int>();

  std::cout << std::endl << "Running...\n";

  // Load instruction sequence
  std::vector<uint32_t> instr_v =
      test_utils::load_instr_sequence(vm["instr"].as<std::string>());

  if (verbosity >= 1)
    std::cout << "Sequence instr count: " << instr_v.size() << "\n";

  // Start the XRT context and load the kernel
  xrt::device device;
  xrt::kernel kernel;

  test_utils::init_xrt_load_kernel(device, kernel, verbosity,
                                   vm["xclbin"].as<std::string>(),
                                   vm["kernel"].as<std::string>());

  // set up the buffer objects
  auto bo_instr = xrt::bo(device, instr_v.size() * sizeof(int),
                          XCL_BO_FLAGS_CACHEABLE, kernel.group_id(1));
  auto bo_inA = xrt::bo(device, PASSTHROUGH_SIZE * sizeof(DATATYPE),
                        XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(3));
  auto bo_out =
      xrt::bo(device, PASSTHROUGH_SIZE * sizeof(DATATYPE) + trace_size,
              XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(4));

  if (verbosity >= 1)
    std::cout << "Writing data into buffer objects.\n";

  // Copy instruction stream to xrt buffer object
  void *bufInstr = bo_instr.map<void *>();
  memcpy(bufInstr, instr_v.data(), instr_v.size() * sizeof(int));

  // Initialize buffer bo_inA
  DATATYPE *bufInA = bo_inA.map<DATATYPE *>();
  printf("Input:\n");
  for (int i = 0; i < PASSTHROUGH_SIZE; i++){
    bufInA[i] = i;
  }
  for (int i = 0; i < PASSTHROUGH_SIZE; i++){
    printf("%d", bufInA[i]);
  }

  // Zero out buffer bo_out
  DATATYPE *bufOut = bo_out.map<DATATYPE *>();
  memset(bufOut, 0, PASSTHROUGH_SIZE * sizeof(DATATYPE) + trace_size);

  // sync host to device memories
  bo_instr.sync(XCL_BO_SYNC_BO_TO_DEVICE);
  bo_inA.sync(XCL_BO_SYNC_BO_TO_DEVICE);
  bo_out.sync(XCL_BO_SYNC_BO_TO_DEVICE);

  printf("\n");
  // Execute the kernel and wait to finish
 std::cout << "Running Kernel.\n";
  unsigned int opcode = 3;
  auto run = kernel(opcode, bo_instr, instr_v.size(), bo_inA, bo_out);
  run.wait();

  // Sync device to host memories
  bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE);

  // Compare out to in
  int errors = 0;
  for (int i = 0; i < PASSTHROUGH_SIZE; i++) {
    if (bufOut[i] != bufInA[i])
      errors++;
  }

  printf("Out:\n");
  for (int i = 0; i < PASSTHROUGH_SIZE; i++) {
    printf("%d", bufOut[i]);
  }

  if (trace_size > 0) {
    test_utils::write_out_trace(((char *)bufOut) +
                                    (PASSTHROUGH_SIZE * sizeof(DATATYPE)),
                                trace_size, vm["trace_file"].as<std::string>());
  }

  // Print Pass/Fail result of our test
  if (!errors) {
    std::cout << std::endl << "PASS!" << std::endl << std::endl;
    return 0;
  } else {
    std::cout << std::endl
              << errors << " mismatches." << std::endl
              << std::endl;
    std::cout << std::endl << "fail." << std::endl << std::endl;
    return 1;
  }
}

jgmelber · 2024-12-11T17:42:18Z

Thanks for the code, we are looking into it.

CC @AndraBisca @pvasireddy-amd

ngdymx · 2024-12-11T17:43:31Z

Thanks!

AndraBisca · 2024-12-22T12:36:05Z

Hello! Just to follow up on this: after some investigating it turned out that there were two related issues here. One PR has been merged into the main branch, the other is well on its way but requires some last cleanups, which will happen after the holidays. Once that PR is in, your design should work!

ngdymx · 2024-12-22T16:20:55Z

Hi,

Great, thank you!
Marry Christmas!

AndraBisca self-assigned this Dec 12, 2024

pvasireddy-amd self-assigned this Dec 12, 2024

pvasireddy-amd mentioned this issue Dec 13, 2024

Remove (X, Y) coordinates from NpuDmaMemcpyNdOp #1971

Draft

jgmelber added the triaged This has been looked at and triaged label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the number of Compute tiles? #1964

Some questions about the number of Compute tiles? #1964

ngdymx commented Dec 10, 2024

jgmelber commented Dec 10, 2024

ngdymx commented Dec 10, 2024 •

edited

Loading

jgmelber commented Dec 10, 2024

ngdymx commented Dec 11, 2024

jgmelber commented Dec 11, 2024

ngdymx commented Dec 11, 2024

AndraBisca commented Dec 22, 2024

ngdymx commented Dec 22, 2024

Some questions about the number of Compute tiles? #1964

Some questions about the number of Compute tiles? #1964

Comments

ngdymx commented Dec 10, 2024

jgmelber commented Dec 10, 2024

ngdymx commented Dec 10, 2024 • edited Loading

jgmelber commented Dec 10, 2024

ngdymx commented Dec 11, 2024

jgmelber commented Dec 11, 2024

ngdymx commented Dec 11, 2024

AndraBisca commented Dec 22, 2024

ngdymx commented Dec 22, 2024

ngdymx commented Dec 10, 2024 •

edited

Loading