Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the number of Compute tiles? #1964

Open
ngdymx opened this issue Dec 10, 2024 · 8 comments
Open

Some questions about the number of Compute tiles? #1964

ngdymx opened this issue Dec 10, 2024 · 8 comments
Assignees
Labels
triaged This has been looked at and triaged

Comments

@ngdymx
Copy link

ngdymx commented Dec 10, 2024

Hi team,

I read the tutorial and got that the device is present in HawkPoint (e.g., 8040HS) SOCs. has 5 Columns and 6 Rows, as shown below:

image

However, I cannot control the leftmost column CTS. I want to confirm whether we cannot access it. Is there any way to access and control that column? Thanks a lot!

Additionally, the following is noted about device partitions:

image

@jgmelber
Copy link
Collaborator

Hi! If you use the device type npu1 you should be able to access the whole 5 column device. Remember that you will need to start your shimDMAs at column 1 in that case as there is not a shimTile in the leftmost column.

@ngdymx
Copy link
Author

ngdymx commented Dec 10, 2024

Hi @jgmelber,

Thanks.

I try it, but get the error shown in the following figure.

image

Could you help me with it? Is there any example I can follow? Thanks a lot!

image

@jgmelber
Copy link
Collaborator

@ngdymx can you share the entire design file, or at least the @sequence?

@ngdymx
Copy link
Author

ngdymx commented Dec 11, 2024

Hi @jgmelber,

Sorry for the late reply. I write a passthrough kernel to test it. It works well when I set the dev is AIEDevice.npu1_1col. When I change the dev to AIEDevice.npu1, I will get the error I reported yesterday. I need to rewrite the sequence function but don't know how to do it. Could you help me with it? Thanks!

Case 1:

dev = AIEDevice.npu1_1col
/*************************/
ShimTile = tile(0, 0)
ComputeTile = tile(0, 2)

# To/from AIE-array data movement
@runtime_sequence(tensor_ty, tensor_ty)
def sequence(A, C):
  npu_dma_memcpy_nd(metadata=of_in, bd_id=1, mem=A, sizes=[1, 1, 1, N], issue_token=True)
  npu_dma_memcpy_nd(metadata=of_out, bd_id=0, mem=C, sizes=[1, 1, 1, N], issue_token=True)
  dma_wait(of_in, of_out)

The following is the result:
image

Case 2:
Then I only change the dev to AIEDevice.npu1 and the corresponding tile.

dev = AIEDevice.npu1
/*************************/
 # Tile declarations
ShimTile = tile(1, 0)
ComputeTile = tile(1, 2)

The error I get:
image

Here is the project:
Kernel code:

#include <aie_api/aie.hpp>

template <typename T_in, typename T_out>
void passthrough_aie(const T_in *__restrict in0, T_out *__restrict out, const int N) {
    for (int i = 0; i < N; i++) {
            out[i] = in0[i];
    }
}

extern "C" {

    void passThrough(const int32_t *__restrict in0, int32_t *__restrict out, const int N) {
          passthrough_aie<int32_t, int32_t>(in0, out, N);
          
    }
    
}

Then the working mlir code:

import numpy as np

from aie.dialects.aie import *
from aie.dialects.aiex import *
from aie.helpers.dialects.ext.scf import _for as range_
from aie.extras.context import mlir_mod_ctx

import sys

N = 64
dev = AIEDevice.npu1_1col

def pass_loops():
    with mlir_mod_ctx() as ctx:

        @device(dev)
        def device_body():
            tensor_ty = np.ndarray[(N,), np.dtype[np.int32]]

            # Tile declarations
            ShimTile = tile(0, 0)
            ComputeTile = tile(0, 2)

            # AIE-array data movement with object fifos
            of_in = object_fifo("in", ShimTile, ComputeTile, 2, tensor_ty)
            of_out = object_fifo("out", ComputeTile, ShimTile, 2, tensor_ty)

            # AIE Core Function declarations
            passthrough = external_func(
                "passThrough", inputs=[tensor_ty, tensor_ty, np.int32]
            )

            # Set up compute tiles
            @core(ComputeTile, "passThrough.o")
            def core_body():
                for _ in range_(sys.maxsize):
                    elemIn = of_in.acquire(ObjectFifoPort.Consume, 1)
                    elemOut = of_out.acquire(ObjectFifoPort.Produce, 1)
                    passthrough(elemIn, elemOut, N)
                    of_out.release(ObjectFifoPort.Produce, 1)
                    of_in.release(ObjectFifoPort.Consume, 1)

            # To/from AIE-array data movement
            @runtime_sequence(tensor_ty, tensor_ty)
            def sequence(A, C):
                npu_dma_memcpy_nd(
                    metadata=of_in, bd_id=1, mem=A, sizes=[1, 1, 1, N], issue_token=True
                )
                npu_dma_memcpy_nd(metadata=of_out, bd_id=0, mem=C, sizes=[1, 1, 1, N], issue_token=True)
                dma_wait(of_in, of_out)

    print(ctx.module)


pass_loops()

The host code:

#include <cstdint>
#include <fstream>
#include <iostream>
#include <sstream>

#include "test_utils.h"
#include "xrt/xrt_bo.h"

#ifndef DATATYPES_USING_DEFINED
#define DATATYPES_USING_DEFINED
// ------------------------------------------------------
// Configure this to match your buffer data type
// ------------------------------------------------------
using DATATYPE = std::int32_t;
#endif

#define PASSTHROUGH_SIZE 64
namespace po = boost::program_options;

int main(int argc, const char *argv[]) {

  // Program arguments parsing
  po::options_description desc("Allowed options");
  po::variables_map vm;
  test_utils::add_default_options(desc);

  test_utils::parse_options(argc, argv, desc, vm);
  int verbosity = vm["verbosity"].as<int>();
  int trace_size = vm["trace_sz"].as<int>();

  std::cout << std::endl << "Running...\n";

  // Load instruction sequence
  std::vector<uint32_t> instr_v =
      test_utils::load_instr_sequence(vm["instr"].as<std::string>());

  if (verbosity >= 1)
    std::cout << "Sequence instr count: " << instr_v.size() << "\n";

  // Start the XRT context and load the kernel
  xrt::device device;
  xrt::kernel kernel;

  test_utils::init_xrt_load_kernel(device, kernel, verbosity,
                                   vm["xclbin"].as<std::string>(),
                                   vm["kernel"].as<std::string>());

  // set up the buffer objects
  auto bo_instr = xrt::bo(device, instr_v.size() * sizeof(int),
                          XCL_BO_FLAGS_CACHEABLE, kernel.group_id(1));
  auto bo_inA = xrt::bo(device, PASSTHROUGH_SIZE * sizeof(DATATYPE),
                        XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(3));
  auto bo_out =
      xrt::bo(device, PASSTHROUGH_SIZE * sizeof(DATATYPE) + trace_size,
              XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(4));

  if (verbosity >= 1)
    std::cout << "Writing data into buffer objects.\n";

  // Copy instruction stream to xrt buffer object
  void *bufInstr = bo_instr.map<void *>();
  memcpy(bufInstr, instr_v.data(), instr_v.size() * sizeof(int));

  // Initialize buffer bo_inA
  DATATYPE *bufInA = bo_inA.map<DATATYPE *>();
  printf("Input:\n");
  for (int i = 0; i < PASSTHROUGH_SIZE; i++){
    bufInA[i] = i;
  }
  for (int i = 0; i < PASSTHROUGH_SIZE; i++){
    printf("%d", bufInA[i]);
  }

  // Zero out buffer bo_out
  DATATYPE *bufOut = bo_out.map<DATATYPE *>();
  memset(bufOut, 0, PASSTHROUGH_SIZE * sizeof(DATATYPE) + trace_size);

  // sync host to device memories
  bo_instr.sync(XCL_BO_SYNC_BO_TO_DEVICE);
  bo_inA.sync(XCL_BO_SYNC_BO_TO_DEVICE);
  bo_out.sync(XCL_BO_SYNC_BO_TO_DEVICE);

  printf("\n");
  // Execute the kernel and wait to finish
 std::cout << "Running Kernel.\n";
  unsigned int opcode = 3;
  auto run = kernel(opcode, bo_instr, instr_v.size(), bo_inA, bo_out);
  run.wait();

  // Sync device to host memories
  bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE);

  // Compare out to in
  int errors = 0;
  for (int i = 0; i < PASSTHROUGH_SIZE; i++) {
    if (bufOut[i] != bufInA[i])
      errors++;
  }

  printf("Out:\n");
  for (int i = 0; i < PASSTHROUGH_SIZE; i++) {
    printf("%d", bufOut[i]);
  }

  if (trace_size > 0) {
    test_utils::write_out_trace(((char *)bufOut) +
                                    (PASSTHROUGH_SIZE * sizeof(DATATYPE)),
                                trace_size, vm["trace_file"].as<std::string>());
  }

  // Print Pass/Fail result of our test
  if (!errors) {
    std::cout << std::endl << "PASS!" << std::endl << std::endl;
    return 0;
  } else {
    std::cout << std::endl
              << errors << " mismatches." << std::endl
              << std::endl;
    std::cout << std::endl << "fail." << std::endl << std::endl;
    return 1;
  }
}

@jgmelber
Copy link
Collaborator

Thanks for the code, we are looking into it.

CC @AndraBisca @pvasireddy-amd

@ngdymx
Copy link
Author

ngdymx commented Dec 11, 2024

Thanks!

@AndraBisca AndraBisca self-assigned this Dec 12, 2024
@pvasireddy-amd pvasireddy-amd self-assigned this Dec 12, 2024
@jgmelber jgmelber added the triaged This has been looked at and triaged label Dec 17, 2024
@AndraBisca
Copy link
Collaborator

Hello! Just to follow up on this: after some investigating it turned out that there were two related issues here. One PR has been merged into the main branch, the other is well on its way but requires some last cleanups, which will happen after the holidays. Once that PR is in, your design should work!

@ngdymx
Copy link
Author

ngdymx commented Dec 22, 2024

Hi,

Great, thank you!
Marry Christmas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged This has been looked at and triaged
Projects
None yet
Development

No branches or pull requests

4 participants