Skip to content

Exp callback 0.12.0 #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 265 additions & 0 deletions sdk/include/pti/pti_callback.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
//==============================================================

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, what is described in the header file seems fine. I have a few questions though:

  • How would GPU hardware counters be configured? Would there be callbacks before/after kernel launches (as in AMD's interface) to set counter configurations determined previously?
  • There don't seem to be callbacks reporting code loading/unloading. Is that just one of the driver operations reported, or is something different required?
  • Is there a guarantee that the callbacks occur in the thread initiating the operation (e.g. a callstack unwind there is meaningful)?
  • There is another API in PTI-GPU for reporting the completion of GPU operations. That also requires callbacks. Perhaps this file should be named the pti_sync_callbacks.h, since the callbacks are synchronous in the thread initiating an operation (or so I believe).

A concern about using it is that there is no API for tool initialization. With OpenMP target programs, static constructors load code into a GPU, requiring the tool to be on at that point. When would one register these callbacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmellorcrummey, thank you very much for your review! we will work to address your comments in the next couple of days and push the update.
Let me start from the end:

A concern about using it is that there is no API for tool initialization. With OpenMP target programs, static constructors load code into a GPU, requiring the tool to be on at that point. When would one register these callbacks?

Today's approach that PTI initializes lazily on the first call to it. So far it is working.. But we have not tested on OpenMP programs yet. We will look into it.
if you think that lazy initialization would not work - please, describe why.

There is another API in PTI-GPU for reporting the completion of GPU operations. That also requires callbacks. Perhaps this file should be named the pti_sync_callbacks.h, since the callbacks are synchronous in the thread initiating an operation (or so I believe).

Make sense. Will discuss in the team and, probably, will rename this one.

Is there a guarantee that the callbacks occur in the thread initiating the operation (e.g. a callstack unwind there is meaningful)?

Yes. So far we have exactly this synchronous implementation in mind. (even removed thread_id from all Callback structures)

There don't seem to be callbacks reporting code loading/unloading. Is that just one of the driver operations reported, or is something different required?

Such callback make sense. Let us look into it.

How would GPU hardware counters be configured? Would there be callbacks before/after kernel launches (as in AMD's interface) to set counter configurations determined previously?

We have not put yet a lot of thoughts into how HW counters would "cooperate" with Callback APIs. The thoughts were that we will make the first experimental version of Callback and on the way - will think about how make them work together. Does it make sense?

Do we understand correctly that your primary interest GPU EU Stall sampling and not time-based streaming counters?

Copy link

@jmellorcrummey jmellorcrummey Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today's approach that PTI initializes lazily on the first call to it. So far it is working.. But we have not tested on OpenMP programs yet. We will look into it. if you think that lazy initialization would not work - please, describe why.

I am not concerned about initialization of the PTI-GPU monitoring interface itself. I am concerned about initialization of a tool using PTI-GPU. When launching an application with HPCToolkit, before we collect ANY performance data, we initialize HPCToolkit's measurement subsystem for all of the desired data collection.

In working with AMD, we helped them design two APIs to support initialization of client tools. We would like essentially the same capabilities in PTI-GPU.

The implementation using such primitives is a bit more subtle than it appears at first glance. Basically, the initialization for HPCToolkit and PTI-GPU need to rendezvous and both initialize at the same time.

Do we understand correctly that your primary interest GPU EU Stall sampling and not time-based streaming counters?

Our current interests are threefold

  • we want to collect information about GPU operations using the PTI-View activity trace
  • we want to collect GPU EU stall samples
  • we want to collect hardware counter metrics for individual kernels (not using the streaming sampling)

In the future, we expect that we would like to collect time-based streaming metrics as well if we are not collecting HW counter metrics for individual kernels.

// Copyright (C) Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#ifndef INCLUDE_PTI_CALLBACK_H_
#define INCLUDE_PTI_CALLBACK_H_

#include <stddef.h>
#include <stdint.h>

#include "pti/pti.h"
#include "pti/pti_driver_levelzero_api_ids.h"
#include "pti/pti_export.h"
#include "pti/pti_view.h"

/**
* This file contains the draft definitions of the callback API.
* It is subject of discussion and changes.
* As soon as interested parties agreed on the basic terms -
* the plan is to do POC implementation of the API.
*
* The tool that using PTI will be able to subscribe to the driver and PTI events.
* Driver or PTI functions other than
*
* const char* ptiResultTypeToString(pti_result result_value);
* pti_result ptiViewGetApiIdName(pti_api_group_id type, uint32_t unique_id, const char** name);
* pti_result ptiCallbackGetExternalCorrelationIds(...)
* < other to be indicated there >
*
* must not be called from the callbacks.
*
* All data passed into Callback functions are valid only during the callback.
* No assumptions should be made if there data would be valid after the callback.
*/

/* clang-format off */
#if defined(__cplusplus)
extern "C" {
#endif

/**
* Callback domains (that are not INTERNAL) inform user about important events about
* GPU context creating, GPU operation launch, GPU host synchronization.
* These events delivered through the callbacks of a number of driver APIs.
*
* Each domain corresponds to a group of APIs and will be extended if new APIs are added
* to the driver. This way - it will be transparent for a user and
* do not require to setup callback to individual APIs.
*
* However, if there is a need, a user can subscribe to individual APIs if needed,
* via ptiCallbackEnableDriverAPI function.
*
* The subscriptions via Domain API and via individual API are treated as different subscriptions and
* so a user might have two callbacks for the same API - one for the domain and one for the individual API.
* The callback will be called for the domain and for the individual API.
*/
typedef enum _pti_callback_domain {
PTI_DOMAIN_UNNAMED = 0, // Callback created for specific API ID
PTI_DOMAIN_DRIVER_CONTEXT_CREATE = 1,
PTI_DOMAIN_DRIVER_GPU_OPERATION_LAUNCH = 2,
PTI_DOMAIN_DRIVER_HOST_SYNCHRONIZATION = 3,
// below domains to inform user about PTI internal events
PTI_DOMAIN_INTERNAL_THREADS = 20,
PTI_DOMAIN_INTERNAL_EVENT = 21,

PTI_DOMAIN_MAX = 0x7fffffff
} pti_callback_domain;

typedef enum _pti_callback_site {
PTI_SITE_INVALID = 0,
PTI_SITE_DRIVER_API_ENTER = 1,
PTI_SITE_DRIVER_API_EXIT = 2,
PTI_SITE_INTERNAL_THREAD_START = 3,
PTI_SITE_INTERNAL_THREAD_END = 4,
PTI_SITE_INTERNAL_EVENT = 5,

PTI_SITE_MAX = 0x7fffffff
} pti_callback_site;

typedef enum _pti_backend_submit_type {
PTI_BACKEND_SUBMIT_TYPE_INVALID = 0,
PTI_BACKEND_SUBMIT_TYPE_IMMEDIATE = 1,
PTI_BACKEND_SUBMIT_TYPE_QUEUE = 2,

PTI_BACKEND_SUBMIT_TYPE_MAX = 0x7fffffff
} pti_backend_submit_type;

/**
* INTERNAL DOMAINs Callback inform user about PTI internal operations and could be used to
* characterize internal overhead of the collection or data processing, or some non-standard situations.
*
* Most of PTI work happens in the application threads in the callbacks to runtime and driver functions.
* But in some cases PTI creates its own threads to do some work:
* as of today - to deliver data to the user and to collect GPU data hadware metrics.
*/
typedef enum _pti_thread_purpose {
PTI_THREAD_PURPOSE_INVALID = 0,
PTI_THREAD_PURPOSE_SERVICE = 1,
PTI_THREAD_PURPOSE_DATA_DELIVERY = 2,
PTI_THREAD_PURPOSE_DATA_COLLECTION = 3, // for example, thread that collects GPU HW metrics

PTI_THREAD_PURPOSE_MAX = 0x7fffffff
} pti_thread_purpose;

/**
* A user can subscribe to notifications about non-standard situation from PTI
* when it collects or processes the data
*/
typedef enum _pti_internal_event_type {
PTI_INTERNAL_EVENT_TYPE_INFO = 0,
PTI_INTERNAL_EVENT_TYPE_WARNING = 1, // one or few records data inconsistences, or other

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some info about how it would be used.

Knowing that something is inconsistent but not what doesn't seem very useful.

Copy link
Contributor Author

@jfedorov jfedorov Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As PTI depends on many lower level components.. there could be unexpected situations - due to either bugs (in PTI and others) or some missing "synchronizations" between components.

This might result in two major issues

  1. PTI crashes application
  2. PTI delivers unreliable data
    We have to absolutely avoid 1). As well as much as possible avoid 2) - if PTI knows that something went wrong and it effects validity of data - it can report about it.
    We don't want user to look into unreliable data but we better mark them.

Underneath PTI traces a lot of "paths" - some paths might be better tested/ simpler/ more frequently used than the others.
So when data on "frequent" and "reliable" paths are OK, some other (not tested well) might experience issues.

But we don't want to throw all collected data - but rather mark those that PTI knows that are invalid and also provide context where they were generated.
So it will be WARNING. And message will have details.

The recent example: in some situations expected events for memory operations are not signaled - so their Timestamps are not available. I think that in this situation - we can use this mechanism and notify a user (a tool). Of course, if it wants (subscribes).

Or it could be that something went completely out of control - say, PTI is starting GPU Sampling data collection, trying to open a temp file but there is no permission or other issue...
That could be CRITICAL. PTI shuts down its all operations. Each PTI call would return error (hopefully descriptive enough). But CRITICAL event will have a message with more details as well some info on the context.

Does it make any sense?

// collection is safe to continue
PTI_INTERNAL_EVENT_TYPE_CRITICAL = 2, // critical error after which further collected data are invalid

PTI_INTERNAL_EVENT_TYPE_MAX = 0x7fffffff
} pti_internal_event_type;

typedef enum _pti_operation_kind {
PTI_OPERATION_KIND_INVALID = 0,
PTI_OPERATION_KIND_KERNEL = 1,
PTI_OPERATION_KIND_MEMCPY = 2,
PTI_OPERATION_KIND_MEM_FILL = 3,

PTI_OPERATION_KIND_MAX = 0x7fffffff
} pti_operation_kind;

typedef struct _pti_gpu_op_launch_detail {
pti_operation_kind _operation_kind; // kind of the operation: kernel, mem op
uint32_t _correlation_id; // correlation IDs of operation append
const char* _kernel_name; // type of the memcpy operation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field is _kernel_name and the comment says type of memcpy operation. These seem inconsistent.

Rather than a kernel name, I would prefer something like a kernel handle that can be more readily correlated with machine code.

Copy link
Contributor Author

@jfedorov jfedorov Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning L0 kernel handle would be OK.
Would the tool then try to get the kernel binary?
I will look how to do it in a safe way.

The comment // type of memcpy.. is definitely a typo - thanks for spotting it!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want a callback when the binary was loaded and unloaded.

At the time a kernel is launched, I only want to look up the kernel handle and associate the launch with the kernel in the binary, turning it into a (load module id, offset) pair. Post-mortem, HPCToolkit uses a recorded copy of the binary to interpret all of the measurement data.

} pti_gpu_op_launch_detail;

typedef struct _pti_apicall_callback_data {
pti_callback_domain _domain; // domain of the callback
pti_backend_submit_type _submit_type; // Immediate command list, Command Queue execute,..
pti_callback_site _site; // ENTER or EXIT

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the name phase rather than site. I realize that NVIDIA uses site. AMD uses phase. This is more descriptive since unless the enter/exit callbacks occur in different calling contexts in user code.

Copy link
Contributor Author

@jfedorov jfedorov Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense - will use the name phase.

Question - do you suggest to keep both ENTER/EXIT phase-s ?
Asking because some internal feedback was to simplify the APIs setting callbacks..E.g.

pti_result PTI_EXPORT
ptiCallbackEnableDomain(pti_callback_subscriber subscriber,
                        pti_callback_domain  domain,
                        uint32_t enter_cb,
                        uint32_t exit_cb);

avoiding passing phases enter_cb and exit_cb ..
and probably having by default only one Enter phase enabled..
and having may be separate API for each phase..
?

uint32_t _return_code; // will be valid only for L0 API EXIT, for others will be nullptr
uint32_t _correlation_id; // ID that corresponds to the same call reported by View API records
uint32_t _operation_count; // number of operations submitted to the GPU,
// non-zero only for DOMAIN_DRIVER_GPU_OPERATION_LAUNCH,
// or API related to GPU operation submission
pti_gpu_op_launch_detail* _operation_details; // not Null only for DOMAIN_DRIVER_GPU_OPERATION_LAUNCH,
// or API related to GPU operation submission.
} pti_apicall_callback_data;

typedef struct _pti_internal_callback_data {
pti_callback_domain _domain; // domain of the callback
pti_callback_site _site; // THREAD START/END or INTERNAL EVENT

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer phase rather than site, as described in prior comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will make phase.

uint32_t _detail; // depending on the domain should be casted/interpreted
// as a purpose of an internal PTI thread or
// pti_internal_event_type
const char* _message; // explains details
} pti_internal_callback_data;

typedef void (*pti_callback_function)(
pti_callback_domain domain,
pti_api_group_id driver_api_group_id, // driver API group ID, keep it to distinguish between L0 and OpenCL
// although the current implementation is only for L0
uint32_t driver_api_id,
pti_backend_ctx_t backend_context, // Driver (L0) level context handle
void* cb_data, // depending on the domain it should be type-casted to the pointer
// to either pti_apicall_callback_data or pti_internal_callback_data
void* user_data);

typedef void* pti_callback_subscriber; // Subscriber handle, current implementation supports only one subscriber
// per process, but it is not a limitation of the API

/**
* @brief Initiate Callback subscriber
*
* @param subscriber - subscriber handle
* @param callback - pointer to the callback function
* @param user_data - user data to be passed to the callback function
* @return pti_result
*/
pti_result PTI_EXPORT
ptiCallbackSubscribe(pti_callback_subscriber* subscriber,
pti_callback_function callback,
void* user_data);

/**
* @brief Unsubscribe Callback subscriber, this will disable all callbacks and
* destroy subscriber handle
*/
pti_result PTI_EXPORT
ptiCallbackUnsubscribe(pti_callback_subscriber subscriber);

/**
* @brief Enables callbacks on specific domain
*
* @param subscriber - subscriber handle
* @param domain - domain to enable
* @param enter_cb - indicate if callback called on enter/start: 0-no, 1-yes; used only for domains with 2 sites
* @param exit_cb - indicates if callback called on exit/end: 0-no, 1-yes; used only for domains with 2 sites
* @return pti_result
*/
pti_result PTI_EXPORT
ptiCallbackEnableDomain(pti_callback_subscriber subscriber,
pti_callback_domain domain,
uint32_t enter_cb,
uint32_t exit_cb);

/**
* @brief Disables callbacks on specific domain
*/
pti_result PTI_EXPORT
ptiCallbackDisableDomain(pti_callback_subscriber subscriber,
pti_callback_domain domain);

/**
* @brief Disables all callbacks for the subscriber, all domains, APIs
*/
pti_result PTI_EXPORT
ptiCallbackDisableAll(pti_callback_subscriber subscriber);

/**
* @brief Enables callbacks on specific driver API
*
* @param subscriber - subscriber handle
* @param driver_api_group_id - driver API group ID, for now only PTI_API_GROUP_LEVELZERO supported
* @param driver_api_id - driver API ID
* @param enter_cb - indicate if callback called on enter/start: 0-no, 1-yes; used only for domains with 2 sites
* @param exit_cb - indicates if callback called on exit/end: 0-no, 1-yes; used only for domains with 2 sites
* @return pti_result
*/
pti_result PTI_EXPORT
ptiCallbackEnableDriverAPI(pti_callback_subscriber subscriber,
pti_api_group_id driver_api_group_id,
uint32_t driver_api_id,
uint32_t enter_cb,
uint32_t exit_cb);

/**
* @brief Disables callbacks on specific driver API
*/
pti_result PTI_EXPORT
ptiCallbackDisableDriverAPI(pti_callback_subscriber subscriber,
pti_api_group_id driver_api_group_id,
uint32_t driver_api_id);

/**
* @brief Returns external correlation IDs that are on top of the stack at the Callback point
*
* Usage: 1 - pass external_kind and external_id as nullptr to get number of IDs available
* on the top of stack at the Callback point
* 2 - call this API again where number is number of IDs to return,
* external_kind and external_id should have enough memory allocated to hold number of IDs
* that will be copied to that memory
*
* @param[in,out] number - pointer to retrieve the number of external correlation IDs on the stack
* or to ask for the number of IDs to return
* @param[in] external_kind - pointer to store kinds of the external correlation IDs
* @param[in] external_id - pointer to store external correlation IDs
* @return pti_result
*/
pti_result PTI_EXPORT
ptiCallbackGetExternalCorrelationIds(uint32_t* number,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to retrieve correlation ids set by the runtime system? If so, what is external about these ids?

From HPCToolkit's perspective, I want to be able to specify a 64-bit correlation id that represents a (thread id, sequence number) pair that I can use to direct monitoring information back to the initiating thread. I am not sure what your vision is for how tools will work with correlation ids using this API.

Both NVIDIA and AMD APIs provide the ability to either push/pop a correlation id for an operation before/after launch, or receive a singleton callback where a tool can specify an external correlation id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also external IDs that user can set up and with their help make a correlation of calling context with low level APIs and GPU operations. Please, see the API below.
It looked useful if in the Callback PTI can report external IDs from the top of a stack.
So that this could immediately correlate low-level API/operation with the calling context.
Of course it means that a tool is doing some instrumentation of a traced application.
if such mechanism is not required at the moment - we can postpone exposing/implementing it.

/**
* @brief External correlation kinds
*/
typedef enum _pti_view_external_kind {
PTI_VIEW_EXTERNAL_KIND_INVALID = 0, //!< Invalid external kind
PTI_VIEW_EXTERNAL_KIND_UNKNOWN = 1, //!< Unknown external kind
PTI_VIEW_EXTERNAL_KIND_CUSTOM_0 = 2, //!< Custom external kind
PTI_VIEW_EXTERNAL_KIND_CUSTOM_1 = 3, //!< Custom external kind
PTI_VIEW_EXTERNAL_KIND_CUSTOM_2 = 4, //!< Custom external kind
PTI_VIEW_EXTERNAL_KIND_CUSTOM_3 = 5, //!< Custom external kind
} pti_view_external_kind;

* @brief External Correlation View record type
*/
typedef struct pti_view_record_external_correlation {
pti_view_record_base _view_kind; //!< Base record
uint32_t _correlation_id; //!< ID that correlates this record with records
//!< of other Views
uint64_t _external_id; //!< ID provided by user, marking an external
//!< to PTI operation
pti_view_external_kind _external_kind;
} pti_view_record_external_correlation;

* @brief Pushes ExternelCorrelationId kind and id for generation of external correlation records
*
* @param external_kind
* @param external_id
* @return pti_result
*/
pti_result PTI_EXPORT
ptiViewPushExternalCorrelationId(pti_view_external_kind external_kind, uint64_t external_id);
/**
* @brief Pops ExternelCorrelationId kind and id for generation of external correlation records
*
* @param external_kind
* @param external_id
* @return pti_result
*/
pti_result PTI_EXPORT
ptiViewPopExternalCorrelationId(pti_view_external_kind external_kind, uint64_t* p_external_id);

pti_view_external_kind* external_kind,
uint64_t* external_id);
#if defined(__cplusplus)
}
#endif

#endif // INCLUDE_PTI_CALLBACK_H_
1 change: 1 addition & 0 deletions sdk/src/pti_view.cc
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#include <map>

#include "internal_helper.h"
#include "pti/pti_callback.h"
#include "tracing_cb_api.gen"
#include "view_handler.h"

Expand Down