You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -616,7 +616,7 @@ \section{Process and Job Monitoring}
616
616
job. Various watchdog methods have been developed for detecting this situation, including requiring a periodic ``heartbeat''
617
617
from the application and monitoring a specified file for changes in size and/or modification time.
618
618
619
-
The following \acp{API} allow applications to request monitoring, directing what is to be monitored, the frequency of the associated check, whether or not the application is to be notified (via the event notification subsystem) of stall detection, and other characteristics of the operation.
619
+
The following \acp{API} allow applications to request monitoring, directing what is to be monitored, the frequency of the associated check, whether or not the application is to be notified (via the event notification subsystem) of stall detection, and other characteristics of the operation. In addition, statistics on the resource usage at the individual process level and/or overall node level (including usage information on disks and/or network interfaces) can be measured and periodically reported back to the requestor.
\item\refattr{PMIX_RANGE} Non-default range to be used when generating the associated event for this monitoring action.
714
+
\item\refattr{PMIX_MONITOR_LOCAL_ONLY}
698
715
\end{itemize}
699
716
\optattrend
700
717
701
718
%%%%
702
719
\descr
703
720
704
-
Request that application processes and/or files be monitored via several possible methods.
705
-
For example, that the server monitor a given process for periodic heartbeats as an indication that the process has not become ``wedged''.
706
-
When a monitor detects the specified alarm condition, it will generate an event notification using the provided error code and passing along any available relevant information.
707
-
It is up to the caller to register a corresponding event handler.
721
+
This \ac{API} can be used for two purposes:
722
+
723
+
\begin{itemize}
724
+
\item Request that application processes and/or files be monitored for activity via several possible methods. For example, that the server monitor a given process for periodic heartbeats as an indication that the process has not become ``wedged''. When a monitor detects the specified alarm condition, it will generate an event notification using the provided error code and passing along any available relevant information. It is up to the caller to register a corresponding event handler.
725
+
\item Report resource usage statistics for processes and/or nodes, including disk and network interfaces attached to nodes. This can be done on a per-request basis, or periodically updated on a time interval specified by the \refattr{PMIX_MONITOR_RESOURCE_RATE} attribute.
726
+
\end{itemize}
708
727
709
728
The \refarg{monitor} argument is an attribute indicating the type of monitor being requested.
710
-
For example, \refattr{PMIX_MONITOR_FILE_CHANGES} to indicate that the requestor is asking that a file be monitored.
729
+
For example, \refattr{PMIX_MONITOR_FILE_CHANGES} to indicate that the requestor is asking that a file be monitored, or \refattr{PMIX_MONITOR_PROC_RESOURCE_USAGE} to obtain a report on process resource usage.
711
730
712
-
The \refarg{error} argument is the status code to be used when generating an event notification alerting that the monitor has been triggered.
713
-
The range of the notification defaults to \refconst{PMIX_RANGE_NAMESPACE}.
714
-
This can be changed by providing a \refattr{PMIX_RANGE} directive.
731
+
The \refarg{error} argument is the status code to be used when generating an event notification alerting that the monitor has been triggered or to receive a periodic resource usage update.
732
+
The range of the notification defaults to \refconst{PMIX_RANGE_NAMESPACE} for alarm events, and to \refconst{PMIX_RANGE_CUSTOM} for resource usage updates to ensure delivery solely to the requesting process. This can be changed by providing a \refattr{PMIX_RANGE} directive.
715
733
716
-
The \refarg{directives} argument characterizes the monitoring request (e.g., monitor file size) and frequency of checking to be done
734
+
The \refarg{directives} argument characterizes the monitoring request (e.g., monitor file size or specific resource usage metrics to be measured) and frequency of checking to be done.
717
735
718
-
The returned \refarg{status} indicates whether or not the request was granted, and information as to the reason for any denial of the request shall be returned in the \refarg{results} array.
736
+
The returned \refarg{status} indicates whether or not the request was granted, and information as to the reason for any denial of the request shall be returned in the \refarg{results} array. If the request was successful, then any measured values will be returned in the \refarg{results}.
737
+
738
+
\adviceuserstart
739
+
A \refconst{PMIX_SUCCESS} return status only indicates that no error was encountered when executing the request and does not guarantee a non-\code{NULL} value for \refarg{results} when requesting resource usage statistics. For example, a request for resource usage by processes on a specified node that is not currently executing any user-level processes will return success to indicate that the request was executed without error, but a \code{NULL} \refarg{results} array because no statistics were returned.
\item\refconst{PMIX_OPERATION_SUCCEEDED}, indicating that the request was immediately processed and returned \textit{success} - the \refarg{cbfunc} will \textit{not} be called
778
+
\item\refconst{PMIX_OPERATION_SUCCEEDED}, indicating that the request was immediately processed and returned \textit{success} - the \refarg{cbfunc} will \textit{not} be called, and no resource usage results will be returned by the \ac{API} itself.
757
779
\end{itemize}
758
780
\returnend
759
781
760
-
\optattrstart
761
-
The following attributes may be implemented by a \ac{PMIx} library or by the host environment. If an attribute is supported by the \ac{PMIx} server library, then the library must not pass the supported attributes to the host environment unless the requested action involves other nodes. In addition, the library is \textit{required} to add the \refAttributeItem{PMIX_USERID} and the \refAttributeItem{PMIX_GRPID} attributes of the requesting process to the directives array when it passes actions to its host.
762
-
763
-
The \refarg{monitor} argument may contain any of the following actions:
764
-
765
-
\begin{itemize}
766
-
\item\refattr{PMIX_MONITOR_CANCEL}
767
-
\item\refattr{PMIX_MONITOR_HEARTBEAT} The associated \refarg{directives} array may include any of the following:
768
-
\begin{itemize}
769
-
\item\refattr{PMIX_MONITOR_HEARTBEAT_TIME}
770
-
\item\refattr{PMIX_MONITOR_HEARTBEAT_DROPS}
771
-
\end{itemize}
772
-
\item\refattr{PMIX_SEND_HEARTBEAT}
773
-
\item\refattr{PMIX_MONITOR_FILE_CHANGES} The associated \refarg{directives} array may include any of the following:
774
-
\begin{itemize}
775
-
\item\refattr{PMIX_MONITOR_FILE_CHECK_TIME}
776
-
\item\refattr{PMIX_MONITOR_FILE_DROPS}
777
-
\item\refattr{PMIX_MONITOR_TARGET_FILES}
778
-
\item\refattr{PMIX_MONITOR_TARGET_NODES}. Monitor the given files on the specified nodes, where present.
779
-
\item\refattr{PMIX_MONITOR_TARGET_NODEIDS}. Monitor the given files on the specified nodes, where present.
780
-
\end{itemize}
781
-
\item\refattr{PMIX_MONITOR_PROC_RESOURCE_USAGE} The associated \refarg{directives} array may include any of the following:
782
-
\begin{itemize}
783
-
\item\refattr{PMIX_MONITOR_RESOURCE_RATE}
784
-
\item\refattr{PMIX_MONITOR_TARGET_PROCS}
785
-
\item\refattr{PMIX_MONITOR_TARGET_PIDS}
786
-
\item\refattr{PMIX_MONITOR_TARGET_NODES}. All processes on the specified nodes are to be monitored.
787
-
\item\refattr{PMIX_MONITOR_TARGET_NODEIDS}. All processes on the specified nodes are to be monitored.
788
-
\end{itemize}
789
-
\item\refattr{PMIX_MONITOR_NODE_RESOURCE_USAGE} The associated \refarg{directives} array may include any of the following:
790
-
\begin{itemize}
791
-
\item\refattr{PMIX_MONITOR_RESOURCE_RATE}
792
-
\item\refattr{PMIX_MONITOR_TARGET_NODES}
793
-
\item\refattr{PMIX_MONITOR_TARGET_NODEIDS}
794
-
\item\refattr{PMIX_MONITOR_TARGET_PROCS}. Monitor the nodes where the specified processes are located.
795
-
\end{itemize}
796
-
\item\refattr{PMIX_MONITOR_DISK_RESOURCE_USAGE} The associated \refarg{directives} array may include any of the following:
797
-
\begin{itemize}
798
-
\item\refattr{PMIX_MONITOR_RESOURCE_RATE}
799
-
\item\refattr{PMIX_MONITOR_TARGET_DISKS}
800
-
\item\refattr{PMIX_MONITOR_TARGET_NODES}
801
-
\item\refattr{PMIX_MONITOR_TARGET_NODEIDS}
802
-
\item\refattr{PMIX_MONITOR_TARGET_PROCS}. Monitor the nodes where the specified processes are located.
803
-
\end{itemize}
804
-
\item\refattr{PMIX_MONITOR_NETWORK_RESOURCE_USAGE} The associated \refarg{directives} array may include any of the following:
805
-
\begin{itemize}
806
-
\item\refattr{PMIX_MONITOR_RESOURCE_RATE}
807
-
\item\refattr{PMIX_MONITOR_TARGET_NETS}
808
-
\item\refattr{PMIX_MONITOR_TARGET_NODES}
809
-
\item\refattr{PMIX_MONITOR_TARGET_NODEIDS}
810
-
\item\refattr{PMIX_MONITOR_TARGET_PROCS}. Monitor the nodes where the specified processes are located.
811
-
\end{itemize}
812
-
\end{itemize}
813
-
814
-
In addition to action-specific directives, the \refarg{directives} array may include:
815
-
816
-
\begin{itemize}
817
-
\item\refattr{PMIX_MONITOR_ID}
818
-
\item\refattr{PMIX_MONITOR_APP_CONTROL}
819
-
\item\refattr{PMIX_RANGE} Non-default range to be used when generating the associated event for this monitoring action.
820
-
\end{itemize}
821
-
\optattrend
822
-
823
782
%%%%
824
783
\descr
825
784
826
-
Non-blocking form of the \refapi{PMIx_Process_monitor} \ac{API}. The \refarg{cbfunc} function provides a \refarg{status} to indicate whether or not the request was granted, and to provide some information as to the reason for any denial in the \refapi{pmix_info_cbfunc_t} array of \refstruct{pmix_info_t} structures.
785
+
Non-blocking form of the \refapi{PMIx_Process_monitor} \ac{API}. The \refarg{cbfunc} function provides a \refarg{status} to indicate whether or not the request was granted, and to provide some information as to the reason for any denial in the \refapi{pmix_info_cbfunc_t} array of \refstruct{pmix_info_t} structures. Any resource usage data generated by the function will be returned there as well.
implementations and \acp{OS}, depending upon availability and access restrictions, and the
1341
1301
provided list of requested values. Except for
1342
1302
the node ID as the first element, ordering of information in the array is arbitrary.
1343
-
}
1344
1303
1345
1304
Optional information that may be included (see \href{https://www.kernel.org/doc/html/latest/filesystems/proc.html#kernel-data}{KERNEL} and \href{https://www.kernel.org/doc/html/latest/filesystems/proc.html#meminfo}{MEMINFO} for a detailed description of the following fields):
files which have been mmapped, such as libraries. Note that some kernel configurations might consider all pages part of a larger allocation (e.g., THP) as “mapped”, as soon as a single page is mapped. In MBytes
1382
-
}
1383
-
\item\refattr{PMIX_DISK_RESOURCE_USAGE} One for each disk attached to the node, if requested.
1384
-
\item\refattr{PMIX_NETWORK_RESOURCE_USAGE} One for each network interface on the node, if requested.
files which have been mmapped, such as libraries. Note that some kernel configurations might consider all pages part of a larger allocation (e.g., THP) as “mapped”, as soon as a single page is mapped. In MBytes
1341
+
}
1342
+
\item\refattr{PMIX_DISK_RESOURCE_USAGE} One for each disk attached to the node, if requested.
1343
+
\item\refattr{PMIX_NETWORK_RESOURCE_USAGE} One for each network interface on the node, if requested.
0 commit comments