gpu-workloads

ping me at: [email protected] for help with the scripts.

Two different scripts are listed here. first script is to monitor the GPU data and second script is to capture the network data across the cluster.

clone the scripts
- git clone https://github.com/slakumalla-intel/monitor-gpus.git
- cd monitor-gpus; chmod +x *.sh
Monitor GPUs
- open the file monitor_gaudi3.sh and change the desired time period to collect the data
- run the test as: ./monitor_gaudi3.sh ctrl+c to exit after your test has completed
- creates 3 output files in csv format
  - gpu_monitor_basiclog.csv : captures the power, temperature, utilization, memory related data
  - gpu_monitor_ecclog.csv : captures the ECC error data
  - gpu_monitor_pcielog.csv : captures the PCIe link speed related data
Network ports and data monitoring we can capture the data either on the local or across the cluster. -Local node : a. To capture the local network data and check if any ports are in the down state ./g3_get_node_port_status.sh | grep -i down

b. Cluster nodes data: - open the hosts.txt file and update with the list of intrested nodes - setup the passwordless network access to the cluster nodes ./pass_setup.sh ex: ./pass_setup.sh slakumal Enter your password when prompted during the execution of this script
```
    - capture the network port status and stats data across cluster
    ./cluster_status.sh <username> 
 	cat network_status.log | grep -i down
```
c. Output data files to debug: - network_external_link_stats.log : captures gaudi externel NIC stats ( across all external ports of nodes listed in hosts.txt) - network_internal_link_stats.log : captures gaudi internal port stats ( acorss all internal ports of nodes listed in hosts.txts) - network_port_status.log : captures gaudi internal and expernal port status ( acorss all ports of nodes listed in the hosts.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-workloads

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README		README
README.md		README.md
cluster_status.sh		cluster_status.sh
g3_external_port_stats.sh		g3_external_port_stats.sh
g3_get_node_port_status.sh		g3_get_node_port_status.sh
g3_internal_ports_stats.sh		g3_internal_ports_stats.sh
g3_node_ext_port_stats.sh		g3_node_ext_port_stats.sh
g3_node_int_port_stats.sh		g3_node_int_port_stats.sh
gpu_monitor_basiclog.csv		gpu_monitor_basiclog.csv
gpu_monitor_ecclog.csv		gpu_monitor_ecclog.csv
gpu_monitor_pcielog.csv		gpu_monitor_pcielog.csv
hosts.txt		hosts.txt
monitor_gaudi3.sh		monitor_gaudi3.sh
network_external_link_stats.log		network_external_link_stats.log
network_internal_link_stats.log		network_internal_link_stats.log
network_port_status.log		network_port_status.log
pass_setup.sh		pass_setup.sh
run_pcie_test.sh		run_pcie_test.sh

slakumalla-intel/monitor-gpus

Folders and files

Latest commit

History

Repository files navigation

gpu-workloads

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages