CLI Utility for managing your cluster.
Usage: mgmt [OPTIONS] COMMAND [ARGS]...clusters- Commands to manage clustersconfigurations- Commands to manage configurationsdatabase- Commands to do in the databasefabrics- Commands to display fabricslogin- Commands to manage login nodesnetwork- Network block commandsnodes- Commands to manage nodesrecommendations- Commands to show recommendations about the clusterservices- Commands to manage services
Commands to manage clusters.
Usage: mgmt clusters [OPTIONS] COMMAND [ARGS]...add- Add nodes to clusters or memory fabricscreate- Create a new clusterdelete- Delete a cluster with namelist- List all clusters in tabular or JSON format
Create a new cluster.
Usage: mgmt clusters create [OPTIONS]Options:
--count INTEGER- Number of nodes to add [required]--cluster TEXT- Specify the name of the cluster [required]--instancetype TEXT- Specify the instance type of the cluster [required]--names TEXT- Comma separated list of host names--fabric TEXT- OCID of the memory fabric to add the nodes in for BM.GPU.GB200.4 nodes--memorycluster TEXT- Name used for the memory cluster fabric, default will be cluster_xxxxx with xxxxx the last 5 character of the fabric ocid
Examples:
# Create a standard compute cluster
mgmt clusters create --count 3 --cluster mycluster --instancetype BM.Standard.E3.128
# Create a GPU cluster with memory fabric
mgmt clusters create --count 2 --cluster mycluster --instancetype BM.GPU.GB200.4 --fabric ocid1.fabric.oc1..xxxx --names node01,node02Add compute nodes to a cluster.
Usage: mgmt clusters add node [OPTIONS]Options:
--count INTEGER- Number of nodes to add [required]--cluster TEXT- Name of the cluster--names TEXT- Comma-separated list of host names--memorycluster TEXT- Name of the memory cluster (alternative to --cluster)
Example:
mgmt clusters add node --count 2 --cluster myclusterAdd nodes to a memory fabric.
Usage: mgmt clusters add memory-fabric [OPTIONS]Options:
--count INTEGER- Number of nodes to add [required]--cluster TEXT- Name of the compute cluster [required]--fabric TEXT- OCID of the memory fabric [required]--memorycluster TEXT- Name for the memory cluster--instancetype TEXT- Instance type for the nodes [required]
Example:
mgmt clusters add memory-fabric --count 1 --cluster mycluster --fabric ocid1.fabric.oc1..xxxx --instancetype BM.GPU.GB200.4Delete a cluster with name.
Usage: mgmt clusters delete [OPTIONS]Options:
--cluster TEXT- Specify the name of the cluster--memory_cluster TEXT- Specify the name of the Memory cluster (Compute cluster does not need to be specified)
List all clusters in tabular or JSON format.
Usage: mgmt clusters list [OPTIONS]Options:
--format [tabular|json]- Output format [default: tabular]
Examples:
# List all clusters
mgmt clusters list
# List all clusters in JSON format
mgmt clusters list --format jsonCommands to manage configurations.
Usage: mgmt configurations [OPTIONS] COMMAND [ARGS]...create- Create Configurationdelete- Delete Configurationget- Get information about the configurationlist- List Configuration based on role, partition, or shapeupdate- Update Configuration
Create Configurations from file.
Usage: mgmt configurations create from-file [OPTIONS]Options:
--file TEXT- Name of the json or yaml file [required]
Duplicate Configuration with new name.
Usage: mgmt configurations create from-existing [OPTIONS]Options:
--configuration TEXT- Name of the existing configuration to copy [required]--name TEXT- Name for the new configuration [required]
Delete Configuration.
Usage: mgmt configurations delete [OPTIONS]Options:
--configuration TEXT- Name of the configuration to delete [required]
Get information about the configuration.
Usage: mgmt configurations get [OPTIONS]Options:
--name TEXT- Get configuration name [required]
List Configuration based on role, partition, or shape.
Usage: mgmt configurations list [OPTIONS]Options:
--format [tabular|json|yaml]- Output format [default: tabular]--output_file TEXT- Name of the output file--partition TEXT- Get all configurations in that defined partition--role [compute|login|all]- Get all configurations for compute or login [default: all]--shape TEXT- Get all configurations with a particular shape
Update Configuration.
Usage: mgmt configurations update [OPTIONS]Options:
--name TEXT- Name of the configuration to update [required]--fields TEXT- Comma-separated list of updates to apply, Example: shape="VM.Standard.E5.Flex,instance_pool_ocpus=4" [required]
Commands to do in the database.
Usage: mgmt database [OPTIONS] COMMAND [ARGS]...add- Add specific node to the DBcreate- Create database/tablesdelete- Delete nodes from the DBexport- Export database contents to a SQLite DB filescan-vcn- Scan the specified VCN CIDR to list nodesupdate- Update a field for a list of nodes
Add specific node to the DB.
Usage: mgmt database add [OPTIONS]Options:
--ip TEXT- IP Address of the node [required]--hostname TEXT- Hostname of the node--ocid TEXT- OCID of the node
Create database/tables. Will not recreate tables that already exist.
Usage: mgmt database create [OPTIONS]Delete nodes from the DB. This will not terminate the nodes.
Usage: mgmt database delete [OPTIONS]Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--fields TEXT- Fields to filter nodes (e.g., role=compute,status=running)
Export database contents to a SQLite DB file. May not work if Python was built without sqlite support.
Usage: mgmt database export [OPTIONS]Options:
--filename TEXT- SQLite filename. Must not already exist [default: export.sqlite]--use-base- Use embedded Base metadata when creating target db. This can be used as very simple validation; if the source database schema doesn't match, an error may be raised
Scan the specified VCN CIDR to list nodes.
Usage: mgmt database scan-vcn [OPTIONS] CIDROptions:
--dns- Scan DNS--change_hostname- Change OCI hostname
Update a field for a list of nodes.
Usage: mgmt database update [OPTIONS] IDENTIFIERSOptions:
--fields TEXT- Add a list of update to do, Example shape=VM.Standard.E5.Flex,instance_pool_ocpus=4 [required]
Commands to display fabrics.
Usage: mgmt fabrics [OPTIONS] COMMAND [ARGS]...list- List all fabrics for nodes
List all fabrics for nodes.
Usage: mgmt fabrics list [OPTIONS]Options:
--full- Get full information about the node
Commands to manage login nodes.
Usage: mgmt login [OPTIONS] COMMAND [ARGS]...create- Add login node to the clusterdelete- Delete a login nodelist- List all login nodes
Add login node to the cluster.
Usage: mgmt login create [OPTIONS]Options:
--count INTEGER- Number of login nodes to add [required]--configuration TEXT- Specify the name of the login configuration [required]--names TEXT- Comma separated list of host names [required]
Delete a login node.
Usage: mgmt login delete [OPTIONS]Options:
--hostname TEXT- Specify the name of the login node [required]
List all login nodes.
Usage: mgmt login list [OPTIONS]Options:
--format [tabular|json]- Output format [default: tabular]
Network block commands.
Usage: mgmt network [OPTIONS] COMMAND [ARGS]...blocks- Commands to manage network blocksrails- Commands to manage rails
Get blocks by cluster.
Usage: mgmt network blocks list cluster [OPTIONS]Options:
--cluster TEXT- Name of the cluster [required]
Get rails by cluster.
Usage: mgmt network rails list cluster [OPTIONS]Options:
--cluster TEXT- Name of the cluster [required]--nodes / --no-nodes- Show nodes in the rail
Commands to manage nodes.
Usage: mgmt nodes [OPTIONS] COMMAND [ARGS]...boot-volume-swap- Boot Volume Swap one or more nodesget- Get information about nodeshealthchecks- Tag nodes as unhealthylist- List nodes with various filters and formatsreboot- Reboot one or more nodesreconfigure- Rerun the cloud-init script on the nodestag- Tag nodes as unhealthytag-and-terminate- Tag and Terminate nodesterminate- Terminate nodes
Boot Volume Swap one or more nodes.
Usage: mgmt nodes boot-volume-swap [OPTIONS]You must specify either --nodes or --fields to identify which nodes to reboot.
Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--fields TEXT- Fields to filter nodes (e.g., role=compute,status=running)--image TEXT- Specify the image for BVR--size INTEGER- Specify the size for BVR in GB
Examples:
# Boot Volume Swap by node names
mgmt nodes boot-volume-swap --nodes=node1,node2
# Boot Volume Swap by fields
mgmt nodes boot-volume-swap --fields=role=compute,status=running
# Boot Volume Swap image
mgmt nodes boot-volume-swap --nodes=node1 --image=ocid1.image.oc1..exampleuniqueid
# Boot Volume Swap BV size
mgmt nodes boot-volume-swap --nodes=node1 --size=100Get information about nodes.
Usage: mgmt nodes get [OPTIONS] COMMAND [ARGS]...Subcommands:
any- Default: Get info by serial, IP, OCID, or hostnameids- Get information about a node by IDips- Get information about a node by IPnames- Get information about a node by host nameserials- Get information about a node by serial number
Common Options:
--format [node|csv|json]- Output format [default: node]
Get healthcheck details of node/s.
Usage: mgmt nodes healthchecks [OPTIONS]Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--fields TEXT- Fields to filter nodes (e.g., role=compute,status=running)--type [all|passive|active|multi-node]- Type of healthcheck to run (all, passive, active, multi-node)--exclude-node TEXT- Node to exclude from multi_node healthcheck--reservation TEXT- Include a Reservation Name for the healthcheck in case the nodes are in a reservation, InitialValidation is the reservation created for all new nodes
Examples:
# Get healthcheck details of a node
mgmt nodes healthchecks --nodes gpu-6175List nodes with various filters and formats.
Usage: mgmt nodes list [OPTIONS]Options:
--one-line- Share the hostnames list in one line (or compact output with --json)--cluster TEXT- List nodes that are part of named cluster--memory-cluster TEXT- List nodes that are part of named memory cluster--style [lines|box|none]- Table style for tabular output [default: box]--format [tabular|node|csv|json]- Output format [default: tabular]--width INTEGER- Width of output [default: detect from terminal or COLUMNS env var]--columns TEXT- Comma separated list of fields to display. Also accepts ALL, DEFAULT, SIMPLE (all single-line fields), HC (all healthcheck fields + simple fields), or LIST (to list field names and exit)--no-header- Do not include header in tabular/csv formats--fields TEXT- Add a list of fields to filter, Example: role=compute,status=running
Examples:
# List all nodes in a cluster
mgmt nodes list --cluster mycluster
# Lists all node hostnames in a boxed table format without headers, using a fixed width of 30
mgmt nodes list --columns hostname --style box --no-header --width 30
# Lists all compute nodes in a json format with all fields
mgmt nodes list --format json --columns all --fields role=computeReboot one or more nodes.
Usage: mgmt nodes reboot [OPTIONS]You must specify either --nodes or --fields to identify which nodes to reboot.
Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--fields TEXT- Fields to filter nodes (e.g., role=compute,status=running)--soft- Perform a soft reboot (OS level) instead of a hard reset
Examples:
# Reboot by node names
mgmt nodes reboot --nodes=node1,node2
# Reboot by fields
mgmt nodes reboot --fields=role=compute,status=running
# Soft reboot
mgmt nodes reboot --nodes=node1 --softRerun the cloud-init script on the nodes.
Usage: mgmt nodes reconfigure [OPTIONS]Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--fields TEXT- Fields to filter nodes (e.g., role=compute,status=running)--action [compute|controller|all|custom|command]- What to reconfigure:compute- Rerun the cloud-initcontroller- Reconfigure the node on the controller (Slurm Topology and Prometheus targets)all- Reconfigure the node on the controller and the cloud-initcustom- Reconfigure the node on the controller and the cloud-initcommand- Run a custom command on the nodes
--command TEXT- Specify the command to run on the nodes. To be used with --action=command
Tag nodes as unhealthy.
Usage: mgmt nodes tag [OPTIONS]Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--fields TEXT- Fields to filter nodes (e.g., role=compute,status=running)
Tag and Terminate nodes.
Usage: mgmt nodes tag-and-terminate [OPTIONS]Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--fields TEXT- Fields to filter nodes (e.g., role=compute,status=running)
Terminate nodes.
Usage: mgmt nodes terminate [OPTIONS]Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--fields TEXT- Fields to filter nodes (e.g., role=compute,status=running)
Commands to show recommendations about the cluster.
Usage: mgmt recommendations [OPTIONS] COMMAND [ARGS]...list- List all the nodes with recommendationsrun- Run all the recommendations
List all the nodes with recommendations.
Usage: mgmt recommendations list [OPTIONS]Options:
--healthcheck- Only show the Healthcheck Recommendations--unreachable- Only show the unreachable nodes--unconfigured- Only show the nodes failing to start--unreachable_timeout INTEGER- Timeout in minutes before a node is considered unreachable--unconfigured_timeout INTEGER- Timeout in minutes before a node is considered unreachable
Run all the recommendations.
Usage: mgmt recommendations run [OPTIONS]Options:
--nodes TEXT- Comma separated list of nodes (IP Addresses, hostnames, OCID's, serials or oci names)--healthcheck--unreachable- Get full information about the nodes--unconfigured- Get full information about the nodes--unreachable_timeout INTEGER- Timeout in minutes before a node is considered unreachable--unconfigured_timeout INTEGER- Timeout in minutes before a node is considered unreachable
Commands to manage services.
Usage: mgmt services [OPTIONS] COMMAND [ARGS]...active-hc- Run active healthcheckall- Run full workflow: scan queue, update metadata, run ansible and update nodes in case of successansible- Run Ansible to configure nodesinit- Reconfigure the Slurm Config files on the controllermulti-node-hc- Run active healthcheckscan-host-api- Scan Host API, update Health information and report number of available nodes in the dedicated poolscan-queue- Scan queue for new or removed nodes and update the DBupdate-metadata- Update metadata for all hosts in the DB
Run active healthcheck.
Usage: mgmt services active-hc [OPTIONS]Run full workflow: scan queue, update metadata, run ansible and update nodes in case of success.
Usage: mgmt services all [OPTIONS]Options:
--http_port INTEGER- Specify HTTP Port
Run Ansible to configure nodes.
Usage: mgmt services ansible [OPTIONS]Reconfigure the Slurm Config files on the controller. topology.conf.
Usage: mgmt services init [OPTIONS]Run active healthcheck.
Usage: mgmt services multi-node-hc [OPTIONS]Scan Host API, update Health information and report number of available nodes in the dedicated pool.
Usage: mgmt services scan-host-api [OPTIONS]Scan queue for new or removed nodes and update the DB.
Usage: mgmt services scan-queue [OPTIONS]Update metadata for all hosts in the DB.
Usage: mgmt services update-metadata [OPTIONS]Options:
--nodes TEXT- Any of the hostname, OCID, IP, serial, OCI_name of the node--http_port INTEGER- Specify HTTP Port