Connectors directory[1]   Full listing of connectors[2]

Nvidia DGX Server (REST)

Description

This connector monitors hardware for Nvidia DGX Servers.

This connector supersedes:

enterprise[4] hardware[5] nvidia[6]

Target

Typical platform: Nvidia DGX[7]

Operating system: Out-Of-Band

Prerequisites

Leverages: Nvidia DGX REST API

Technology and protocols: HTTP/REST

This connector is not available for the local host (it is applicable to remote hosts only).

Examples

CLI

metricshub HOSTNAME -t management -c +NvidiaDGXREST --https --http-port 443 -u USERNAME

metricshub.yaml

resourceGroups:
  <RESOURCE_GROUP>:
    resources:
      <HOSTNAME-ID>:
        attributes:
          host.name: <HOSTNAME> # Change with actual host name
          host.type: management
        connectors: [ +NvidiaDGXREST ] # Optional, to load only this connector
        protocols:
          http:
            https: true
            port: 443 # or probably something else
            username: <USERNAME> # Change with actual credentials
            password: <PASSWORD> # Encrypted using metricshub-encrypt

Connector Activation Criteria

The Nvidia DGX Server (REST) connector will be automatically activated, and its status will be reported as OK if all the below criteria are met:

  • The HTTP Request below to the managed host succeeds:
    • get /redfish/v1/Systems
    • Request Header:
      ${file::httpHeader}
    • The response body contains: redfish (regex)
  • The HTTP Request below to the managed host succeeds:
    • get /redfish/v1/Systems/DGX
    • Request Header:
      ${file::httpHeader}
    • The response body contains: redfish (regex)

Metrics

Type Collected Metrics Specific Attributes
cpu
  • hw.cpu.speed.limit{limit_type="max"}
  • hw.status{hw.type="cpu", state="degraded|failed|ok"}
  • hw.status{hw.type="cpu", state="present"}
  • hw.parent.id
  • hw.parent.type
  • id
  • model
  • name
  • vendor
enclosure
  • hw.enclosure.energy
  • hw.enclosure.power
  • hw.power.limit{hw.type="enclosure", limit_type="high.critical"}
  • hw.status{hw.type="enclosure", state="degraded|failed|ok"}
  • hw.status{hw.type="enclosure", state="present"}
  • id
  • model
  • name
  • serial_number
  • type
  • vendor
fan
  • hw.fan.speed
  • hw.fan.speed.limit{limit_type="low.critical"}
  • hw.fan.speed_ratio
  • hw.status{hw.type="fan", state="degraded|failed|ok"}
  • hw.status{hw.type="fan", state="present"}
  • hw.parent.id
  • hw.parent.type
  • id
  • name
gpu
  • hw.energy{hw.type="gpu"}
  • hw.gpu.speed
  • hw.gpu.speed.limit{limit_type="high.critical"}
  • hw.gpu.speed.limit{limit_type="high.degraded"}
  • hw.gpu.speed.limit{limit_type="low.degraded"}
  • hw.power.limit{hw.type="gpu", limit_type="high.critical"}
  • hw.power{hw.type="gpu"}
  • hw.status{hw.type="gpu", state="degraded|failed|ok"}
  • hw.status{hw.type="gpu", state="present"}
  • hw.parent.id
  • hw.parent.type
  • id
  • model
  • name
  • serial_number
  • vendor
memory
  • hw.memory.limit
  • hw.status{hw.type="memory", state="degraded|failed|ok"}
  • hw.status{hw.type="memory", state="present"}
  • hw.parent.id
  • hw.parent.type
  • id
  • model
  • name
  • serial_number
  • type
  • vendor
network
  • hw.network.up
  • hw.status{hw.type="network", state="degraded|failed|ok"}
  • hw.status{hw.type="network", state="present"}
  • hw.parent.id
  • hw.parent.type
  • id
  • name
  • physical_address
physical_disk
  • hw.physical_disk.size
  • hw.status{hw.type="physical_disk", state="degraded|failed|ok"}
  • hw.status{hw.type="physical_disk", state="present"}
  • hw.parent.id
  • hw.parent.type
  • id
  • model
  • name
  • vendor
power_supply
  • hw.power_supply.limit
  • hw.power_supply.power
  • hw.power_supply.utilization
  • hw.status{hw.type="power_supply", state="degraded|failed|ok"}
  • hw.status{hw.type="power_supply", state="present"}
  • hw.parent.id
  • hw.parent.type
  • id
  • model
  • name
  • power_supply_type
  • serial_number
  • vendor
temperature
  • hw.status{hw.type="temperature", state="degraded|failed|ok"}
  • hw.status{hw.type="temperature", state="present"}
  • hw.temperature
  • hw.temperature.limit{limit_type="high.critical"}
  • hw.temperature.limit{limit_type="high.degraded"}
  • hw.parent.id
  • hw.parent.type
  • id
  • name
voltage
  • hw.status{hw.type="voltage", state="degraded|failed|ok"}
  • hw.status{hw.type="voltage", state="present"}
  • hw.voltage
  • hw.parent.id
  • hw.parent.type
  • id
  • name
No results.