Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

Integrated Dell Remote Access Controller 9 User's Guide

GPU (Accelerators) Management

Dell PowerEdge servers are shipped with Graphics Processing Unit (GPU). GPU management enables you to view the various GPUs connected to the system and also monitor power, temperature, and thermal information of the GPUs.

The following are the GPU properties and the license details:

Table 1. GPU properties and the license detailsGPU properties and the license details.
GPU Properties License
Inventory
Board Part Number All Licenses
OEM Info All Licenses
Serial Number All Licenses
Marketing Name All Licenses
GPU Part Number All Licenses
Build Date. All Licenses
Firmware Version All Licenses
GPU GUID All Licenses
PCI vendorid All Licenses
PCI deviceid All Licenses
PCI Subvendorid All Licenses
PCI Subdeviceid All Licenses
GPU Status All Licenses
GPU Health All Licenses
Thermal Metrics
Primary GPU temperature All Licenses
Secondary GPU temperature All Licenses
Board temperature All Licenses
Memory temperature All Licenses
Minimum GPU HW Slowdown Temperature Enterprise
GPU Shutdown Temperature Enterprise
Maximum Memory Operating temperature Enterprise
Maximum GPU Operating Temperature Enterprise
Thermal Alert State Enterprise
Power Brake State Enterprise
Power Metrics
Power Consumption All Licenses
Power Supply Status Enterprise
Board Power Supply Status Enterprise
NOTE:
  • GPU properties are not listed for Embedded GPU cards, and the Status is marked as Unknown.
  • The operating temperature may be different for AMD-based systems.
  • The number of GPU entries per PCIe slot that is displayed in the host may differ from that in the iDRAC.
  • When a manual AC power cycle is required after performing any component or bundled firmware updates for GPUs or Power Distribution Board (PDB) CPLDs, SUP0545 event in Lifecycle(LC) logs is displayed. After this event, ensure to perform a manual AC or virtual AC power cycle to avoid any unexpected behavior in the server.
  • After a GPU firmware update that includes component firmware updates or bundled firmware updates, ensure to perform an AC power cycle or virtual AC power cycle to complete the update. Doing so avoids any unexpected behavior in iDRAC related to GPUs.
  • In Persistent Mode, during warm reboot, the GPU Power capping limit values may not be accurate.
  • GPU Power Capping feature is not available in Non-A2 GPU configurations.

GPU has to be in ready state before the command fetches the data. GPUStatus field in Inventory shows the availability of the GPU and whether the GPU device is responding or not. If the GPU status is ready, GPUStatus shows OK, otherwise the status shows Unavailable.

The GPU offers multiple health parameters that can be pulled through the SMBPB interface of the NVIDIA controllers. This feature is limited only to NVIDIA cards. The following are the health parameters that are retrieved from the GPU device:

  • Power
  • Temperature
  • Thermal
NOTE:This feature is only limited to NVIDIA cards. This information is not available for any other GPU that the server may support. The interval for polling the GPU cards over the PBI is 5 s.
NOTE: While updating the GPU firmware, avoid any USB or USB-NIC operations (such as connecting to USB management port, iDRAC quick sync operation, enabling or disabling USB-NIC port, or similar USB operations) in iDRAC. Any such operation during the firmware update can result in nondeterministic behavior and may also result in a firmware update failure.

With warm reboot and Persistent mode is disabled, we can see the following behavior:

  • Power Consumption shows as N/A.
  • Power Cap limit shows with older inventory Limit Values.

The host system must have the NVIDIA driver that is installed and running for the Power consumption, GPU target temperature, Min GPU slowdown temperature, GPU shutdown temperature, Max memory operating temperature, and Max GPU operating temperature features to be available. These values are shown as N/A if the GPU driver is not installed.

In Linux, when the card is unused, the driver down-trains the card and unloads to save power. In such cases, the power consumption, GPU target temperature, Min GPU slowdown temperature, GPU shutdown temperature, max memory operating temperature, max memory operating temperature, and max GPU operating temperature features are not available. Persistent mode should be enabled for the device to avoid unload. You can use nvidia-smi tool to enable this using the command nvidia-smi -pm 1.

You can generate GPU reports using Telemetry. For more information about telemetry features, see Telemetry Streaming.

NOTE: In Racadm, You may see dummy GPU entries with empty values. This may happen if the device is not ready to respond when iDRAC queries the GPU device for the information. Perform an iDRAC racrest operation to resolve this issue.

Monitoring Processing Accelerators

Accelerator devices with PCIe class Processing Accelerators need real-time temperature and sensor monitoring because it generates significant heat when in use.

Perform the following steps to get Processing Accelerators' inventory information:

  1. Power off the server.
  2. Install the accelerators on the riser card.
  3. Power on the server.
  4. Wait until POST is complete.
  5. Log in to the iDRAC UI.
  6. Go to System > Overview > Accelerators. You can see both GPU and Processing Accelerators sections.
  7. Expand the specific accelerator to see the following sensor information:
    • Power consumption
    • Temperature details
NOTE:Logical temperature sensors are not displayed in the iDRAC interfaces. Only physical temperature sensors are displayed.
NOTE:You must have iDRAC login privilege to access the accelerators' information.
NOTE:Power consumption sensors are available only for the supported accelerators and are available only with Datacenter license.
NOTE: iDRAC interfaces may not display the information of power and thermal sensors that are dependent on the host operating system (operating system). In this case, install the GPU drivers (ROCm package) in the host operating system.
NOTE:
  • It is recommended to perform A100 GPU CEC firmware update before updating the Accelerators firmware.
  • Do not perform GPU CEC and Accelerators firmware update simultaneously to avoid failure of updates. Perform an AC or virtual AC power cycle after the firmware update failures. Doing so avoids further failures of a single update that is caused by a previous update failure.
  • The firmware update for the HGX A100 8-GPU Baseboard FPGA may take between 60 and 90 minutes to complete.
  • HGX A100 8-GPU Baseboard FPGA and CEC DUP updates must not be triggered simultaneously. It is recommended to follow these steps:
    1. Update the CEC firmware.
    2. Perform a virtual AC or a manual AC cycle.
    3. Update the FPGA firmware.
    4. Perform another virtual AC or manual AC cycle.
  • To update the PDB CPLD from the operating system, start a cold reboot. After the update, a virtual AC cycle is performed.
NOTE:Occasionally, the accelerators send 0 values for power consumption. Therefore, PLDM also uses 0 values and displays the same in the UI. Although, the values are automatically corrected in subsequent readings.
NOTE: PCIe Devices are dependent on the device drivers and firmware to respond to iDRAC requests. These devices log LC message HWC9053 (communication that is lost with the device) when the required drivers and firmware are not loaded or when the server is in the Pre-OPERATING SYSTEM environment (UEFI shell and Lifecycle Controller page).

Rate this content

Accurate
Useful
Easy to understand
Was this article helpful?
0/3000 characters
  Please provide ratings (1-5 stars).
  Please provide ratings (1-5 stars).
  Please provide ratings (1-5 stars).
  Please select whether the article was helpful or not.
  Comments cannot contain these special characters: <>()\