Prometheus Alertmanager

If your Prometheus server is configured to send alerts to Alertmanager[1], you need to configure Alert Rules to be notified when issues occur. To simplify this process, MetricsHub provides the following alert rules that you can tailor to your specific needs:

Alert Rules When to Use Alerts Triggered When
MetricsHub Always
  • A host cannot be reached
  • A connector has failed
  • A protocol has failed
  • The MetricsHub Agent is not sending metrics.
Hardware When hardware monitoring is performed
  • Battery charge is critically or abnormally low
  • Devices report high error rates (e.g. CPU, memory, disks, network)
  • Fan speed is too low
  • LUN has too few or no available paths
  • Network card error ratio is high
  • Physical disk endurance is low
  • Power supply usage is abnormally high
  • Temperature or voltage is out of range
  • A hardware device is missing, degraded, predicted to fail or failing.
System When system monitoring is performed
  • CPU usage, file system utilization, memory usage, or bandwidth usage is abnormally high
  • Too many network errors are detected
  • A high page faults rate occurs over an extended period of time.

Notes:

  • These alert rules are distinct from the internal alerts generated by MetricsHub and emitted as OpenTelemetry logs. The alert rules described in this page are managed exclusively by Prometheus Alertmanager.

  • To see alert descriptions, you must use the full Prometheus Alertmanager interface (usually available on port 9093). The simple web UI bundled with Prometheus does not display this additional alert information.

Alert Rules Thresholds

The alert rules rely on two types of thresholds:

  • Static thresholds: Used when the same threshold applies to all devices (e.g., battery charge). The alert rule compares the metric to a fixed, hardcoded value.
  • Dynamic thresholds: Used when thresholds vary across devices (e.g., temperature or fan speed). In this case, two additional metrics define the warning and critical thresholds. The alert rules compare the base metric to the corresponding threshold metrics.

Static Threshold Example

For the hw_battery_charge_ratio metric:

  • a warning alert is triggered when the battery charge is below 0.5 (50%)
  • a critical alert is triggered when the battery charge is below 0.3 (30%)
  • both warning and critical alerts are triggered when the value is below 0.3, since the above conditions are met.
- name: MetricsHub-Hardware-Battery-Charge
  rules:
    - alert: MetricsHub-Hardware-Battery-Charge-Warning
      expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 <= 50
      for: 5m
      labels:
        severity: warning

    - alert: MetricsHub-Hardware-Battery-Charge-Critical
      expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 < 30
      for: 5m
      labels:
        severity: critical

Dynamic Threshold Example

For the hw_temperature_celsius metric:

  • a warning alert is triggered when the temperature exceeds the value of hw_temperature_limit_celsius{limit_type="high.degraded"}
  • a critical alert is triggered when the temperature exceeds the value of hw_temperature_limit_celsius{limit_type="high.critical"}
- name: Temperature
  rules:
    - alert: Temperature-High-Warning
      expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.degraded"}
      labels:
        severity: warning

    - alert: Temperature-High-Critical
      expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.critical"}
      labels:
        severity: critical

The table below summarizes the metrics that should be compared to their corresponding dynamic threshold metrics:

Base Metric Dynamic Threshold Metrics
rate(hw_errors_total[1h]) ignoring(limit_type) hw_errors_limit{limit_type="degraded"}
ignoring(limit_type) hw_errors_limit{limit_type="critical"}
hw_fan_speed_rpm ignoring(limit_type) hw_fan_speed_limit_rpm{limit_type="low.degraded"}
ignoring(limit_type) hw_fan_speed_limit_rpm{limit_type="low.critical"}
hw_fan_speed_ratio ignoring(limit_type) hw_fan_speed_ratio_limit{limit_type="low.degraded"}
ignoring(limit_type) hw_fan_speed_ratio_limit{limit_type="low.critical"}
hw_lun_paths{type="available"} ignoring(limit_type) hw_lun_paths_limit{limit_type="low.degraded"}
hw_network_error_ratio ignoring(limit_type) hw_network_error_ratio_limit{limit_type="degraded"}
ignoring(limit_type) hw_network_error_ratio_limit{limit_type="critical"}
hw_other_device_uses ignoring(limit_type) hw_other_device_uses_limit{limit_type="degraded"}
ignoring(limit_type) hw_other_device_uses_limit{limit_type="critical"}
hw_other_device_value ignoring(limit_type) hw_other_device_value_limit{limit_type="degraded"}
ignoring(limit_type) hw_other_device_value_limit{limit_type="critical"}
hw_temperature_celsius ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.degraded"}
ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.critical"}
hw_voltage_volts ignoring(limit_type) hw_voltage_limit_volts{limit_type="low.critical"}
ignoring(limit_type) hw_voltage_limit_volts{limit_type="high.critical"}

Install

To activate the alert rules:

  1. Copy the required configuration files into your Prometheus installation folder:

    • config/metricshub-rules.yaml
    • config/metricshub-hardware-rules.yaml
    • config/metricshub-system-rules.yaml
  2. Declare them in the prometheus.yaml file:

    rule_files:
      - metricshub-rules.yaml
      - metricshub-hardware-rules.yaml
      - metricshub-system-rules.yaml
    
  3. Restart your Prometheus server to take the new rules into account.

No results.