MetricsHub
MetricsHub Enterprise 2.1.00
-
Home
- Integrations Prometheus 2
Prometheus Alertmanager
If your Prometheus server is configured to send alerts to Alertmanager[1], you need to configure Alert Rules to be notified when issues occur. To simplify this process, MetricsHub provides the following alert rules that you can tailor to your specific needs:
Alert Rules | When to Use | Alerts Triggered When |
---|---|---|
MetricsHub | Always |
|
Hardware | When hardware monitoring is performed |
|
System | When system monitoring is performed |
|
Notes:
These alert rules are distinct from the internal alerts generated by MetricsHub and emitted as OpenTelemetry logs. The alert rules described in this page are managed exclusively by Prometheus Alertmanager.
To see alert descriptions, you must use the full Prometheus Alertmanager interface (usually available on port
9093
). The simple web UI bundled with Prometheus does not display this additional alert information.
Alert Rules Thresholds
The alert rules rely on two types of thresholds:
- Static thresholds: Used when the same threshold applies to all devices (e.g., battery charge). The alert rule compares the metric to a fixed, hardcoded value.
- Dynamic thresholds: Used when thresholds vary across devices (e.g., temperature or fan speed). In this case, two additional metrics define the warning and critical thresholds. The alert rules compare the base metric to the corresponding threshold metrics.
Static Threshold Example
For the hw_battery_charge_ratio
metric:
- a
warning
alert is triggered when the battery charge is below 0.5 (50%) - a
critical
alert is triggered when the battery charge is below 0.3 (30%) - both
warning
andcritical
alerts are triggered when the value is below 0.3, since the above conditions are met.
- name: MetricsHub-Hardware-Battery-Charge
rules:
- alert: MetricsHub-Hardware-Battery-Charge-Warning
expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 <= 50
for: 5m
labels:
severity: warning
- alert: MetricsHub-Hardware-Battery-Charge-Critical
expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 < 30
for: 5m
labels:
severity: critical
Dynamic Threshold Example
For the hw_temperature_celsius
metric:
- a
warning
alert is triggered when the temperature exceeds the value ofhw_temperature_limit_celsius{limit_type="high.degraded"}
- a
critical
alert is triggered when the temperature exceeds the value ofhw_temperature_limit_celsius{limit_type="high.critical"}
- name: Temperature
rules:
- alert: Temperature-High-Warning
expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.degraded"}
labels:
severity: warning
- alert: Temperature-High-Critical
expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.critical"}
labels:
severity: critical
The table below summarizes the metrics that should be compared to their corresponding dynamic threshold metrics:
Base Metric | Dynamic Threshold Metrics |
---|---|
rate(hw_errors_total[1h]) |
ignoring(limit_type) hw_errors_limit{limit_type="degraded"} ignoring(limit_type) hw_errors_limit{limit_type="critical"} |
hw_fan_speed_rpm |
ignoring(limit_type) hw_fan_speed_limit_rpm{limit_type="low.degraded"} ignoring(limit_type) hw_fan_speed_limit_rpm{limit_type="low.critical"} |
hw_fan_speed_ratio |
ignoring(limit_type) hw_fan_speed_ratio_limit{limit_type="low.degraded"} ignoring(limit_type) hw_fan_speed_ratio_limit{limit_type="low.critical"} |
hw_lun_paths{type="available"} |
ignoring(limit_type) hw_lun_paths_limit{limit_type="low.degraded"} |
hw_network_error_ratio |
ignoring(limit_type) hw_network_error_ratio_limit{limit_type="degraded"} ignoring(limit_type) hw_network_error_ratio_limit{limit_type="critical"} |
hw_other_device_uses |
ignoring(limit_type) hw_other_device_uses_limit{limit_type="degraded"} ignoring(limit_type) hw_other_device_uses_limit{limit_type="critical"} |
hw_other_device_value |
ignoring(limit_type) hw_other_device_value_limit{limit_type="degraded"} ignoring(limit_type) hw_other_device_value_limit{limit_type="critical"} |
hw_temperature_celsius |
ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.degraded"} ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.critical"} |
hw_voltage_volts |
ignoring(limit_type) hw_voltage_limit_volts{limit_type="low.critical"} ignoring(limit_type) hw_voltage_limit_volts{limit_type="high.critical"} |
Install
To activate the alert rules:
-
Copy the required configuration files into your
Prometheus
installation folder:config/metricshub-rules.yaml
config/metricshub-hardware-rules.yaml
config/metricshub-system-rules.yaml
-
Declare them in the
prometheus.yaml
file:rule_files: - metricshub-rules.yaml - metricshub-hardware-rules.yaml - metricshub-system-rules.yaml
-
Restart your Prometheus server to take the new rules into account.