Success Stories

How MetricsHub® Helped a Major Telco Achieve Unified Observability at Scale in Just 5 Months

Managing a vast bare-metal infrastructure of over 6,000 servers from three different vendors across tens of data centers is no small feat, especially when each vendor has different tools and notification processes. For this leading telecommunications provider, fragmented management was risky to their operations and resulted downtime disrupted critical services and customer satisfaction. They needed a reliable and scalable solution to “see what [was] going on with all [their] servers at a click of a button and have automated tickets with solid alerting”

About the Telco Giant

This leading telecommunications company operates over 6,000 servers across multiple data centers in North America. For 25 years, their IT team faced significant challenges in monitoring their infrastructure to detect failures, troubleshoot issues, and ensure seamless service delivery to millions of customers. With new observability concepts emerging, the company needed to rethink their monitoring setup to meet modern demands, such as adopting open standards for interoperability, leveraging OpenTelemetry for unified data collection, or implementing vendor-agnostic tools for greater flexibility.

The Preexisting Landscape: Fragmented Tools, Delayed Resolutions

The IT infrastructure spanned servers from three major vendors—Dell, HPE, and Super Micro—each managed through separate vendor-specific tools. When issues arose, admins received basic email notifications sent to a Slack channel.

Figure 1 - Fragmented monitoring where vendor-specific tools manage servers separately and send Slack notifications, leading to delayed issue resolution.

These notifications lacked critical details, forcing admins to manually log into management consoles or search for information in the CMDB—significantly delaying resolution times. Additionally, vendor tools lacked essential integration for ticketing or dashboarding, leading to operational silos. This fragmented approach resulted in services being down for a total of seven days annually, putting customer satisfaction and SLAs at risk.

Figure 2 - A fragmented approach put customer satisfaction and SLAs at risk

Evaluating Their Options: MetricsHub®, Zabbix, Telegraf, and In-house Solution

To address their infrastructure monitoring challenges, the company was looking for a monitoring solution capable of addressing key areas like hardware monitoring, data organization, integration, and alerting. They shortlisted four potential solutions: MetricsHub®, Zabbix, Telegraf, and an in-house tool. They evaluated each solution based on its ability to meet their critical requirements, whether out of the box, requiring custom development, or not supported at all and came up with this feature evaluation matrix:

Out of the box Requires development Not supported

Features MetricsHub® Zabbix Telegraf by InfluxDB In-house Solution
Single pane of glass
Vendor agnostic
Commercial support / Consultancy /
Automated provisioning from NetBox data through CI/CD pipeline
Data center floor map
Power consumption for each server
Redfish and SNMP Polling for server metrics
Alerting: 3rd party integration with PagerDuty, Slack, and Email
Automatic tickets: 3rd party integrations for creating and closing tickets with JIRA
Trends and metrics: History in a time-series based database
Alerting to set contact groups
Dashboards for hardware health, power, and cooling metrics
Multi-tenant for front-end access
Provides alerting for all base hardware metrics:
  • Power supply, health status and real-time power consumption
  • Fan status and health
  • Hardware, cabinet and room temperature
  • Disk and controller status
  • NIC status
  • Memory and CPU status
  • Firmware version
  • Etc.
Ability to organize assets in a web front-end by region > data center > room > cabinet hierarchy

The comparison between the four shortlisted solutions revealed the following insights:

  • Zabbix: A strong candidate for general monitoring, but hardware monitoring capabilities are limited, particularly in supporting protocols like Redfish or providing detailed energy usage reports.
  • Telegraf by InfluxDB: Excellent for alerting, but lacks critical features such as hardware monitoring, energy usage reports, and robust data organization.
  • The In-House Solution: This option offered very limited integrations, with no support for floor heatmaps or adequate hardware monitoring and required significant development effort to meet their specific needs.
  • MetricsHub: Provided all required features except for multi-tenancy, which would require architectural adjustments. Its seamless integration with an observability platform and comprehensive hardware monitoring made it the clear choice.

Why MetricsHub® Won

MetricsHub® emerged as the only solution meeting all core requirements, including comprehensive hardware monitoring, seamless integration, and robust alerting.

Figure 3 - MetricsHub® fulfils all Telco infrastructure monitoring requirements

Its single-agent design and integration capabilities streamlined monitoring across 6,000 servers, meeting all the requirements where others did not.

Figure 4 - One MetricsHub® agent collects from hundreds of systems, pushes to many backends

5 Months to Success: A Phased Implementation

The implementation of MetricsHub® was divided into three strategic phases to ensure success: proof of concept, automatic provisioning, and full deployment.

Phase 1: Proof of Concept

The proof-of-concept phase was critical for identifying system requirements, detecting potential defects, and improving performance. During this phase, MetricsHub’s consultants deployed MetricsHub®, Prometheus, and Grafana on a virtual machine to monitor more than 250 servers. When polling issues arose with a single instance of MetricsHub® managing over 200 servers, the team quickly addressed these challenges, refining the system for scalability and reliability.

Phase 2: Automatic Provisioning

With thousands of systems to configure, manual setup was not feasible. The customer’s team leveraged their existing infrastructure management tool, NetBox, to automate the process. By developing a custom script to extract data from NetBox, they automated the configuration of MetricsHub® instances. This approach streamlined the distribution and deployment process, significantly reducing manual effort and potential errors.

Phase 3: Full Deployment

In the final phase, the monitoring workload was optimized by distributing it across 70 instances of MetricsHub®, each deployed using Docker, Kubernetes, and OpenShift. With each instance capable of managing over 500 monitored systems, the architecture struck an ideal balance between performance and scalability.

The revamped architecture integrates modern solutions that seamlessly communicate with one another. MetricsHub collects metrics from the Dell, HPE, and Super Micro servers, while Prometheus handles metric storage and real-time alerting. Grafana provides intuitive dashboards for telemetry visualization, and BMC Helix leverages AIOps for advanced analysis. Together, these tools enable a streamlined, 1-click observability experience.

Figure 5 - Telco monitoring infrastructure including MetricsHub®, Prometheus, Grafana, and BMC Helix.

Results: Transforming Operations and Boosting Productivity

As a result of this transformation, the IT team of our customer experienced a remarkable 70% improvement in productivity, driven by enhanced operational efficiency. Proactive monitoring significantly reduced annual downtime, ensuring compliance with SLAs and boosting customer satisfaction. Additionally, the unified visibility and automation provided by the new architecture minimized manual interventions, enabling the customer’s team to shift their focus toward more strategic initiatives.

Conclusion

In just five months, this telecom provider transformed their fragmented monitoring approach into a modern, scalable, and unified solution with MetricsHub. By replacing vendor-specific tools with a centralized observability framework, they eliminated inefficiencies, reduced operational risks, and improved service continuity. Now, real-time telemetry data collection and proactive alerting provide full visibility across their 6,000+ servers, enabling their IT team to focus on strategic operations instead of troubleshooting monitoring issues.

Ready to streamline your infrastructure monitoring like this telco? Contact us today to discover how MetricsHub® can transform your operations.

Share this post