Implement and manage observability solutions using Prometheus, Grafana, Loki, Tempo, and OpenTelemetry.
Create and maintain dashboards and alerts to monitor application health, performance, and availability.
Collaborate with application teams to onboard services into monitoring systems, adding metrics, logs, and tracing.
Support production monitoring issues, troubleshoot incidents, and ensure system reliability.
Provide monitoring support for Kubernetes-based applications and cloud/microservices environments.
Correlate metrics, logs, and traces using trace IDs, labels, metadata, and time correlation for effective debugging.
Document monitoring standards, write runbooks, and help teams adopt observability tools independently.
Required Skills & Qualifications
7+ years of experience in monitoring, observability, or reliability engineering.
Strong hands-on expertise with Prometheus, Grafana, Loki, Tempo, and OpenTelemetry.
Proven experience in creating dashboards, setting alerts, and supporting application monitoring.
Experience in production issue resolution using observability data (metrics, logs, traces).
Solid understanding of Kubernetes monitoring and cloud-native environments.
Ability to collaborate with multiple teams and follow standardized onboarding processes.
Excellent problem-solving, debugging, and communication skills.
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Project & Program ManagementRole Category: Technology / ITRole: Technology / IT - OtherEmployement Type: Full time