Role Overview
We are looking for a highly skilled Site Reliability Engineer (SRE) to build, operate, and continuously improve highly available, scalable, and observable platforms running on baremetal
Kubernetes clusters, DataDog and Google Kubernetes Engine (GKE).
The ideal candidate brings deep Kubernetes expertise, strong cloud-native experience on GCP, and a passion for reliability, automation, and operational excellence. This role works closely with application, platform, and architecture teams to ensure production systems are resilient, secure, and performant at scale.
Experience Range - 4 to 12 years
Locations: Chennai, Hyderabad, Noida and Gurgaon only
Mode of work: Work from office only
Key Responsibilities
Design, operate, and support Kubernetes platforms across baremetal clusters and GKE
Ensure high availability, scalability, performance, and reliability of production systems
Implement and manage GitOps-based deployment workflows using tools like Argo CD
Build, maintain, and optimize CI/CD pipelines using tools such as GitHub Actions,Harness, CircleCI, or equivalent
Deploy and manage applications using Helm, including canary and progressive delivery strategies
Hands on exp on cloudbased monitoring and observability tool i.e. DataDog Implement comprehensive observability using Prometheus, Grafana, Loki, and Tempo
Proactively monitor systems, troubleshoot incidents, and perform root cause analysis (RCA)
Partner with development teams to improve service reliability, scalability, and operational maturity
Provision and manage cloud infrastructure on Google Cloud Platform (GCP)
Automate infrastructure and platform operations using Infrastructure as Code (IaC) and scripting
Drive continuous improvements in resilience, automation, and operational efficiency
Required Skills & Qualifications
Strong hands-on experience with Kubernetes architecture and administration
Experience managing both bare-metal Kubernetes clusters and Google Kubernetes
Engine (GKE)
Solid understanding of Google Cloud Platform (GCP) services and networking concepts
Proven experience with GitOps practices and tools such as Argo CD
Proficiency with CI/CD tools (GitHub Actions, Harness, CircleCI, or similar)
Practical experience with:
Strong expertise in observability and monitoring:
Understanding of modern API technologies such as GraphQL
Familiarity with API management platforms (Apigee Edge, Apigee X)
Knowledge of CDN and edge services (e.g., Akamai)
Good to Have
Working knowledge of Java (Spring Boot) and/or Node.js framework
Understanding of microservices architecture and service-to-service communication
Experience with Ansible or similar configuration management tools
Exposure to hybrid or multicloud environments
Experience in performance tuning and cost optimization on GCP
Understanding of Kubernetes and cloud security best practices
SRE experience aligned with SLIs, SLOs, and error budgets

Keyskills: Kubernetes Cluster Springboot Java Site Reliability Engineering Datadog Gcp Cloud Platform Development Ansible Prometheus Helm Grafana
Hucon Solutions India Pvt.Ltd. Hucon Solutions is an Integrated HR Service Provider for all Corporates all over India. We are backed by a good ERP and enough experience in HR and related activities. It has helped generate career opportunities for more than a million individuals in India. Hucon Solu...