Job Description
Cloud / SRE Engineer
Location
Hyderabad, India
Employment Type
Full-Time
Experience Level
59 Years
Position Overview
We are seeking a Cloud / SRE Engineer to design, operate, and continuously improve the reliability of a multi-cloud platform supporting AI-enabled applications across AWS and GCP.
This is a true build & run role, with end-to-end ownership of production systems, including infrastructure, reliability, deployments, and operational health.
This role is embedded within the engineering team and owns CI/CD pipelines and deployment practices, collaborating closely with developers to enable safe, scalable, and efficient delivery. The engineer helps define how services are deployed into production while ensuring systems meet enterprise standards for reliability, security, and operability.
Key Responsibilities
MultiCloud Infrastructure Engineering
- Design and operate infrastructure across:
- AWS (UI/application hosting and delivery)
- GCP (AI processing, backend services, and data workloads)
- Implement scalable, fault-tolerant, and secure cloud architectures
- Manage networking, IAM, security controls, and environment configurations
Reliability Engineering (SRE)
- Own system availability, reliability, and performance
- Define and maintain:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Error budgets
- Design systems for resiliency, redundancy, and recovery
- Continuously improve system stability based on production insights
Observability & Monitoring
- Design and implement:
- Metrics
- Logging
- Distributed tracing
- Build actionable alerting systems to detect and prevent production issues
- Create dashboards and visibility into system health and performance
- Ensure systems are instrumented for operability and supportability [JD_India_S...y Engineer | Word]
Incident Management & Operational Support
- Participate in on-call rotations and incident response
- Perform root cause analysis (RCA) and drive resolution of systemic issues
- Develop and maintain runbooks and operational procedures
- Reduce operational toil through automation and improved design
CI/CD, Deployment Standards & Infrastructure Automation
- Design, build, and own CI/CD pipelines for:
- Application services
- APIs
- AI/agent workloads
- Define and enforce deployment standards, patterns, and best practices across the platform
- Ensure deployments are:
- Repeatable
- Secure
- Observable
- Aligned to enterprise governance standards
- Implement and maintain Infrastructure as Code (IaC) using Terraform or CloudFormation
- Automate provisioning, environment configuration, and deployment workflows
- Continuously improve pipelines to increase:
- Reliability
- Deployment safety
- Speed of delivery
Performance & Cost Optimization
- Optimize systems for:
- Performance
- Scalability
- Cost efficiency
- Monitor and manage cloud resource usage and consumption
Security & Governance
- Implement and enforce cloud security best practices:
- IAM and access control
- Secrets management
- Policy enforcement
- Ensure compliance with enterprise security standards
Collaboration & Engineering Enablement
- Partner closely with AI Platform Engineers to:
- Ensure systems are operable, observable, and scalable
- Enable consistent, production-ready deployment patterns
- Collaborate with development teams on:
- Pipeline design
- Deployment strategies
- Release processes
- Provide guidance and guardrails, enabling teams to deliver efficiently
(not acting as a blocker, but as a platform enabler) - Influence system design to improve:
- Reliability
- Operability
- Ease of deployment
Must Have
- 5+ years of cloud engineering, SRE, or infrastructure engineering experience
- Hands-on experience with both AWS and GCP (required)
- Experience supporting production systems, including:
- On-call rotations
- Incident response
- Operational ownership
- Strong experience with:
- Observability (metrics, logging, alerting)
- Monitoring system design and implementation
- Incident investigation and troubleshooting
- Experience with Infrastructure as Code:
- Terraform (preferred) or CloudFormation
- Experience designing and owning CI/CD pipelines and deployment processes, not just using them
- Experience defining or enforcing:
- Deployment standards
- Release practices
- Production readiness requirements
- Experience building or supporting CI/CD pipelines for cloud-native systems
- Strong understanding of:
- Cloud networking
- IAM and security best practices
- Distributed system behavior
Nice to Have
- Experience supporting AI/ML or data processing workloads
- Experience with container platforms:
- Experience designing or supporting multi-region / highly available architectures
- Experience working with:
- Event-driven systems
- Streaming or data pipeline architectures
- Cloud certifications (AWS and/or GCP)
Accountability
- Own platform reliability and operational excellence across AWS and GCP
- Define and operate deployment pipelines and standards for the organization
- Ensure systems are production-ready, observable, and resilient
- Enable engineering teams to deliver safely, consistently, and at scale.
Please share your resume to su*******************y@th********d.com
Job Classification
Industry: IT Services & Consulting
Functional Area / Department: IT & Information Security
Role Category: IT Infrastructure Services
Role: Infrastructure Architect
Employement Type: Full time
Contact Details:
Company: The Hartford
Location(s): Hyderabad
Keyskills:
Reliability Engineering
Aws Cloud
IT Infrastructure Management
Terraform
GCP
Iac
Incident Management
observality
Kubernetes