SRE Lead

  • Permanent
  • Full time
  • Remote

Job Description: Site Reliability Engineer (SRE)

Role Overview

We are seeking a Site Reliability Engineer (SRE) to work across multiple Agile Value Streams within a large insurance enterprise. This role will be responsible for ensuring the reliability, scalability, and security of infrastructure, DevOps processes, and applications. The SRE will collaborate with development, infrastructure, and security teams to implement best practices in automation, monitoring, and incident management, while also contributing towards DORA metrics and continuous improvement initiatives.

Key Responsibilities

1. Infrastructure & Cloud Reliability

  • Manage and optimize cloud and on-prem infrastructure to ensure high availability and performance.
  • Automate infrastructure provisioning using Infrastructure as Code (IaC) tools like Terraform, ARM templates, or CloudFormation.
  • Implement best practices for Kubernetes (AKS, EKS, GKE), containerization (Docker), and serverless architectures.
  • Ensure capacity planning, performance tuning, and cost optimization for infrastructure services.

2. DevOps & Automation

  • Develop and maintain CI/CD pipelines using Azure DevOps, GitHub Actions, or Jenkins.
  • Automate software deployments and infrastructure updates using scripting languages like PowerShell, Bash, or Python.
  • Manage and improve release management processes to enable smooth software delivery across Agile Value Streams.

3. Application Reliability & Performance Monitoring

  • Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for business-critical applications.
  • Implement DORA (DevOps Research and Assessment) metrics to measure software delivery and operational performance (Deployment Frequency, Lead Time, Change Failure Rate, MTTR).
  • Utilize APM (Application Performance Monitoring) tools like New Relic, Datadog, Prometheus, or Azure Monitor for real-time observability.
  • Optimize system reliability by performing root cause analysis (RCA) and post-mortems for incidents.

4. Security & Compliance

  • Work closely with security teams to implement DevSecOps practices and ensure compliance with industry regulations.
  • Implement zero-trust security models, IAM best practices, and vulnerability scanning tools.
  • Ensure compliance with ISO 27001, SOC 2, and GDPR security standards.
  • Integrate security tools like SonarQube, Aqua Security into CI/CD pipelines.

5. Incident & Problem Management

  • Develop automated runbooks for faster incident resolution and self-healing infrastructure.
  • Act as a technical escalation point for critical incidents, coordinating between development and infrastructure teams.
  • Establish and refine on-call rotation processes to ensure high system uptime and quick recovery.

6. Collaboration & Agile Ways of Working

  • Work closely with Scrum Masters, Product Owners, and Engineering Leads across Agile Value Streams to improve system reliability.
  • Conduct technical coaching and knowledge-sharing sessions for teams on SRE best practices.

Required Skills & Qualifications

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
  • Strong knowledge of Azure, AWS, or GCP with expertise in cloud-native infrastructure.
  • Hands-on experience with Kubernetes (AKS/EKS/GKE), Helm, and containerized deployments.
  • Proficiency in Terraform, Ansible, or other IaC tools for infrastructure automation.
  • Deep understanding of DORA metrics and ability to implement monitoring solutions.
  • Experience with CI/CD pipeline design, release automation, and deployment strategies.
  • Strong background in networking, DNS, load balancing, and firewall configurations.
  • Knowledge of SRE principles (error budgets, observability, and fault tolerance mechanisms).
  • Hands-on experience with logging and monitoring tools (ELK stack, Prometheus, Grafana, Splunk, etc.).
  • Strong scripting ability in PowerShell, Bash, or Python.
  • Experience working in Agile and SAFe (Scaled Agile Framework) environments.
  • Familiarity with ITIL practices for incident, problem, and change management.

Preferred Skills

  • Experience with serverless computing, microservices architecture, and API Gateway solutions.
  • Certification in Azure Solutions Architect, AWS Certified DevOps Engineer, or Kubernetes CKA/CKAD.