Jobs
>
SRE Lead

SRE Lead

Permanent
Full time
Remote

Job Description: Site Reliability Engineer (SRE)

Role Overview

We are seeking a Site Reliability Engineer (SRE) to work across multiple Agile Value Streams within a large insurance enterprise. This role will be responsible for ensuring the reliability, scalability, and security of infrastructure, DevOps processes, and applications. The SRE will collaborate with development, infrastructure, and security teams to implement best practices in automation, monitoring, and incident management, while also contributing towards DORA metrics and continuous improvement initiatives.

Key Responsibilities

1. Infrastructure & Cloud Reliability

Manage and optimize cloud and on-prem infrastructure to ensure high availability and performance.
Automate infrastructure provisioning using Infrastructure as Code (IaC) tools like Terraform, ARM templates, or CloudFormation.
Implement best practices for Kubernetes (AKS, EKS, GKE), containerization (Docker), and serverless architectures.
Ensure capacity planning, performance tuning, and cost optimization for infrastructure services.

2. DevOps & Automation

Develop and maintain CI/CD pipelines using Azure DevOps, GitHub Actions, or Jenkins.
Automate software deployments and infrastructure updates using scripting languages like PowerShell, Bash, or Python.
Manage and improve release management processes to enable smooth software delivery across Agile Value Streams.

3. Application Reliability & Performance Monitoring

Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for business-critical applications.
Implement DORA (DevOps Research and Assessment) metrics to measure software delivery and operational performance (Deployment Frequency, Lead Time, Change Failure Rate, MTTR).
Utilize APM (Application Performance Monitoring) tools like New Relic, Datadog, Prometheus, or Azure Monitor for real-time observability.
Optimize system reliability by performing root cause analysis (RCA) and post-mortems for incidents.

4. Security & Compliance

Work closely with security teams to implement DevSecOps practices and ensure compliance with industry regulations.
Implement zero-trust security models, IAM best practices, and vulnerability scanning tools.
Ensure compliance with ISO 27001, SOC 2, and GDPR security standards.
Integrate security tools like SonarQube, Aqua Security into CI/CD pipelines.

5. Incident & Problem Management

Develop automated runbooks for faster incident resolution and self-healing infrastructure.
Act as a technical escalation point for critical incidents, coordinating between development and infrastructure teams.
Establish and refine on-call rotation processes to ensure high system uptime and quick recovery.

6. Collaboration & Agile Ways of Working

Work closely with Scrum Masters, Product Owners, and Engineering Leads across Agile Value Streams to improve system reliability.
Conduct technical coaching and knowledge-sharing sessions for teams on SRE best practices.

Required Skills & Qualifications

5+ years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
Strong knowledge of Azure, AWS, or GCP with expertise in cloud-native infrastructure.
Hands-on experience with Kubernetes (AKS/EKS/GKE), Helm, and containerized deployments.
Proficiency in Terraform, Ansible, or other IaC tools for infrastructure automation.
Deep understanding of DORA metrics and ability to implement monitoring solutions.
Experience with CI/CD pipeline design, release automation, and deployment strategies.
Strong background in networking, DNS, load balancing, and firewall configurations.
Knowledge of SRE principles (error budgets, observability, and fault tolerance mechanisms).
Hands-on experience with logging and monitoring tools (ELK stack, Prometheus, Grafana, Splunk, etc.).
Strong scripting ability in PowerShell, Bash, or Python.
Experience working in Agile and SAFe (Scaled Agile Framework) environments.
Familiarity with ITIL practices for incident, problem, and change management.

Preferred Skills

Experience with serverless computing, microservices architecture, and API Gateway solutions.
Certification in Azure Solutions Architect, AWS Certified DevOps Engineer, or Kubernetes CKA/CKAD.

Permanent
Full time
Remote

Apply now

This website uses cookies

FACTORIAL uses cookies to personalise content and ads, to provide social media features and to analyse our traffic. We also share information about your use of our site with our social media, advertising and analytics partners who may combine it with other information that you've provided to them or that they've collected from your use of their services.