Job requirements and application form

Senior DevOps Engineer Team Lead
(552803)
Job description
We are seeking for Senior DevOps Engineer Team Lead
to lead a small, hands-on DevOps team operating our Azure and Hetzner estates for both Predict⁺ and EI. Own reliability, security (ISO27001), and cost efficiency; drive incident response and an on-call rotation; stay deeply hands-on across Kubernetes, IaC, CI/CD, networking, and monitoring.
Key responsibilities
• Team leadership: roadmap ownership, mentoring, SLAs/SLOs, and postmortems.
• Kubernetes (self-hosted & AKS): upgrades, scaling, backup/restore/DR, and security hardening.
• Azure & colo integration: subscriptions, identity, networking, cost; connectivity with on‑prem colo
(VPN/storage).
• IaC & automation: Terraform or Pulumi (plus Ansible) to standardize infra.
• CI/CD at scale: Jenkins / Azure DevOps / GitLab (build, tests, security scans, rollouts/rollbacks).
• Monitoring & incident response: Prometheus, Grafana, Zabbix; actionable alerts, SLO dashboards,
runbooks, on‑call.
• Security & compliance: ISO27001 controls, secrets management, least privilege, image scanning,logging/audit.
Job requirements
Must‑have qualifications (Team Lead, 7+ years)
• 7+ years in DevOps/SRE with recent team‑lead responsibilities.
• Self‑hosted Kubernetes and Azure AKS (hands‑on).
• Terraform or Pulumi (Ansible for configuration management).
• CI/CD pipelines: Jenkins and Azure DevOps / GitLab.
• IT background with on‑prem colocation (servers/storage/networking, VPNs).
• Managing Azure environments (subscriptions, identity, networking, security, cost).
• Networking: TCP/IP, DNS, VPNs, load balancers, firewalls; strong troubleshooting.
• Monitoring with Prometheus, Grafana, Zabbix.
• Strong Linux fundamentals and scripting (Bash/Python).
Nice to have / Advantages
• Experience supporting globally distributed, customer‑facing production systems.
• TimescaleDB & ClickHouse operations (backup/restore, performance, retention).
• Keycloak/OIDC/SAML; SIEM/ELK/Wazuh; SAST/DAST.
• Container registries (Harbor/ACR) and SBOM practices.
• Cost optimization, multi‑cloud exposure, GPU workloads for AI/MLr service issues.loud environments (Azure, Hetzner, or others).