Engineering
Senior Site Reliability Engineer

We are looking for an experienced Senior Site Reliability Engineer to join our Engineering team. You will work closely with the team and develop software systems and automated solutions for the operational aspects of an organization. You will also be responsible for monitoring computer systems and building alerts for various operational issues that computer systems can experience.
What will you do:
Reliability & Incident Management
- Define and enforce SLOs, SLIs, and error budgets for trading, payment, KYC, and notification services
- Own the on-call rotation structure, runbooks, and postmortem culture
- Lead incident response for P0/P1 issues; coordinate across backend, mobile, and compliance teams
- Reduce MTTR through better alerting in Coralogix, auto-remediation, and progressive delivery
- Drive blameless postmortems and ensure action items are tracked to closure in Linear
Infrastructure & Platform
- Own GKE clusters (dev, staging, production), ArgoCD GitOps pipelines
- Harden the Caddy API gateway: routing, rate limiting, TLS termination, canary weighted_round_robin
- Drive migration off legacy infrastructure: Docker Swarm, Consul-based service discovery cleanup
- Lead capacity planning, GCP cost optimization, and multi-AZ / disaster recovery strategy
- Own IaC for all infrastructure (Terraform / Helm / Kustomize)
Observability & Developer Experience
- Expand Coralogix coverage across logs, traces, alerts, and dashboards; enforce structured logging standards
- Improve deploy workflows
- Maintain Telepresence setups, test clusters, and developer self-service tooling
- Close the feedback loop between alerts, Linear tickets, and engineering fixes
Security & Compliance
- Own the infrastructure for secrets management, key rotation, and production access controls (the platforms and workflows, with policy set by Security)
- Harden the CI/CD supply chain and container runtime posture — image provenance, signing, base image hygiene, build isolation
Data & Streaming
- Own Kafka topology hygiene; plan and execute staging/production isolation
- Ensure Debezium CDC pipeline reliability and lag monitoring
- Partner with backend on SQL migration safety (Bytebase, gh-ost) for online schema changes
- Define and enforce database operational standards (backups, replication, failover drills)
Mentorship & Culture
- Level up backend, mobile, and frontend engineers on operational thinking
- Establish and run quarterly DR / game-day exercises
- Contribute to engineering documentation, design reviews, and architectural decisions
What we are looking for:
Required Qualifications
- 7+ years of SRE / DevOps / Platform Engineering experience, with 3+ years leading reliability for production systems at scale
- Proven track record owning production systems in a financial services, fintech, or high-availability environment
- Experience leading P0/P1 incidents as incident commander and driving systemic improvements from postmortems
Required Technical Skills
- Kubernetes (expert)— GKE preferred; multi-cluster, workload identity, network policies, HPA/VPA, node pool design
- GitOps— deep experience with ArgoCD (or Flux) and Helm / Kustomize
- Kafka operations — brokers, consumer groups, partition rebalancing, lag monitoring; RedPanda or Confluent a plus
- Cloud platforms — GCP preferred (Artifact Registry, Cloud SQL, VPC, IAM, Cloud Logging); AWS / Azure transferable
- Observability — hands-on with Coralogix, Datadog, Grafana, or equivalent; SLO engineering and alert quality, not just dashboards
- Infrastructure as Code — Terraform or Pulumi; strong YAML / Helm templating - Programming & scripting — Go, Python, or Bash to production quality
- CI/CD — GitHub Actions, container build pipelines, artifact promotion flows
- Networking — TCP/IP, DNS, TLS, load balancers, reverse proxies (Caddy, Envoy, nginx)
- Databases — PostgreSQL operations (replication, failover, connection pooling), Redis, online schema migrations (gh-ost, Bytebase, pt-online-schema-change)
Nice-to-Have Skills
- Track record using AI coding assistants to accelerate platform work — automating ops tasks, generating IaC, triaging incidents
- Crypto exchange / trading / low-latency systems experience (order flow, market data, wallet/custodian risk)
- Regulated environment experience (OJK, MAS, SEC, or similar financial regulators)
- Service mesh (Istio, Linkerd); eBPF tooling (Cilium, Pixie)
- Chaos engineering (Litmus, Chaos Mesh)
- FinOps / cloud cost optimization track record
- Legacy infrastructure retirement (Docker Swarm → K8s, monolith → microservices)
- Caddy, Consul, or similar service discovery