Site Reliability Engineer

I build digital experiences that feel aliveresilienteffortlesselectricinevitable

Eight years keeping production alive at scale. Game servers at a million concurrent, enterprise SaaS across three clouds. I think about infrastructure the way an artisan thinks about craft.

I believe infrastructure and aesthetics aren't opposites. The best digital experiences are engineered and designed with equal care. I build systems that are fast, beautiful, and resilient.

Principles

I've led high-performance teams and built engineering cultures that survive the environments most don't

Years of production under my belt, from AAA games at million-CCU scale to global marketplaces and enterprise SaaS

I'm a curious builder, always chasing crafts outside my own

Reliability Engineering Distributed Systems Incident Response Chaos Engineering Observability Multi-Cloud Platforms Production at Scale Reliability Engineering Distributed Systems Incident Response Chaos Engineering Observability Multi-Cloud Platforms Production at Scale

Experience

Site Reliability Engineer

Semrush

Org-wide reliability tooling and AI-driven incident response.

Created organization-wide integrations with incident.io and IdP tools (Okta, Workday) across engineering teams.
Developed AI-powered solutions to streamline post-mortem and incident response processes, reducing manual overhead in reliability workflows.

Present

SRE Manager

AccelByte

Leading SRE for global game infrastructure across multiple clouds.

Led a team of 5 SREs on the AccelByte Multiplayer Servers team, ensuring 99.99% uptime on global game server infrastructure capable of handling up to 1 million concurrent users.
Built and deployed multi-cloud billing pipelines in GCP and Azure, improving billing accuracy and reducing reconciliation time by 80%.
Created chaos tests and disaster recovery drills for DB outages, AWS region outages, and Kubernetes cluster node failures.
Led load testing, monitoring, launch planning, and incident response for major game launches including Payday 3, achieving a peak CCU of 220k at launch.
Saved up to $20k in infrastructure costs by migrating certificate infrastructure from HashiCorp Vault to AWS ACMPCA.
Mitigated 100% of DDoS-related disruptions post-launch by engineering a scalable protection solution.
Set up SLO dashboards, alerts, and runbooks for key services, reducing MTTR to under 10 minutes.

2022 — 2025

Site Reliability Engineer

Opswerks

CI automation and deploy guardrails for platform teams.

Added automated checks for Kustomize builds and deploys using GitHub Actions, significantly reducing misconfigurations and increasing deployment reliability.

2021 — 2022

Engineering Team Lead

Freelancer.com

Platform migration to Kubernetes at marketplace scale.

Led the migration of microservices from EC2 + Puppet to Kubernetes using Helm and ArgoCD.
Improved application error visibility and accountability by integrating logging analytics and Sentry.
Reduced AWS costs by up to 20% by migrating the API infra from central load balancers to sidecar proxies.

2018 — 2021

Skills

Bash
Python
Go
TypeScript
Kubernetes
Docker
Terraform
Nomad
Consul
Helm
Flux
AWS
GCP
Azure
GitLab CI
Ansible
Puppet
Prometheus
Grafana
Sentry
Chaos Mesh
PagerDuty
OpenTelemetry
PostgreSQL
Redis
Kafka
Elasticsearch

Let's build something togetherfastamazingincredibleboldwildrelentlessunforgettablelegendaryunstoppabletogether

gabrnavarro@gmail.com