KN

Site Reliability Engineer

I build digital experiences that feel aliveresilienteffortlesselectricinevitable

Kelly Navarro

Eight years keeping production alive at scale. Game servers at a million concurrent, enterprise SaaS across three clouds. I think about infrastructure the way an artisan thinks about craft.

I believe infrastructure and aesthetics aren't opposites. The best digital experiences are engineered and designed with equal care. I build systems that are fast, beautiful, and resilient.

Principles

01

I've led high-performance teams and built engineering cultures that survive the environments most don't

02

Years of production under my belt, from AAA games at million-CCU scale to global marketplaces and enterprise SaaS

03

I'm a curious builder, always chasing crafts outside my own

Reliability Engineering Distributed Systems Incident Response Chaos Engineering Observability Multi-Cloud Platforms Production at Scale Reliability Engineering Distributed Systems Incident Response Chaos Engineering Observability Multi-Cloud Platforms Production at Scale

Experience

Site Reliability Engineer

Semrush

Org-wide reliability tooling and AI-driven incident response.

  • Created organization-wide integrations with incident.io and IdP tools (Okta, Workday) across engineering teams.
  • Developed AI-powered solutions to streamline post-mortem and incident response processes, reducing manual overhead in reliability workflows.
Present

SRE Manager

AccelByte

Leading SRE for global game infrastructure across multiple clouds.

  • Led a team of 5 SREs on the AccelByte Multiplayer Servers team, ensuring 99.99% uptime on global game server infrastructure capable of handling up to 1 million concurrent users.
  • Built and deployed multi-cloud billing pipelines in GCP and Azure, improving billing accuracy and reducing reconciliation time by 80%.
  • Created chaos tests and disaster recovery drills for DB outages, AWS region outages, and Kubernetes cluster node failures.
  • Led load testing, monitoring, launch planning, and incident response for major game launches including Payday 3, achieving a peak CCU of 220k at launch.
  • Saved up to $20k in infrastructure costs by migrating certificate infrastructure from HashiCorp Vault to AWS ACMPCA.
  • Mitigated 100% of DDoS-related disruptions post-launch by engineering a scalable protection solution.
  • Set up SLO dashboards, alerts, and runbooks for key services, reducing MTTR to under 10 minutes.
2022 — 2025

Site Reliability Engineer

Opswerks

CI automation and deploy guardrails for platform teams.

  • Added automated checks for Kustomize builds and deploys using GitHub Actions, significantly reducing misconfigurations and increasing deployment reliability.
2021 — 2022

Engineering Team Lead

Freelancer.com

Platform migration to Kubernetes at marketplace scale.

  • Led the migration of microservices from EC2 + Puppet to Kubernetes using Helm and ArgoCD.
  • Improved application error visibility and accountability by integrating logging analytics and Sentry.
  • Reduced AWS costs by up to 20% by migrating the API infra from central load balancers to sidecar proxies.
2018 — 2021

Skills

  • Bash
  • Python
  • Go
  • TypeScript
  • Kubernetes
  • Docker
  • Terraform
  • Nomad
  • Consul
  • Helm
  • Flux
  • AWS
  • GCP
  • Azure
  • GitLab CI
  • Ansible
  • Puppet
  • Prometheus
  • Grafana
  • Sentry
  • Chaos Mesh
  • PagerDuty
  • OpenTelemetry
  • PostgreSQL
  • Redis
  • Kafka
  • Elasticsearch

Let's build something togetherfastamazingincredibleboldwildrelentlessunforgettablelegendaryunstoppabletogether

[email protected]