DevOps Engineer | AWS, Terraform, Kubernetes | Join my DevOps newsletter brankopetric.com/newsletter

Serbia
Joined August 2019
Breakdown of AWS outage in simple words 1. Sunday night, a DNS problem hit AWS - DynamoDB endpoint lost 2. This meant services couldn't find DynamoDB (a database that stores tons of data). 3. AWS fixed the DNS issue in about 3 hours. 4. But then EC2 (the system that creates virtual servers) broke because it needs DynamoDB to work. 5. Then the system that checks if network load balancers are healthy also failed. 6. This crashed Lambda, CloudWatch, SQS, and 75+ other services - everything that needed network connectivity. 7. This created a chain reaction - servers couldn't talk to each other, new servers couldn't start, everything got stuck 8. AWS had to intentionally slow down EC2 launches and Lambda functions to prevent total collapse. 9. Recovery took 15+ hours as they fixed each broken service while clearing massive backlogs of stuck requests. This outage impacted: Snapchat, Roblox, Fortnite, McDonald's app, Ring doorbells, banks, and 1,000+ more websites. This all happened in one AWS region (us-east-1). This is why multi-region architecture isn't optional anymore.
61
402
34
2,364
Understand ArgoCD in 60 seconds: ArgoCD implements GitOps for Kubernetes deployments. The problem it solves: - kubectl apply from CI/CD is fragile - No source of truth for what's running - Manual syncing between Git and cluster - Can't audit who deployed what Key concepts: - GitOps: Git is source of truth for cluster state - Continuous sync: ArgoCD keeps cluster matching Git - Rollback: Revert to any previous commit When to use it: - Running Kubernetes - Want declarative deployments - Need visibility into what's deployed - Multiple environments to manage When to skip it: - Not using Kubernetes - CI/CD pipeline works fine - Team isn't ready for GitOps Bottom line: ArgoCD makes your Git repo the control plane for Kubernetes. Your cluster becomes a reflection of your repo.
2
We ran performance tests for weeks and everything looked great. Production launch day hit, and we had 5x the latency. The difference? Garbage Collection. Root Cause: - Our test load was flat; production load was spiky - Spiky load made the JVM aggressively collect garbage, leading to long pauses - The problem wasn't CPU or network; it was the runtime environment under non-linear load Lesson: Performance testing without realistic, chaotic, production-like load patterns is worse than useless. It's misleading.
1
1
18
The cost of maintaining a poorly-documented internal tool always exceeds the one-time cost of buying or building a slightly better-documented external alternative. Internal tooling debt is real.
3
1
11
Every time you log into a production machine to make a manual change, you are creating a hidden, undocumented dependency. The system is no longer what your Infrastructure as Code says it is.
7
2
24
Understand Nginx in 60 seconds: Nginx is a web server, reverse proxy, and load balancer. The problem it solves: - Need to serve static files fast - Load balance traffic across multiple servers - SSL termination without burdening app servers - Caching at the edge Key concepts: - Reverse proxy: Sits in front of app servers - Upstream: Backend servers Nginx forwards to - Event-driven: Handles thousands of connections efficiently When to use it: - Serving static assets - Load balancing to app servers - SSL termination - Rate limiting and caching layer Nginx is boring, fast, and reliable.
1
13
1
225
At 1000 requests per second, our PostgreSQL query planner was brilliant. At 10000 rps, it started choosing wrong indexes randomly. We didn't change the queries. We didn't change the schema. What changed? Statistics. - Query planner uses table statistics to choose indexes - Statistics update frequency couldn't keep up with write volume - Stale stats led to terrible query plans - Increased auto_analyze frequency - P95 latency dropped from 800ms back to 60ms At scale, the meta-systems matter as much as the systems. Your database statistics are infrastructure too.
3
6
136
Hypothesis: Adding read replicas will reduce our database load. Ran the experiment: - Added 3 read replicas - Split reads across them - Monitored for a week Results: - Primary CPU: Still at 80% - Replica CPU: 15% Why? 95% of our load was writes. Read replicas don't help write-heavy workloads. We needed connection pooling and query optimization instead. Fixed those. Primary CPU dropped to 30%. Didn't need the replicas. Test your assumptions with actual metrics. Your mental model of where the bottleneck is might be completely wrong.
9
5
126
We chose event-driven architecture because it's 'loosely coupled and scalable.' What we traded without realizing: - Debugging became an archeological expedition through message queues - Lost the ability to grep for 'where is this function called' - Added eventual consistency everywhere - Every new feature required understanding 6 different event handlers - Onboarding time went from 2 weeks to 2 months The architecture was technically sound. But it exceeded our team's cognitive capacity. Loose coupling in code means tight coupling in your mental model. Choose the coupling you can afford.
We had a 'quick fix' for flapping health checks in our ECS cluster. Just increase the timeout from 5s to 30s. Ship it. - Health checks stopped flapping - Alerts went silent - Team celebrated - Three weeks later: 6-minute outage because dying containers stayed in rotation for 30 seconds We masked the symptom instead of fixing why containers were slow to respond. The real issue? Database connection pooling was exhausted. Fixed that, reverted the timeout. Quick fixes aren't fixes. They're IOUs to future you.
3
2
1
27
We had an outage that lasted 2 hours. The fix took 5 minutes. The rest of the time? We were looking in the wrong place. Post-mortem revealed our actual problem: - We had alerts for symptoms, not causes - We knew the API was down - We didn't know why - So we investigated everything except the actual root cause Spent the next week redesigning our alerting strategy. - Added structured logging with correlation IDs - Built dashboards that showed dependencies - Created runbooks with decision trees Next incident: 6 minutes to resolution. The best post-mortem fix isn't preventing the failure. It's preventing the confusion.
2
2
33
We were paying $15K/month for an RDS Postgres instance. We had 5 read replicas. Checked the metrics. - Replica 1: 80% CPU - Replicas 2-5: 2% CPU Why? We added them during a scale event 2 weeks ago and never removed them. Deleted 4 replicas. Nothing broke. Saved $9K/month. The lesson isn't about over-provisioning. Everyone knows about that. Infrastructure accumulates. Set reminders to audit what you're actually using.
8
5
3
113
Our API handled 500 requests per second with zero issues. Marketing ran a campaign. Traffic hit 5,000 rps. What broke wasn't what we expected: - Application servers? Fine. - Database? Fine. - Load balancer? Fine. - Our rate limiter crashed because we stored rate limit counters in Redis with no memory limits. Redis ran out of memory. Rate limiter failed open. Actual traffic hit the API unthrottled. Then everything crashed. We optimized for scale but not for the failure modes of our safety mechanisms.
Using Kubernetes because Google uses Kubernetes is like using a forklift because Amazon uses forklifts. They have a warehouse. You have a garage.
Your monitoring should tell you what's broken before your users do. If customers report outages faster than your alerts, you're not monitoring. You're just collecting data.
1
4
49
We had 100% test coverage and still shipped a critical bug to production. How? - Our tests mocked every external dependency - The bug was in how we integrated with a third-party API - The mocks returned what we thought the API returned - The actual API had changed its response format - All tests passed because they tested our assumptions, not reality The bug caused a 1 hour outage. Added contract tests. Integration tests against staging APIs. Reduced unit test coverage to 70%. 100% test coverage measures quantity, not quality. Tests that mock reality don't test anything useful.
Every 'temporary' solution in production is permanent until it breaks. Then it becomes an emergency. There's no such thing as temporary infrastructure.
2
1
28
Stop calling everything 'tech debt.' Tech debt is intentional shortcuts with known future cost. What you have is probably just bad code.
6
4
1
44
Microservices gave us independent deployments. They also gave us 47 ways for the system to partially fail. Before microservices: - System was up or down - Debugging was linear - Outages were obvious After microservices: - Payments work but emails don't - Search works but recommendations don't - Everything is "degraded" - Users report problems we can't reproduce We traded simplicity for flexibility. Most days, I'm not sure we got the better end of that deal. The question I wish I'd asked: Are we prepared to debug distributed systems, or just excited about deploying them independently?
We scaled our infrastructure to handle 10x traffic. Then realized our monitoring couldn't scale with it. - 100 servers: Datadog cost $500/month - 1000 servers: Datadog quoted $18,000/month - Our monitoring budget was $2,000/month We had to choose: - Scale with less visibility - Don't scale and stay observable - Build custom monitoring We chose less visibility. Biggest mistake. Two months later we had a 6-hour outage we couldn't debug because we'd disabled half our metrics to save money. Observability isn't optional. If you can't afford to monitor it, you can't afford to run it.
Most technical debates are actually risk preference debates in disguise. "Microservices vs monolith" > "Do you prefer deployment risk or operational complexity?" "Strong typing vs dynamic typing" > "Do you prefer runtime surprises or development friction?" "Cloud vs on-prem" > "Do you prefer cost unpredictability or capital investment?" There are no right answers. Just different acceptable risks. Stop debating which was "better" and started discussing which risks you could handle.
4
7
36