A story about why using managed services in the Cloud is turbo nice when you're a small team
Production went down. Load balancer health checks failing. All instances marked unhealthy. SSH'd into an instance: - App: running - CPU: 5% - Memory: 40% - Disk: 15% - Network: fine Manually hit health endpoint: curl localhost/health {"status": "ok"} Worked perfectly. Checked load balancer logs: - Health check URL: /health - Response: timeout - Instance marked: unhealthy The issue: - Health endpoint responded in 100ms locally - Load balancer timeout: 2 seconds - Should be plenty of time Then I noticed: Health check ran every 5 seconds. App logged every health check. To a file. That file grew to 47 GB. Every health check: 1. Opened 47 GB log file 2. Appended 1 line 3. Closed file 4. Took 3 seconds due to file size 5. Timed out Fix: Disabled health check logging. Response time: back to 100ms.

Oct 25, 2025 · 6:39 PM UTC

1