The Outage
At 3 AM, our load balancer started returning 502s. The application servers were healthy. The database was fine.
The Culprit
We had migrated to a new ingress provider but forgot to lower the TTL (Time To Live) on our DNS records before the switch.
Some ISPs were caching the old IP address for 24 hours.
Lesson learned: Always lower your TTL to 60 seconds at least 48 hours before a major migration.