On Jan. 24 2018 around 6 AM UTC, one of our Nginx gateways stopped responding but was not removed from the pool.
The lead all requests being directed to it to generate connection timeout errors, but without raising any exception within the API itself since requests were never reaching the API servers themselves. Around 1/3 of the traffic directed to the API got timeouts errors due to the issue.
After multiple round of checking network and DNS health, we found the culprist and promptly replaced it from the gateway pool. All system went back to normal around 13:15.
We plan to implement an additional health check at the Nginx gateway level to avoid getting the same issue again