Random 502 Bad Gateway

Hi,

We encounter a random Nginx 502 bad gateway error from one of our services, a NestJS API handling reading and writing to a TSDB (QuestDB Cloud).
This service receives big POST and GET requests continuously from our data science team.
We didn’t find any error in the service logs so we tried to look further.

Through our log analysis, we discovered another Nginx error in our logs : client intended to send too large body.
We reduced the size of these POST requests by 3 (increasing the number of requests), and this error disappeared. The number of 502 bad gateway error decreased after these changes, but the error still happens.

We analysed all the logs and metrics we can find in our New Relic account.
We found a correlation between the POST requests ending with a 502, and a nginx-ingress error : recv() failed (104: Connection reset by peer) while reading response header from upstream. There is perfect match between this error and every 502 Bad Gateway.

We found that this recv() failed error happens with GET requests too, but strangely all the GET requests seem to end with 200 status, and queried data is received.

I tried to send the same requests again and again : the same GET or POST request can be successful most of the time but this recv() failed happens at some point (with a 502 for the POST request).

We made our best to find by ourselves how to solve this problem, but after many days we are not able to solve it.

Could you please help us ?

Thanks in advance.

Mathieu

Environment : https://console.qovery.com/organization/55ee45eb-b3e8-4cdb-8479-9ffa2899b6d2/project/d0575a39-0386-4ec8-8f91-8d1c132c6f3d/environment/3c23caad-908b-4ba4-9d7b-6ae7b88b037f/services/general

Service ID : 629366c1-740e-42c4-9d92-a3326f443de7

Hello @Mathieu_Haage,

I will have a look.

1- Do you have any insights / metrics on what’s your app response time on those 502 issues? I am wondering if it cannot be due to a timeout (30 seconds).

2- Do you have clues / metrics on cpu / memory usage for your app? Do you see any spikes especially in memory when those 502 arise?

3- what’s the QPS your app is usually handling?

Cheers

Also noticed you have only one replica on your service, any reason for that? Pod can be moved to another node if the autoscaler wants to optimize your workfload and scaledown for example, which can lead to 502 while your service is restarted on the other node. I strongly advise to provision several replicas for this app.

Thank you for your quick reply.

Do you have any insights / metrics on what’s your app response time on those 502 issues? I am wondering if it cannot be due to a timeout (30 seconds). What’s the QPS your app is usually handling?

We have 3 endpoints used continuously with these metrics from NewRelic :
- GET /data/last : avg response time 3.6s / 1rpm
- POST /data : avg response time 1,16s / 26rpm
- GET /data : avg response time 662ms / 15rpm
We thought about the possibility of a timeout but the metrics and the parameters (on Qovery or in out code) were way higher than these metrics.

I’m waiting for the data science team answer about their response time when a 502 error is encountered. I’ll give you this info ASAP.

Do you have clues / metrics on cpu / memory usage for your app? Do you see any spikes especially in memory when those 502 arise?

We don’t see anything in CPU and RAM that matches the 502 errors, we only have a match for the nginx-ingress logs.

Also noticed you have only one replica on your service, any reason for that?

We decided to start small and add replicas if needed. This environment is our preprod and we are only in a validation phase, our goal is to analyse the services behaviour and increase CPU/RAM/replicas if needed.
We didn’t realize a service can be moved. I tried to add a replica, but I didn’t notice an improvment, the 502 was still there a few times by hour.

Average is evil :slight_smile:

Do you have the 99 percentile for response time?

oops, sorry, I didn’t know… :sweat_smile: Just starting with monitoring and New Relic.

  • GET /data/last → 99th percentile response time : 6.57 s
  • POST /data → 99th percentile response time : 1.98 s
  • GET /data → 99th percentile response time : 4.69 s

Hi,

As I said earlier, friday I tried to add 1 replica, so I had 2-2 replicas and I didn’t see so much improvment.

But today I retried with 3-3 replicas and we can see a big improvment. I don’t see any recv() failed error since 8h56 UTC, the moment I increased the number of replicas.

I there a way to analyse and understand the source of this error.
Can the error be encountered with 2 replicas if the service is moved to another node for optimization ?

I’m still waiting for 502 response times from data science team.

Hey @Mathieu_Haage !

I will have a closer look, but if nginx returns a 502, it should tell you where the error is coming from (your app or nginx directly). You can display nginx logs in live logs, if 502 arrise, you should be able to dig more:

Also, you can increase max CPU / RAM given to your nginx controllers via cluster advanced settings, but I don’t see any restarts on your nginx controllers, so default resources should be ok.

Cheers

Thank you for these new informations.

I started analysing Nginx logs by enabling logs in your UI last week, now I’m doing it in New Relic. It seems easier to me.
I compared logs to be sure I was not missing any message. I confirm that I can only see the recv() failed error followed by the 502 at the exact same time. I can’t find any more information in logs for now.

We continue the tests with 3 instances, then I’ll try to reduce the number of instances and increase resources in the Nginx parameters.
I understand that this won’t solve the problem, but I’m curious to see if there’s any improvement.

After these tests, I proposed to the data science team to try to reduce requests size again, and see if smaller requests better suit our system, even if there are more of them.

Hi,

I have a few more details after after testing several things:

  • response times for the 502 errors on the POST request : 622ms, 678ms, 465ms, 927ms, 497ms, 901ms, 675ms (with bodies around 3MB)
  • We added a request_id parameter to follow the request easily : we can see that when the 502 occurs the request never reaches the API, then a retry mechanism executes the same query 10 seconds later which is successful this time.
  • Test with 3 replicas : things have improved, and at first I thought it solved the problem but the 502 Bad Gateway error came back. It’s a random error, which sometimes gives false hope when there is no error for 15 or 30 minutes. But the error comes back.
  • Increase Nginx resources : no improvment
  • I switched healthcheck from TCP to HTTP : I forgot this parameter and I saw in a discussion that someone solved a similar error by making this change : no improvment

Thanks for those information. Did you found full nginx log showing those 502?
How many QPS are sent to this service?
What’s the percentage of 502 vs 2xx requests?
Does it happen after specific actions such as deployments?

Cheers

  • QPS : It’s so small I though it would be more useful in minutes : around 40 queries by minute.
  • Full Nginx logs surrounding the 404 (recv() failed error in red then 502)
  • It happens randomly, no particular event has been correlated with the error.
  • There is the warning about body buffering, I tried to increase the body buffer parameter up to 300MB but it didn’t change anything. This warning happens a lot and it’s not a problem for other requests

After closer look, seems you have a lot of such messages:

2024/03/26 12:05:11 [warn] 348#348: *937022 a client request body is buffered to a temporary file /tmp/nginx/client-body/0000012134, client: **.**.112.184, server: preprod.metrics-api.***.com, request: "POST /data?request_id=2488970314211455&origin=script_tph_unbiased&start_time=2024-03-26T12%3A03%3A00.000Z&end_time=2024-03-26T12%3A04%3A00.000Z HTTP/1.1", host: "preprod.metrics-api.***.com"

Seems for some, your app returns HTTP 502:

**.**.112.184 - - [26/Mar/2024:09:01:02 +0000] "POST /data?request_id=9884629743835268&origin=script_tph_unbiased&start_time=2024-03-26T08%3A59%3A00.000Z&end_time=2024-03-26T09%3A00%3A00.000Z HTTP/1.1" 502 150 "-" "python-requests/2.31.0" 3378778 0.474 [zd0575a39-z3c23caad-app-z629366c1-metrics-api-5000] [] **.**.18.128:5000 0 0.355 502 a1641812231d350e723bf2f52d61616f
**.**.112.184 - - [26/Mar/2024:09:21:04 +0000] "POST /data?request_id=5212105373695189&origin=script_tph_unbiased&start_time=2024-03-26T09%3A17%3A00.000Z&end_time=2024-03-26T09%3A19%3A00.000Z HTTP/1.1" 502 150 "-" "python-requests/2.31.0" 3420321 0.393 [zd0575a39-z3c23caad-app-z629366c1-metrics-api-5000] [] **.**.18.128:5000 0 0.276 502 fcccb0caf1f21406824e8a27031869d5
**.**.112.184 - - [26/Mar/2024:10:18:15 +0000] "POST /data?request_id=666537369708680&origin=script_rain_detection&start_time=2024-03-26T10%3A13%3A00.000Z&end_time=2024-03-26T10%3A18%3A00.000Z HTTP/1.1" 502 150 "-" "python-requests/2.31.0" 283615 0.373 [zd0575a39-z3c23caad-app-z629366c1-metrics-api-5000] [] **.**.18.128:5000 0 0.350 502 c0650e98ab3a61b2c591292311b6a1ab

You also have things like

2024/03/26 10:18:15 [error] 164#164: *108859 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: **.**.112.184, server: preprod.metrics-api.*.*.com, request: "POST /data?request_id=666537369708680&origin=script_rain_detection&start_time=2024-03-26T10%3A13%3A00.000Z&end_time=2024-03-26T10%3A18%3A00.000Z HTTP/1.1", upstream: "http://100.64.18.128:5000/data?request_id=666537369708680&origin=script_rain_detection&start_time=2024-03-26T10%3A13%3A00.000Z&end_time=2024-03-26T10%3A18%3A00.000Z", host: "preprod.metrics-api.*.*.com"
2024/03/26 10:23:12 [error] 164#164: *114073 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: **.**.112.184, server: preprod.metrics-api.*.*.com, request: "GET /data?request_id=9166882577039716&origin=script_rain_detection&fields=real_length&start_time=2024-03-25T10%3A23%3A00.000Z&end_time=2024-03-26T00%3A16%3A00.000Z&stations_codes=GEO-00009&stations_codes=GEO-00017&stations_codes=GEO-00020&stations_codes=GEO-00024&stations_codes=GEO-00025&stations_codes=GEO-00026&stations_codes=GEO-00029&stations_codes=GEO-00003&stations_codes=GEO-00006&stations_codes=GEO-00010&stations_codes=GEO-00013&stations_codes=GEO-00014&stations_codes=GEO-00015&stations_codes=GEO-00001&stations_codes=GEO-00002&stations_codes=GEO-00004&stations_codes=GEO-00005&stations_codes=GEO-00008&stations_codes=GEO-00011&stations_codes=GEO-00021&stations_codes=GEO-00022&stations_codes=GEO-00023&stations_codes=GEO-00027&stations_codes=GEO-00028&stations_codes=GEO-00012&sciences_environment=preproduction&format=wide_array&metrics_type=raw HTTP/1.1", upstream: "http://100.64.18.128:5000/data?request_id=9166882577039716&origin=script_rain_detection&fields=real_length&start_time=2024-03-25T10%3A23%3A00.000Z&end_time=2024-03-26T00%3A16%3A00.000Z&stations_codes=GEO-00009&stations_codes=GEO-00017&stations_codes=GEO-00020&stations_codes=GEO-00024&stations_codes=GEO-00025&stations_codes=GEO-00026&stations_codes=GEO-00029&stations_codes=GEO-00003&stations_codes=GEO-00006&stations_codes=GEO-00010&stations_codes=GEO-00013&stations_codes=GEO-00014&stations_codes=GEO-00015&stations_codes=GEO-00001&stations_codes=GEO-00002&stations_codes=GEO-00004&stations_codes=GEO-00005&stations_codes=GEO-00008&stations_codes=GEO-00011&stations_codes=GEO-00021&stations_codes=GEO-00022&stations_codes=GEO-00023&stations_codes=GEO-00027&stations_codes=GEO-00028&stations_codes=GEO-00012&sciences_environment=preproduction&format=wide_array&metrics_type=raw", host: "preprod.metrics-api.*.*"
2024/03/26 11:23:00 [error] 554#554: *177068 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: **.**.112.184, server: preprod.metrics-api.*.*.com, request: "GET /data?request_id=1440249459496182&origin=script_raster_th&fields=humidity_unbiased&start_time=2024-03-26T11%3A15%3A00.000Z&end_time=2024-03-26T11%3A19%3A00.000Z&stations_codes=GEO-00001&stations_codes=GEO-00005&stations_codes=GEO-00008&stations_codes=GEO-00012&stations_codes=GEO-00014&stations_codes=GEO-00015&stations_codes=GEO-00017&stations_codes=GEO-00021&stations_codes=GEO-00023&stations_codes=GEO-00024&stations_codes=GEO-00025&stations_codes=GEO-00026&stations_codes=GEO-00028&stations_codes=GEO-00029&sciences_environment=preproduction&format=wide_array&metrics_type=raw HTTP/1.1", upstream: "http://**.**.20.235:5000/data?request_id=1440249459496182&origin=script_raster_th&fields=humidity_unbiased&start_time=2024-03-26T11%3A15%3A00.000Z&end_time=2024-03-26T11%3A19%3A00.000Z&stations_codes=GEO-00001&stations_codes=GEO-00005&stations_codes=GEO-00008&stations_codes=GEO-00012&stations_codes=GEO-00014&stations_codes=GEO-00015&stations_codes=GEO-00017&stations_codes=GEO-00021&stations_codes=GEO-00023&stations_codes=GEO-00024&stations_codes=GEO-00025&stations_codes=GEO-00026&stations_codes=GEO-00028&stations_codes=GEO-00029&sciences_environment=preproduction&format=wide_array&metrics_type=raw", host: "preprod.metrics-api.*.*.com"

Those are more complicated to diagnose, might be nginx cutting responses because response headers are too big, or maybe your app is cutting it.

Can you disable those two settings (cutting off buffering), we should see clearer.

Last, for recv() failed can you provide a request generating such error? You can try to send it via a port forward directly to your app to see what happens (pulling out nginx).

Cheers

Those are more complicated to diagnose

I totally agree with you, this is one of the most complex errors I’ve had to debug. :face_with_spiral_eyes:

I mentioned the warning in my previous message (we sent our messages at the same time). With body size around 3MB and proxy_body_size_mb set to 100MB by default, or even increased to 300MB, I don’t understand why I’ve got this error.
We have a perfect match between the recv() failed error and the 502 error, but the warning appears a lot with requests that work too.

I disabled the proxy buffering parameters, the deployment time was 13h34 UTC. After reading a lot about these parameters I had hesitated to do it myself a few days ago, but wasn’t sure what the outcome would be.

I’m sorry, I don’t understand how to send the request via a port forward directly to the app. :pensive:

Is it the CLI port-forwarding feature you’re talking about ? (just discovering it in the doc)

For the port forward, you can use our Qovery CLI and have a look to this blog post, basically, you will be able to forward local HTTP request to your service running on your cluster directly without hitting NGINX.

So if you can find a request generating a 502 with recv() failed you can just send it locally it will forward it to your service so you get a sense on how your app is behaving with such.

Note: do you see the same behavior in production? If not, maybe you have a debug flag in your app adding debug info within headers leading to big headers?

Thank you for your help, I’ll try that port-forwarding feature.
As a reminder, I already know that the request never reaches our app when the 502 occurs, the request is stopped at the Nginx level.
Then when the exact same request is retried 10 seconds later the request reaches the app and everything works as expected.

I checked the logs since buffering is disabled : all warnings about buffering disappeared, but recv() failed errors are still there.

Sadly the production is not ready to do the comparison, we wanted to validate everything’s fine on preprod before going further. All went well, until this 502.
We decided this morning to start the same data pipelines on production as soon as possible to be able to do the comparison between the 2 environments.

Keep me posted, I am curious :slight_smile:
Good luck with testing

Thanks for the port-forward feature, it’s so useful !

I’m preparing a long day of testing. I’ll come back to you when I have results.

I tried to think to other sources of problems for a recv() failed error, so I searched in New Relic logs for “dns” + “error” and I was surprised to find so much error logs from an external-dns container.

{
  "cluster_name": "qovery-develop",
  "container_name": "external-dns",
  "message": "time=\"2024-03-27T09:50:19Z\" level=error msg=\"Unable to patch zone. Status: 400 Bad Request, Body: \"",
  "namespace_name": "kube-system",
  "newrelic.logPattern": "nr.DID_NOT_MATCH",
  "newrelic.source": "api.logs",
  "plugin.source": "k8s-fb-1.17.3",
  "pod_name": "externaldns-external-dns-66b754b98b-85s66",
  "stream": "stderr",
  "timestamp": 1711533019970
}

I’m wondering if a DNS problem could be the source of our problems ?

I tried to replace our custom domain by the default Qovery domain, just in case, it seemed to work perfectly at the beginning, then the error appeared again. :pensive:

Hi @bchastanier,

I’ll try to summarize the results of our tests :

  • Using your port-forwarding feature, I tried to simulate the server load with curl and bash scripts : no error occured, but it’s really hard to simulate the real charge from my computer, there are a lot of dynamic data handled by the data science team. It could be one more clue indicating that the problem happens around nginx-ingress. :thinking:

  • I cloned the service experiencing this error and asked the data science team to send requests to both at the same time, to be able to check if the error would happen on both at the same time.
    We started with only one script (GET + POST) requests : no error:
    Then more scripts were executed (GET + POST) : errors started to happen, sooner than I was expecting.
    We have one more clue after this test : the error happened sometimes on one service, sometimes of the other, but never happened on both services at the same time. Totally random, and no particular event seems to be the source of this error.

  • Finally, we stopped all the scripts but the first, still no error visible, and I used locust to do load testing : I used only one GET request, changing parameters only : one huge request containing a long list of weather sensors, and another with only one sensor as parameter.
    We have one more clue after this test : the request size and the size of the data requested doesn’t change the number of error occurrences.
    With the small and the huge request, I had 12 to 13 errors for 25mn with 9 to 10 QPS.

Perhaps these informations will make you think of something I don’t know, as I’m a beginner on k8s.

One last thing, as I was searching through the net a solution to this problem, I found this answer to a serverfault.com question. Could it be related ?

If you use the nginx-ingress reverse proxy, it maybe the cause. Try to check in Configmap for proxy-next-upstream settings, and extend it to handle the http_502 case.

Also enable the retry-non-idempotent if the request that got 502 is POST, LOCK, PATCH, if it is safe for your app to do so.

My guess: when the backend pod reach its load limit (or pod recycling), it “rejects” new request and nginx reverse proxy is more sensitive with this rejection. Without nginx-ingress, k8s handling the load balancing and it can handle everything better, queuing instead of rejecting.

kubernetes - k8s nginx ingress returns randomly 502 error on load - Server Fault

Thanks for your help.