Random 502 Bad Gateway

Hey @Mathieu_Haage,

Thanks for the heads up.
It’s very interesting :slight_smile:

What we can do to validate it can fixes the issue is to manually edit the nginx config map on your cluster, settings those two fields:

retry-non-idempotent: "true"
proxy-next-upstream: "error timeout http_502"

Once set on your cluster, you shouldn’t redeploy the cluster unless it will erase those configs.

We can let it run couple hours or days so you can check if it solves your issue.

If it works, then, we can add those settings in the product directly so you can customize it.

How does it sound?

Cheers

Thank you for this suggestion, it sounds great !
I can’t wait to see if these settings solve the problem. :crossed_fingers:
Running the script for a couple of days is more appropriate, as some scripts are run once a day.

If these parameters solve the problem, it’ll be very useful to be able to customize it indeed.

Ok ! So you just let me know when you want me to override those settings on your staging cluster.

As soon as possible. :slight_smile:

I just found a more precise info (easier now I know what I’m looking for):

Starting in nginx 1.9.13, non-idempotent requests (PUT , POST , etc) are not retried by default.

Now I understand why the GET requests respond with a 200 and return data when the recv() failed happens, but POST requests respond with a 502.

This seems to be the best lead since a week, at last it makes sense. :crossed_fingers:

:crossed_fingers:

I’ve just updated your staging cluster nginx config with the values above mentioned.

I also locked your cluster so no clusters updates (Qovery initiated nor by you) will be possible during the test. We will remove the lock afterwards.

Let me know how it goes :slight_smile:

Thank you. Now we wait. :crossed_fingers:
I’ll keep you up to date.

Hi @bchastanier,

Sadly the new settings didn’t change anything. So you can revert to the previous config now.

I’ll be off for 2 weeks, so I won’t work too much on this problem. We have a retry mechanism that handles the error for now.
I’ll pick it up when I get back.

While I’m away I’ll try to figure out what could be causing the error “recv() failed - connection reset by peer”. Maybe there’s a node/nestjs specificity I don’t know about.

Once again, a big thank you for your help.

Cheers

Ok !

Yes indeed, now it seems that the issue comes from your application indeed.

Again, let me know once you found the solution :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.