Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport endpoint is not connected. #1

Open
ClaireJuil opened this issue Sep 25, 2023 · 2 comments
Open

Transport endpoint is not connected. #1

ClaireJuil opened this issue Sep 25, 2023 · 2 comments

Comments

@ClaireJuil
Copy link

Sometimes an application is no longer available due to a problem with a job using data persistence, for example a database.
In Nomad User Interface, the job for the database seems running and healthy but in the logs, we have:
PANIC: could not open file "/var/lib/postgresql/data/global/pg_control": Transport endpoint is not connected.
The state of the Kadalu jobs and Nomad volume are OK.

It is not always possible to restart only the database job, it seems that it is also necessary to restart the kadalu jobs (and all jobs using data persistence).
When trying to restart a job with persistence, we can have again the error:
"failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = Unknown desc = Exception calling application: [Errno 107] Transport endpoint is not connected: '/mnt/PROD/subvol'
When the kadalu and application jobs are restarted successfully, no data is lost but we can see in the kadalu logs according to the nodes
DEBUG [nodeserver - 150:NodeUnpublishVolume] - Received the unmount request volume=keycloak-db
although the database is restarted and volume=keycloak-db mounted.

The characteristics of our environment:
AlmaLinux version=8.7
Glusterfs 8.6
Nomad v1.4.3
Kadalu: v 1.0.0

Please help me to find a solution to this problem.

@leelavg
Copy link
Owner

leelavg commented Sep 25, 2023

DEBUG [nodeserver - 150:NodeUnpublishVolume] - Received the unmount request volume=keycloak-db

  • I recently fixed this at Fix NodeUnpublishVolume RPC call kadalu/kadalu#1008 (you can see linked issue for folks hitting this in Nomad), awaiting release.
  • I can only see one edge case this might be effect you, i.e, Nomad assigning same alloc id after restart, however I don't think that'll ever be the case, so you can disregard these log lines.

Transport Endpoint, (ENOTCONN)

  • this usually happens when the nodeplugin jobs are restarted, it looses Gluster connetivitiy, when that happens app jobs need to be restarted

  • On a lighter note, I'm unwell from couple of days and couldn't think deep about anything, when I'm back healthy, I'll give another look
  • At the same time, above info should unblock you.

@ClaireJuil
Copy link
Author

Thank you for your prompt response.
I hope you get better quickly

Ok for the 'unmount_volume' trace when restarting jobs.
Do you have a solution for my initial problem: 'Transport endpoint is not connected'?
This error is described by someone else here:
https://discuss.hashicorp.com/t/transport-endpoint-is-not-connected/14712

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants