-
Notifications
You must be signed in to change notification settings - Fork 7
Trouble Shooting Orangelight
- Check the Ex Libris status page. We are on Alma NA05.
- Check
Honeybadger
for recent errors.
- Solr Errors: See Solr Debugging
- Postgres Errors: See Postgres Debugging
- No Errors: See Web Infrastructure Debugging
- Viewing the Load Balancer can help to diagnose issues
- SSH into one of the catalog boxes as
pulsys
sudo passenger-status
- Check the number of requests in the queue.
- In a DOS scenario, the number of requests will be at the maximum, 100.
- You can see the specific requests in the queue with
sudo passenger-status --show=requests
- You can clear the queue with
sudo systemctl restart nginx
, but they will likely fill up again. - To diagnose which requests are filling up the queue, check Datadog:
- The logs page can tell you if a single IP address is responsible for all the long-running requests.
- The APM page can show you which long-running requests are causing the holdup.
This is the same as the solr debugging found on the Bibdata Troubleshooting page
Ensure that other machines that use this postgres cluster are also broken.
Those are: https://bibdata.princeton.edu, https://abid.princeton.edu, and https://oawaiver.princeton.edu/
-
If they aren't, then log on to the catalog machines
catalog1
,catalog2
andcatalog3
and restartnginx
, like so:ssh pulsys@catalog1 sudo service nginx restart
This would be a very unlikely scenario, and may need more in depth troubleshooting.
-
If other services ARE down and your errors say that it can't connect to postgres, then postgres may be down.
Check the logs to see if you're seeing anything like disk space errors:
ssh pulsys@lib-postgres-prod1 sudo tail -n 5000 /var/log/postgresql/postgresql-13-main.log
Assuming postgres has just somehow broken, SSH into
lib-postgres-prod1
and restart postgres.ssh pulsys@lib-postgres-prod1 sudo -u postgres /usr/lib/postgresql/13/bin/pg_ctl -D /var/lib/postgresql/13/main restart
If this does not resolve it you may have to reboot the server. Be ready to contact Operations if it does not come back up in the next 15 minutes.
sudo /sbin/reboot
This scenario is also very unlikely.
If you're not getting Honeybadger errors then it means the Rails application isn't erroring. Either the load balancer has detected the site is unhealthy, or nginx has gone down on the boxes.
-
Check the Rails logs, see if any requests are failing: Link to Logs'
-
If requests are failing and you aren't getting Honeybadger errors, there's probably something wrong with the boxes. Disk space, read-only file systems, or similar. Operations will probably need to fix these issues.
-
If there are no requests failing, or no requests coming through at all, nginx may be broken. Check the passenger log on
catalog1
,catalog2
andcatalog3
for errors.ssh pulsys@catalog1 sudo tail -n 1000 /var/log/nginx/error.log
-
If you find errors, restart nginx on these boxes:
sudo service nginx restart
. It may take some time for the load balancer to recognize these boxes are healthy again.
See the Bibdata instructions as they are the same.