-
Notifications
You must be signed in to change notification settings - Fork 7
Trouble Shooting Bibdata
- Check
Honeybadger
for recent errors.
- Solr Errors: See Solr Debugging
- Postgres Errors: See Postgres Debugging
- No Errors: See Web Infrastructure Debugging
- Viewing the Load Balancer can help to diagnose issues
If you're getting Solr errors, other applications are probably down. Use the following steps to determine which solr server is having issues and then follow the steps to restart the solr instance on that box.
-
Check the Solr Health Dashboard
- Find the boxes which have very high Heap Memory. There are straight lines at the top (max heap) and moving lines at the bottom (current heap). Current heap should be far lower than max.
This happens pretty commonly and the server will need to be restarted.
-
Check the solr monitor on datadog
If one of the hosts is showing red then you will need to note the server and restart it
-
Check the solr console
You can look at the solr console from your local machine using capistrano
cd <local pul_solr directory> bundle exec cap solr8-production solr:console
If the graphs to the right are full, or there are lots of errors in the log. also look at the cloud (to the left) graph (under cloud) for red shards. Make note of the machine that has red shards...
Restart the solr service on the machine with red shards (It may take a minute or two to stop)
- SSH into those boxes and restart solr, e.g (lib-solr-prod4):
ssh pulsys@lib-solr-prod4 sudo service solr restart
- If restarting via service doesn't succeed after a minute or two, ctrl+c out
of that command, do
ps aux | grep solr
, find the Solr process ID, and dokill -9 <solr-pid> sudo service solr restart
Ensure that other machines that use this postgres cluster are also broken.
Those are: https://catalog.princeton.edu, https://abid.princeton.edu, and https://oawaiver.princeton.edu/
-
If they aren't, then log on to the bibdata machines
bibdata-alma1
andbibdata-alma2
and restartnginx
, like so:ssh pulsys@bibdata-alma1 sudo service nginx restart
This would be a very unlikely scenario, and may need more in depth troubleshooting.
-
If other services ARE down and your errors say that it can't connect topostgres, then postgres may be down.
Check the logs to see if you're seeing anything like disk space errors:
ssh pulsys@lib-postgres-prod1 sudo tail -n 5000 /var/log/postgresql/postgresql-13-main.log
Assuming postgres has just somehow broken, SSH into
lib-postgres-prod1
and restart postgres.ssh pulsys@lib-postgres-prod1 sudo -u postgres /usr/lib/postgresql/13/bin/pg_ctl -D /var/lib/postgresql/13/main restart
If this does not resolve it you may have to reboot the server. Be ready to contact Operations if it does not come back up in the next 15 minutes.
sudo /sbin/reboot
This scenario is also very unlikely.
If you're not getting Honeybadger errors then it means the Rails application isn't erroring. Either the load balancer has detected the site is unhealthy, or nginx has gone down on the boxes.
-
Check the Rails logs, see if any requests are failing: Link to Logs
-
If requests are failing and you aren't getting Honeybadger errors, there's probably something wrong with the boxes. Disk space, read-only file systems, or similar. Operations will probably need to fix these issues.
-
If there are no requests failing, or no requests coming through at all, nginx may be broken. Check the passenger log on
bibdata-alma1
andbibdata-alma2
for errors.ssh pulsys@bibdata-alma1 sudo tail -n 1000 /var/log/nginx/error.log
-
If you find errors, restart nginx on these boxes:
sudo service nginx restart
. It may take some time for the load balancer to recognize these boxes are healthy again.
To check the load balancer:
ssh -L 8080:localhost:8080 pulsys@lib-adc2
ip a
If you see inet 128.112.203.146
in the list eno1
list you are on the correct machine otherwise exit and ssh into lib-adc1
go to the dashboard to view the state of the load balancer