Trouble Shooting Bibdata

Resolution

Check Honeybadger for recent errors.
1. Solr Errors: See Solr Debugging
2. Postgres Errors: See Postgres Debugging
3. No Errors: See Web Infrastructure Debugging
Viewing the Load Balancer can help to diagnose issues

Solr Debugging

If you're getting Solr errors, other applications are probably down. Use the following steps to determine which solr server is having issues and then follow the steps to restart the solr instance on that box.

Check the Solr Health Dashboard
- Find the boxes which have very high Heap Memory. There are straight lines at the top (max heap) and moving lines at the bottom (current heap). Current heap should be far lower than max.
This happens pretty commonly and the server will need to be restarted.
Check the solr monitor on datadog

If one of the hosts is showing red then you will need to note the server and restart it
Check the solr console

You can look at the solr console from your local machine using capistrano
```
cd <local pul_solr directory>
bundle exec cap solr8-production solr:console
```
If the graphs to the right are full, or there are lots of errors in the log. also look at the cloud (to the left) graph (under cloud) for red shards. Make note of the machine that has red shards...

Restart the solr service on the machine with red shards (It may take a minute or two to stop)

Restarting Solr on a server

SSH into those boxes and restart solr, e.g (lib-solr-prod4):
```
ssh pulsys@lib-solr-prod4
sudo service solr restart
```
If restarting via service doesn't succeed after a minute or two, ctrl+c out of that command, do ps aux | grep solr, find the Solr process ID, and do
```
kill -9 <solr-pid>
sudo service solr restart
```

Postgres Debugging

Ensure that other machines that use this postgres cluster are also broken.

Those are: https://catalog.princeton.edu, https://abid.princeton.edu, and https://oawaiver.princeton.edu/

If they aren't, then log on to the bibdata machines bibdata-alma1 and bibdata-alma2 and restart nginx, like so:
```
ssh pulsys@bibdata-alma1
sudo service nginx restart
```
This would be a very unlikely scenario, and may need more in depth troubleshooting.
If other services ARE down and your errors say that it can't connect topostgres, then postgres may be down.

Check the logs to see if you're seeing anything like disk space errors:
```
ssh pulsys@lib-postgres-prod1
sudo tail -n 5000 /var/log/postgresql/postgresql-13-main.log
```
Assuming postgres has just somehow broken, SSH into lib-postgres-prod1 and restart postgres.
```
ssh pulsys@lib-postgres-prod1
sudo -u postgres /usr/lib/postgresql/13/bin/pg_ctl -D /var/lib/postgresql/13/main restart
```
If this does not resolve it you may have to reboot the server. Be ready to contact Operations if it does not come back up in the next 15 minutes.

sudo /sbin/reboot

This scenario is also very unlikely.

Web Infrastructure Debugging

If you're not getting Honeybadger errors then it means the Rails application isn't erroring. Either the load balancer has detected the site is unhealthy, or nginx has gone down on the boxes.

Check the Rails logs, see if any requests are failing: Link to Logs
If requests are failing and you aren't getting Honeybadger errors, there's probably something wrong with the boxes. Disk space, read-only file systems, or similar. Operations will probably need to fix these issues.
If there are no requests failing, or no requests coming through at all, nginx may be broken. Check the passenger log on bibdata-alma1 and bibdata-alma2 for errors.
```
ssh pulsys@bibdata-alma1
sudo tail -n 1000 /var/log/nginx/error.log
```
If you find errors, restart nginx on these boxes: sudo service nginx restart. It may take some time for the load balancer to recognize these boxes are healthy again.

Viewing The Load Balancer

To check the load balancer:

ssh -L 8080:localhost:8080 pulsys@lib-adc2
ip a

If you see inet 128.112.203.146 in the list eno1 list you are on the correct machine otherwise exit and ssh into lib-adc1

go to the dashboard to view the state of the load balancer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trouble Shooting Bibdata

Resolution

Solr Debugging

Restarting Solr on a server

Postgres Debugging

Web Infrastructure Debugging

Viewing The Load Balancer

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally