Skip to content

Trouble Shooting Orangelight

Jane Sandberg edited this page May 26, 2023 · 3 revisions

Resolution

  1. Check the Ex Libris status page. We are on Alma NA05.
  2. Check Honeybadger for recent errors.
    1. Solr Errors: See Solr Debugging
    2. Postgres Errors: See Postgres Debugging
    3. No Errors: See Web Infrastructure Debugging
  3. Viewing the Load Balancer can help to diagnose issues

Checking passenger status

  1. SSH into one of the catalog boxes as pulsys
  2. sudo passenger-status
  3. Check the number of requests in the queue.
    1. In a DOS scenario, the number of requests will be at the maximum, 100.
    2. You can see the specific requests in the queue with sudo passenger-status --show=requests
    3. You can clear the queue with sudo systemctl restart nginx, but they will likely fill up again.
    4. To diagnose which requests are filling up the queue, check Datadog:
      1. The logs page can tell you if a single IP address is responsible for all the long-running requests.
      2. The APM page can show you which long-running requests are causing the holdup.

Solr Debugging

This is the same as the solr debugging found on the Bibdata Troubleshooting page

Postgres Debugging

Ensure that other machines that use this postgres cluster are also broken.

Those are: https://bibdata.princeton.edu, https://abid.princeton.edu, and https://oawaiver.princeton.edu/

  1. If they aren't, then log on to the catalog machines catalog1, catalog2 and catalog3 and restart nginx, like so:

    ssh pulsys@catalog1
    sudo service nginx restart
    

    This would be a very unlikely scenario, and may need more in depth troubleshooting.

  2. If other services ARE down and your errors say that it can't connect to postgres, then postgres may be down.

    Check the logs to see if you're seeing anything like disk space errors:

    ssh pulsys@lib-postgres-prod1
    sudo tail -n 5000 /var/log/postgresql/postgresql-13-main.log
    

    Assuming postgres has just somehow broken, SSH into lib-postgres-prod1 and restart postgres.

    ssh pulsys@lib-postgres-prod1
    sudo -u postgres /usr/lib/postgresql/13/bin/pg_ctl -D /var/lib/postgresql/13/main restart
    

    If this does not resolve it you may have to reboot the server. Be ready to contact Operations if it does not come back up in the next 15 minutes.

    sudo /sbin/reboot

    This scenario is also very unlikely.

Web Infrastructure Debugging

If you're not getting Honeybadger errors then it means the Rails application isn't erroring. Either the load balancer has detected the site is unhealthy, or nginx has gone down on the boxes.

  1. Check the Rails logs, see if any requests are failing: Link to Logs'

  2. If requests are failing and you aren't getting Honeybadger errors, there's probably something wrong with the boxes. Disk space, read-only file systems, or similar. Operations will probably need to fix these issues.

  3. If there are no requests failing, or no requests coming through at all, nginx may be broken. Check the passenger log on catalog1, catalog2 and catalog3 for errors.

    ssh pulsys@catalog1
    sudo tail -n 1000 /var/log/nginx/error.log
    
  4. If you find errors, restart nginx on these boxes: sudo service nginx restart. It may take some time for the load balancer to recognize these boxes are healthy again.

Viewing The Load Balancer

See the Bibdata instructions as they are the same.