-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Disaster Recovery & Immediate Response
The Emergency Response Guide for OpenLibrary.org first-responders.
See also the Open Library Performance Monitoring Playbook
- 1. Report outage on
#openlibraryand#opson Slack, follow the escalation guide - 2. ❗ Search previous post mortem reports for insights and solutions to common issues
- 3. Check the public monitoring dashboards and internal:
- 4. If the bare-metal machine is hanging, contact #ops on slack or manually restart baremetal
- 5. If there's a fiber outage and openlibrary.org's servers don't resolve (even to Sorry service), ask in the internal slack channels
#openlibraryor#opsfor openlibrary.org to be temporarily pointed to an active "Sorry Server" - 6. Create a new postmortem issue and proceed to this guide:
Before continuing, check our Performance Monitoring Guide Port-mortems to see if this is a known / already solved problem.
- Is CPU load high on web nodes and/or is there a spike in # of transactions?
- Are
ol-mem*slow to ssh into? We may want to/etc/init.d/memcached restartor even manually restart bare-metal if ssh hangs for more than 3 minutes - Does homepage cache look weird?
Staff, refer also to the Wayback Team's Playbook
- There is an admin dashboard for blocking certain terms from appearing on Open Library: https://openlibrary.org/admin/spamword
- You can also block & revert changes per specific accounts via https://openlibrary.org/admin/people
- If the edit to a page contains any of the spam words or email of the user is from the blacklisted domains, the edit won’t be accepted. New registrations with emails from those domains are also not accepted.
Once services return, please make sure all services are running and that VMs are ssh'able (this can probably be a script).
If a machine is up but not reachable, manually restart baremetal.
If a machine is up and reachable but services are not running, check docker ps on the host.
You can restart solr via docker as:
ssh -A ol-solr1
docker restart solr_builder_solr_1 solr_builder_haproxy_1There are few servers which we expect to fill up. ol-db1/2 and ol-covers0/1 are candidates because their job is to store temporary or long term data. ol-home0 is another service which generates data dumps, aggregates partner data, and generates sitemaps. These three servers likely need a manual investigation when nagios reports their space is low.
The following will prune unattached images which were created more than 1 week ago (168h):
# to prune the build cache
docker builder prune
# prune unused images created more than 1 week ago
docker image prune -a --filter "until=168h"Caution
When docker prune is being run, unfortunately the rest of docker typically becomes unresponsive; see this issue. When this happens, do not try and restart the server with ganeti. When you do, not only will docker still be unresponsive until the prune finishes, but additionally all docker containers that were running will stop and be unreachable.
Even with this being the case, a very common cause of disk fill are out docker images which have not been pruned during our deploy process. These can be many GB over time. Run docker image ls for a listing of images registered in docker to see if any of them can be pruned or deleted.
Docker logs can take up a ton of space. @cdrini mentions one solution is: (Truncating docker logs for container with ID d12b...)
sudo df -h - See the sizes of a bunch of things on the VM
truncate -s 0 $(docker inspect --format='{{.LogPath}}' d12b518475e1)Symptom: sudo df -h shows a bunch of 100% or 99%. Testing deploys might fail on occasion.
Containers and images can stick around on our dev server causing it to fill up. To free up space:
- Confirm with folks on slack, #team-abc, that there are not stopped containers that people care about. There shouldn't be. There is some risk of data loss if someone has made modification to the file system inside a now stopped container. That is why we confirm!
- Run
docker container prune - Run
docker images prune. This will remove any images; all images should haveDockerfilessomewhere, so there's little risk of data loss. But it might be annoying because someone will have to rebuild a docker image they might care about and have to find theDockerfile!
There is a possibility supervisor can get confused (perhaps related to permissions/chown), and instead of rotating logs, will start writing to /var/log/openlibrary/upstart.log until /dev/vda1 (or wherever root / is mounted) runs out of space. The solution is to restart "supervisor" (not openlibrary via supervistorctl but supervisor itself) on the aflicted node (e.g. ol-web4 in this example):
sudo service supervisor restartIf successful, you should see a new openlibrary.log with an update time more recent than upstart.log. One you've confirmed this, you can truncate the erroneously inflated upstart.log to free up disk space:
sudo truncate upstart.log --size 0After truncating, you'll want to restart openlibrary, e.g.
ssh ol-web4 sudo supervisorctl restart openlibrarySometimes an error occurs while compiling the homepage and an empty body is cached: https://github.com/internetarchive/openlibrary/issues/6646
Solution: You can use this the url to hit to clear the homepage memcache entry: https://openlibrary.org/admin/inspect/memcache?keys=home.homepage.en.pd-&action=delete . Note the .pd . Remove that if you want to clear the cache for non printdisabled users.
- If solr-updater or import-bot or deploy issue, or infobase (API), check
ol-home - If lending information e.g. books appear as available on OL when they are waitlisted on IA, this is a freak incident w/ memcached and we'll need to ssh into each memcached (ol-mem*) and
sudo service memcached restart - If there's an issue with ssl termination, static assets, connecting to the website, check
ol-www1(which is where all traffic enters and goes into haproxy -- which also lives on this machine). Another case is abuse, which is documented in the troubleshooting guide (usually haproxy limits or banning via nginx/opt/openlibrary/olsystem/etc/nginx/deny.conf - If there's a database problem, sorry (
ol-db0primary,ol-db1replication,ol-backup1) - If we're seeing
ol-web1andol-web2offline, it may be network, upstream, DNS, or a breaking dependency, CHECK NAGIOS + alert #ops + #openlibrary. Check the logs in/var/log/openlibrary/(esp.upstart.log) - If you notice a disk filling up rapidly or almost out of space... CREATE A BASILISK FILE (an empty 2GB placeholder
dd'd file that we can delete and have the ability tols, etc)
Please use this new Wiki. Welcome to the Open Library Handbook! Here you will learn how to...
- Get Set Up
- Understand the Codebase
- Contribute to the Front-end
- Contribute to the Back-end
- Manage your developer environment
- Lookup Common Recipes
- Participate in the Community
Developer Guides
- BookWorm / Affiliate Server
- Developing the My Books & Reading Log
- Developing the Books page
- Understanding the "Read" Button
Other Portals
- Design
- Librarianship
- Communications
- Staff (internal)