Disaster Recovery & Immediate Response

Disaster Recovery

The Emergency Response Guide for OpenLibrary.org first-responders.

See also the Open Library Performance Monitoring Playbook

Responding to a Outage

1. Report outage on #openlibrary and #ops on Slack, follow the escalation guide
2. ❗ Search previous post mortem reports for insights and solutions to common issues
3. Check the public monitoring dashboards and internal:
- NAGIOS
- HAProxy
4. If the bare-metal machine is hanging, contact #ops on slack or manually restart baremetal
5. If there's a fiber outage and openlibrary.org's servers don't resolve (even to Sorry service), ask in the internal slack channels #openlibrary or #ops for openlibrary.org to be temporarily pointed to an active "Sorry Server"
6. Create a new postmortem issue and proceed to this guide:

Diagnostic's Guide

Before continuing, check our Performance Monitoring Guide Port-mortems to see if this is a known / already solved problem.

Is CPU load high on web nodes and/or is there a spike in # of transactions?
- Handling Abuse & DDOS (Denial of Service Attack)
Are ol-mem* slow to ssh into? We may want to /etc/init.d/memcached restart or even manually restart bare-metal if ssh hangs for more than 3 minutes
Does homepage cache look weird?

Spam

Staff, refer also to the Wayback Team's Playbook

There is an admin dashboard for blocking certain terms from appearing on Open Library: https://openlibrary.org/admin/spamword
You can also block & revert changes per specific accounts via https://openlibrary.org/admin/people

If the edit to a page contains any of the spam words or email of the user is from the blacklisted domains, the edit won’t be accepted. New registrations with emails from those domains are also not accepted.

Power Outages at Data Center

Once services return, please make sure all services are running and that VMs are ssh'able (this can probably be a script).

If a machine is up but not reachable, manually restart baremetal. If a machine is up and reachable but services are not running, check docker ps on the host.

Solr Search Issues

You can restart solr via docker as:

ssh -A ol-solr1
docker restart solr_builder_solr_1 solr_builder_haproxy_1

Out of Space

Cleanup Deploys

There are few servers which we expect to fill up. ol-db1/2 and ol-covers0/1 are candidates because their job is to store temporary or long term data. ol-home0 is another service which generates data dumps, aggregates partner data, and generates sitemaps. These three servers likely need a manual investigation when nagios reports their space is low.

The following will prune unattached images which were created more than 1 week ago (168h):

# to prune the build cache
docker builder prune

# prune unused images created more than 1 week ago
docker image prune -a --filter "until=168h"

Caution

When docker prune is being run, unfortunately the rest of docker typically becomes unresponsive; see this issue. When this happens, do not try and restart the server with ganeti. When you do, not only will docker still be unresponsive until the prune finishes, but additionally all docker containers that were running will stop and be unreachable.

Docker images

Even with this being the case, a very common cause of disk fill are out docker images which have not been pruned during our deploy process. These can be many GB over time. Run docker image ls for a listing of images registered in docker to see if any of them can be pruned or deleted.

Docker Logs

Docker logs can take up a ton of space. @cdrini mentions one solution is: (Truncating docker logs for container with ID d12b...)

sudo df -h - See the sizes of a bunch of things on the VM
truncate -s 0 $(docker inspect --format='{{.LogPath}}' d12b518475e1)

ol-dev1 out of storage

Symptom: sudo df -h shows a bunch of 100% or 99%. Testing deploys might fail on occasion.

Containers and images can stick around on our dev server causing it to fill up. To free up space:

Confirm with folks on slack, #team-abc, that there are not stopped containers that people care about. There shouldn't be. There is some risk of data loss if someone has made modification to the file system inside a now stopped container. That is why we confirm!
Run docker container prune
Run docker images prune . This will remove any images; all images should have Dockerfiles somewhere, so there's little risk of data loss. But it might be annoying because someone will have to rebuild a docker image they might care about and have to find the Dockerfile!

upstart.log

There is a possibility supervisor can get confused (perhaps related to permissions/chown), and instead of rotating logs, will start writing to /var/log/openlibrary/upstart.log until /dev/vda1 (or wherever root / is mounted) runs out of space. The solution is to restart "supervisor" (not openlibrary via supervistorctl but supervisor itself) on the aflicted node (e.g. ol-web4 in this example):

sudo service supervisor restart

If successful, you should see a new openlibrary.log with an update time more recent than upstart.log. One you've confirmed this, you can truncate the erroneously inflated upstart.log to free up disk space:

sudo truncate upstart.log  --size 0

After truncating, you'll want to restart openlibrary, e.g.

ssh ol-web4 sudo supervisorctl restart openlibrary

Homepage Errors

Sometimes an error occurs while compiling the homepage and an empty body is cached: https://github.com/internetarchive/openlibrary/issues/6646

Solution: You can use this the url to hit to clear the homepage memcache entry: https://openlibrary.org/admin/inspect/memcache?keys=home.homepage.en.pd-&action=delete . Note the .pd . Remove that if you want to clear the cache for non printdisabled users.

Notes

If solr-updater or import-bot or deploy issue, or infobase (API), check ol-home
If lending information e.g. books appear as available on OL when they are waitlisted on IA, this is a freak incident w/ memcached and we'll need to ssh into each memcached (ol-mem*) and sudo service memcached restart
If there's an issue with ssl termination, static assets, connecting to the website, check ol-www1 (which is where all traffic enters and goes into haproxy -- which also lives on this machine). Another case is abuse, which is documented in the troubleshooting guide (usually haproxy limits or banning via nginx /opt/openlibrary/olsystem/etc/nginx/deny.conf
If there's a database problem, sorry (ol-db0 primary, ol-db1 replication, ol-backup1)
If we're seeing ol-web1 and ol-web2 offline, it may be network, upstream, DNS, or a breaking dependency, CHECK NAGIOS + alert #ops + #openlibrary. Check the logs in /var/log/openlibrary/ (esp. upstart.log)
If you notice a disk filling up rapidly or almost out of space... CREATE A BASILISK FILE (an empty 2GB placeholder dd'd file that we can delete and have the ability to ls, etc)

Is the server having trouble after rebooting?
Is OpenLibrary getting slammed with traffic, crawlers, or bad actors?
Is Search Overloading archive.org elastic search upstream?

Please use this new Wiki. Welcome to the Open Library Handbook! Here you will learn how to...

Get Set Up
Understand the Codebase
- Identify which file(s) power each URL Endpoint
- Trace step-by-step the Lifecycle of a Network Request through the application
- Add a new Endpoint
Contribute to the Front-end
Contribute to the Back-end
Manage your developer environment
- Import production data into your local environment
- Create new users
Lookup Common Recipes
- Use cache, cookies, fetching from db
Participate in the Community

Developer Guides

Project Management

Other Portals

Uh oh!

Disaster Recovery & Immediate Response

Disaster Recovery

Responding to a Outage

Diagnostic's Guide

Spam

Power Outages at Data Center

Solr Search Issues

Out of Space

Cleanup Deploys

Docker images

Docker Logs

ol-dev1 out of storage

upstart.log

Homepage Errors

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!