TODO

Restore

Problem stored database logic:

Examples: Routines (procedures and functions), triggers & views

These database objects are not replicated; hence they won't be restored properly by the Job. We would have to run the MySQL restore for every single Pod startup. It might be risky though to restore everything over and over again.

Potential solutions:

Split up database objects into separate SQL backup files
Consider restoring database objects in separate Job at the end using LoadBalancer
Consider replicating non-ndb-binlogs between MySQL servers of the same cluster

Backups

Remove DROP TABLE from backups (?)
Test with OVH object storage
Consider writing backup epoch to backup metadata.json (currently not being used)
- Possible to output epoch as part of Backup Job?

Global Replication

Next steps

Add test for fail-overs
Allow only writing certain databases to the binlog
Create Stateful Set per cluster we are replicating from
Look into implementing multiple replica appliers
1. Replication channel cutover only happens if replica applier server dies
2. Have multiple replica appliers running but use Lease/Mutex, so only one replicates
Consider placing binlog servers behind a LoadBalancer (might not work)
Fix error with multiple binlog files at startup
Add API for users (e.g. hopsworksroot)
Consider disabling running user-defined init scripts (and relying on replication)

Query heartbeat table

Run this SQL from the replica applier:

-- Liveness probe on replica applier can check how long ago HB was run
-- Time stays relative to the binlog server
SELECT id, 
       TIMESTAMPDIFF(SECOND, updated_at, NOW()) AS seconds_since_update
FROM your_table_name;

Setup active-active without conflict detection

One may want to have active-active replication for quicker fail-overs. However, if conflict detection is not in place, one may risk conflicts.

Can disallow writing in MySQLds with:
- SET GLOBAL read_only = ON; and
- GRANT SUPER, REPLICATION SLAVE ON *.* TO 'replication_user'@'host';

Avoiding running out of binlogs

Spin up another binlog server (do this early enough)
DON'T KILL the primary binlog server
- Just wait for command to remove binlogs

Error handling

Handle:

Binlogs run out
- Idea: Try finding LOST_EVENTS programmatically (try mysqlbinlog program)

Purging binlogs

How do we signal the binlog servers to purge their binlogs?

Idea:

The primary cluster always knows about the secondary cluster
- Otherwise, if it doesn't know about it, it becomes difficult for it to decide when to purge its binlogs (especially if it isn't even actively replicating)

Idea for active-active:

Create entry in Heartbeat table PER server_id. Each MySQL replication server can then write its applied epoch into this HB table. Every MySQLd can then regularly look into this table, and hence decide which binlogs it can purge.

Debugging binlog files

Commands:

mysqlbinlog mysql-cluster/mysql-binlogs/binlog.0000* | grep "LOST"
mysql -uroot -p$MYSQL_ROOT_PASSWORD
SELECT * from ndb_binlog_index;
SHOW MASTER STATUS;
ls -l /srv/hops/mysql-cluster/mysql/binlog*
PURGE BINARY LOGS TO binlog.000003;
SHOW BINARY LOGS;
SELECT @file:=SUBSTRING_INDEX(next_file, '/', -1), @pos:=next_position FROM mysql.ndb_binlog_index ORDER BY epoch DESC LIMIT 1;

Other

Operator: Add REDO log & UNDO log usage to CRD status
Create more values.yaml files for production settings
Make data node memory changeable
Add YCSB to benchmark options
Figure out how to run SELECT count(*) FROM ndbinfo.nodes as MySQL readiness check.
- Using --defaults-file=$RONDB_DATA_DIR/my.cnf and GRANT SELECT ON ndbinfo.nodes TO 'hopsworks'@'%'; does not work
- Error: ERROR 1356 (HY000): View 'ndbinfo.nodes' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them

Kubernetes Jobs to add:

Increasing logfile group sizes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!