Skip to content

flux-content: support new checkpoints list command #6798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented May 6, 2025

Built on top of #6772. So I'll list this as WIP for the time being.

Problem: There is currently no way to get multiple checkpoints from the content modules.

Support an optional "index" key when getting content checkpoints. If the index is not available, return ENOENT to the caller. Support a new "flux-content checkpoints" command.

@chu11
Copy link
Member Author

chu11 commented Jun 5, 2025

re-pushed, rebasing on top of #6772

@chu11 chu11 force-pushed the issue6629_list_checkpoints branch 4 times, most recently from 4ed972b to 9bb06d9 Compare June 11, 2025 22:51
@chu11 chu11 force-pushed the issue6629_list_checkpoints branch 3 times, most recently from 110b841 to d53113d Compare June 25, 2025 19:16
@chu11 chu11 force-pushed the issue6629_list_checkpoints branch from d53113d to 95f0f74 Compare July 1, 2025 18:40
@chu11 chu11 changed the title WIP: flux-content: support new checkpoints list command flux-content: support new checkpoints list command Jul 1, 2025
@chu11 chu11 force-pushed the issue6629_list_checkpoints branch from 95f0f74 to 8934f84 Compare July 1, 2025 20:44
@chu11
Copy link
Member Author

chu11 commented Jul 1, 2025

removed WIP, now that #6772 has been merged

@garlick
Copy link
Member

garlick commented Jul 9, 2025

Would it be useful to add an option to flux content checkpoints for human readable output? I'm picturing a human trying to decide how far back in time to roll back time and not making much sense of floating point time values. Perhaps they could be shown in a tabular form with sequence, date, and blobref columns, with the date converted to ISO?

Also is the index field in the RPC justified? Seems like we could just return all checkpoints and let the client filter. Even if there are hundreds, it won't be that much data.

@chu11
Copy link
Member Author

chu11 commented Jul 9, 2025

Would it be useful to add an option to flux content checkpoints for human readable output? I'm picturing a human trying to decide how far back in time to roll back time and not making much sense of floating point time values. Perhaps they could be shown in a tabular form with sequence, date, and blobref columns, with the date converted to ISO?

Good point. I originally wrote this back when "any checkpoint format" was something we were still allowing / supporting, but we've moved on from that notion.

Also is the index field in the RPC justified? Seems like we could just return all checkpoints and let the client filter. Even if there are hundreds, it won't be that much data.

Hmmm, I didn't think of it that way. I think the index in the RPC was predominantly to maintain backwards compatibility with the current RPC. If we want to return all checkpoints, we'd have to alter the protocol in some way. Perhaps it returns an array of checkpoints. Or perhaps we can stream all of them back?

@chu11 chu11 force-pushed the issue6629_list_checkpoints branch from 8934f84 to f6a2bff Compare July 10, 2025 16:38
@chu11
Copy link
Member Author

chu11 commented Jul 10, 2025

Hmmm, I didn't think of it that way. I think the index in the RPC was predominantly to maintain backwards compatibility with the current RPC. If we want to return all checkpoints, we'd have to alter the protocol in some way. Perhaps it returns an array of checkpoints. Or perhaps we can stream all of them back?

I had forgotten we have a checkpoint version number, so I guess it wouldn't be horrible to up the version. We could return an array of checkpoints vs. a single one.

But the big negative is the libkvs kvs_checkpoint API. We could keep it the way it is (i.e. deal w/ array index 0) but that doesn't feel right (hypothetically could deprecate it). Will ponder a bit on this.

@chu11 chu11 force-pushed the issue6629_list_checkpoints branch from f6a2bff to 79daf5b Compare July 10, 2025 23:40
@chu11
Copy link
Member Author

chu11 commented Jul 10, 2025

Re-pushed, going w/ this default output. It's admittedly hand created. I ponder if remaking flux content in python would allow us to go with the output format support in all of the other tools. But that is perhaps for another day.

>flux content checkpoints
Index      Sequence   Time                 Rootref
0          5          2025-07-10T16:39:39Z sha1-238260b968b162927aa7542491f68efdf217d86c
1          4          2025-07-10T16:39:38Z sha1-238260b968b162927aa7542491f68efdf217d86c
2          1          2025-07-10T16:39:33Z sha1-1bb67ca407fc7ca54fd1dbe5e14f40403125bb84

I ended up keeping the index in the RPC. It just felt like too much churn to change the RPC and the APIs around it, for this admittedly small need.

@garlick
Copy link
Member

garlick commented Jul 14, 2025

Nice improvement on the command output!

I ended up keeping the index in the RPC. It just felt like too much churn to change the RPC and the APIs around it, for this admittedly small need.

I guess I wasn't thrilled with the SQL query change that I had to look up (because I am a complete SQL noob), given that the index probably isn't necessary given the expected small size of the query result. Easier to just filter it on the client side IMHO. We own the clients and servers as well as the convenience API (internal only), so I think we could change it all we want. But if you feel strongly, this is OK I guess. We can always fix it later.

@chu11
Copy link
Member Author

chu11 commented Jul 14, 2025

I guess I wasn't thrilled with the SQL query change that I had to look up (because I am a complete SQL noob), given that the index probably isn't necessary given the expected small size of the query result. Easier to just filter it on the client side IMHO. We own the clients and servers as well as the convenience API (internal only), so I think we could change it all we want. But if you feel strongly, this is OK I guess. We can always fix it later.

Don't feel strongly. I struggled the most with how to convert these API functions

int kvs_checkpoint_lookup_get_rootref (flux_future_t *f, const char **rootref);
int kvs_checkpoint_lookup_get_timestamp (flux_future_t *f, double *timestamp);
int kvs_checkpoint_lookup_get_sequence (flux_future_t *f, int *sequence);

i.e. they only return one checkpoint entry's info.

Hmmmm, pondering this for a second, perhaps we could adjust the API to be similar to the libeventlog API's eventlog_decode() (get array of stuff) and eventlog_entry_parse() (parse an entry). Underneath the covers it can still handle the different versions of the checkpoint. Lemme try that.

@garlick
Copy link
Member

garlick commented Jul 15, 2025

Sounds good. Just a convenience library, so whatever is going to work out for the localized use cases is fine IMHO.

@chu11 chu11 force-pushed the issue6629_list_checkpoints branch 2 times, most recently from 43ccae7 to 89efe1b Compare July 16, 2025 23:06
@chu11
Copy link
Member Author

chu11 commented Jul 16, 2025

re-pushed, going with a revamped libkvs api that looks like this now

/* return array of checkpoints */
int kvs_checkpoint_lookup_get_checkpoints (flux_future_t *f,
                                           const json_t **checkpoints);

int kvs_checkpoint_parse_rootref (json_t *checkpoint, const char **rootref);

/* sets timestamp to 0 if unavailable
 */
int kvs_checkpoint_parse_timestamp (json_t *checkpoint, double *timestamp);

/* sets sequence to 0 if unavailable
 */
int kvs_checkpoint_parse_sequence (json_t *checkpoint, int *sequence);

content-fiels and content-sqlite now return all checkpoints in a json array.

As expected it does lead to a healthy amount of "churn". Less in the code than I expected, but a lot more in the tests than I expected (alot of tests call checkpoint-get and parsed the response via jq).

Copy link

codecov bot commented Jul 16, 2025

Codecov Report

Attention: Patch coverage is 83.76068% with 19 lines in your changes missing coverage. Please review.

Project coverage is 83.90%. Comparing base (7986127) to head (89efe1b).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/cmd/builtin/content.c 81.48% 10 Missing ⚠️
src/modules/content-sqlite/content-sqlite.c 55.55% 8 Missing ⚠️
src/cmd/builtin/fsck.c 87.50% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6798   +/-   ##
=======================================
  Coverage   83.90%   83.90%           
=======================================
  Files         540      540           
  Lines       90539    90616   +77     
=======================================
+ Hits        75963    76028   +65     
- Misses      14576    14588   +12     
Files with missing lines Coverage Δ
src/cmd/builtin/dump.c 88.29% <100.00%> (+0.16%) ⬆️
src/cmd/builtin/restore.c 90.52% <ø> (ø)
src/common/libkvs/kvs_checkpoint.c 89.39% <100.00%> (+0.68%) ⬆️
src/modules/content-files/content-files.c 73.88% <ø> (ø)
src/modules/kvs/kvs.c 74.43% <100.00%> (+0.09%) ⬆️
src/modules/kvs/kvstxn.c 80.26% <ø> (ø)
src/cmd/builtin/fsck.c 75.51% <87.50%> (+0.51%) ⬆️
src/modules/content-sqlite/content-sqlite.c 70.94% <55.55%> (-1.18%) ⬇️
src/cmd/builtin/content.c 84.61% <81.48%> (-1.48%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chu11
Copy link
Member Author

chu11 commented Jul 17, 2025

hit this build failure, it's a very new test from #6911 so just documenting in case this is a racy pattern (could hypotheticlaly just be a timeout)

expecting success: 
  	$waitfile -t 60 -p kvs.lookup trace3.out
  
  waitfile: trace3.out: 60.001s: Timeout after 60s
  -rw-r--r-- 1 runner runner 227 Jul 16 23:26 trace3.out
  [Jul16 23:26]  resource tx > attr.get [32]
  [  +4.301247]  resource tx > groups.get [25]
  [  +4.303290]  resource tx > groups.get [25]
  [  +4.303333]  resource tx > module.status [13]
  [  +4.303368]  resource tx > kvs.commit [239]
  not ok 28 - kvs.lookup request was captured

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I had a couple of minor suggestions that you can ignore if you want.

Comment on lines +125 to 106
if (json_unpack (checkpoint,
"{s:i s:s}",
"version", &version,
"rootref", &tmp_rootref) < 0)
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this set errno=EPROTO on error?

Comment on lines +149 to 130
if (json_unpack (checkpoint,
"{s:i s?f}",
"version", &version,
"timestamp", &ts) < 0)
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set errno?

Comment on lines +171 to 152
if (json_unpack (checkpoint,
"{s:i s?i}",
"version", &version,
"sequence", &seq) < 0)
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set errno

Comment on lines +40 to +41
int kvs_checkpoint_lookup_get_checkpoints (flux_future_t *f,
const json_t **checkpoints);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this one could just be called kvs_checkpoint_lookup_get() since it returns the whole enchilada?

Comment on lines 83 to 93
if (flux_rpc_get_unpack (f,
"{s:o}",
"value", &o) < 0)
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe check that o is an array and that it contains at least one entry, then fail with EPROTO if so?

Some users access element 0 without checking if it is NULL.

Comment on lines -137 to +152
checkpoint_get | jq -r .value | jq -r .rootref >rootref.out &&
checkpoint_get | jq -r .value[0] | jq -r .rootref >rootref.out &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for later cleanup, but these jq invocations could easily be combined, e.g.

checkpoint_get | jq -r .value[0].rootref

Two potential gotchas with the way these tests are structured:

  • I think without -e, jq returns success even if the requested key/array entry is not found
  • A pipeline succeeds if the last command in in succeeds

Not saying these tests are broken but they might be a little fragile as is.

chu11 added 6 commits July 18, 2025 10:00
Problem: Several functions in the checkpoint API still take a
key parameter, even though the key is no longer used.

Remove the key input parameter in kvs_checkpoint_commit() and
kvs_checkpoint_lookup().  Update all callers accordingly.
Problem: The internal KVS checkpoint API only supports the retrieval
of a single checkpoint.  However, some content backing modules may
store multiple checkpoints.

Update the KVS checkpoint API to support an array of checkpoints
to be returned and parsing functions to parse the individual
entries in the array.  Note that the content backing modules do not
yet support returning multiple entries.  This update is in preparation
for that future change.

Update all callers accordingly.
Problem: There is currently no way to get multiple checkpoints
from the content modules.

Update checkpoint lookup to return all checkpoints in a json array.

Update all clients to handle new protocol response.
Problem: There is no way for a user to list the checkpoints that
are available for recovering from.

Support a new "flux content checkpoints" command that will list all
of the available checkpoints for the currently configured content
backing store.
Problem: There is no documentation for the new flux content checkpoints
command.

Add it to the flux-content(1) manpage.
Problem: There are no tests for the new flux content checkpoints
command.

Add coverage in all content backing store tests.
@chu11 chu11 force-pushed the issue6629_list_checkpoints branch from 89efe1b to feeadef Compare July 18, 2025 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants