FRC: Piece Multihash CID #759

aschmahmann · 2023-07-26T18:14:10Z

aschmahmann
Jul 26, 2023

The current Piece CID semantics are not particularly friendly for use as a piece identifier. In particular, the size tends to be needed along with the root of the piece tree which is why they are almost always passed around together

This FRC proposes an alternative CID representation that can be used to describe pieces/CommPs.

There were a few different ways to do this and after some discussions with people like @ribasushi, @Kubuxu ended up here. However, I've heard some alternatives proposed and starting some threads here seems like the standard FRC approach.

aschmahmann · 2023-07-26T18:27:23Z

aschmahmann
Jul 26, 2023
Author

What data should be considered the "base data" that gets fed into the hash function:

Hash does the FR32 Padding (Current Approach): this requires that the data fed into the hash function be some number of bytes N = 2^i * 127/128 where i is any integer ≥ 7 and ≤ 255. This matches how the Piece Gateway FRC sends the bytes without the FR32 padding which means the consumer would reapply it to validate it matched the expected hash which seems to support this approach.
Hash does not do the FR32 Padding: this requires that the data fed into the hash function be some number of bytes N = 2^i
Hash does the zero padding of the data to get to N = 2^i * 127/128 and then FR32 padding
- Note: This would require the digest in the multihash to contain not just the tree height but enough information to computed the pre-padded size as well (e.g. a varint of the size of the data, or more efficiently a varint of the amount of padding that was needed)

Option 1 seems the simplest and has the easiest compatibility story which is why it was chosen for the initial proposals and implementations.

cc @ribasushi @Gozala since I think you had some opinions here

8 replies

Stebalien Jul 27, 2023

I don't think that matters. The data is guaranteed to be stored regardless, the extra length is just some extra metadata useful to clients.

ribasushi Jul 27, 2023

This matters in the context I am in though ( various storage brokers, trusting exclusively the current chain state ).

To be fair, if we accept arbitrary sizes in CIDs, I can simply shift the limitation to the broker itself, via something like "while CID X is valid, it can not be accepted as there can not exist a corresponding chain state with this value: please come back with a power-of-two size".

So... yeah, not feeling that strongly either way, but the ick factor is there.

Stebalien Jul 27, 2023

This matters in the context I am in though ( various storage brokers, trusting exclusively the current chain state ).

I'm not so sure.

At least in terms of piece CIDs, the client/storage provider have to agree on the CID so the storage provider can't lie here.
Regardless, the CID is technically correct. Those extra zeros are sealed.

aschmahmann Aug 25, 2023
Author

A version of option 3 added in #808 (along with a basic implementation in filecoin-project/go-fil-commcid#6). Reviews would be appreciated along with a decision on if we're happy with this approach🙏.

aschmahmann Aug 25, 2023
Author

I'd like to flush out some of @ribasushi's concerns here and see if they still stand and/or if I'm misunderstanding.

Say we have two v2 piece CIDs of the option 3 variety Pad and NoPad where Pad is NoPad with enough zeros added to the end of the data such that after fr32 extension (a word I'm stealing from @ribasushi to avoid using padding to refer to both fr32 and the zeros appended to the data) the size is of 2^i.

Using the current version of #808 and two examples of a 1024 byte fr32 extended + padded binary tree:

NoPad: 127 x 0 | 127 x 1 | 127 x 2 | 127 x 3 | 508 x 0
Pad : 127 x 0 | 127 x 1 | 127 x 2 | 127 x 3 | 131 x 0

Both Pad and NoPad have the same v1 piece CID of baga6ea4seaqn42av3szurbbscwuu3zjssvfwbpsvbjf6y3tukvlgl2nf5rha6pa. However they have different v2 piece CIDs:

NoPad: bafkzcibcauan42av3szurbbscwuu3zjssvfwbpsvbjf6y3tukvlgl2nf5rha6pa
Pad: bafkzcibdax4qfxticxolgseegik2stpfgkkuwyf6kufex3doorkvmzpjuxwe4dz4

IIUC the chain guarantees are:

NoPad:
- The bytes identified by the tuple (v1 piece CID, tree height) are stored by a given SP and if you ask for all the bytes back (with the fr32 extension removed) and run it through the fr32-commp hasher again you will get the same result.
- Because NoPad has no padding the chain also guarantees that the (option 3) v2 piece CID is what the SP was instructed to store and prove.
Pad:
- The bytes identified by the tuple (v1 piece CID, tree height) are stored by a given SP and if you ask for all the bytes back (with the fr32 extension removed) and run it through the fr32-commp hasher again you will get the same result.
- Because Pad has padding AND there is no information tracking "data size" (as opposed to padded size) used by the proofs (or even just on the chain itself to be verified outside of the consensus proofs) there is no way to prove how many trailing zeros requested the SP to store vs how many the SP needed to pad with to make Filecoin proofs happy.

This is a distinction, it's true. However, it doesn't seem to me like this matters in any meaningful way. For example:

Say the storer meant to store NoPad with the SP and some retriever found the CID for Pad and ended up asking for Pad from the SP. Does that matter? Sure they're getting extra zeros now, but that's the fault of the retriever and whoever gave them that CID. Maybe that was even intentional. The SP has the data for both Pad and NoPad and is continuously proving it has those bytes.
- As long as the SP has enough smarts to understand that both Pad and NoPad can be served from the same data this should be fine (e.g. the smarts could be something like to use the NoPad multihash as the data identifier and when requests for Pad come in, deterministically convert them to NoPad and then serve only a range from NoPad specified in the Pad CID)
- The SP could also not have any smarts here and just say "I don't serve v2 piece CIDs with padding" and as long as they get that error the client can use range requests and figure it out on their end. My 2c is that SPs should have the smarts here, they're fairly minimal and unlock some value like being compatible with IPFS HTTP Gateway Spec
Say the storer asked some delegate to please get Pad on chain and then wants to verify it. The storer can deterministically convert Pad into NoPad and then verify that NoPad is on chain (since the NoPad is the only stuff on chain).

All this is to say that AFAICT the loss of chain provability here is minimal. Although to be fair I've thought about the Filecoin specifics here much less than both of you, so if I'm missing something here it'd be great to know (and have documented for when other folks ask in the event something comes up the causes us to prefer Option 1 to Option 3).

aschmahmann · 2023-07-26T18:34:38Z

aschmahmann
Jul 26, 2023
Author

How would we like the community to refer to this new variety of Piece CID (and the multihash)? I've noticed that in https://github.com/filecoin-project/go-fil-commcid there's a lot of emphasis on the current commitments being v1 since they may change in the future. Ideally it would be unambiguous that the commitments aren't being versioned just the CID representation.

A couple are:

v1 and v2 Piece CIDs, correct but perhaps ambiguous if/when a new commitment comes along
Piece Multihash CIDs, correct and unlikely to change when new commitments come along due to being more accurate. People might prefer the v1/v2 shorthand though

Both are used to some extent in the current FRC document. If people have preferences or suggestions that'd be helpful.

Note: the name of the multihash is also debatable, if people are interested in that. I'll raise a PR in the multicodec repo reserving the number and giving some context and it might be better discussed there multiformats/multicodec#331

0 replies

Stebalien · 2023-07-26T18:56:08Z

Stebalien
Jul 26, 2023

Note: there are actually 2 sizes:

The size of the unpadded data.
The final size of the padded piece.

For security, I'm pretty sure 1 should actually be part of the hash function inputs. I.e., <unpadded-length-prefix><padded-piece-data>.

2 replies

aschmahmann Jul 26, 2023
Author

As I understand it we cannot change the data that gets hashed into the inner digest without breaking compatibility everywhere (i.e. the digest of the sha2-256-trunc254-padded multihash used in v1 piece CIDs). However, we can stuff some data into the digest of the new multihash we're creating (currently called fr32-sha2-256-trunc254-padded-binary-tree).

At the moment we're just prefixing with tree height and asserting that it MUST be a specific size (N = 2^i *127/128 where N and i are postive integers). This is option 1 here #759 (comment).

If you want the hash function to do zero padding (not the fr32 padding) for you and to support arbitrary sizes of data then yeah you need the size embedded in the multihash digest. That's option 3 in that linked discussion.

Stebalien Jul 26, 2023

As I understand it we cannot change the data that gets hashed into the inner digest without breaking compatibility everywhere (i.e. the digest of the sha2-256-trunc254-padded multihash used in v1 piece CIDs).

We can, but it wouldn't be retroactive and probably a lot of work.

But really.... I think I'm wrong about the security here. As long as the length is stored in the hash digest, it should be fine (client won't sign the deal if the length isn't correct).

If you want the hash function to do zero padding (not the fr32 padding) for you and to support arbitrary sizes of data then yeah you need the size embedded in the multihash digest. That's option 3 in that linked discussion.

I'd really like to be able to treat this as a black box where possible: data in -> hash out.

aschmahmann · 2023-07-26T19:28:45Z

aschmahmann
Jul 26, 2023
Author

Should we do this for CommR in addition to CommP? It seems like it give that there are places people are using CommR in the same way they are using CommP and in the same way others in the IPFS ecosystem are using other merkle-tree hashes like Blake3.

@Kubuxu I recall you had some concerns about this

0 replies

davidd8 · 2023-08-18T21:22:59Z

davidd8
Aug 18, 2023

Following up here to start moving this FRC along, can we start resolving some of these threads? Specifically,

Which implementation option for the FRC - can we confirm it is Option 1?
The naming of these CIDs - my vote is for Piece CID v2, but not a big deal either way.
Extending this FRC to include CommR, or create a separate FRC? I'd encourage focusing on CommP, go through the process, and have a fast-follow FRC for CommR. Open to other opinions.

If we can resolve these questions by 8/25, I'd suggest moving the FRC into Last Call status.

0 replies

rvagg · 2023-09-19T05:16:37Z

rvagg
Sep 19, 2023
Maintainer

During the original process of coming up with this CommP CID thing, two sides (crypto and multiformats/ipld) met together and had to try to understand each others worlds, and we did a pretty poor job of it I think; partly because of the rush but I'm certain also partly due to personal limitations (on my on part).

I'd invite the interested reader to review some of the original discussions, starting in this thread, maybe even with this comment by Juan: multiformats/multicodec#161 (comment)

The internal hashing of all the relevant trees is -- by definition -- a proper merkle tree. One can either (a) treat all nodes in the tree as ipld objects (my preferred approach), or (b) elect to treat the whole intermediate tree as an implicit special purpose object/format for applications like filecoin.

We opted for, and are living in a world of (a), yet it doesn't quite make sense, because yes you can theoretically navigate through a CommP binary merkle as 64-byte IPLD blocks, pointing you to 2 more IPLD blocks, but you don't know when to stop because we never put any indication of height || length into the picture. So we made a garbage CID that's only really good as an ASCII representable thing that you know points to some piece in Filecoin (i.e. the kind of signalling we now discourage when people register codecs in the multicodec table!); that's it. It always felt like a bit of a dirty compromise and I'm glad that it's being revisited.

We should have gone for option (b), and just call it a hash function, and make it so by a reasonable definition of "hash function". Then assigning a multihash code for it is completely reasonable too. Then you can do whatever you like with the codec bit of the CID, including car (0x0202). The internals of the hash function are not particularly interesting or relevant to outsiders, it just happens to use a binary merkle across a zero padded, fr32 extended (good word!) form of the input data—many (most?) hash functions do weird stuff like this, we just made a new one that happens to work with the filecoin consensus mechanisms really well.

I think #759 gets us to that place. Even SHA-2 pads input data to a multiple of 512 bits; we just happen to pad to a multiple of ^2 after fr32. That's an internal implementation detail of the hash function. Adding the height or length of the input gives us the uniqueness we need to make it a secure hash function.

#808 is a bit tricky though and I think I agree with some of @ribasushi's points there. Mainly because I want this to be a hash function—which are not reversible things—a hash digest usually gives you no information other than the uniqueness of the input. #808 is introducing a bit more information into the hash digest that has nothing to do with cryptographic security of the hash function.

From #808:

enables piece CIDs that refer to arbitrarily sized data rather than just data of size N= 2^i * 127/128 where N and i are integers

I think that #759 gives us this, doesn't it? It just happens to pad the "arbitrarily sized data" internally. SHA-2 requires inputs be padded to 512 bit chunks, yet it can still refer to arbitrarily sized data. I think this same point applies to the other reasoning items in #808. If I ask a Kubo node for bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze, I'm not giving it any information about the size of the data, just that there's a multihash in there that it should be able to cross-reference in its database with some bytes that match it.

It seems that the extensions in #808 are addressing the fact that we happen to also store the data in ^2 sized chunks; but if we treat that simply as an implementation detail of Filecoin, that just happens to align with an implementation detail of the hash function, then that all goes away and being able to reverse-engineer the actual stored data out of the Filecoin "piece" is a problem for Filecoin implementations. Perhaps this belongs on chain for some reason, but couldn't we just introduce a new integer for that? The sad fact that we have ZeroLengthSectionAsEOF as an option for go-car probably means we should have put the length somewhere handy, but that place probably isn't in the digest itself.

Client: here's my chunk of data to store, the CommP I calculated is 0xdeadbeef
Server: great, I got your chunk of data and agree that the CommP is 0xdeadbeef, I'll store it
Client: can I have my chunk of data back now? the one with the CommP of 0xdeadbeef
Server: yep, I know where it's stored too! I put it in a sector after I fr32 extended it and happened to leave a bunch of zeros after it too (coincidentally, this is also what I did to calculate CommP!) and then did a bunch of other stuff to it, but I can undo all of that and give it back to you ... or maybe I have one of the original versions of it laying around here somewhere I can just give you, I'll look it up in my database ...

(Conceptually of course; but replace with CommP with SHA2-256 and it roughly still works).

1 reply

aschmahmann Sep 19, 2023
Author

Even SHA-2 pads input data to a multiple of 512 bits; we just happen to pad to a multiple of ^2 after fr32. That's an internal implementation detail of the hash function. Adding the height or length of the input gives us the uniqueness we need to make it a secure hash function.

While SHA-2 is padded, per section 5.1.1 of NIST FIPS 180-4 the padding is both mandatory and unique because it includes the size of the padding. This means there is no (computationally feasible) way to, given some data D and hash of the data H to produce D2 such that SHA256(D2) == H. However, if you allow the CommP multihash to operate on arbitrarily sized bytes and pad them then there become to obvious ways to get a hash of the same bytes since hash({1,2,3}) == hash({1,2,3,0) == hash({1,2,3,0,0})... This is not safe and fails to be a cryptographic hash function by way of failing second-preimage-resistance.

If Filecoin mainnet had started with producing padding information like this like SHA-2 has then this might've been fine, but that's not the world we're in, because as mentioned the group designing CommP was not really taking into account how it might be used outside of the Filecoin proofs needed for consensus.

This means that given CIDs require cryptographically secure hash functions to identify (and verify) the content we are left with:

A hash function that only works on data of a fixed size: FRC: Piece Multihash CID #758
A hash function that works on data of arbitrary size: frc(0069): change v2 piece multihashes to enable arbitrarily sized data #808

#808 is introducing a bit more information into the hash digest that has nothing to do with cryptographic security of the hash function.

As above, #808 is introducing the ability to have a CommP hash function that works on data of arbitrary size, while still being cryptographically secure. That added feature (supporting arbitrary size) comes at the cost of the bit more information pasted into the digest because, unfortunately, that data is not available in the tree itself.

While I don't feel super strongly about #758 vs #808 my inclination is that what we're going for is minimizing the likelihood of people shooting themselves in the foot. For example:

Today: Use v1 CommP CIDs -> evidence of people shooting themselves in the foot with misuse by using them alone for validation when they need at least the tree height
FRC: Piece Multihash CID #758: If people think it can be used with arbitrarily sized data, or try and manipulate it into doing that we could end up with problems (e.g. a non-cryptographic hash function being used as a cryptographic hash function)
frc(0069): change v2 piece multihashes to enable arbitrarily sized data #808: Unless the padding info to create these kinds of CIDs makes it on chain I could see people getting confused by the ability to have N piece CIDs that refer to data + (between 0 and N-1) zeros . The implementation is also a little more complex.
- Not worse than FRC: Piece Multihash CID #758, but an FYI that since the CIDs can be longer in the cases where there is padding they will (frequently?) be long enough that even with base36 encoding it wouldn't fit in a domain-name label, so if that's something you needed as is the case for some applications loaded via public IPFS gateways that would be a problem in need of addressing - although the situation is no worse than for sha2-512.

So I think the question is given those tradeoffs what do people want feature-wise from these CIDs, and what do we feel would put users on a strong footing where they are less likely to misuse the primitive?

aschmahmann · 2023-10-10T21:49:34Z

aschmahmann
Oct 10, 2023
Author

Aside from anyone wanting to change the naming to something shorter (suggestions absolutely welcome) AFAICT the remaining issue to resolve is the size of the smallest amount of data that does not need padding. AFAICT the main options are 127 bytes (smallest amount where you can do the fr32 extension of 2 bits every 254 bits and get an integer number of bytes) or 31 bytes (the CommP tree node size after adding 1 byte of zeros for the fr32 extension).

Wrote up how each might work/changes to support them.

Pad up to 127
- Requires changing "padding data it must be less than the size of the underlying data" to be "padding data it must be less than the size of the underlying data for data at least 64 bytes. Smaller data must be padded up to 127 bytes"
Pad up to 31 (for height 0) and 63 (for height 1)
- Requires either some text around how to handle these OR (perhaps equivalently) redefining fr32 padding from "data must be a multiple of 127 bytes. add 2 zero bits every 254 bits" to "data must be a multiple of 1 byte. add 2 zero bits every 254 bits, and if after padding you don't have a complete byte add enough zero bits so you do". Resulting conversions from the binary tree leaves to the data bytes are:
- Height = 0
  - Data = the padded 32 bytes - 1 - padding
  - Data is size 0 to 31 bytes
  - Should be able to assert that all the padding bytes are 0
- Height = 1
  - Data =
    Remove the last two bits of byte 32 and shift bytes 33-64 back two bits (making the last 2 bits of byte 64 zero)
    Return the first 63 - padding bytes
    Data is of size 32-63 bytes
    Should be able to assert that all the padding bytes are 0

2 replies

ribasushi Oct 10, 2023

Very strong vote for the first option:

Always pad to 127/128 minimum
Therefore Height < 2 is invalid
Can remove the "64" language: it is cleaner to just say: payload must be of size 127*(2^N), and will be zero-padded if necessary

Gozala Oct 11, 2023

The 127/128 is also my favorite given how much simpler it is than others

FRC: Piece Multihash CID #759

Uh oh!

Replies: 7 comments · 13 replies

Uh oh!

aschmahmann Jul 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aschmahmann Aug 25, 2023 Author

Uh oh!

aschmahmann Aug 25, 2023 Author

Uh oh!

Uh oh!

aschmahmann Jul 26, 2023 Author

Uh oh!

Uh oh!

aschmahmann Jul 26, 2023 Author

Uh oh!

Uh oh!

aschmahmann Jul 26, 2023 Author

Uh oh!

Uh oh!

rvagg Sep 19, 2023 Maintainer

Uh oh!

Uh oh!

aschmahmann Sep 19, 2023 Author

Uh oh!

aschmahmann Oct 10, 2023 Author

Uh oh!

Uh oh!

Replies: 7 comments 13 replies

aschmahmann
Jul 26, 2023
Author

aschmahmann Aug 25, 2023
Author

aschmahmann Aug 25, 2023
Author

aschmahmann
Jul 26, 2023
Author

aschmahmann Jul 26, 2023
Author

aschmahmann
Jul 26, 2023
Author

rvagg
Sep 19, 2023
Maintainer

aschmahmann Sep 19, 2023
Author

aschmahmann
Oct 10, 2023
Author