FRC: Piece Multihash CID #759
Replies: 7 comments 13 replies
-
What data should be considered the "base data" that gets fed into the hash function:
Option 1 seems the simplest and has the easiest compatibility story which is why it was chosen for the initial proposals and implementations. cc @ribasushi @Gozala since I think you had some opinions here |
Beta Was this translation helpful? Give feedback.
-
How would we like the community to refer to this new variety of Piece CID (and the multihash)? I've noticed that in https://github.com/filecoin-project/go-fil-commcid there's a lot of emphasis on the current commitments being A couple are:
Both are used to some extent in the current FRC document. If people have preferences or suggestions that'd be helpful. Note: the name of the multihash is also debatable, if people are interested in that. I'll raise a PR in the multicodec repo reserving the number and giving some context and it might be better discussed there multiformats/multicodec#331 |
Beta Was this translation helpful? Give feedback.
-
Note: there are actually 2 sizes:
For security, I'm pretty sure 1 should actually be part of the hash function inputs. I.e., |
Beta Was this translation helpful? Give feedback.
-
Should we do this for CommR in addition to CommP? It seems like it give that there are places people are using CommR in the same way they are using CommP and in the same way others in the IPFS ecosystem are using other merkle-tree hashes like Blake3. @Kubuxu I recall you had some concerns about this |
Beta Was this translation helpful? Give feedback.
-
Following up here to start moving this FRC along, can we start resolving some of these threads? Specifically,
If we can resolve these questions by 8/25, I'd suggest moving the FRC into |
Beta Was this translation helpful? Give feedback.
-
During the original process of coming up with this CommP CID thing, two sides (crypto and multiformats/ipld) met together and had to try to understand each others worlds, and we did a pretty poor job of it I think; partly because of the rush but I'm certain also partly due to personal limitations (on my on part). I'd invite the interested reader to review some of the original discussions, starting in this thread, maybe even with this comment by Juan: multiformats/multicodec#161 (comment)
We opted for, and are living in a world of (a), yet it doesn't quite make sense, because yes you can theoretically navigate through a CommP binary merkle as 64-byte IPLD blocks, pointing you to 2 more IPLD blocks, but you don't know when to stop because we never put any indication of height || length into the picture. So we made a garbage CID that's only really good as an ASCII representable thing that you know points to some piece in Filecoin (i.e. the kind of signalling we now discourage when people register codecs in the multicodec table!); that's it. It always felt like a bit of a dirty compromise and I'm glad that it's being revisited. We should have gone for option (b), and just call it a hash function, and make it so by a reasonable definition of "hash function". Then assigning a multihash code for it is completely reasonable too. Then you can do whatever you like with the codec bit of the CID, including I think #759 gets us to that place. Even SHA-2 pads input data to a multiple of 512 bits; we just happen to pad to a multiple of ^2 after fr32. That's an internal implementation detail of the hash function. Adding the height or length of the input gives us the uniqueness we need to make it a secure hash function. #808 is a bit tricky though and I think I agree with some of @ribasushi's points there. Mainly because I want this to be a hash function—which are not reversible things—a hash digest usually gives you no information other than the uniqueness of the input. #808 is introducing a bit more information into the hash digest that has nothing to do with cryptographic security of the hash function. From #808:
I think that #759 gives us this, doesn't it? It just happens to pad the "arbitrarily sized data" internally. SHA-2 requires inputs be padded to 512 bit chunks, yet it can still refer to arbitrarily sized data. I think this same point applies to the other reasoning items in #808. If I ask a Kubo node for It seems that the extensions in #808 are addressing the fact that we happen to also store the data in ^2 sized chunks; but if we treat that simply as an implementation detail of Filecoin, that just happens to align with an implementation detail of the hash function, then that all goes away and being able to reverse-engineer the actual stored data out of the Filecoin "piece" is a problem for Filecoin implementations. Perhaps this belongs on chain for some reason, but couldn't we just introduce a new integer for that? The sad fact that we have
(Conceptually of course; but replace with CommP with SHA2-256 and it roughly still works). |
Beta Was this translation helpful? Give feedback.
-
Aside from anyone wanting to change the naming to something shorter (suggestions absolutely welcome) AFAICT the remaining issue to resolve is the size of the smallest amount of data that does not need padding. AFAICT the main options are 127 bytes (smallest amount where you can do the fr32 extension of 2 bits every 254 bits and get an integer number of bytes) or 31 bytes (the CommP tree node size after adding 1 byte of zeros for the fr32 extension). Wrote up how each might work/changes to support them.
|
Beta Was this translation helpful? Give feedback.
-
The current Piece CID semantics are not particularly friendly for use as a piece identifier. In particular, the size tends to be needed along with the root of the piece tree which is why they are almost always passed around together
This FRC proposes an alternative CID representation that can be used to describe pieces/CommPs.
There were a few different ways to do this and after some discussions with people like @ribasushi, @Kubuxu ended up here. However, I've heard some alternatives proposed and starting some threads here seems like the standard FRC approach.
Beta Was this translation helpful? Give feedback.
All reactions