Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
8842176
Create ipip-0000.md
mishmosh Apr 3, 2025
4ba68f0
Update and rename ipip-0000.md to ipip-0499.md
mishmosh Apr 3, 2025
6cc64cb
add extra attributes proposed in review
lidel Apr 15, 2025
d8b8389
incorporate kubo#10774
lidel Apr 15, 2025
600d1fc
Merge branch 'main' into patch-1
bumblefudge May 5, 2025
595588c
Update src/ipips/ipip-0499.md
2color Aug 12, 2025
41f9b86
add daniel as editor
2color Aug 12, 2025
229988f
edit summary and motivation
2color Aug 12, 2025
f37e610
edit summary
2color Aug 12, 2025
7a12f0a
edit parameters and design
2color Aug 12, 2025
ff69e56
edit user benefit and compatibility
2color Aug 12, 2025
09baf68
refine parameters and introduce a named profile
2color Aug 12, 2025
cffade8
Apply suggestions from code review
2color Aug 20, 2025
0402c84
edit based on hector's feedback
2color Aug 20, 2025
ec07e30
Apply suggestions from code review
2color Aug 20, 2025
f454912
add multibase encoding
2color Aug 20, 2025
9c621ba
address feedback from rvagg
2color Aug 20, 2025
c109c1a
Update ipip-0499.md
mishmosh Nov 15, 2025
383f9e3
Update src/ipips/ipip-0499.md
mishmosh Nov 20, 2025
e564968
Update src/ipips/ipip-0499.md
mishmosh Nov 20, 2025
bbd547f
Update src/ipips/ipip-0499.md
lidel Nov 20, 2025
70514b9
fix typo (the the)
mishmosh Nov 21, 2025
89c9c62
Merge branch 'main' into patch-1
lidel Dec 12, 2025
92352d7
feat(ipip-0499): add chunking algorithm and align profile tables
lidel Dec 12, 2025
9d0d415
fix(ipip-0499): correct kubo legacy profile
lidel Dec 12, 2025
a3dc7e2
fix(ipip-0499): document legacy profile filtering behavior
lidel Dec 13, 2025
94a1b79
fix(ipip-0499): note that legacy table includes non-UnixFS implementa…
lidel Dec 13, 2025
7a8d6ab
feat(ipip-0499): add implementation versions to legacy profiles table
lidel Dec 13, 2025
a3044d6
fix(ipip-0499): update HAMTDirectory threshold and clean up parameters
lidel Dec 13, 2025
5b19f2b
feat(ipip-0499): document symlink handling in profiles
lidel Dec 13, 2025
3a092a4
fix(ipip-0499): clarify HAMTDirectory threshold calculation methods
lidel Dec 13, 2025
123be3d
fix(ipip-0499): update metadata and add contributors
lidel Dec 13, 2025
263892a
feat(ipip-0499): document HAMTDirectory threshold estimation methods
lidel Jan 13, 2026
e2f95dd
chore: bump spec-generator to 1.7.0
lidel Jan 13, 2026
b832bcc
feat(ipip-0499): restructure document and rename profiles
lidel Jan 14, 2026
d7e81d7
feat(ipip-0499): document efficiency benefits of modern profile param…
lidel Jan 14, 2026
26162e2
Merge branch 'main' into patch-1
lidel Jan 16, 2026
37132f1
feat(ipip-0499): add singularity to divergence table
lidel Jan 16, 2026
0188e10
feat(ipip-0499): add test fixtures section with deterministic CIDs
lidel Jan 24, 2026
273a2d3
fix(ipip-0499): update test fixtures with chunk threshold vectors
lidel Jan 27, 2026
c2af85f
refactor(ipip-0499): restructure Motivation section
lidel Jan 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 92 additions & 77 deletions src/ipips/ipip-0499.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: 'IPIP-0499: UnixFS CID Profiles'
date: 2025-12-13
date: 2026-01-14
ipip: proposal
editors:
- name: Michelle Lee
Expand Down Expand Up @@ -54,110 +54,125 @@ This proposal introduces **configuration profiles** for CIDs that represent file

## Motivation

UnixFS CIDs are currently non-deterministic. The same file or directory can produce different CIDs across implementations, because parameters like chunk size, DAG width, and layout vary between implementations. Often, these parameters are not even configurable by users.
While CIDs and UnixFS DAGs are cryptographically verifiable, the same file or directory can produce different CIDs across UnixFS implementations, because DAG construction parameters like chunk size, DAG width, and layout vary between tools. Often, these parameters are not even configurable by users.

This creates three problems:
This creates two problems:

- **Verification difficulty:** The same content produces different CIDs across tools, making content verification unreliable.
- **Additional overhead:** Users must store and transfer UnixFS merkle proofs to verify CIDs, adding storage overhead, network bandwidth, and complexity.
- **Broken expectations:** Unlike standard hash functions where identical input produces identical output, UnixFS CIDs behave unpredictably.
- **Broken hash semantics:** Unlike standard hash functions where identical input produces identical output, UnixFS CIDs depend on DAG construction parameters. Simple CID comparison leads to false-negatives.
- **Verification overhead:** Without knowing the original parameters, users must retrieve and compare entire DAGs to verify content, adding storage, bandwidth, and complexity.

Configuration profiles solve this by explicitly defining all parameters that affect CID generation. This preserves UnixFS flexibility (users can still choose parameters) while enabling deterministic results.
A potential solution is to define configuration profiles: well-known parameter presets that implementations can adopt when common conventions for DAG creation are desired.

## Detailed design

We introduce a set of **named configuration profiles**, each specifying the complete set of parameters for generating UnixFS CIDs. When implementations use these profiles, they guarantee that the same input, processed with the same profile, will yield the same CID across different tools and implementations.
See related discussion at <https://discuss.ipfs.tech/t/should-we-profile-cids/18507>

### UnixFS parameters

The following UnixFS parameters were identified as factors that affect the resulting CID:
The following [UnixFS](https://specs.ipfs.tech/unixfs/) parameters were identified as factors that affect the resulting CID:

1. CID version, e.g. CIDv0 or CIDv1
1. Multibase encoding for the CID, e.g. `base32`
1. Hash function used for all nodes in the DAG, e.g. `sha2-256`
1. UnixFS file chunking algorithm
1. UnixFS file chunk size or target (if required by the chunking algorithm)
1. UnixFS DAG layout, e.g. `balanced`, `trickle`
1. UnixFS file chunking algorithm and chunk size (e.g., fixed-size chunks of 256KiB)
1. UnixFS DAG layout:
- `balanced`: builds a balanced tree where all leaf nodes are at the same depth. Optimized for random access, seeking, and range requests within files (e.g., video).
- `trickle`: builds a tree optimized for streaming, where data can be consumed before the entire file is available. Useful for logs and other append-only data structures where random access is not important.
1. UnixFS DAG width (max number of links per `File` node)
1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves).
1. `HAMTDirectory` threshold: max `Directory` size before converting to `HAMTDirectory`, based on `PBNode.Links` count or estimated serialized [PBNode](https://ipld.io/specs/codecs/dag-pb/spec/) size.
1. Leaf Envelope: either `dag-pb` or `raw`
1. [HAMTDirectory](https://specs.ipfs.tech/unixfs/#dag-pb-hamtdirectory) fanout: the branching factor at each level of the HAMT tree (e.g., 256 leaves).
1. [HAMTDirectory threshold](https://specs.ipfs.tech/unixfs/#when-to-use-hamt-sharding): max `Directory` size before converting to `HAMTDirectory`, based on `PBNode.Links` count or estimated serialized [dag-pb](https://ipld.io/specs/codecs/dag-pb/spec/) size:
- `links-count`: `PBNode.Links` length (child count). Simple but ignores varying entry sizes.
- `links-bytes`: sum of `PBNode.Links[].Name` and `PBNode.Links[].Hash` byte lengths. Underestimates actual size by ignoring UnixFS Data, Tsize, and protobuf overhead.
- `block-bytes`: full serialized dag-pb node size. Most accurate, accounts for varint `Tsize` and optional metadata such as `mode` or `mtime`.
1. Leaves: either [dag-pb wrapped](https://specs.ipfs.tech/unixfs/#dag-pb-node) or [raw](https://specs.ipfs.tech/unixfs/#raw-node)
1. Whether empty directories are included in the DAG. Some implementations may apply filtering.
1. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering.
1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file.
1. Presence and accurate setting of `Tsize`.
1. Symlink handling: preserved as UnixFS Type=4 nodes, or followed (dereferenced to target).

The [UnixFS spec](https://specs.ipfs.tech/unixfs/) defines Type=4 for symlinks with target path stored in the Data field.

## CID profiles

To enable consistent CID generation, we define a series of named profiles that specify complete UnixFS parameter sets. Profile names may have any prefix, but must end in `YYYY` or `YYYY-MM`.

The initial profile in the series, **`unixfs-2025`**, captures the baseline default parameters used by multiple implementations as of November 2025.

| Parameter | `unixfs-2025` |
| ----------------------------- | ------------------------------------------------------- |
| CID version | CIDv1 |
| Hash function | sha2-256 |
| Chunking algorithm | fixed-size |
| Max chunk size | 1MiB |
| DAG layout | balanced |
| DAG width (children per node) | 1024 |
| `HAMTDirectory` fanout | 256 blocks |
| `HAMTDirectory` threshold | 256KiB (block-bytes) |
| Leaves | raw |
| Empty directories | included (opt-out) |
| Hidden entities | opt-in |
| Symlinks | preserved |

### HAMTDirectory threshold

This IPIP recognizes and documents the divergence across ecosystem: the decision when to switch to HAMT sharded directory can be based on child link count (naive), or one of several serialized size estimation methods to keep directory blocks under a byte limit.

Methods:

- **`links-count`**: `PBNode.Links` length (child count). Simple but ignores varying entry sizes.

- **`links-bytes`**: sum of `PBNode.Links[].Name` and `PBNode.Links[].Hash` byte lengths. Underestimates actual size by ignoring UnixFS Data, Tsize, and protobuf overhead.

- **`block-bytes`**: full serialized [dag-pb](https://ipld.io/specs/codecs/dag-pb/spec/) node size. Recommended: most accurate, accounts for varint `Tsize` and optional metadata such as `mode` or `mtime`.

## Legacy profiles

We also define a series of **legacy profiles**, used by various implementations as of November 2025:

| | `kubo-legacy-2025` (v0.39) | `helia-2025` | `storacha-2025` | `kubo-2025` | `kubo-wide-2025` | `dasl-2025` |
| ----------------------------- | ------------------------------ | --------------- | ------------------ | ------------------ | ----------------------- | ------------- |
| Based on | kubo v0.39 (`legacy-cid-v0`) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | kubo v0.39 (`test-cid-v1`) | kubo v0.39 (`test-cid-v1-wide`) | 2025-12 |
| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 |
| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 |
| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | fixed-size | not specified |
| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | 1MiB | not specified |
| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified |
| DAG width (children per node) | 174 | 1024 | 1024 | 174 | **1024** | not specified |
| `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified |
| `HAMTDirectory` threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | **1MiB** (links-bytes) | not specified |
| Leaves | dag-pb | raw | raw | raw | raw | not specified |
| Empty directories | included | included | excluded | included | included | not specified |
| Hidden entities | opt-in | opt-in | opt-in | opt-in | opt-in | not specified |
| Symlinks | preserved | followed | followed | preserved | preserved | not specified |
1. Presence and accurate setting of `Tsize` (correct UnixFS has `Tsize` of child sub-DAGs).
1. [Symlink](https://specs.ipfs.tech/unixfs/#dag-pb-symlink) handling: preserved as UnixFS Type=4 nodes, or followed (dereferenced to target).
1. [Mode](https://specs.ipfs.tech/unixfs/#mode-field): optional POSIX file permissions.
1. [Mtime](https://specs.ipfs.tech/unixfs/#mtime-field): optional modification timestamp.

### Divergence in current implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obo20 any chance someone from Pinata yeam can provide what defaults you use currently, so we can list your "CID profile" next to Storacha one? Understanding divergence would be very useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acejam any chance someone from Filebase team can provide what defaults you use currently when user data is onboarded and chunked into UnixFS? (or are you using default from specific Kubo version behind the scenes?)

We would like to document your "CID profile" next to other popular ones in the ecosystem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel, I can't speak for them directly, but I recently added Filebase as a storage support option.

Through their IPFS RPC API endpoint, adding a file:

  • wrap with directory param defaults to false
  • cid version defaults to 0
  • Multihash: sha2-256
  • Chunksize: 256KiB

Not able to confirm for the remaining params, but it's a start!

When using their S3-compatible API endpoint, it uses the same defaults with no option to override (at least I never figured out how).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included Singularity as example showing that even among different "balanced" layout implementations we can see different variants that affect CID determinism for large files.

  • documented balanced and balanced-packed DAG layout variants
  • noted implicit boxo defaults for HAMT parameters that the project seems to be using
  • assumed it uses rclone's defaults for hidden files and symlinks

cc data-preservation-programs/singularity#525 @SethDocherty @parkan @2color to proofread if the "singularity" column here reflects reality or if there is more nuance to what Singularity does

Copy link
Contributor Author

@mishmosh mishmosh Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel,

Nice addition with the balanced-packed section. What you've documented there sounds correct to me. I also believe one of the design reasons for Singularity going with a balanced-packed approach is to pack chunked content as best as possible to fit into Filecoin Sectors. Up to you, but it's some additional design context to consider if it would be worth noting

The table details for Singularity look correct, too.

I'll defer to @parkan, @2color for confirmation.


We analyzed the default settings across the most popular UnixFS implementations in the ecosystem. The table below documents the divergence that prevents deterministic CID generation today:

| Parameter | kubo (CIDv0) | helia | storacha | kubo (CIDv1) | dasl |
| ----------------------------- | ------------------------ | -------------------- | ------------------ | ----------------------------- | ------------ |
| Based on | v0.39 (`unixfs-v0-2015`) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | v0.39 (`test-cid-v1` profile) | spec 2025-12 |
| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 |
| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 |
| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | N/A |
| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | N/A |
| DAG layout | balanced | balanced | balanced | balanced | N/A |
| DAG width (children per node) | 174 | 1024 | 1024 | 174 | N/A |
| HAMTDirectory fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | N/A |
| HAMTDirectory threshold | 256KiB (links-bytes) | 256KiB (links-bytes) | 1000 (links-count) | 256KiB (links-bytes) | N/A |
| Leaves | dag-pb | raw | raw | raw | N/A |
| Empty directories | included | included | excluded | included | N/A |
| Hidden entities | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | excluded (opt-in) | N/A |
| Symlinks | preserved | followed | followed | preserved | N/A |
| Mode (permissions) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | N/A |
| Mtime (modification time) | excluded (opt-in) | excluded (opt-in) | not supported | excluded (opt-in) | N/A |

**Terminology:**

- `included`: Always included in the DAG (no option to exclude)
- `excluded`: Always excluded from the DAG (no option to include)
- `opt-in`: Excluded by default; implementations provide a flag to include (e.g., `--hidden` in Kubo/Storacha, `hidden: true` in Helia)
- `opt-out`: Included by default; implementations provide a flag to exclude
- `preserved`: Symlinks stored as UnixFS Type=4 nodes with target path (per [UnixFS spec](https://specs.ipfs.tech/unixfs/)). Note: Kubo (v0.39) `--dereference-args` only follows symlinks passed as CLI arguments; symlinks found during recursive traversal are always preserved.
- `followed`: Symlinks dereferenced and treated as target files/directories

See related discussion at <https://discuss.ipfs.tech/t/should-we-profile-cids/18507>
## Detailed design

We introduce a set of **named configuration profiles**, each specifying the complete set of parameters for generating UnixFS CIDs. When implementations use these profiles, they guarantee that the same input, processed with the same profile, will yield the same CID across different tools and implementations.

### The `unixfs-v1-2025` modern profile

Based on the research above, we define **`unixfs-v1-2025`** as an opinionated profile for implementations that want to adopt deterministic CID generation for UnixFS DAGs with CIDv1.

| Parameter | `unixfs-v1-2025` |
| ----------------------------- | -------------------- |
| CID version | CIDv1 |
| Hash function | sha2-256 |
| Chunking algorithm | fixed-size |
| Max chunk size | 1MiB |
| DAG layout | balanced |
| DAG width (children per node) | 1024 |
| HAMTDirectory fanout | 256 blocks |
| HAMTDirectory threshold | 256KiB (block-bytes) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Helia can currently do block-bytes. How much of a problem is this for making this the "modern" profile? Or does that just mean there's work to do.

https://github.com/ipfs/helia/blob/005c2a7a5e45349398cf750fd73f3c47591bb00a/packages/unixfs/src/commands/utils/is-over-shard-threshold.ts#L34-L45

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see from your table that links-bytes is the most common across the implementations, so why not just use it? It's not precise, but it means that you don't need to change implementations as much to conform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvagg just mean there's work to do to implement block-bytes, but we will do it in Helia for parity with Kubo.

I hear you - standardizing on links-bytes sounds pragmatic at first. But the more I dug into this, the more I realized implementations aren't actually aligned on anything that feels solid:

  • GO used >= despite comments claiming > (just fixed in ipfs/boxo@6707376, JS uses > (imo correctly))
  • but then JS hardcodes CID byte lengths assuming sha2-256 (link). Using longer hash function would produce incorrect estimation etc. Yes, this doesn't impact the default today, but it's not a solid foundation for a modern standard, it would backfire in the future when we switch away from sha2-256

Both implementations need code changes to align on any method. There's no free lunch here.

Given that we're already introducing unixfs-v1-2025 with intentional breaking changes (1MiB chunks), this is our window to get it right, avoid ambiguity.

The goal of IPIP-499 is to create sensible modern profile that is implementable without poking your eyes out. I feel it needs block-bytes: it is unambiguous, precise, hash-agnostic, and future-proof. Legacy behavior stays in unixfs-v0-2015 for anyone who needs it, and modern software can just choose to not implement it.

(This is my "strong belief weakly held", I'm happy to be convinced otherwise :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonable, as long as there is people-budget to do all the work on this 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYSA JS work tracked in ipfs/helia#941

| Leaves | raw |
| Empty directories | included (opt-out) |
| Hidden entities | excluded (opt-in) |
| Symlinks | preserved |
| Mode (permissions) | excluded (opt-in) |
| Mtime (modification time) | excluded (opt-in) |

### The `unixfs-v0-2015` legacy profile

This profile documents the default UnixFS DAG construction parameters used by Kubo through version 0.39 when producing CIDv0. It is provided for users who depend on CIDv0 identifiers generated by Kubo and need to reproduce them with other implementations, or verify content against existing CIDv0 references. The year 2015 in the name indicates that the majority of these parameters were picked a decade ago, when the initial go-ipfs alpha software was implemented, and these defaults were never contested since then.

| Parameter | `unixfs-v0-2015` |
| ----------------------------- | -------------------- |
| CID version | CIDv0 |
| Hash function | sha2-256 |
| Chunking algorithm | fixed-size |
| Max chunk size | 256KiB |
| DAG layout | balanced |
| DAG width (children per node) | 174 |
| HAMTDirectory fanout | 256 blocks |
| HAMTDirectory threshold | 256KiB (links-bytes) |
| Leaves | dag-pb |
| Empty directories | included |
| Hidden entities | excluded (opt-in) |
| Symlinks | preserved |
| Mode (permissions) | excluded (opt-in) |
| Mtime (modification time) | excluded (opt-in) |

### User benefit

Profiles provide 3 key advantages for working with content-addressed data:

1. **Predictable, deterministic behavior:** Profiles restore the expected property of content addressing: identical input data always produces identical CIDs, regardless of which implementation generates them.
1. **Predictable, deterministic behavior:** Profiles restore intuitive hash-like behavior: identical input data always produces identical CIDs, regardless of which implementation generates them.

2. **Lightweight verification:** Users can verify content without needing to rely on additional merkle proofs or CAR files.

Expand All @@ -167,7 +182,7 @@ Profiles provide 3 key advantages for working with content-addressed data:

UnixFS data encoded with the CID profiles defined in this IPIP remains fully compatible with existing implementations, since it conforms to the [UnixFS specification](https://specs.ipfs.tech/unixfs/).

To generate CIDs in compliance with this IPIP, implementations MUST support the `unixfs-2025` profile. Legacy profiles are provided for historical context and MAY be supported for backward compatibility.
To generate CIDs in compliance with this IPIP, implementations MUST support the `unixfs-v1-2025` profile. The `unixfs-v0-2015` profile is provided for backward compatibility and MAY be supported by implementations that need to produce CIDs matching historical Kubo output.

Implementations SHOULD allow users to inspect default values and adjust configuration options related to CID generation.

Expand Down
4 changes: 3 additions & 1 deletion src/unixfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,9 @@ node exceeds a size threshold between 256 KiB and 1 MiB. This threshold:

See [Block Size Considerations](#block-size-considerations) for details on block size limits and conventions.

For standardized threshold estimation methods that enable deterministic CID generation, see [IPIP-499: UnixFS CID Profiles](../ipips/ipip-0499.md).
:::note
For standardized threshold estimation methods that enable deterministic CID generation, see [IPIP-499: UnixFS CID Profiles](https://specs.ipfs.tech/ipips/ipip-0499/).
:::

### `dag-pb` `Symlink`

Expand Down