Skip to content

Commit 51d8c7a

Browse files
authored
Update docs (#143)
* feat: update general docs for matching current impl * feat: add service overviews for File Retriever, DAG Indexer, and Object Mapping Indexer in documentation * docs: add object mapping indexer service overview to DAG Indexer documentation * docs: clarify the meaning of archived files in README * docs: update DAG Indexer documentation to include support for `system.remark_with_event` extrinsics
1 parent 6508452 commit 51d8c7a

File tree

4 files changed

+195
-7
lines changed

4 files changed

+195
-7
lines changed

README.md

Lines changed: 34 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,40 @@ This service is used for retrieving already archived files from the Autonomy's D
44

55
## FAQs
66

7-
- **What means a file to be archived?:** When a file is uploaded to Autonomy’s network as a series of transactions, it is not immediately distributed as segmented pieces from which farmers generate storage proofs. There's some variable delay between the transaction submission and the file availability that depends on the network congestion among other factors.
7+
- **What does it mean for a file to be archived?:** When a file is uploaded to Autonomy’s network as a series of transactions, it is not immediately distributed as segmented pieces from which farmers generate storage proofs. There's some variable delay between the transaction submission and the file availability that depends on the network congestion among other factors.
88

99
## How it works?
1010

11-
When a user requests the download of a file, the file retriever asks the object mapping indexer the position of the CID in the DSN and uses the subspace gateway to download it. For big files if there are more objects to be retrieved this process is repeated until all the object are downloaded.
12-
13-
Meanwhile, the object mapping indexer is constantly listening to the Autonomy's network for keeping track of the position of uploaded objects in the DSN.
14-
15-
![File Download Diagram](https://github.com/autonomys/auto-files-gateway/blob/main/.github/diagrams/file-download.png)
16-
11+
The flow illustrates how a file retrieval is performed in **Files Gateway**:.
12+
13+
- When a **user requests a specific content identifier (CID)**, the **File Retriever** coordinates the process by querying various indexers to obtain both metadata and storage location details.
14+
- The **Object Indexer** and **DAG Indexer** maintain a constantly updated map of where data resides across the network and how it is structured within content-addressable graphs.
15+
- Once the necessary information is gathered, the **Gateway** and participating **Nodes** manage the actual data transfer, retrieving the requested file from the decentralized storage and delivering it back to the user efficiently and securely.
16+
- This coordinated interaction ensures that even though data is stored in a distributed manner, retrieval remains seamless and transparent to the end user.
17+
18+
```mermaid
19+
sequenceDiagram
20+
participant User
21+
participant FileRetriever as FILE-RETRIEVER
22+
participant ObjectIndexer as ObjectIndexer
23+
participant DAGIndexer as DAGIndexer
24+
participant Gateway as GATEWAY
25+
participant Node as NODE
26+
27+
User->>FileRetriever: 1. Asks for a CID
28+
FileRetriever->>DAGIndexer: 2. Requests the list of nodes needed<br/> for retrieving the requested CID (partial retrieval posibly)
29+
FileRetriever->>ObjectIndexer: 3. Requests the position of the object on the DSN
30+
Note over ObjectIndexer: Listens to every object upload,<br/>saving its position in the DSN<br/>for use during file download.
31+
Note over DAGIndexer: Listens to every "System.remark" extrinsic,<br/> and decodes saving IPLD metadata.
32+
FileRetriever->>Gateway: 4. Fetches the object using<br/>the Subspace Gateway
33+
Node->>ObjectIndexer: Sends object mappings
34+
Node->>DAGIndexer: Sends blocks
35+
Gateway->>Node: Retrieve object metadata from the node
36+
```
37+
38+
For a more detailed over on each service check out:
39+
40+
- [File retriever](docs/file-retriever.md)
41+
- [DAG Indexer](docs/dag-indexer.md)
42+
- [Object Mapping Indexer](docs/object-mapping-indexer.md)
43+
- [Subspace Node & Gateway](https://github.com/autonomys/subspace)

docs/dag-indexer.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
### DAG Indexer — Service Overview
2+
3+
SubQuery-based indexer that listens to `system.remark` and `system.remark_with_event` extrinsics and indexes DAG-PB nodes produced by `@autonomys/auto-dag-data`. Persists node metadata, links, sizes, and upload options for downstream services like File Retriever.
4+
5+
### Flow
6+
7+
- SubQuery runtime connects to Subspace network (`wss://rpc.mainnet.subspace.foundation/ws`).
8+
- For each successful `system.remark` extrinsic, decode DAG-PB, compute CID and BLAKE3, extract links and IPLD metadata.
9+
- Save entity `Node` with block/extrinsic identifiers and metadata.
10+
11+
### Data Model (GraphQL schema)
12+
13+
- `Node` fields: `cid`, `type`, `linkDepth`, `name`, `blockHeight`, `blockHash`, `extrinsicId`, `extrinsicHash`, `indexInBlock`, `links[]`, `size`, `blake3Hash`, `timestamp`, `uploadOptions`.
14+
- `uploadOptions` includes optional compression and encryption options.
15+
16+
### Mapping Logic (handleCall)
17+
18+
- Skip failed extrinsics.
19+
- Decode remark hex to PBNode; derive `cid`, `links`, `blake3Hash`.
20+
- Decode `IPLDNodeData` for `type`, `size`, `name`, `linkDepth`, `uploadOptions`.
21+
- Persist as `Node` entity with composite indexes on `(blockHeight,indexInBlock)` and `(timestamp,indexInBlock)` for efficient ordering.
22+
23+
### Configuration (project.yaml)
24+
25+
- `dataSources[0]` kind: `substrate/Runtime` with handler `handleCall` and filter `{ module: system, method: remark }` or `{ module: system, method: remarkWithEvent }`.
26+
- Network `chainId` and endpoint set to Subspace Mainnet.
27+
- Runner `@subql/node` `>=5.2.9` with `unsafe: true`, `historical: false`.
28+
29+
### Consumers
30+
31+
- File Retriever queries DAG Indexer storage to get:
32+
- Node metadata for `/:cid/metadata` and validation.
33+
- Sorted chunks for multi-node files.
34+
- Object mapping indexer `/ipld-nodes` service.
35+
36+
### Notes
37+
38+
- Only nodes with `node.Data` are saved; others are ignored with a warning.
39+
- `size` defaults to `0` if missing in metadata.

docs/file-retriever.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
### File Retriever — Service Overview
2+
3+
The File Retriever serves content from Autonomys DSN by composing IPLD DAG-PB chunks into a file stream. It validates access, enforces moderation, consults the DAG Indexer for metadata and chunk ordering, and fetches raw objects via Subspace Gateway using Object Mapping Indexer hints.
4+
5+
### Flow
6+
7+
- Request arrives at `GET /files/:cid` with API auth.
8+
- Validate CID and optional byte range; check moderation ban list.
9+
- Fetch node metadata from DAG Indexer to derive size, filename, mime/encoding.
10+
- Resolve object mappings (piece index/offset) through Object Mapping Indexer and batch-fetch nodes from Subspace Gateway.
11+
- Stream file (DFS over chunks) or partial range; cache full responses.
12+
13+
### Endpoints
14+
15+
- `GET /health` — 200 OK.
16+
- `GET /files/:cid/metadata` — Returns IPLD metadata for the node.
17+
- `GET /files/:cid/status``{ isCached: boolean }`.
18+
- `GET /files/:cid` — Streams the file.
19+
- Query: `raw=true` disables `Content-Encoding`; `originControl=no-cache` or header `x-origin-control: no-cache` bypasses cache; HTTP Range supported via `Range` header.
20+
- `GET /files/:cid/partial?chunk=N` — Returns raw bytes for chunk N; 204 if missing.
21+
- `POST /moderation/:cid/ban` — Ban a CID; `POST /moderation/:cid/unban` — Unban; `GET /moderation/:cid/status``{ isBanned }`; `GET /moderation/banned?page&limit` — paginated list.
22+
- `GET /nodes/:cid` — Raw DAG-PB node bytes (decoded to PBNode); `GET /nodes/:cid/ipld` — Decoded IPLD node.
23+
24+
All `/files/*` and `/moderation/*` endpoints require API auth.
25+
26+
### Authentication
27+
28+
- Header: `Authorization: Bearer <API_SECRET>` or query: `?api_key=<API_SECRET>`.
29+
30+
### Caching
31+
32+
- Two layers:
33+
- Node cache: stores fetched PBNodes for batching and later migration to file cache.
34+
- File cache: stores complete file streams keyed by CID. Partial responses are not cached.
35+
- Controls: set query param `originControl=no-cache` or header `x-origin-control: no-cache` to bypass cache.
36+
37+
### Partial Content
38+
39+
- Accepts HTTP Range. Determines affected chunks and slices stream precisely with `Content-Range` and `206` responses.
40+
- For `/:cid/partial?chunk=N`, fetches the exact chunk by index.
41+
42+
### Dependencies
43+
44+
- DAG Indexer: node metadata and chunk ordering.
45+
- Object Mapping Indexer: piece index/offset to batch fetch objects within the same piece.
46+
- Subspace Gateway: `subspace_fetchObject` JSON-RPC to retrieve encoded nodes.
47+
48+
### Key Environment Variables
49+
50+
- `API_SECRET` (required): API token.
51+
- `SUBSPACE_GATEWAY_URLS` (required): comma-separated gateway URLs.
52+
- `FILE_RETRIEVER_PORT` (default: 8090)
53+
- `CORS_ORIGIN` (default: `*`)
54+
- `OBJECT_MAPPING_INDEXER_URL` (required)
55+
- `DATABASE_URL` (required)
56+
- Caching: `CACHE_DIR` (default: `./.cache`), `CACHE_MAX_SIZE` (default: 10GiB), `CACHE_TTL` (default: 24h)
57+
- Object fetching: `MAX_OBJECTS_PER_FETCH` (100), `MAX_SIMULTANEOUS_FETCHES` (10), `FETCH_TIMEOUT` (180000 ms)
58+
- Monitoring (optional): `VICTORIA_ACTIVE`, `VICTORIA_ENDPOINT`, `VICTORIA_USERNAME`, `VICTORIA_PASSWORD`, `METRIC_ENVIRONMENT_TAG`
59+
60+
### Headers and Responses
61+
62+
- Sets `Content-Type` from filename when deflate-compressed and not encrypted.
63+
- `Content-Disposition: filename="<name>"` when provided.
64+
- `Accept-Ranges: bytes` advertised; `Content-Encoding: deflate` when applicable and not `raw`.
65+
- `x-file-origin: cache|gateway` indicates source.
66+
67+
### Notes
68+
69+
- Only `MetadataType.File` nodes are streamable; otherwise 400.
70+
- Moderation returns 451 for banned files.

docs/object-mapping-indexer.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
### Object Mapping Indexer — Service Overview
2+
3+
Indexes and serves DSN object mappings that locate objects within archived pieces. Provides HTTP APIs for lookups and a JSON-RPC/WebSocket interface for realtime and recovery distribution of mappings to consumers (e.g., File Retriever).
4+
5+
### Flow
6+
7+
- Listener subscribes to Subspace RPC (`subspace_subscribeObjectMappings`).
8+
- On notification, persist `[hash, pieceIndex, pieceOffset]` with `blockNumber`.
9+
- HTTP API exposes lookups by hash, CID, block number, and piece index.
10+
- RPC server streams mappings to subscribers in batches and supports catch-up (recovery) from a given `pieceIndex`.
11+
12+
### HTTP Endpoints
13+
14+
- `GET /health` — 200 OK.
15+
- `GET /objects/:hash` — Returns `GlobalObjectMapping` for a BLAKE3 `hash`.
16+
- `GET /objects/by-cid/:cid` — Returns `[hash, pieceIndex, pieceOffset]` for a CID.
17+
- `GET /objects/by-block/:blockNumber` — Returns array of `[hash, pieceIndex, pieceOffset]`.
18+
- `GET /objects/by-piece-index/:pieceIndex` — Returns array of `[hash, pieceIndex, pieceOffset]`.
19+
- `GET /ipld-nodes/by-block-height-range?fromBlock&toBlock` — Returns DAG nodes indexed within the range (size-limited by config).
20+
21+
### RPC (JSON-RPC over WebSocket)
22+
23+
- `subscribe_object_mappings``{ subscriptionId }` then server emits `object_mapping_list([...])` batches.
24+
- `unsubscribe_object_mappings({ subscriptionId })``{ success }`.
25+
- `subscribe_recover_object_mappings({ pieceIndex, step })``{ subscriptionId }` for bounded catch-up; auto-switches to realtime when caught up.
26+
- `unsubscribe_recover_object_mappings({ subscriptionId })``{ success }`.
27+
- `get_object_mappings({ hashes })` → ordered array of mappings matching input hash order.
28+
29+
Batching and throttling are controlled to avoid overwhelming clients; only archived segments are distributed.
30+
31+
### Persistence
32+
33+
- Stores mappings with `hash`, `pieceIndex`, `pieceOffset`, `blockNumber`.
34+
- Lookup helpers provide ordering and ranges for efficient streaming.
35+
36+
### Key Environment Variables
37+
38+
- `OBJECT_MAPPING_INDEXER_PORT` (default: 3000)
39+
- `REQUEST_SIZE_LIMIT` (default: `200mb`)
40+
- `CORS_ALLOW_ORIGINS`
41+
- `NODE_RPC_URL` (required)
42+
- `RECOVERY_INTERVAL` (ms, default: 1000)
43+
- `MAX_RECOVERY_STEP` (default: 1000)
44+
- `LOG_LEVEL` (default: `debug` in dev, `info` in prod)
45+
- `DATABASE_URL` (required)
46+
- Distribution: `MAX_OBJECTS_PER_MESSAGE` (1000), `TIME_BETWEEN_MESSAGES` (ms, default: 1000)
47+
- IPLD Nodes API: `MAX_BLOCK_HEIGHT_RANGE` (default: 1000)
48+
49+
### Notes
50+
51+
- Uses BLAKE3 hash derived from CID when querying by CID.
52+
- Recovery subscription respects latest archived segment boundaries.

0 commit comments

Comments
 (0)