Update docs (#143)

clostao · web-flow · commit 51d8c7a5c046 · 2025-10-23T19:02:56.000+02:00
* feat: update general docs for matching current impl

* feat: add service overviews for File Retriever, DAG Indexer, and Object Mapping Indexer in documentation

* docs: add object mapping indexer service overview to DAG Indexer documentation

* docs: clarify the meaning of archived files in README

* docs: update DAG Indexer documentation to include support for `system.remark_with_event` extrinsics
diff --git a/README.md b/README.md
@@ -4,13 +4,40 @@ This service is used for retrieving already archived files from the Autonomy's D
 
 ## FAQs
 
-- **What means a file to be archived?:** When a file is uploaded to Autonomy’s network as a series of transactions, it is not immediately distributed as segmented pieces from which farmers generate storage proofs. There's some variable delay between the transaction submission and the file availability that depends on the network congestion among other factors.
+- **What does it mean for a file to be archived?:** When a file is uploaded to Autonomy’s network as a series of transactions, it is not immediately distributed as segmented pieces from which farmers generate storage proofs. There's some variable delay between the transaction submission and the file availability that depends on the network congestion among other factors.
 
 ## How it works?
 
-When a user requests the download of a file, the file retriever asks the object mapping indexer the position of the CID in the DSN and uses the subspace gateway to download it. For big files if there are more objects to be retrieved this process is repeated until all the object are downloaded.
-
-Meanwhile, the object mapping indexer is constantly listening to the Autonomy's network for keeping track of the position of uploaded objects in the DSN.
-
-![File Download Diagram](https://github.com/autonomys/auto-files-gateway/blob/main/.github/diagrams/file-download.png)
-
+The flow illustrates how a file retrieval is performed in **Files Gateway**:.
+
+- When a **user requests a specific content identifier (CID)**, the **File Retriever** coordinates the process by querying various indexers to obtain both metadata and storage location details.
+- The **Object Indexer** and **DAG Indexer** maintain a constantly updated map of where data resides across the network and how it is structured within content-addressable graphs.
+- Once the necessary information is gathered, the **Gateway** and participating **Nodes** manage the actual data transfer, retrieving the requested file from the decentralized storage and delivering it back to the user efficiently and securely.
+- This coordinated interaction ensures that even though data is stored in a distributed manner, retrieval remains seamless and transparent to the end user.
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant FileRetriever as FILE-RETRIEVER
+    participant ObjectIndexer as ObjectIndexer
+    participant DAGIndexer as DAGIndexer
+    participant Gateway as GATEWAY
+    participant Node as NODE
+
+    User->>FileRetriever: 1. Asks for a CID
+    FileRetriever->>DAGIndexer: 2. Requests the list of nodes needed<br/> for retrieving the requested CID (partial retrieval posibly)
+    FileRetriever->>ObjectIndexer: 3. Requests the position of the object on the DSN
+    Note over ObjectIndexer: Listens to every object upload,<br/>saving its position in the DSN<br/>for use during file download.
+    Note over DAGIndexer: Listens to every "System.remark" extrinsic,<br/> and decodes saving IPLD metadata.
+    FileRetriever->>Gateway: 4. Fetches the object using<br/>the Subspace Gateway
+    Node->>ObjectIndexer: Sends object mappings
+    Node->>DAGIndexer: Sends blocks
+    Gateway->>Node: Retrieve object metadata from the node
+```
+
+For a more detailed over on each service check out:
+
+- [File retriever](docs/file-retriever.md)
+- [DAG Indexer](docs/dag-indexer.md)
+- [Object Mapping Indexer](docs/object-mapping-indexer.md)
+- [Subspace Node & Gateway](https://github.com/autonomys/subspace)
diff --git a/docs/dag-indexer.md b/docs/dag-indexer.md
@@ -0,0 +1,39 @@
+### DAG Indexer — Service Overview
+
+SubQuery-based indexer that listens to `system.remark` and `system.remark_with_event` extrinsics and indexes DAG-PB nodes produced by `@autonomys/auto-dag-data`. Persists node metadata, links, sizes, and upload options for downstream services like File Retriever.
+
+### Flow
+
+- SubQuery runtime connects to Subspace network (`wss://rpc.mainnet.subspace.foundation/ws`).
+- For each successful `system.remark` extrinsic, decode DAG-PB, compute CID and BLAKE3, extract links and IPLD metadata.
+- Save entity `Node` with block/extrinsic identifiers and metadata.
+
+### Data Model (GraphQL schema)
+
+- `Node` fields: `cid`, `type`, `linkDepth`, `name`, `blockHeight`, `blockHash`, `extrinsicId`, `extrinsicHash`, `indexInBlock`, `links[]`, `size`, `blake3Hash`, `timestamp`, `uploadOptions`.
+- `uploadOptions` includes optional compression and encryption options.
+
+### Mapping Logic (handleCall)
+
+- Skip failed extrinsics.
+- Decode remark hex to PBNode; derive `cid`, `links`, `blake3Hash`.
+- Decode `IPLDNodeData` for `type`, `size`, `name`, `linkDepth`, `uploadOptions`.
+- Persist as `Node` entity with composite indexes on `(blockHeight,indexInBlock)` and `(timestamp,indexInBlock)` for efficient ordering.
+
+### Configuration (project.yaml)
+
+- `dataSources[0]` kind: `substrate/Runtime` with handler `handleCall` and filter `{ module: system, method: remark }` or `{ module: system, method: remarkWithEvent }`.
+- Network `chainId` and endpoint set to Subspace Mainnet.
+- Runner `@subql/node` `>=5.2.9` with `unsafe: true`, `historical: false`.
+
+### Consumers
+
+- File Retriever queries DAG Indexer storage to get:
+  - Node metadata for `/:cid/metadata` and validation.
+  - Sorted chunks for multi-node files.
+  - Object mapping indexer `/ipld-nodes` service.
+
+### Notes
+
+- Only nodes with `node.Data` are saved; others are ignored with a warning.
+- `size` defaults to `0` if missing in metadata.
diff --git a/docs/file-retriever.md b/docs/file-retriever.md
@@ -0,0 +1,70 @@
+### File Retriever — Service Overview
+
+The File Retriever serves content from Autonomys DSN by composing IPLD DAG-PB chunks into a file stream. It validates access, enforces moderation, consults the DAG Indexer for metadata and chunk ordering, and fetches raw objects via Subspace Gateway using Object Mapping Indexer hints.
+
+### Flow
+
+- Request arrives at `GET /files/:cid` with API auth.
+- Validate CID and optional byte range; check moderation ban list.
+- Fetch node metadata from DAG Indexer to derive size, filename, mime/encoding.
+- Resolve object mappings (piece index/offset) through Object Mapping Indexer and batch-fetch nodes from Subspace Gateway.
+- Stream file (DFS over chunks) or partial range; cache full responses.
+
+### Endpoints
+
+- `GET /health` — 200 OK.
+- `GET /files/:cid/metadata` — Returns IPLD metadata for the node.
+- `GET /files/:cid/status` — `{ isCached: boolean }`.
+- `GET /files/:cid` — Streams the file.
+  - Query: `raw=true` disables `Content-Encoding`; `originControl=no-cache` or header `x-origin-control: no-cache` bypasses cache; HTTP Range supported via `Range` header.
+- `GET /files/:cid/partial?chunk=N` — Returns raw bytes for chunk N; 204 if missing.
+- `POST /moderation/:cid/ban` — Ban a CID; `POST /moderation/:cid/unban` — Unban; `GET /moderation/:cid/status` — `{ isBanned }`; `GET /moderation/banned?page&limit` — paginated list.
+- `GET /nodes/:cid` — Raw DAG-PB node bytes (decoded to PBNode); `GET /nodes/:cid/ipld` — Decoded IPLD node.
+
+All `/files/*` and `/moderation/*` endpoints require API auth.
+
+### Authentication
+
+- Header: `Authorization: Bearer <API_SECRET>` or query: `?api_key=<API_SECRET>`.
+
+### Caching
+
+- Two layers:
+  - Node cache: stores fetched PBNodes for batching and later migration to file cache.
+  - File cache: stores complete file streams keyed by CID. Partial responses are not cached.
+- Controls: set query param `originControl=no-cache` or header `x-origin-control: no-cache` to bypass cache.
+
+### Partial Content
+
+- Accepts HTTP Range. Determines affected chunks and slices stream precisely with `Content-Range` and `206` responses.
+- For `/:cid/partial?chunk=N`, fetches the exact chunk by index.
+
+### Dependencies
+
+- DAG Indexer: node metadata and chunk ordering.
+- Object Mapping Indexer: piece index/offset to batch fetch objects within the same piece.
+- Subspace Gateway: `subspace_fetchObject` JSON-RPC to retrieve encoded nodes.
+
+### Key Environment Variables
+
+- `API_SECRET` (required): API token.
+- `SUBSPACE_GATEWAY_URLS` (required): comma-separated gateway URLs.
+- `FILE_RETRIEVER_PORT` (default: 8090)
+- `CORS_ORIGIN` (default: `*`)
+- `OBJECT_MAPPING_INDEXER_URL` (required)
+- `DATABASE_URL` (required)
+- Caching: `CACHE_DIR` (default: `./.cache`), `CACHE_MAX_SIZE` (default: 10GiB), `CACHE_TTL` (default: 24h)
+- Object fetching: `MAX_OBJECTS_PER_FETCH` (100), `MAX_SIMULTANEOUS_FETCHES` (10), `FETCH_TIMEOUT` (180000 ms)
+- Monitoring (optional): `VICTORIA_ACTIVE`, `VICTORIA_ENDPOINT`, `VICTORIA_USERNAME`, `VICTORIA_PASSWORD`, `METRIC_ENVIRONMENT_TAG`
+
+### Headers and Responses
+
+- Sets `Content-Type` from filename when deflate-compressed and not encrypted.
+- `Content-Disposition: filename="<name>"` when provided.
+- `Accept-Ranges: bytes` advertised; `Content-Encoding: deflate` when applicable and not `raw`.
+- `x-file-origin: cache|gateway` indicates source.
+
+### Notes
+
+- Only `MetadataType.File` nodes are streamable; otherwise 400.
+- Moderation returns 451 for banned files.
diff --git a/docs/object-mapping-indexer.md b/docs/object-mapping-indexer.md
@@ -0,0 +1,52 @@
+### Object Mapping Indexer — Service Overview
+
+Indexes and serves DSN object mappings that locate objects within archived pieces. Provides HTTP APIs for lookups and a JSON-RPC/WebSocket interface for realtime and recovery distribution of mappings to consumers (e.g., File Retriever).
+
+### Flow
+
+- Listener subscribes to Subspace RPC (`subspace_subscribeObjectMappings`).
+- On notification, persist `[hash, pieceIndex, pieceOffset]` with `blockNumber`.
+- HTTP API exposes lookups by hash, CID, block number, and piece index.
+- RPC server streams mappings to subscribers in batches and supports catch-up (recovery) from a given `pieceIndex`.
+
+### HTTP Endpoints
+
+- `GET /health` — 200 OK.
+- `GET /objects/:hash` — Returns `GlobalObjectMapping` for a BLAKE3 `hash`.
+- `GET /objects/by-cid/:cid` — Returns `[hash, pieceIndex, pieceOffset]` for a CID.
+- `GET /objects/by-block/:blockNumber` — Returns array of `[hash, pieceIndex, pieceOffset]`.
+- `GET /objects/by-piece-index/:pieceIndex` — Returns array of `[hash, pieceIndex, pieceOffset]`.
+- `GET /ipld-nodes/by-block-height-range?fromBlock&toBlock` — Returns DAG nodes indexed within the range (size-limited by config).
+
+### RPC (JSON-RPC over WebSocket)
+
+- `subscribe_object_mappings` → `{ subscriptionId }` then server emits `object_mapping_list([...])` batches.
+- `unsubscribe_object_mappings({ subscriptionId })` → `{ success }`.
+- `subscribe_recover_object_mappings({ pieceIndex, step })` → `{ subscriptionId }` for bounded catch-up; auto-switches to realtime when caught up.
+- `unsubscribe_recover_object_mappings({ subscriptionId })` → `{ success }`.
+- `get_object_mappings({ hashes })` → ordered array of mappings matching input hash order.
+
+Batching and throttling are controlled to avoid overwhelming clients; only archived segments are distributed.
+
+### Persistence
+
+- Stores mappings with `hash`, `pieceIndex`, `pieceOffset`, `blockNumber`.
+- Lookup helpers provide ordering and ranges for efficient streaming.
+
+### Key Environment Variables
+
+- `OBJECT_MAPPING_INDEXER_PORT` (default: 3000)
+- `REQUEST_SIZE_LIMIT` (default: `200mb`)
+- `CORS_ALLOW_ORIGINS`
+- `NODE_RPC_URL` (required)
+- `RECOVERY_INTERVAL` (ms, default: 1000)
+- `MAX_RECOVERY_STEP` (default: 1000)
+- `LOG_LEVEL` (default: `debug` in dev, `info` in prod)
+- `DATABASE_URL` (required)
+- Distribution: `MAX_OBJECTS_PER_MESSAGE` (1000), `TIME_BETWEEN_MESSAGES` (ms, default: 1000)
+- IPLD Nodes API: `MAX_BLOCK_HEIGHT_RANGE` (default: 1000)
+
+### Notes
+
+- Uses BLAKE3 hash derived from CID when querying by CID.
+- Recovery subscription respects latest archived segment boundaries.