Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create the bootstrap spec #6

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
4 changes: 4 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
[workspace]
members = [
"crates/api",
"crates/boot_srv",
]
resolver = "2"

Expand Down
13 changes: 13 additions & 0 deletions crates/boot_srv/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[package]
name = "kitsune2_boot_srv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boot sounds like it's for startup, I'd prefer to spell this one out :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just go with bs? /jk

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: edc2546

version = "0.0.1-alpha"
description = "p2p / dht communication framework api"
license = "Apache-2.0"
homepage = "https://github.com/holochain/kitsune2"
documentation = "https://docs.rs/kitsune2_boot_srv"
authors = ["Holochain Core Dev Team <[email protected]>"]
keywords = ["holochain", "holo", "p2p", "dht", "networking"]
categories = ["network-programming"]
edition = "2021"

[dependencies]
150 changes: 150 additions & 0 deletions crates/boot_srv/src/lib.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
#![deny(missing_docs)]
//! Kitsune2 boot server is an HTTP REST server for handling bootstrapping
//! discovery of peer network reachability in p2p applications.
//!
//! Despite being in the kitsune2 repo, `boot_srv` and `boot_cli` do not depend
//! on any Kitsune2 crates. This is to ensure the bootstrapping functionality
//! is well-defined, self-contained, easily testable in isolation, and
//! usable for projects that don't choose to make use of Kitsune2 itself.
//!
//! That being said, the boot server and client are designed to transfer the
//! `AgentInfoSigned` data type defined in the `kitsune2_api` crate. Since
//! the canonical encoding of that type is JSON, we just redefine a subset
//! of the schema here, and only validate the parts required for bootstrapping.
//!
//! For additional details, please see the [spec].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also include a section about recommended client implementation. I.e. expiration times, retry, scheduling of updating infos

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: d49c23e


/// This is a documentation module containing the kitsune2_boot spec.
///
/// #### 1. Types
///
/// - `Base64Agent` - base64UrlNoPad safe encoded string agent id.
mattyg marked this conversation as resolved.
Show resolved Hide resolved
/// - `Base64Space` - base64UrlNoPad safe encoded string space id.
/// - `Base64Sig` - base64UrlNoPad safe encoded string crypto signature.
/// - `Json` - string containing json that can be decoded.
/// - `I64` - string contaning an i64 number.
neonphog marked this conversation as resolved.
Show resolved Hide resolved
///
/// ```text
/// AgentInfoSigned = { "agentInfo": Json, "signature": Base64Sig }
/// AgentInfo = {
/// "agent": Base64Agent,
/// "space": Base64Space,
/// "createdAt": I64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific with this type? local time vs UTC time is so easy to get tripped up on

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: dd48b48

/// "expiresAt": I64,
/// "isTombstone": boolean
/// }
/// ```
///
/// Any other properties on these objects will be ignored and pass through.
///
/// #### 2. REST API
///
/// ##### 2.1. In Brief
///
/// ```text
/// ErrResponse = { "error": string }
/// OkResponse = {}
/// ListResponse = [ AgentInfoSigned, .. ]
/// ```
///
/// - `PUT /boot/Base64Space/Base64Agent`
mattyg marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it clearer that these are path parameters? {Base64Space} or <Base64Space>?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done: 4b12a8b

/// - Request Body: `AgentInfoSigned`
/// - Response Body: `OkResponse | ErrResponse`
/// - `GET /boot/Base64Space`
/// - Response Body: `ListResponse | ErrResponse`
/// - `GET /`
/// - Response Body: `OkResponse | ErrResponse`
///
/// ##### 2.2. Publishing info to the boot server.
///
/// A `PUT` on `/boot/Base64Space/Base64Agent` with an `AgentInfoSigned`
/// json object as the request body.
///
/// - The server MUST reject the request if the body is > 1024 bytes.
jost-s marked this conversation as resolved.
Show resolved Hide resolved
/// - The server MUST reject the request if `createdAt` is not within
/// 3 minutes in either direction of the server time.
/// - The server MUST reject the request if `expiresAt` is in the past.
Copy link
Member

@mattyg mattyg Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must reject if expiresAt <= createdAt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: a2338e6

Copy link
Member

@mattyg mattyg Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the rationale is for the client requesting an expiration time? Do we want to put some configurable constraints on the expiration timespan? Too short could be a form of spam. Too long could cause some agents to outlive others in the store.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added the reject if more than 30 minutes: 84799bc

The clients can already spam the server by sending a new createdAt every millisecond or whatever, so that side will have to be covered by the rate limiting.

/// - The server MUST reject the request if `signature` is invalid vs
/// the `agentInfo`.
/// - If `isTombstone` is `true`, the server MUST delete any existing
/// info being held.
Copy link
Member

@mattyg mattyg Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must delete all AgentInfo records that meet all these criteria:

  • The 'space' field matches Base64Space from the request URL
  • The 'agent' field matches Base64Agent from the request URL
  • The 'createdAt' timestamp is older than or equal to the timestamp in the request body"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me. A tombstone to me is a marker that something has been deleted. So I'd expect people querying the bootstrap server to see this agent as having removed themselves. If we just want ourselves gone, that should be DELETE verb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not working as a tombstone on the bootstrap server... but it is working as a tombstone within the kitsune peer gossip, so let's just reuse that already written functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// - If `isTombstone` is `false`, the server MAY begin storing the info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find it clearer to split create and delete into different endpoints. Then we could also drop the expiresAt field for delete requests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tombstone field is require for the functionality on the kitsune side. Since in the active network they cannot just delete an agent info--they might get the old one gossiped back to them again, so they need tombstone records. And having them be gossiped just the same as other infos saves complexity. Since we need that anyways, I don't want to implement a different method for the bootstrap server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm asking for something similar above. I think the bootstrap API is different to what Kitsune does internally, I'd be fine with seeing a piece of code that sends a delete to the bootstrap server rather than a PUT with a delete marker

Actually, I'd be happier :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the verifiability aspect. We cannot let just anybody delete any record, so we'd have to come up with some alternate payload for the clients to sign anyways... Can we just use the existing functionality?

/// See section 3. on storage strategies below.
///
/// ##### 2.3. Listing data stored on the boot server.
///
/// A `GET` on `/boot/Base64Space`.
///
/// - The server MUST respond with a complete list of stored infos.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that include expired info that hasn't been deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's up to the server implementation. In my imaginings there will be a periodic task cleaning up expired entries, and it would also adjust the space whenever a PUT happened. So there is a chance some expired entries will be gossiped, but only just, and kitsune will be tolerant to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to include GET parameters to support pagination such as "count" and "offset"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding complexity like this, as well as needing to manage our server memory usage is the reason for limiting the infos per space. A max of 32k (plus a little overhead for commas and square brackets) will be easy to send in a single response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the infos per space is configurable, what if another server has a very high limit and sends much more in a response?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assert that allowing so many bootstrap agents within a space that they cannot be sent in a single response is YAGNI. We could put documentation around the config explaining that the max response size will be MAX_INFOS * MAX_INFO_BYTES and to consider that before setting. Or error starting the server if that response size is beyond some threshold.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, it will simplify things like hcterm and general debugging where you have to keep calling "give me some more please" and hope to build a full set for display :)

Was there some kind of security reason not to do this in the first place? Or was it just that Kitsune wasn't supposed to call it like this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy bootstrap server holds all agents within the space. We didn't ever want to send down the potentially millions of entries we thought holochain might scale to, so we had the server randomize the list and send you only a handful.

/// - If there are no infos stored at this space, the server MUST return
/// an empty list (`[]`).
/// - The only reason a server MAY return an `ErrResponse` for this request
/// is in the case of internal server error.
///
/// ##### 2.4. Health check.
///
/// A `GET` on `/`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a prefix /health for the api and leave / available for an fancy health monitor html page?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 5e4fc87

///
/// - The server, in general, SHOULD return `OkResponse` to this request.
/// - The server MAY return `ErrResponse` for internal errors or some other
/// inability to continue serving correctly.
///
/// #### 3. Storage Strategies
///
/// ##### 3.1. The Future
///
/// It is the intention to someday in the future to add a "trusted" strategy,
neonphog marked this conversation as resolved.
Show resolved Hide resolved
/// that will be triggerd via a new api, perhaps `/registerTrust/Base64Space`.
///
/// Even when that API is implemented, however, the default strategy defined
/// next will be used on all spaces that have not been registered with a
/// different strategy. Therefore, we can proceed for now without consideration
/// for any other future strategies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the list of returned agent infos be segmented into two lists by trusted/authoritative and self-published?

And for now the trusted list could be empty, and the self-published list works as being described in this doc?

Copy link
Member

@mattyg mattyg Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're trying to avoid placeholders in the codebase as they add noise and friction for development.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is that spaces will either use the "default" storage strategy defined here, or the "trusted" strategy to be defined in the future. Once we have the ability, if you register a space as trusted, it will then only accept trusted infos and only return those trusted infos.

///
/// ##### 3.2. The "Default" Storage Strategy
///
/// Assumptions:
///
/// - We don't want to store unbounded count infos in each space.
/// - We like storing long-running reliable nodes.
/// - We don't want to ONLY store long-running nodes to avoid eclipse scenarios.
///
/// The solution put forth here is to allocate half our storage space to
/// long-running nodes, and the other half to whoever has most recently
/// published an info. The strategy for accomplishing this is defined here:
///
/// Consider "the store" acting as a stack. It can be implemented in any manner,
jost-s marked this conversation as resolved.
Show resolved Hide resolved
/// but the following strategy assumes new entries are added to the end,
/// and when any entries are removed, the indexes of any items after them
/// will be decrimented.
///
/// - The server SHOULD provide a configurable per-space max info count.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it also provide a max expiration timespan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the complexity of the server needing to tell the clients what the "configured" expiration time is so they know how often to refresh / what they can set it to... Let's start with just a hard-coded maximum of 30 minutes, and see if there is need for anything else.

mattyg marked this conversation as resolved.
Show resolved Hide resolved
/// For the duration of this document, that value will be called MAX_INFOS.
/// It is recommended to default that value to `32`.
mattyg marked this conversation as resolved.
Show resolved Hide resolved
/// - The server SHOULD delete expired infos periodically.
/// - If the server is already storing an agent, and a `PUT` with a newer
jost-s marked this conversation as resolved.
Show resolved Hide resolved
/// `createdAt` arrives, the existing entry MUST be replaced.
/// - If the store count is < MAX_INFOS, the server MUST push the info
/// onto the stack.
/// - If the store count is >= MAX_INFOS, the server MUST delete the entry
/// at MAX_INFOS / 2 and then push the info onto the stack.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_INFOS / 2 is the first of the recently put agent infos?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's the MAX_INFOS / 2th most recently pushed info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why that and not the oldest one for example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's described in the first paragraphs of section 3.2, basically the first half of the stack is allocated to known reliable nodes (because they've been online without expiring for the longest). And the second half of the stack is allocated to an ever changing list of the most recent nodes to publish their agent infos (to make it less possible for the older peers to collude on an eclipse attack).

The "delete the middle one" algorithm was my attempt at the simplest design of this.

///
/// #### 4. Rate Limiting
///
/// Info count within a space is limited by the storage strategy above. But
/// space count within a server is unbounded. While servers will likely want
/// to limit request counts in general, we are also going to have to take some
/// special care with PUT requests that result in creation of a new space that
/// was not previously tracked in order to mitigate attacks.
///
/// - A server SHOULD delete any spaces that no longer contain agents.
/// - A server SHOULD decide a max count of spaces it intends to support
/// (based on available memory or disk space for storing agents within those
/// spaces), and then return error 429 beyond that.
/// - A server MAY make the above max space count configurable.
/// - A server SHOULD track the IP addresses of clients that make PUT requests
/// which result in space creation, and error with 429 if that frequency
/// is beyond a limit.
/// - A server MAY make this limit configurable.
#[cfg(doc)]
pub mod spec {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have a /metrics path which behaves similar to the current metrics system (showing number of space and active nodes in each space (or maybe number of trusted and self-published nodes in each space)) for a block of time, except this time have the results be REVERSE chronological showing the most recent time blocks first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually metrics are a server-specific implementation detail. I'm happy to include a metrics api with the server.

@artbrock - do you think a metrics api needs to be included in the spec? Indicating that any k2 bootstrap server implementation anybody made would respond with the exact same metrics?