Skip to content

Conversation

@alexlovelltroy
Copy link
Member

This pull request establishes the handlers and storage functions to allow cloud-init to function as a bss server in simpole cases. Below are the most important changes grouped by theme.

Storage Backend Enhancements:

  • Introduced a new quackstore backend for persistent storage, configurable via the --storage-backend flag. Default remains memstore. Added support for specifying a database file path for quackstore via the --db-path flag. (cmd/cloud-init-server/main.go, [1] [2] [3] [4]
  • Updated imports to use pkg/smdclient instead of internal/smdclient for better modularity. (cmd/cloud-init-server/handlers.go, [1]; cmd/cloud-init-server/main.go, [2]; other handler files)

Boot Parameter Management:

  • Added a new pkg/bss package to handle boot parameters (BootParams and BootParamsV1 structs) with JSON parsing, validation, and merging capabilities. (pkg/bss/bootparams.go, pkg/bss/bootparams.goR1-R112)
  • Implemented a utility function to generate iPXE boot scripts based on boot parameters, including support for root filesystem and cloud-init configurations. (pkg/bss/bootscript.go, pkg/bss/bootscript.goR1-R88)

Testing Improvements:

Code Quality Enhancements:

  • Improved the generateInstanceID function in quackstore to generate unique instance IDs using random hexadecimal values. (internal/quackstore/quackstore.go, internal/quackstore/quackstore.goL351-R357)
  • Added new dependencies to go.mod for UUID generation and YAML parsing. (go.mod, [1] [2]

These changes enhance the flexibility, reliability, and test coverage of the cloud-init-server project, paving the way for improved storage options and boot parameter handling.

alexlovelltroy and others added 19 commits March 24, 2025 14:01
- Added QuackStore implementation for managing groups, instances, and cluster defaults using Quack for persistence.
- Introduced unit tests for QuackStore to ensure functionality and data integrity.
- Updated go.mod to include new dependencies and replaced the quack module path.
- Enhanced CloudConfigFile JSON marshaling and unmarshaling for better encoding handling.
…handlers

- Added BootParams struct to define boot parameters including kernel, initrd, root filesystem, and cloud-init server configurations.
- Implemented GenerateBootScript function to create iPXE boot scripts based on BootParams.
- Created BootParamsHandler for handling HTTP requests to create, update, retrieve, and generate boot scripts.
- Introduced MemoryStore for in-memory storage of boot parameters with versioning.
- Added comprehensive unit tests for BootParams, boot script generation, and HTTP handlers to ensure functionality and reliability.
- Updated go.mod to include the new dependency for UUID generation.
- Refactored import paths to use the new smdclient package structure.
- Introduced BootParamsV1 struct for backward compatibility with the old API specification.
- Enhanced boot parameters validation and merging logic.
- Implemented new HTTP handlers for creating and managing boot parameters.
- Added comprehensive unit tests for the new features and refactored existing tests for better coverage.
- Added QuackStore implementation for managing groups, instances, and cluster defaults using Quack for persistence.
- Introduced unit tests for QuackStore to ensure functionality and data integrity.
- Updated go.mod to include new dependencies and replaced the quack module path.
- Enhanced CloudConfigFile JSON marshaling and unmarshaling for better encoding handling.
- Introduced a new test file for MemStore to validate its functionality.
- Integrated standard test suite from cistore testing package to ensure comprehensive testing.
- Introduced a new command-line flag for selecting the storage backend (mem or quack).
- Default storage backend set to memstore, with a new option for quackstore using a specified database path.
- Updated the datastore initialization logic to handle both memstore and quackstore, including error handling for quackstore initialization.
- Removed base64 encoding handling from UnmarshalJSON for clarity.
- Added MarshalJSON implementation to ensure proper JSON serialization of CloudConfigFile.
- Added functionality to generate a unique instance ID in the format "i-XXXXXX" using a random 6-digit hexadecimal string.
- Updated the comment to clarify the instance ID generation process.
- Added QuackStore implementation for managing groups, instances, and cluster defaults using Quack for persistence.
- Introduced unit tests for QuackStore to ensure functionality and data integrity.
- Updated go.mod to include new dependencies and replaced the quack module path.
- Enhanced CloudConfigFile JSON marshaling and unmarshaling for better encoding handling.
- Introduced new BSS API endpoints for managing boot parameters and generating boot scripts.
- Added handlers for creating, updating, and retrieving boot parameters.
- Implemented routing for BSS endpoints in the main application.
- Updated go.mod to include new dependencies for BSS functionality.
- Enhanced documentation for the new API endpoints with Swagger annotations.
@synackd
Copy link
Contributor

synackd commented Apr 24, 2025

I think that this is meant to be a solution to BSS's current lack of group name support when using the Postgres backend, but I'm a bit worried that by adding BSS functionality to cloud-init, we are stepping towards distributed monoliths instead of microservices, which I believe is the same reason we removed discovery from SMD.

Is there an important use case for having cloud-init also provide boot parameter management and script generation? Just trying to understand why we're duplicating BSS functionality here versus refactoring BSS.

@alexlovelltroy alexlovelltroy changed the title Alovelltroy/bss in cloud init Add a subset of API-compatible endpoints to establish the feasibility of switching away from upstream BSS without losing functionality Apr 24, 2025
@alexlovelltroy alexlovelltroy marked this pull request as draft April 24, 2025 19:39
@davidallendj
Copy link
Contributor

I agree with @synackd here. I believed we had this discussion before and IIRC I think before we were certainly against duplicating the same functionality in multiple microservices. Is the intent here to eventually phase out BSS entirely? Are we not packing too many features in cloud-init alone? I'm wondering if adding this will move cloud-init in the same direction as SMD.

@alexlovelltroy
Copy link
Member Author

This would replace BSS which has proven particularly hostile to refactoring and unit testing.

@alexlovelltroy
Copy link
Member Author

I have confined the BSS portions to a pkg/ so we can decide later if we want to maintain a separate BSS or if we want to integrate cloud-init and bss.

@rustydb
Copy link

rustydb commented Apr 29, 2025

qq, are the functions of BSS that OpenCHAMI was leveraging confined to just bootparameters and bootscripts?

@alexlovelltroy
Copy link
Member Author

@rustydb

There are four endpoints described in the swaggerfile. At this point, OpenCHAMI is only leveraging the two concerned with bootparameters and bootscripts. It didn't look to us like there was added value in the dumpstate or hosts endpoints.

  1. /boot/v1/bootscript
  2. /boot/v1/bootparameters
  3. /boot/v1/hosts
  4. /boot/v1/dumpstate

I'm not closed to the idea that we might leverage more of the API at some point. We simply haven't needed them. Are we missing something?

@synackd
Copy link
Contributor

synackd commented Apr 30, 2025

I think reimplementing an API-compliant version of BSS that allows the functionality that we want to implement is probably the way to go, but I'm wondering if cloud-init is the right place for it.

Are we sure we want to combine boot parameter handling and post-boot configuration into a single scope?

@rustydb
Copy link

rustydb commented Apr 30, 2025

@alexlovelltroy

@rustydb

There are four endpoints described in the swaggerfile. At this point, OpenCHAMI is only leveraging the two concerned with bootparameters and bootscripts. It didn't look to us like there was added value in the dumpstate or hosts endpoints.

Makes sense. FWIW I'm looking for any gaps, anything that BSS does today that isn't covered here and whether that matters.

That really just leaves the pre-signed URLs (for private S3 buckets) and spire-tokens, which I don't see mentioned in the PR. I assume those aren't important to OpenCHAMI either.

@alexlovelltroy
Copy link
Member Author

We're not using pre-signed urls or spire tokens at this point in OpenCHAMI. I'm not convinced that either add security given the unencrypted/unauthenticated nature of iPXE. However, if we choose to go down this path, we may need to open both assertions up for discussion.

Pre-signed urls aren't necessary for LANL at this time because none of our system images, kernels, or initrd contain any secrets and are on segregated networks. If OpenCHAMI needs to support authenticated access to S3 resources, we'll need to update this project.

I'll defer to @yogi4 who has worked a lot more with spire, but I think we'd prefer not to embed the spire join token in the kernel commandline and instead leverage the tpm attestation process to initialize spire.

@alexlovelltroy
Copy link
Member Author

Looks like there is a boot/v1/endpoint-history that will provide information about the last time each node retrieved their bootscript. That sounds pretty useful

@Masber
Copy link

Masber commented May 1, 2025

Just sharing my personal opinion about cloud-init and BSS:

These services serve different purposes and operate at different stages. BSS handles boot-time configuration (setting kernel parameters or delivering initrd, etc) essentially shaping how the system starts. cloud-init, by contrast, is about runtime configuration once the OS is up (e.g., user setup, networking, mount filesystems, package installs).

I agree with @synackd’s point, but I’d also emphasize that cloud-init isn't a given (not every OCHAMI user will use it). Runtime config is complex and often handled differently across environments (cloud-init, Ansible, custom scripts, overlayfs, etc.).

@alexlovelltroy
Copy link
Member Author

I agree that we don't want to force people into using cloud-init if they don't want to.

I'm updating this branch/PR to isolate the BSS API in a separate executable with a separate build. We can review it once that's functional

…rameters

- Introduced a new netboot server in cmd/netboot-server/main.go.
- Updated .goreleaser.yaml to include netboot-server build configuration.
- Added API endpoints for creating, retrieving, and updating boot parameters in docs/swagger.yaml and docs/docs.go.
- Defined BootParams schema in swagger documentation and Go code.
Copy link
Contributor

@rainest rainest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could basically just stuff BSS under a router subsection, correct? As-is it has multiple BSS endpoints under the root, but I think if we group those and adhere to code organization constraints, keeping these in a single repo/binary is fine.

Most everything this adds resides under pkg/bss. The only parts that need to interact with both are the main/cmd and the router, and we can design those such that they keep the components separate, and that we could, if desired, split them into actual separate repos later.

We'd need to be careful about how we code shared components, since it wouldn't be enforced by split repos, but it's doable. There are some bits in the PR as written that don't keep that separation clean (see the in-context comments).

I think you're also blocked from just ignoring the cloud-init bit--it looks like the cluster defaults initialization creates de-facto required configuration specific to cloud-init, and ideally we wouldn't have that.

Arguments for are:

  • These are similar enough in use and needs: they're both basically infrequent human write, frequent machine read config stores that communicate with SMD a bit, and they both need to be accessible to machines before those machines are fully configured.
  • Hey, it's an excuse to get away from the BSS codebase! If a simpler version of that is sufficient, may as well.

If we do want to separate it, we should have a separate repo rather than a separate binary: pinning them to the same module versions and needing multiple Containerfiles is annoying. SMDClient is probably a reasonable module-only repo, and not re-implementing that across services makes sense anyway.

Comment on lines 215 to 218
if bssEnabled {
bssStorage := bss.NewMemoryStore()
initBssRouter(router, ciHandler, bssStorage)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the same pattern as the above routers, where it mounts a subrouter to some path within main(). The subrouter definition can still live in a function, but the prefix it lives under should be under the control of main().

})
}

func initBssRouter(router chi.Router, handler *CiHandler, bssStorage bss.Store) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting a CiHandler here does bind BSS logic to CI-specific types, we should avoid that.

AFAICT the struct only serves to aggregate several common constructor arguments into one, and BSS only uses the SMDClient field. We should just the SMDClient alone; BSS will never have a need for the CI cluster name or CI store.

We may want to refactor that type into WithSMDClient(), WithCIStore(), etc. functions on the classes that use them, but that's probably better handled elsewhere.

That struct is kinda silly already: it doesn't serve a purpose other than just aggregating several shared constructor inputs for other types, and its consumers

Copy link
Member Author

@alexlovelltroy alexlovelltroy May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree.

Based on other feedback, I'm working on a separate binary to serve the BSS portions. That implementation could be moved to a separate repo in the future as you suggest, but we'd need to do some careful planning to prevent a tightly coupled distributed monolith from reappearing.

The new signature is at:

func initBssRouter(router chi.Router, sm smdclient.SMDClientInterface, bssStorage bss.Store) {

Cleaning up that struct is on my mental todo list for cloud-init as well.

alexlovelltroy and others added 3 commits May 19, 2025 13:38
- Added QuackStore for persistent storage of boot parameters, supporting versioning and group templates.
- Implemented API endpoints for managing boot parameters, including creation, retrieval, updating, and versioning.
- Introduced a version command to display build information and server version.
- Updated .goreleaser.yaml to support new archive formats (zip and tar.gz).
- Enhanced documentation with new fields and API specifications in swagger and Go code.
- Added tests for QuackStore functionality to ensure data integrity and operations.
Copy link
Contributor

@rainest rainest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on 8c4c2b8 mostly re some organization changes to better allow alternate data stores and avoiding bash for integration tests.

insecure bool
fakeSMDEnabled bool
storageBackend string
dbPath string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look at turning the server into a struct that main() instantiates from the args rather than using package vars.

For dbPath in particular, that struct should have a quackConfig field, whose value is a struct with Quack-specific parameters.

serveCmd.Flags().BoolVar(&insecure, "insecure", os.Getenv("INSECURE") == "true", "Disable TLS verification (default from INSECURE env)")
serveCmd.Flags().BoolVar(&fakeSMDEnabled, "fake-smd", os.Getenv("CLOUD_INIT_SMD_SIMULATOR") == "true", "Enable FakeSMDClient simulation (default from CLOUD_INIT_SMD_SIMULATOR env)")
serveCmd.Flags().StringVar(&storageBackend, "storage-backend", "mem", "Storage backend to use (mem or quack)")
serveCmd.Flags().StringVar(&dbPath, "db-path", "netboot.db", "Path to the database file for quackstore backend")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, storage backend-specific flags should use some common prefix to scope them, so quack-db-path probably.

Str("certPath", certPath).
Bool("insecure", insecure).
Str("storageBackend", storageBackend).
Str("dbPath", dbPath).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to leave room for other stores, we could either:

  • Add a func (s *Store) Describe() string "identify yourself" method to bss.Store, and then have Str("storage", bssStorage.Describe())
  • Add a log argument to NewQuackStore() and have it log whatever setup info it chooses.


info := VersionInfo()
uptime := time.Since(startTime).String()
memstats := &runtime.MemStats{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to include info other than the version, we may want to consider an entire diagnostic child router, with separate /info/version and /info/memstats endpoints. IDK how many, if any additional diagnostic endpoints we'd add, but if it's more than two they probably shouldn't live directly at the root.

@@ -0,0 +1,97 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this functionally provides an integration test? It's not invoked by anything, but does perform tests.

I'd strongly recommend using Go to handle integration tests as well. If the server's changed into a struct, a TestMain() can instantiate it and start serving. That does skip testing the actual CLI, but that's usually simple enough to not warrant much concern.

Aside avoiding the joys of lengthy bash scripts, that avoids hidden test dependencies (this introduces one for HTTPie), lets us use a debugger for both the tests and tested instance, and allows collection of coverage and pass/fail data in the same format as other tests.

@alexlovelltroy
Copy link
Member Author

waiting to rebase on top of the quackstore PR (#68) once that merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants