feat: persist sandbox information locally in orchestrator #376

tychoish · 2025-03-05T15:58:50Z

This sets up a SQLite database in the orchestrator to track the
information about sandboxes in the orchestrator rather than the API
server, in support of rolling deployments.

API server(s) can't be responsible for this information anymore
because they might restart (because of deploys), there might be more
than one of them at a time (for redundancy, high avalibility, or as a
side effect of deployment.), and we don't want to add a load-baring
system of record for information about the sandboxes that the API
servers can access. Letting orchestrators be the system of record
makes sense because they already have the information, and the
lifecylce of a sandbox is (at the moment) tied to the lifecycle of the
orchestrator.

There are many implementation possibilities, but I/we went with SQLite
because:

it's well understood and battletested. It's also fast for this
kind of workload and we can avoid writing filters/queries in Go, and
just use SQL.
can be used with the DB tooling we already have. (n.b. this is the
first time I've really used this ORM tool, and its pretty nice.)
writing data to disk means that we can be less worried about a large
number of short running sandboxes filling up memory, and we can be
less aggressive about removing data because there's (likely) plenty
of disk space. We can rely on SQLite's caching mechanism (rather
than Go or our own implementation) to keep or release data from
memory.
because (at the moment) orchestrators never restart or are
redeployed, we don't have to worry about schema or data migration:
realistically every time the orchestrator starts, the database will
be empty. In the future when we might be able to add
if the API servers' view of what's running on the orchestrator is no
longer strictly consistent (because there might be many of them,
they might restart, etc.) then we need to keep a record of not just
what is running but what has run recently so we can make sure to
bill correctly and so we can distinguish between "this sandbox
doesn't exist anywhere" and "this sandbox used to exist."
embedded in this implementation are version numbers for both
sandboxes (as they change) and a global version number for all the
data in the database/orchestrator. The idea here is that if we
increment these numbers correctly when modifying the data in SQLite,
we can provide an interface that the API servers can use to
efficently determine if their cache is out of date.
- because we store the global version number in the sandbox record
  you can get all of the sandboxes that have been modified or
  created since your last view.
- you can compare integers per-record or per orchestrator, to figure
  out if your data is stale rather than needing a more complex
  algorithim.
I did attempt to implement this using an in-memory cache rather than
SQLite, which I think would be possible, but our concurrent map is
sharded (to prevent lock contention for modification-heavy
worklods,) and getting the version numbers (plus the extra level of
shard-versioning,) makes things much more complicated from a code
perspective and it's my assessment that SQLite will scale better,
require less code to write, and be easier to develop code against today,
in addition to being something we'll want in the future.

This isn't quite done. Remaining work includes:

We/I need to rehome the APIs to use data from the database rather
than from the cache.
We/I probably need to cache a version of the sandbox structure (with
more information,) in the database. (possibly binary protobuf?) to
support the APIs
Testing, of course. At this point the PR doesn't change the behavior
because the old data storage/cache is still the system of record.
I would like more feedback on this implementation or the use of the
ORM system.

jakubno

I am really sorry, I didn't mentioned it, but we are moving away from entgo, could you please use sqlc.

Here is the setup I currently have.

version: "2"
sql:
  - engine: "postgresql"
    queries: "db/queries"
    schema: "db/migrations"
    gen:
      go:
        emit_pointers_for_null_types: true
        package: "database"
        out: "packages/shared/pkg/database"
        sql_package: "pgx/v5"
        overrides:
          - db_type: "uuid"
            go_type:
              import: "github.com/google/uuid"
              type: "UUID"
          - db_type: "uuid"
            nullable: true
            go_type:
              import: "github.com/google/uuid"
              type: "UUID"
              pointer: true

          - db_type: "pg_catalog.numeric"
            go_type: "github.com/shopspring/decimal.Decimal"
          - db_type: "pg_catalog.numeric"
            nullable: true
            go_type: "*github.com/shopspring/decimal.Decimal"

          - db_type: "timestamptz"
            go_type: "time.Time"
          - db_type: "timestamptz"
            go_type:
              import: "time"
              type: "Time"
              pointer: true
            nullable: true

…rrently-cached-by-the-api-server-about-e2b-1394

(cherry picked from commit 2bc8680)

(cherry picked from commit efefde4)

(cherry picked from commit e0bdcf8)

…rrently-cached-by-the-api-server-about-e2b-1394

dobrac

Looks good, one thing though, if you include already a way how to query the current state/data, you should be able to write (at least) integration tests (which are now merged) to verify the functionality/behavior

(here is a PR adding the orchestrator client to the integration tests: #403)

dobrac · 2025-03-07T15:21:45Z

packages/orchestrator/internal/db/db.go

+	// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models"
+	// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models/sandbox"


Suggested change

// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models"

// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models/sandbox"

dobrac · 2025-03-07T15:23:15Z

packages/orchestrator/internal/db/db.go

+
+func (db *DB) CreateSandbox(ctx context.Context, params database.CreateSandboxParams) error {
+	return db.WithTx(ctx, func(ctx context.Context, op *database.Queries) error {
+		if _, err := op.IncGlobalVersion(ctx); err != nil {


for the versioning, have you considered using timestamp, if so, what was the reason why not use that instead?

dobrac · 2025-03-07T15:25:15Z

packages/orchestrator/internal/sandbox/checks.go

+	var request *http.Request
+
+	request, err = http.NewRequestWithContext(ctx, "GET", address, nil)


Suggested change

var request *http.Request

request, err = http.NewRequestWithContext(ctx, "GET", address, nil)

request, err := http.NewRequestWithContext(ctx, "GET", address, nil)

dobrac · 2025-03-07T15:25:38Z

packages/orchestrator/internal/server/main.go

@@ -106,6 +112,20 @@ func New(ctx context.Context, port uint) (*Service, error) {
 		}
 	}

+	// BLOCK: setup database


can we use function instead

dobrac · 2025-03-07T15:26:52Z

packages/orchestrator/internal/server/main.go

@@ -106,6 +112,20 @@ func New(ctx context.Context, port uint) (*Service, error) {
 		}
 	}

+	// BLOCK: setup database
+	{
+		const dbConnStr = "file:./db/sandboxes.db?_journal_mode=wal"


maybe hoisting this to the top file const or something similar?

dobrac · 2025-03-07T15:34:32Z

packages/orchestrator/internal/server/sandboxes.go

@@ -66,25 +67,36 @@ func (s *server) Create(ctxConn context.Context, req *orchestrator.SandboxCreate
 		errMsg := fmt.Errorf("failed to cleanup sandbox: %w", errors.Join(err, context.Cause(ctx), cleanupErr))
 		telemetry.ReportCriticalError(ctx, errMsg)

-		return nil, status.New(codes.Internal, errMsg.Error()).Err()
+		return nil, status.New(codes.Internal, err.Error()).Err()


why the errMsg to err change?

ValentaTomas · 2025-03-21T01:21:16Z

What is the blocker for merging this?

feat: persist sandboxes locally in orchestrator

6d3fa6a

tychoish requested review from ValentaTomas and jakubno as code owners March 5, 2025 15:58

jakubno requested changes Mar 6, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into track-all-the-data-cu…

f554858

…rrently-cached-by-the-api-server-about-e2b-1394

jakubno added the improvement Improvement for current functionality label Mar 6, 2025

jakubno assigned tychoish and jakubno Mar 6, 2025

tychoish added 4 commits March 7, 2025 08:34

chore: switch to sqlc

07426af

(cherry picked from commit 2bc8680)

feat: track state machine for of orchestrator status

ac9cedc

(cherry picked from commit efefde4)

fix: remove unneeded imports

d17e5d4

(cherry picked from commit e0bdcf8)

Merge remote-tracking branch 'origin/main' into track-all-the-data-cu…

bf4e94d

…rrently-cached-by-the-api-server-about-e2b-1394

dobrac reviewed Mar 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: persist sandbox information locally in orchestrator #376

feat: persist sandbox information locally in orchestrator #376

tychoish commented Mar 5, 2025

jakubno left a comment

dobrac left a comment •

edited

Loading

dobrac Mar 7, 2025

dobrac Mar 7, 2025

dobrac Mar 7, 2025

dobrac Mar 7, 2025

dobrac Mar 7, 2025

dobrac Mar 7, 2025

ValentaTomas commented Mar 21, 2025

		// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models"
		// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models/sandbox"

		var request *http.Request

		request, err = http.NewRequestWithContext(ctx, "GET", address, nil)

feat: persist sandbox information locally in orchestrator #376

Are you sure you want to change the base?

feat: persist sandbox information locally in orchestrator #376

Conversation

tychoish commented Mar 5, 2025

jakubno left a comment

Choose a reason for hiding this comment

dobrac left a comment • edited Loading

Choose a reason for hiding this comment

dobrac Mar 7, 2025

Choose a reason for hiding this comment

dobrac Mar 7, 2025

Choose a reason for hiding this comment

dobrac Mar 7, 2025

Choose a reason for hiding this comment

dobrac Mar 7, 2025

Choose a reason for hiding this comment

dobrac Mar 7, 2025

Choose a reason for hiding this comment

dobrac Mar 7, 2025

Choose a reason for hiding this comment

ValentaTomas commented Mar 21, 2025

dobrac left a comment •

edited

Loading