-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: persist sandbox information locally in orchestrator #376
base: main
Are you sure you want to change the base?
feat: persist sandbox information locally in orchestrator #376
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really sorry, I didn't mentioned it, but we are moving away from entgo, could you please use sqlc
.
Here is the setup I currently have.
version: "2"
sql:
- engine: "postgresql"
queries: "db/queries"
schema: "db/migrations"
gen:
go:
emit_pointers_for_null_types: true
package: "database"
out: "packages/shared/pkg/database"
sql_package: "pgx/v5"
overrides:
- db_type: "uuid"
go_type:
import: "github.com/google/uuid"
type: "UUID"
- db_type: "uuid"
nullable: true
go_type:
import: "github.com/google/uuid"
type: "UUID"
pointer: true
- db_type: "pg_catalog.numeric"
go_type: "github.com/shopspring/decimal.Decimal"
- db_type: "pg_catalog.numeric"
nullable: true
go_type: "*github.com/shopspring/decimal.Decimal"
- db_type: "timestamptz"
go_type: "time.Time"
- db_type: "timestamptz"
go_type:
import: "time"
type: "Time"
pointer: true
nullable: true
…rrently-cached-by-the-api-server-about-e2b-1394
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one thing though, if you include already a way how to query the current state/data, you should be able to write (at least) integration tests (which are now merged) to verify the functionality/behavior
(here is a PR adding the orchestrator client to the integration tests: #403)
// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models" | ||
// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models/sandbox" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models" | |
// "github.com/e2b-dev/infra/packages/orchestrator/internal/pkg/models/sandbox" |
|
||
func (db *DB) CreateSandbox(ctx context.Context, params database.CreateSandboxParams) error { | ||
return db.WithTx(ctx, func(ctx context.Context, op *database.Queries) error { | ||
if _, err := op.IncGlobalVersion(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the versioning, have you considered using timestamp, if so, what was the reason why not use that instead?
var request *http.Request | ||
|
||
request, err = http.NewRequestWithContext(ctx, "GET", address, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var request *http.Request | |
request, err = http.NewRequestWithContext(ctx, "GET", address, nil) | |
request, err := http.NewRequestWithContext(ctx, "GET", address, nil) |
@@ -106,6 +112,20 @@ func New(ctx context.Context, port uint) (*Service, error) { | |||
} | |||
} | |||
|
|||
// BLOCK: setup database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use function instead
@@ -106,6 +112,20 @@ func New(ctx context.Context, port uint) (*Service, error) { | |||
} | |||
} | |||
|
|||
// BLOCK: setup database | |||
{ | |||
const dbConnStr = "file:./db/sandboxes.db?_journal_mode=wal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe hoisting this to the top file const or something similar?
@@ -66,25 +67,36 @@ func (s *server) Create(ctxConn context.Context, req *orchestrator.SandboxCreate | |||
errMsg := fmt.Errorf("failed to cleanup sandbox: %w", errors.Join(err, context.Cause(ctx), cleanupErr)) | |||
telemetry.ReportCriticalError(ctx, errMsg) | |||
|
|||
return nil, status.New(codes.Internal, errMsg.Error()).Err() | |||
return nil, status.New(codes.Internal, err.Error()).Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the errMsg to err change?
What is the blocker for merging this? |
This sets up a SQLite database in the orchestrator to track the
information about sandboxes in the orchestrator rather than the API
server, in support of rolling deployments.
API server(s) can't be responsible for this information anymore
because they might restart (because of deploys), there might be more
than one of them at a time (for redundancy, high avalibility, or as a
side effect of deployment.), and we don't want to add a load-baring
system of record for information about the sandboxes that the API
servers can access. Letting orchestrators be the system of record
makes sense because they already have the information, and the
lifecylce of a sandbox is (at the moment) tied to the lifecycle of the
orchestrator.
There are many implementation possibilities, but I/we went with SQLite
because:
it's well understood and battletested. It's also fast for this
kind of workload and we can avoid writing filters/queries in Go, and
just use SQL.
can be used with the DB tooling we already have. (n.b. this is the
first time I've really used this ORM tool, and its pretty nice.)
writing data to disk means that we can be less worried about a large
number of short running sandboxes filling up memory, and we can be
less aggressive about removing data because there's (likely) plenty
of disk space. We can rely on SQLite's caching mechanism (rather
than Go or our own implementation) to keep or release data from
memory.
because (at the moment) orchestrators never restart or are
redeployed, we don't have to worry about schema or data migration:
realistically every time the orchestrator starts, the database will
be empty. In the future when we might be able to add
if the API servers' view of what's running on the orchestrator is no
longer strictly consistent (because there might be many of them,
they might restart, etc.) then we need to keep a record of not just
what is running but what has run recently so we can make sure to
bill correctly and so we can distinguish between "this sandbox
doesn't exist anywhere" and "this sandbox used to exist."
embedded in this implementation are version numbers for both
sandboxes (as they change) and a global version number for all the
data in the database/orchestrator. The idea here is that if we
increment these numbers correctly when modifying the data in SQLite,
we can provide an interface that the API servers can use to
efficently determine if their cache is out of date.
because we store the global version number in the sandbox record
you can get all of the sandboxes that have been modified or
created since your last view.
you can compare integers per-record or per orchestrator, to figure
out if your data is stale rather than needing a more complex
algorithim.
I did attempt to implement this using an in-memory cache rather than
SQLite, which I think would be possible, but our concurrent map is
sharded (to prevent lock contention for modification-heavy
worklods,) and getting the version numbers (plus the extra level of
shard-versioning,) makes things much more complicated from a code
perspective and it's my assessment that SQLite will scale better,
require less code to write, and be easier to develop code against today,
in addition to being something we'll want in the future.
This isn't quite done. Remaining work includes:
We/I need to rehome the APIs to use data from the database rather
than from the cache.
We/I probably need to cache a version of the sandbox structure (with
more information,) in the database. (possibly binary protobuf?) to
support the APIs
Testing, of course. At this point the PR doesn't change the behavior
because the old data storage/cache is still the system of record.
I would like more feedback on this implementation or the use of the
ORM system.