All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Device Flow Authentication (kamu CLI
0.235.0
)
- Pinned version for
aws-sdk-s3
crate version before breaking changes - Update all minor versions of other crates
- Denormalization: DatasetEntry now contains a copy of owner's account name for faster dataset handle resolutions without extra round trip to database
- Speedup of account flow runs listing
- Forcing warmup of dataset ids listing cache on startup
- Reverted
sqlx
upgrade that breaking image build
- Outbox: Added new param in consumer metadata
initial_consumer_boundary
which allow new consumer to not process all messages, but start from latest one
- S3 get_stored_dataset_by_id operation takes advantage of in-memory datasets listing cache
- Automatically indexing key dataset blocks in the database for quicker navigation:
- indexing all previously stored datasets at startup
- indexing new changes to datasets incrementally, whenever HEAD advances
- Metadata chain visiting algorithm can now use the key blocks cached in the database to efficiently implement iteration over key blocks, when data events are not needed
- E2E,
kamu-node-e2e-repo-tests
: remove akamu-api-server
dependency that did not cause thekamu-api-server
binary to be rebuilt. - Upgraded to latest version of
dill=0.13
.
- New value
semantic_search_threshold_score
in search configuration which used inUiConfiguration
- New
engine.datafusionEmbedded
config section allows to pass custom DataFusion settings when engine is used in ingest, batch query, and compaction contexts. - GQL:
Datasets::role()
: returns the current user's role in relation to the dataset- GQL:
DatasetsMut::create_empty()
&DatasetsMut::create_from_snapshot()
: alias validation in multi-tenant mode.
- GQL:
DatasetsMut::create_empty()
&DatasetsMut::create_from_snapshot()
:dataset_visibility
is now mandatory. kamu push/pull
command with--force
flag now does not allow overwriting of seed block
- Multiple performance improvements in batch queries to avoid unnecessary metadata scanning.
- Correct api version
- Trigger dependent dataset flows on Http
/ingest
and on smart transfer protocol dataset push DatasetSummary
files replaced withDatasetStatistics
stored in the database and updated synchronously with the HEAD reference updates- Statistics is automatically pre-computed for all existing datasets on first use
DatasetHandle
andDatasetEntry
now contain dataset kind marker- Provenance service and pull request planner fetch dependencies from the graph
- Implemented caching layer for
DatasetEntry
within the currently open transaction
- Private Datasets: sharing access (kamu CLI
0.230.0
)
- Introduced
debug
CLI command group for operational helpers - New
debug semsearch-reindex
CLI command that allows to recreate the embeddings in the vector repository - Semantic search:
- More configuration options for indexing, allowing to skip datasets without descriptions or no data.
- Overfetch is now configurable
- Service will make repeated queries to the vector store to fill the requested results page size.
- Flow: Updated the
BatchingRule
trigger to accept 0 for both properties(min_records_to_await
andmax_batching_interval
), enabling dependency flow execution even when no data is added to the root dataset.
- HTTP & GQL API: Fixed internal error when query contains an unknown column
- E2E: running tests also for S3 repositories
- DB-backed dataset references: they are now stored in the database, supporting transactional updates
- Ensured short transaction length in ingest & transform updates and compaction tasks.
- Dataset Reference indexing to build the initial state of the dataset references.
- Implemented in-memory caching for dataset references that works within the current
- Replaced default GraphQL playground with better maintained
graphiql
(old playground is still available) - Improved API server web console looks
- Upgraded to
datafusion v46
(#1146) - Dependency graph updates are improved for transactional correctness.
- Extracted ODF dataset builders for LFS and S3 to allow for custom implementations.
- Flow progress notifier is now more resilient to deleted datasets
- Use actual
base_url
in catalog configuration instead default one - REST API:
GET /datasets/{id}
returns account data as it should - If dataset creation is interrupted before a dataset entry is written, such a dataset is ignored and may be overwritten
- Value of flow agent throttling now presented in seconds
- Prometheus metrics: S3 (kamu CLI
0.226.5
) - New
FlowSystemConfig
structure inCLIConfig
which allows to configureflow_agent
andtask_agent
services with next optionsawaiting_step_secs
andmandatory_throttling_period_secs
- GQL: New natural language search API - to use this feature you'll need to configure the OpenAI API key and a Qdrant vector database connection
- Fix regression with substitution of incorrect
ServerUrlConfig
component
- Upgrade kamu-cli version to
0.226.4
- OData API: fixed crash when accessing private dataset (kamu CLI
0.225.0
)
- Restoring OData API tolerance to trailing slashes (kamu CLI
0.223.1
)
- Added access token notifier registration
- New access token notifier that sends an email to the user each time a new access token is created (kamu CLI
0.222.0
)
- Upgrade kamu-cli version to
0.223.0
- Upgraded to
datafusion v45
- Upgraded to
axum v0.8
- Upgraded to
- Private Datasets: hot-fixes (kamu CLI
0.221.1
)
- Integration of email gateway (Postmark):
- Defined
EmailSender
crate and prototypedPostmark
-based implementation - Config support for gateway settings (API key, sender address & name)
- Applied
askoma
templating engine and defined base HTML-rich template for all emails - Emails are fire-and-forget / best effort
- First emails implemented: account registration, flow failed
- Defined
- GQL support to query and update email on the currently logged account
- Emails are mandatory for Kamu accounts now:
- predefined users need to specify an email in config
- predefined users are auto-synced at startup in case they existed before
- GitHub users are queried for primary verified email, even if it is not public
- migration code for the database existing users
- Corrected access rights checks for transport protocols (SiTP, SmTP)
- Core changes from the Private Datasets epic (kamu CLI
0.220.0
), Vol. 2
- Toolchain updated to
nightly-2024-12-26
- Core changes from the Private Datasets epic (kamu CLI
0.219.1
)
- Telemetry-driven fixes in flow listings (kamu CLI
0.217.3
)
- Batched loading of flows and tasks (kamu CLI
0.217.2
)
- Extend database config (kamu CLI
0.217.1
)
- Add missing
RemoteStatusServiceImpl
service to catalog
- Env var and flow API changes (kamu CLI
0.217.0
)
- Flight SQL authentication (see kamu-data/kamu-cli#1012)
/verify
endpoint hot fix (kamu CLI0.215.1
)
- Flow configuration separation (kamu CLI
0.215.0
)
- Improved FlightSQL session state management (kamu CLI
0.214.0
)
- Regression in FlightSQL interface related to database-backed
QueryService
- Less aggressive telemetry for key dataset services, like ingestion (kamu CLI
0.213.1
)
- Eliminated regression crash on metadata queries
- Upgrade kamu-cli version to
0.213.0
- Upgrade to
datafusion v43
- Upgrade to
alloy v0.6
- Planners and executors in key dataset manipulation services
- Upgrade to
- Environment variables are automatically deleted if the dataset they refer to is deleted.
- Upgrade kamu-cli version to
0.211.0
:- Dataset dependency graph is now backed with a database, removing need in dependency scanning at startup.
- Upgrade kamu-cli version to
0.210.0
:- Improved OpenAPI integration
- Replaced Swagger with Scalar for presenting OpenAPI spec
kamu-api-server
: error if specialized config is not found- Separated runtime and dynamic UI configuration (such as feature flags)
- Upgrade kamu-cli version to
0.208.1
(minor updates in data image)
- Introduced
DatasetRegistry
abstraction, encapsulating listing and resolution of datasets (kamu-cli version to0.208.0
):- Registry is backed by database-stored dataset entries, which are automatically maintained
- Scope for
DatasetRepository
is now limited to supportDatasetRegistry
and in-memory dataset dependency graph - New concept of
ResolvedDataset
: a wrapper aroundArc<dyn Dataset>
, aware of dataset identity - Query and Dataset Search functions now consider only the datasets accessible for current user
- Core services now explicitly separate planning (transactional) and execution (non-transactional) processing phases
- Similar decomposition introduced in task system execution logic
- Batched form for dataset authorization checks
- Ensuring correct transactionality for dataset lookup and authorization checks all over the code base
- Passing multi/single tenancy as an enum configuration instead of boolean
- Renamed outbox "durability" term to "delivery mechanism" to clarify the design intent
- Upgrade kamu-cli version to
0.207.3
(Outbox versions)
- Upgrade kamu-cli version to
0.207.1
- Correct image version
- Upgrade kamu-cli version to
0.207.0
- Upgrade kamu-cli version to
0.206.5
- Upgrade kamu-cli version to
0.206.3
:- GraphQL: Removed deprecated
JSON_LD
in favor ofND_JSON
inDataBatchFormat
- GraphQL: In
DataBatchFormat
introducedJSON_AOS
format to replace the now deprecated JSON in effort to harmonize format names with REST API
- GraphQL: Removed deprecated
- GraphQL: Fixed invalid JSON encoding in
PARQUET_JSON
schema format when column names contain special character - Improved telemetry for dataset entry indexing process
- Corrected recent migration related to outbox consumptions of old dataset events
- Upgrade kamu-cli version to
0.206.1
:DatasetEntryIndexer
: guarantee startup afterOutboxExecutor
for a more predictable initialization- Add
DatasetEntry
'is re-indexing migration
- Add
- Introduced OpenAPI spec generation
/openapi.json
endpoint now returns the generated spec/swagger
endpoint serves an embedded Swagger UI for viewing the spec directly in the running server- OpenAPI schema is available in the repo
resources/openapi.json
beside its multi-tenant version
- Added endpoint to read a recently uploaded file (
GET /platform/file/upload/{upload_token}
)
- Upgrade kamu-cli version to
0.205.0
:- Simplified organization of startup initialization code over different components
- Postgres implementation for dataset entry and account Re-BAC repositories
DatasetEntry
integration that will allow us to build dataset indexing- Added REST API endpoint:
GET /info
GET /accounts/me
GET /datasets/:id
- Upgrade kamu-cli version to
0.203.1
:- Added database migration & scripting to create an application user with restricted permissions
- Support
List
andStruct
arrow types injson
andjson-aoa
encodings
- Upgrade kamu-cli version to
0.202.0
:- Major dependency upgrades:
- DataFusion 42
- HTTP stack v.1
- Axum 0.7
- latest AWS SDK
- latest versions of all remaining libs we depend on
- Outbox refactoring towards true parallelism via Tokio spanned tasks instead of futures
- Major dependency upgrades:
- Re-enabled missing optional features for eth, ftp, mqtt ingest and JSON SQL extensions
- Failed flows should still propagate
finishedAt
time - Eliminate
span.enter
, replaced with instrument everywhere
- REST API: New
/verify
endpoint allows verification of query commitment
- Upgrade kamu-cli version to
0.201.0
:- Outbox main loop was revised to minimize the number of transactions
- Detecting concurrent modifications in flow and task event stores
- Improved and cleaned handling of flow abortions at different stages of processing
- Revised implementation of flow scheduling to avoid in-memory time wheel
- Added application name prefix to Prometheus metrics
- API Server now exposes Prometheus metrics
- FlightSQL tracing
- Oracle Provider Prometheus metrics names changed to conform to the convention
- Oracle Provider: Updated to use V2
/query
REST API - Oracle Provider: Added ability to scan back only a certain interval of past blocks
- Oracle Provider: Added ability to ignore requests by ID and from certain consumers
- Identity config registration bug that prevented response signing from working
- REST API: The
/query
endpoint now supports response proofs via reproducibility and signing (#816) - REST API: New
/{dataset}/metadata
endpoint for retrieving schema, description, attachments etc. (#816)
- Upgrade kamu-cli version to
0.199.2
- Hot fixes in persistent Tasks & Flows
- Upgrade kamu-cli version to
0.199.1
- Persistent Tasks & Flows
- Database schema breaking changes
- Get Data Panel: use SmTP for pull & push links
- GQL api method
setConfigCompaction
allows to setmetadataOnly
configuration for both root and derived datasets - GQL api
triggerFlow
allows to triggerHARD_COMPACTION
flow inmetadataOnly
mode for both root and derived datasets
- Critical errors were not logged due to logging guard destroyed before the call to tracing
- Upgrade kamu-cli version to
0.198.2
- ReBAC: in-memory & SQLite components
- Smart Transfer Protocol: breaking changes
- Upgrade kamu-cli version to
0.198.0
(address RUSTSEC-2024-0363)
- Add missed
ResetService
dependency
- Upgrade kamu-cli version to
0.197.0
- Missing initialization issue for outbox processor
- Upgrade kamu-cli version to 0.195.1 (DataFusion 41, Messaging outbox)
- Upgrade kamu-cli version to 0.194.0 and add
DatasetKeyValueSysEnv
service if encryption key was not provided
- Upgrade kamu-cli version to 0.191.5 and add init of new
DatasetKeyValueService
in catalog
- Exposed new
engine
,source
, andprotocol
sections in theapi-server
config (#109)
- Dropped "bunyan" log format in favor of standard
tracing
JSON logs (#106)
- The
oracle-provider
now exposes Prometheus metrics via/system/metrics
endpoint (#106) - All apps now support exporting traces via Open Telemetry protocol (#106)
- The
api-server
now support graceful shutdown (#106) - All apps now support
/system/health?type={liveness,readiness,startup}
heath check endpoint using Kubernetes probe semantics (#106)
- Make dataset env vars encryption key optional
- Upgraded to kamu
0.191.4
- Upgraded to kamu
0.191.3
- Integrated the
DatasetEnvVars
service that allows configuring custom variables and secrets to be used during the data ingestion
- Upgraded to new
rustc
version and some dependencies - Upgraded to kamu
0.191.2
- Regression where oracle provider stopped respecting the
block_stride
config
- Upgraded to kamu
0.189.7
which fixes the operation of SmTP along with database transactions
- Integrating modes of RDS password access
- Upgraded to kamu
0.188.3
which is fixing file ingestion feature
- Upgraded to kamu
0.188.1
that includes a fix for transactions getting "stuck" in data queries
- Fixed invalid REST response decoding by
oracle-provider
- Fixed invalid REST request encoding by
oracle-provider
- Upgraded to kamu
0.188.1
- Improve
oracle-provider
:- Dataset identity support
- SQL errors and missing dataset handling
- Reproducibility state support
- Upgraded to kamu
0.188.0
- Oracle provider was migrated from deprecated
ethers
toalloy
crate - Upgraded to kamu
0.186.0
- Upgraded
kamu
from0.181.1
to0.185.1
(changelog)
- Hotfix: upgrade to Kamu CLI v0.181.1 (dealing with unresolved accounts)
- HTTP API: add
/platform/login
handler to enable GitHub authorization inside Jupyter Notebook
- Fix startup: correct config parameter name (
jwt_token
->jwt_secret
)
- Upgraded
kamu
from0.177.0
to0.180.0
(changelog) - Read settings from config file, absorb:
--repo-url
CLI argument- environment variables used for configuration
- Introduced new
kamu-oracle-provider
component which can fulfil data requests from any EVM compatible blockchain, working in conjunction withOdfOracle
contracts defined inkamu-contracts
repository
- Upgraded
kamu
from0.176.3
to0.177.0
(changelog) - CI improvements:
- use
cargo-udeps
to prevent the possibility of using unused dependencies - use
cargo-binstall
to speed up CI jobs
- use
- Missing compacting service dependency
- Synchronized with latest
kamu-cli
v0.176.3
- Fixed startup failure by missed DI dependency
- The
/ingest
REST API endpoint also supports event time hints via odf-event-time header
- Removed paused from
setConfigCompacting
mutation - Extended GraphQL
FlowDescriptionDatasetHardCompacting
empty result with a resulting message - GraphQL Dataset Endpoints object: fixed the query endpoint
- OData API now supports querying by collection ID/key (e.g.
account/covid.cases(123)
)
- Fixed all pedantic lint warnings
- Fixed CI build
- Updated to
kamu v0.171.2
to correct the CLI push command in the Data access panel
- Updated to
kamu v0.171.1
to correct the web link in the Data access panel
- Updated to
kamu v0.171.0
to put in place endpoints for the Data Access panel
- Enable local FS object store for push ingest to work
- Made number of runtime threads configurable
- Incorporate FlightSQL performance fixes in
kamu v0.168.0
- Incorporate FlightSQL location bugfix in
kamu-adapter-flight-sql v0.167.2
- Incorporate dataset creation handle bugfix in
kamu-core v0.167.1
- Changed config env var prefix to
KAMU_API_SERVER_CONFIG_
to avoid collisions with Kubernetes automatic variables
- Support for metadata object caching on local file system (e.g. to avoid too many calls to S3 repo)
- Support for caching the list of datasets in a remote repo (e.g. to avoid very expensive S3 bucket prefix listing calls)
- OData adapter will ignore fields with unsupported data types instead of crashing
- Experimental support for OData protocol
- Updated to
kamu v0.165.0
to bring in flow system latest demo version
- Updated to
kamu v0.164.0
to bring in new REST data endpoints
- Introduced a
ghcr.io/kamu-data/kamu-api-server:latest-with-data-mt
image with multi-tenant workspace
- Updated to
kamu v0.162.1
to bring in more verbose logging on JWT token rejection reason
- Startup crash in Flow Service that started to require admin token to operate
- Updated to
kamu v0.162.0
- Upgraded Rust toolchain and minor dependencies
- Synced with
kamu
v0.158.0
- Upgraded to major changes in ODF and
kamu
- Push ingest API
- Introduced a config file allowing to configure the list of supported auth providers
- FlightSQL endpoint
- Integrated multi-tenancy support: authentication & authorization for public datasets
- Keeping a CHANGELOG
- Integrated latest core with engine I/O strategies - this allows
api-server
run ingest/transform tasks for datasets located in S3 (currently by downloading necessary inputs locally)