Skip to content

feature: added support as source for ducklake#295

Merged
nbesimi merged 4 commits intodevelopfrom
feature/ducklake-integration
Feb 24, 2026
Merged

feature: added support as source for ducklake#295
nbesimi merged 4 commits intodevelopfrom
feature/ducklake-integration

Conversation

@Femi3211
Copy link
Collaborator

@Femi3211 Femi3211 commented Jan 30, 2026

Summary by CodeRabbit

  • New Features

    • Added DuckLake database source support with new configurable connection fields (database path, data path, metadata DB, and S3 credentials).
    • Environment-variable substitution now applied for DuckLake and S3 connection settings.
  • Improvements

    • Schema extraction and validation for DuckLake with safe fallbacks, dynamic extractor loading, and optimized selection path for faster initialization.
  • Tests

    • Two integration test classes marked to be skipped during runs.

@Femi3211 Femi3211 requested a review from nbesimi January 30, 2026 14:18
@coderabbitai
Copy link

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

Adds DuckLake support: six new Connection fields (three DuckLake paths, three S3 creds), config-time substitutions for those fields, a new DuckLakeGenerator that uses DuckDB+DuckLake to extract metadata, factory routing to select DuckLakeGenerator when dbType is "ducklake", and two integration test classes are suppressed.

Changes

Cohort / File(s) Summary
Connection model
common/src/main/java/com/adaptivescale/rosetta/common/models/input/Connection.java
Added DuckLake fields duckdbDatabasePath, ducklakeDataPath, ducklakeMetadataDb and S3 credential fields s3Region, s3AccessKeyId, s3SecretAccessKey with getters/setters.
Config substitution
cli/src/main/java/com/adaptivescale/rosetta/cli/ConfigYmlConverter.java
Apply StringSubstitutor-based substitutions for DuckLake fields and S3 credential fields during config processing when non-null.
DuckLake generator
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
New DuckLakeGenerator: validates config, builds DuckDB JDBC URL/session, installs/loads DuckLake, attaches/configures catalog (DATA_PATH, METADATA_DB), applies S3 creds if present, dynamically loads extractors with fallbacks, extracts/times out to information_schema on errors, filters internal metadata tables, assembles and returns Database; includes generate and validate.
Factory routing
source/src/main/java/com/adataptivescale/rosetta/source/core/SourceGeneratorFactory.java
Detect dbType == "ducklake" (case-insensitive) and return a DuckLakeGenerator early, bypassing standard extractor initialization.
Test suppression
cli/src/test/java/integration/DB2IntegrationTest.java, cli/src/test/java/integration/RedshiftDDLIntegrationTest.java
Added class-level @Ignore and @Disabled annotations to skip these integration tests; Redshift test also removed two timezone columns from a DDL string.

Sequence Diagram

sequenceDiagram
    participant Config as Configuration
    participant Factory as SourceGeneratorFactory
    participant DLGen as DuckLakeGenerator
    participant JDBC as DuckDB JDBC
    participant DL as DuckLake Catalog
    participant S3 as S3 (optional)
    participant Extractors as Table/View/Column Extractors
    participant DB as Database Object

    Config->>Factory: sourceGenerator(connection, driverProvider)
    Factory->>Factory: if dbType == "ducklake"
    Factory->>DLGen: instantiate DuckLakeGenerator
    Factory->>DLGen: generate(connection)

    DLGen->>DLGen: validate ducklakeDataPath & metadata path
    DLGen->>JDBC: open DuckDB JDBC connection
    DLGen->>JDBC: install/load DuckLake
    DLGen->>DL: attach catalog (DATA_PATH, METADATA_DB)
    DLGen->>S3: apply S3 credentials if configured
    DLGen->>DL: select catalog/schema

    DLGen->>Extractors: load table/view/column extractors (with fallbacks)
    DLGen->>JDBC: query DuckLake metadata for tables (fallback to information_schema)
    DLGen->>JDBC: query views and columns
    DLGen->>DLGen: filter internal DuckLake metadata tables

    DLGen->>DB: assemble Database(name, tables, views, type)
    DLGen->>Factory: return Database
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through configs with a curious grin,
DuckLake paths tucked in where the datasets begin,
Catalogs attach, S3 secrets snug and neat,
Extractors hum — tables and columns meet!
A rabbit cheers this metadata treat.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feature: added support as source for ducklake' directly and clearly summarizes the main change: adding DuckLake as a data source, which is reflected throughout the changeset (Connection model extensions, DuckLakeGenerator implementation, and integration in SourceGeneratorFactory).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/ducklake-integration

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 243-248: The SQL string in DuckLakeGenerator directly interpolates
parquetFile.getAbsolutePath() into createTableSql which allows file path
injection via single quotes; sanitize the path before interpolation by escaping
single quotes (e.g., replace "'" with "''") and then use the escaped value when
building createTableSql (references: DuckLakeGenerator, variables parquetFile,
tableName, createStmt, createTableSql) so the generated CREATE TABLE ...
read_parquet('...') literal cannot break SQL syntax.
- Around line 300-344: The executeDebugSQL method currently logs full SQL and
result rows via log.info; change it to avoid leaking sensitive data by (1)
updating the method signature of executeDebugSQL (and any callers) to accept a
boolean flag like enableResultLogging (or use an existing debug flag on
Connection), (2) change the SQL statement log to a non-default level (TRACE or
DEBUG) using log.trace(...) and only emit it when tracing is enabled, and (3)
wrap the result-row logging block (the loop that builds rows and the log.info
calls) in a conditional that checks enableResultLogging before logging row
contents; keep setupDuckLake/connect logic but ensure no sensitive output is
logged when the flag is false (optionally move this utility to a test/debug-only
module if desired).
- Around line 190-208: The ATTACH/USE SQL builds raw SQL from user inputs
(catalogName, rosettaConnection.getDucklakeDataPath()) in DuckLakeGenerator
which allows SQL injection; validate and sanitize these inputs before use:
enforce a strict regex (e.g. only [A-Za-z0-9_]+) for catalogName and reject or
normalize any value that doesn't match, and validate ducklakeDataPath to ensure
it is a safe filesystem path (no semicolons, quotes, or SQL metacharacters) or
escape/quote it correctly; then update the attachSql construction and the
useStmt.execute call (the String.format building attachSql and the "USE " +
catalogName call) to use the validated/escaped values only, failing fast with a
clear exception if validation fails.
- Around line 353-356: The SQL built in importCsvToDuckLake interpolates
tableName and csvFilePath directly causing SQL injection; update
importCsvToDuckLake to validate and safely quote/escape inputs: validate
tableName against a strict identifier regex (e.g. [A-Za-z_][A-Za-z0-9_]* ) and
reject or sanitize otherwise, quote the identifier properly (escape any internal
quotes) rather than raw string concat, and treat csvFilePath as a data parameter
or at minimum escape/quote it (escape single quotes) before inserting into the
SQL string; ensure similar protection for catalogName if it can be controlled
externally and use prepared statements or the DB driver's identifier-quoting
helper where available.
🧹 Nitpick comments (4)
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (4)

60-65: Unchecked casts could mask type mismatches.

The casts on lines 60 and 65 assume that tableExtractor.extract() and viewExtractor.extract() return Collection<Table> and Collection<View> respectively. While this aligns with the expected interface contracts, consider using generics on the TableExtractor and ViewExtractor interfaces to make this type-safe, or add explicit type checking.


104-123: Consider pattern-based filtering for metadata tables.

The hardcoded list of DuckLake metadata tables may become stale as DuckLake evolves. Consider using a prefix-based filter (e.g., tables starting with ducklake_) which would be more maintainable.

♻️ Alternative: Prefix-based filtering
     private Collection<Table> filterDuckLakeMetadataTables(Collection<Table> allTables) {
-        Set<String> metadataTableNames = Set.of(
-            "ducklake_column", "ducklake_column_tag", ...
-        );
-
         Collection<Table> userTables = new ArrayList<>();
         for (Table table : allTables) {
-            if (!metadataTableNames.contains(table.getName())) {
+            if (!table.getName().startsWith("ducklake_")) {
                 userTables.add(table);
             }
         }
         return userTables;
     }

43-77: Consider try-with-resources for cleaner resource management.

The current try-finally pattern works but could be simplified using try-with-resources for java.sql.Connection.

♻️ Refactor to try-with-resources
         Driver driver = driverProvider.getDriver(tempConnection);
         Properties properties = JDBCUtils.setJDBCAuth(tempConnection);
-        java.sql.Connection connect = driver.connect(duckdbUrl, properties);
-
-        try {
+        try (java.sql.Connection connect = driver.connect(duckdbUrl, properties)) {
             String attachedCatalogAlias = setupDuckLake(connect, connection);
             // ... rest of method ...
             return database;
-        } finally {
-            connect.close();
         }

80-102: Code duplication with generate() method.

The validate() method duplicates the connection setup and DuckLake initialization logic from generate(). Consider extracting the common setup into a private helper method to improve maintainability.

♻️ Extract common setup logic
private java.sql.Connection createDuckLakeConnection(Connection connection) throws Exception {
    if (connection.getDucklakeDataPath() == null || connection.getDucklakeDataPath().trim().isEmpty()) {
        throw new IllegalArgumentException("ducklakeDataPath is required for DuckLake connections");
    }
    
    String duckdbUrl = buildDuckDbUrl(connection);
    Connection tempConnection = new Connection();
    tempConnection.setUrl(duckdbUrl);
    tempConnection.setDbType("duckdb");
    Driver driver = driverProvider.getDriver(tempConnection);
    Properties properties = JDBCUtils.setJDBCAuth(tempConnection);
    return driver.connect(duckdbUrl, properties);
}

Then use in both generate() and validate().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 330-361: The public static helper executeDebugSQL in
DuckLakeGenerator exposes an arbitrary-SQL execution surface; remove or relocate
it out of production code by either (1) moving the method into a test utility
class (e.g., create a DuckLakeTestUtils and place executeDebugSQL there for
test-only use), or (2) eliminate it entirely if not required, or (3) at minimum
reduce its visibility (make it private/package-private) and annotate with
`@VisibleForTesting` so it cannot be called from production code; update any
callers to reference the new test utility or remove them accordingly and ensure
the DuckLakeGenerator class no longer exposes executeDebugSQL publicly.
- Around line 115-126: openDuckDbConnection creates a new Connection (temp) but
never copies authentication from the original, so JDBCUtils.setJDBCAuth(temp)
yields empty creds; fix by propagating credentials from the original Connection
into temp (or bypass temp and build Properties from original) before calling
JDBCUtils.setJDBCAuth. Update openDuckDbConnection to copy userName and password
(and any other auth fields) from the original parameter to temp prior to
driverProvider.getDriver / JDBCUtils.setJDBCAuth so the resulting Properties
include the source credentials.
- Around line 248-265: listTablesFallback creates Table instances with only the
name set, causing downstream column extraction to fail because table.getSchema()
and table.getType() are null; update listTablesFallback (the method) to populate
the Table.schema and Table.type fields as you build each Table: set schema from
the query parameter or from the result set column (e.g., result column
"table_schema") and set type to the table_type value (or hardcode "BASE
TABLE"/use rs.getString("table_type")), so that ColumnsExtractor.extract() can
call getMetaData().getColumns(databaseName, table.getSchema(), ...) correctly.
🧹 Nitpick comments (3)
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (3)

288-328: Three near-identical extractor loader methods — consider extracting a generic helper.

loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor share the same structure: call ModuleLoader, check isEmpty, reflectively instantiate, catch the same exceptions. A single generic helper (parameterized on the extractor type) would eliminate the duplication.


267-286: Hardcoded metadata table list is fragile — consider prefix-based filtering.

If DuckLake adds new internal tables in future versions, this static set will miss them. A prefix-based filter (name.startsWith("ducklake_")) would be more resilient, assuming user tables don't use that prefix.


182-184: Silently catching all exceptions from INSTALL ducklake could mask real failures.

While the subsequent LOAD ducklake would likely fail if installation truly failed, consider at least logging the caught exception at DEBUG level for troubleshooting.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (4)

168-207: Silent swallow of INSTALL ducklake failure may hide real errors.

Line 178 catches and ignores all SQLException from INSTALL ducklake. This is presumably to handle the "already installed" case, but it also hides genuine failures (e.g., network errors downloading the extension, permission issues). Consider checking the error message — similar to how line 198 checks for "already exists" — to only ignore expected errors.

Proposed fix
-            try { stmt.execute("INSTALL ducklake"); } catch (SQLException ignored) {}
+            try {
+                stmt.execute("INSTALL ducklake");
+            } catch (SQLException e) {
+                String msg = e.getMessage() == null ? "" : e.getMessage();
+                if (!msg.contains("already installed") && !msg.contains("is already installed")) {
+                    log.warn("INSTALL ducklake failed (non-fatal): {}", msg);
+                }
+            }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 168 - 207, The INSTALL ducklake call in setupDuckLake currently
swallows all SQLExceptions; update the try/catch around stmt.execute("INSTALL
ducklake") to inspect the SQLException message (similar to the attach block) and
only ignore known benign messages (e.g., "already installed", "extension already
exists" or whatever DuckLake returns), but rethrow for other errors; locate the
stmt.execute("INSTALL ducklake") invocation inside setupDuckLake and replace the
broad catch with logic that checks e.getMessage() and suppresses only expected
"already installed" text, otherwise rethrow the exception.

304-344: Three nearly identical extractor-loading methods can be consolidated.

loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor share the same ModuleLoader → fallback → reflective-instantiation pattern. A generic helper would eliminate the duplication.

♻️ Sketch of a generic helper
`@SuppressWarnings`("unchecked")
private <T> T loadExtractor(Connection connection, String pkg,
                            RosettaModuleTypes moduleType,
                            java.util.function.Supplier<T> fallback,
                            Class<?>... ctorArgTypes) {
    Optional<Class<?>> mod = ModuleLoader.loadModuleByAnnotationClassValues(
            pkg, moduleType, connection.getDbType());
    if (mod.isEmpty()) {
        log.warn("{} extractor not found for DuckDB, falling back to default.", moduleType);
        return fallback.get();
    }
    try {
        if (ctorArgTypes.length == 0) {
            return (T) mod.get().getDeclaredConstructor().newInstance();
        }
        return (T) mod.get().getDeclaredConstructor(ctorArgTypes).newInstance(connection);
    } catch (ReflectiveOperationException e) {
        throw new RuntimeException("Failed to instantiate " + moduleType + " extractor", e);
    }
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 304 - 344, Consolidate the duplicated reflection/fallback logic in
loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor
into a single generic helper (e.g., loadExtractor) that takes Connection,
package name, RosettaModuleTypes, a Supplier fallback, and optional ctor arg
types/values; have the helper call
ModuleLoader.loadModuleByAnnotationClassValues, log the fallback with
moduleType, and perform reflective instantiation (no-arg or with Connection)
catching ReflectiveOperationException and rethrowing a RuntimeException; then
replace the three methods to call this helper (passing
DefaultTablesExtractor::new, DefaultViewExtractor::new, and () -> new
ColumnsExtractor(connection) as fallbacks and appropriate ctor arg types) so all
reflection/fallback logic is centralized.

183-185: Same silent-swallow pattern for INSTALL httpfs.

Same concern as above for INSTALL ducklake. Line 184 silently catches all SQLException from INSTALL httpfs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 183 - 185, The try/catch around stmt.execute("INSTALL httpfs") in
DuckLakeGenerator where dataPath.startsWith("s3://") is currently swallowing
SQLException; replace the empty catch with proper handling: catch the
SQLException and log the error (include the exception message/stacktrace) using
the class logger or error reporting used elsewhere in this class, and keep
execution flow consistent with how INSTALL ducklake is handled so the failure is
visible for debugging while allowing the subsequent stmt.execute("LOAD httpfs")
behavior to proceed or fail explicitly.

238-261: Reliance on DuckLake internal metadata schema (__ducklake_metadata_<catalog>) is fragile.

The metadata catalog naming convention __ducklake_metadata_<catalog> and the internal table names (ducklake_table, ducklake_schema) are implementation details of DuckLake. If these change in a future DuckLake version, this query will silently fail (or throw). The fallback paths mitigate this partially, but only if the exception-handling gap above is fixed.

Consider adding a comment documenting which DuckLake version this was tested against, so future maintainers know when to verify compatibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 238 - 261, The code in listTablesFromDuckLakeMetadata relies on the
internal metadata catalog name constructed in metadataCatalog and on internal
tables ducklake_table and ducklake_schema which are fragile across DuckLake
versions; update the method by (1) adding a clear comment above
listTablesFromDuckLakeMetadata stating the DuckLake version(s) this logic was
tested against and that these internal names are implementation details, and (2)
wrap the metadata-query block (the PreparedStatement/ResultSet usage that
queries ducklake_table and ducklake_schema) in a safe SQLException catch that
logs a clear warning including the exception and falls back gracefully (so
callers can try alternate discovery paths) instead of letting the exception
propagate silently. Ensure the comment and the try/catch reference the function
name listTablesFromDuckLakeMetadata and the
metadataCatalog/ducklake_table/ducklake_schema identifiers for easy location.
cli/src/main/java/com/adaptivescale/rosetta/cli/ConfigYmlConverter.java (1)

42-64: Misleading comment: this is config-parameter cross-referencing, not environment variable substitution.

Environment variables are already resolved in processEnvParameters (line 29–33). The StringSubstitutor on line 39 is built from the connection's own field map, so this block performs intra-connection field substitution (e.g., ${databaseName} inside ducklakeDataPath). The comment on line 43 should be updated to avoid confusion.

Also, the repeated null-check-substitute-set pattern is getting verbose. Consider extracting a small helper to reduce duplication.

♻️ Suggested refactor
+    private static String substituteIfPresent(StringSubstitutor sub, String value) {
+        return value != null ? sub.replace(value) : null;
+    }
+
     private Config processConfigParameters(String configContent) throws IOException {
         Config config = new ObjectMapper(new YAMLFactory()).readValue(configContent, Config.class);
         for (Connection connection : config.getConnections()) {
             Map<String, String> configParameters = connection.toMap();
             StringSubstitutor stringSubstitutor = new StringSubstitutor(configParameters, "${", "}");
             String processedUrl = stringSubstitutor.replace(connection.getUrl());
             connection.setUrl(processedUrl);
 
-            // Process DuckLake-specific fields for environment variable substitution
-            if (connection.getDuckdbDatabasePath() != null) {
-                String processedDuckdbPath = stringSubstitutor.replace(connection.getDuckdbDatabasePath());
-                connection.setDuckdbDatabasePath(processedDuckdbPath);
-            }
-            if (connection.getDucklakeDataPath() != null) {
-                String processedDataPath = stringSubstitutor.replace(connection.getDucklakeDataPath());
-                connection.setDucklakeDataPath(processedDataPath);
-            }
-            if (connection.getDucklakeMetadataDb() != null) {
-                String processedMetadataDb = stringSubstitutor.replace(connection.getDucklakeMetadataDb());
-                connection.setDucklakeMetadataDb(processedMetadataDb);
-            }
-            if (connection.getS3Region() != null) {
-                connection.setS3Region(stringSubstitutor.replace(connection.getS3Region()));
-            }
-            if (connection.getS3AccessKeyId() != null) {
-                connection.setS3AccessKeyId(stringSubstitutor.replace(connection.getS3AccessKeyId()));
-            }
-            if (connection.getS3SecretAccessKey() != null) {
-                connection.setS3SecretAccessKey(stringSubstitutor.replace(connection.getS3SecretAccessKey()));
-            }
+            // Process DuckLake-specific fields for config parameter cross-referencing
+            connection.setDuckdbDatabasePath(substituteIfPresent(stringSubstitutor, connection.getDuckdbDatabasePath()));
+            connection.setDucklakeDataPath(substituteIfPresent(stringSubstitutor, connection.getDucklakeDataPath()));
+            connection.setDucklakeMetadataDb(substituteIfPresent(stringSubstitutor, connection.getDucklakeMetadataDb()));
+            connection.setS3Region(substituteIfPresent(stringSubstitutor, connection.getS3Region()));
+            connection.setS3AccessKeyId(substituteIfPresent(stringSubstitutor, connection.getS3AccessKeyId()));
+            connection.setS3SecretAccessKey(substituteIfPresent(stringSubstitutor, connection.getS3SecretAccessKey()));
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/main/java/com/adaptivescale/rosetta/cli/ConfigYmlConverter.java`
around lines 42 - 64, Update the misleading comment and reduce duplication:
change the comment above the block in ConfigYmlConverter to state that this
performs intra-connection field substitution using the connection's own
StringSubstitutor (not environment-variable resolution which is handled by
processEnvParameters), and refactor the repeated null-check/substitute/set
pattern by extracting a small helper method (e.g.,
applySubstitutionIfNotNull(StringSubstitutor, Function/Getter like
connection::getDucklakeDataPath, BiConsumer/Setter like
connection::setDucklakeDataPath) or a private utility that accepts the value,
performs stringSubstitutor.replace if non-null and returns the replaced value)
and use that helper for fields referenced
(getDuckdbDatabasePath/getDucklakeDataPath/getDucklakeMetadataDb/getS3Region/getS3AccessKeyId/getS3SecretAccessKey)
to eliminate boilerplate while keeping semantics identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 209-222: applyS3Credentials currently injects secrets into SQL
literals via stmt.execute("SET s3_access_key_id=...") and stmt.execute("SET
s3_secret_access_key=..."), which can expose credentials in JDBC/SQL logs;
change the implementation to avoid embedding secrets in SQL: set S3 credentials
on the connection/config level (e.g., use Connection.setClientInfo or pass
s3_access_key_id/s3_secret_access_key as connection properties when creating the
Connection) or populate AWS env vars before creating the Connection instead of
calling stmt.execute with secret strings; update applyS3Credentials to use
rosetta connection-level configuration APIs (replace the stmt.execute lines) and
ensure any SQL statement logging is disabled or masked for these settings.
- Around line 49-60: The call to listTablesFromDuckLakeMetadata in
DuckLakeGenerator can throw and currently bypasses all fallbacks; wrap that
initial call in a try-catch around listTablesFromDuckLakeMetadata(jdbc, catalog,
duckdbConnection.getSchemaName()) and on any Exception treat it the same as an
empty result by logging the exception (use log.warn with the exception) and then
attempt the same fallback sequence: call
tableExtractor.extract(duckdbConnection, jdbc) and if that returns empty call
listTablesFallback(jdbc, catalog, duckdbConnection.getSchemaName()); ensure the
existing catch block behavior and logging remains for errors raised by
tableExtractor.extract as well.
- Around line 246-255: The listTablesFromDuckLakeMetadata path in
DuckLakeGenerator populates Table.name and Table.schema but never sets
Table.type, causing inconsistency with listTablesFallback which calls
t.setType("BASE TABLE"); update listTablesFromDuckLakeMetadata to set the type
on each Table (e.g., t.setType("BASE TABLE") or the appropriate JDBC type
constant) before adding to out so downstream code observing table.getType() sees
the same value as the fallback path.

---

Duplicate comments:
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 126-136: openDuckDbConnection creates a new Connection (temp) but
never copies userName/password from the original Connection, so
JDBCUtils.setJDBCAuth(temp) drops credentials; update openDuckDbConnection to
copy authentication fields (e.g., temp.setUserName(original.getUserName()) and
temp.setPassword(original.getPassword()) or equivalent getters/setters) before
calling JDBCUtils.setJDBCAuth(temp), mirroring the credential-forwarding logic
used in createDuckDbConnection, so the Properties passed to
driver.connect(duckdbUrl, props) include the source credentials.

---

Nitpick comments:
In `@cli/src/main/java/com/adaptivescale/rosetta/cli/ConfigYmlConverter.java`:
- Around line 42-64: Update the misleading comment and reduce duplication:
change the comment above the block in ConfigYmlConverter to state that this
performs intra-connection field substitution using the connection's own
StringSubstitutor (not environment-variable resolution which is handled by
processEnvParameters), and refactor the repeated null-check/substitute/set
pattern by extracting a small helper method (e.g.,
applySubstitutionIfNotNull(StringSubstitutor, Function/Getter like
connection::getDucklakeDataPath, BiConsumer/Setter like
connection::setDucklakeDataPath) or a private utility that accepts the value,
performs stringSubstitutor.replace if non-null and returns the replaced value)
and use that helper for fields referenced
(getDuckdbDatabasePath/getDucklakeDataPath/getDucklakeMetadataDb/getS3Region/getS3AccessKeyId/getS3SecretAccessKey)
to eliminate boilerplate while keeping semantics identical.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 168-207: The INSTALL ducklake call in setupDuckLake currently
swallows all SQLExceptions; update the try/catch around stmt.execute("INSTALL
ducklake") to inspect the SQLException message (similar to the attach block) and
only ignore known benign messages (e.g., "already installed", "extension already
exists" or whatever DuckLake returns), but rethrow for other errors; locate the
stmt.execute("INSTALL ducklake") invocation inside setupDuckLake and replace the
broad catch with logic that checks e.getMessage() and suppresses only expected
"already installed" text, otherwise rethrow the exception.
- Around line 304-344: Consolidate the duplicated reflection/fallback logic in
loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor
into a single generic helper (e.g., loadExtractor) that takes Connection,
package name, RosettaModuleTypes, a Supplier fallback, and optional ctor arg
types/values; have the helper call
ModuleLoader.loadModuleByAnnotationClassValues, log the fallback with
moduleType, and perform reflective instantiation (no-arg or with Connection)
catching ReflectiveOperationException and rethrowing a RuntimeException; then
replace the three methods to call this helper (passing
DefaultTablesExtractor::new, DefaultViewExtractor::new, and () -> new
ColumnsExtractor(connection) as fallbacks and appropriate ctor arg types) so all
reflection/fallback logic is centralized.
- Around line 183-185: The try/catch around stmt.execute("INSTALL httpfs") in
DuckLakeGenerator where dataPath.startsWith("s3://") is currently swallowing
SQLException; replace the empty catch with proper handling: catch the
SQLException and log the error (include the exception message/stacktrace) using
the class logger or error reporting used elsewhere in this class, and keep
execution flow consistent with how INSTALL ducklake is handled so the failure is
visible for debugging while allowing the subsequent stmt.execute("LOAD httpfs")
behavior to proceed or fail explicitly.
- Around line 238-261: The code in listTablesFromDuckLakeMetadata relies on the
internal metadata catalog name constructed in metadataCatalog and on internal
tables ducklake_table and ducklake_schema which are fragile across DuckLake
versions; update the method by (1) adding a clear comment above
listTablesFromDuckLakeMetadata stating the DuckLake version(s) this logic was
tested against and that these internal names are implementation details, and (2)
wrap the metadata-query block (the PreparedStatement/ResultSet usage that
queries ducklake_table and ducklake_schema) in a safe SQLException catch that
logs a clear warning including the exception and falls back gracefully (so
callers can try alternate discovery paths) instead of letting the exception
propagate silently. Ensure the comment and the try/catch reference the function
name listTablesFromDuckLakeMetadata and the
metadataCatalog/ducklake_table/ducklake_schema identifiers for easy location.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cli/src/test/java/integration/RedshiftDDLIntegrationTest.java (1)

95-109: ⚠️ Potential issue | 🟡 Minor

Removed columns from DDL but assertions still reference them — test is silently broken.

Per the AI summary, ctimtz TIME WITH TIME ZONE and ctimestamptz TIMESTAMP WITH TIME ZONE were removed from CREATE_ALL_TYPES_DDL. However, the testExtractDDL method (lines 302–326) still has case "c_timestamptz" and case "c_timetz" assertions referencing the all_data_types table. Additionally, textExtract at line 139 asserts a table count of 5, which may also be stale.

Since @Disabled hides these failures, re-enabling the test later will require fixing these stale assertions. Please either clean up the dead assertions now or leave a TODO noting the inconsistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/test/java/integration/RedshiftDDLIntegrationTest.java` around lines
95 - 109, The test assertions are stale because CREATE_ALL_TYPES_DDL no longer
defines ctimetz/ctimestamptz; update the testExtractDDL method to remove or
adjust the case branches referencing "c_timestamptz" and "c_timetz" (or restore
those columns in CREATE_ALL_TYPES_DDL if intended), and update the textExtract
assertion that expects 5 tables if that count changed; alternatively add a TODO
comment in testExtractDDL and textExtract noting the DDL change so the TODO can
be resolved before un-disabling the test.
♻️ Duplicate comments (3)
cli/src/test/java/integration/RedshiftDDLIntegrationTest.java (1)

26-27: Same redundant @Ignore + @Disabled pattern as DB2IntegrationTest.

See the comment on DB2IntegrationTest.java@Disabled alone is sufficient for JUnit 5.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/test/java/integration/RedshiftDDLIntegrationTest.java` around lines
26 - 27, The test class RedshiftDDLIntegrationTest currently uses both `@Ignore`
and `@Disabled`; remove the redundant JUnit 4 `@Ignore` annotation and keep only the
JUnit 5 `@Disabled` on the class (or test method) to disable it; update any
imports or annotations related to `@Ignore` (from org.junit or junit.framework) so
only the JUnit 5 org.junit.jupiter.api.Disabled remains referenced.
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (2)

246-269: ⚠️ Potential issue | 🟡 Minor

Tables from the primary metadata path are still missing type, unlike the fallback path.

listTablesFromDuckLakeMetadata sets name and schema but never calls t.setType(...). In contrast, listTablesFallback (line 284) sets type to "BASE TABLE". If downstream code checks table.getType(), behavior will differ depending on which path was taken. This was flagged in a prior review and appears to still be unaddressed.

Proposed fix
                 Table t = new Table();
                 t.setName(tableName);
                 t.setSchema(schemaName);
+                t.setType("BASE TABLE");
                 out.add(t);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 246 - 269, The method listTablesFromDuckLakeMetadata currently
populates Table instances with name and schema but never sets the table type,
causing inconsistent behavior versus listTablesFallback; update
listTablesFromDuckLakeMetadata to set the type on each Table (e.g., call
t.setType("BASE TABLE") or the appropriate type) before adding to out so
downstream calls to Table.getType() behave identically to listTablesFallback.

217-230: S3 credentials are injected as SQL SET literals — inherent log-exposure risk.

While not directly logged in this code, if DuckDB's JDBC driver has statement-level logging enabled (or a connection pool / monitoring wrapper intercepts statements), the s3_access_key_id and s3_secret_access_key values will appear in plaintext. This is inherent to the SET-based approach. Worth documenting as a known limitation or investigating whether DuckDB supports setting these via connection properties instead.

🧹 Nitpick comments (4)
cli/src/test/java/integration/DB2IntegrationTest.java (1)

34-35: Redundant @Ignore — the test class uses JUnit 5 (Jupiter), so @Disabled alone is sufficient.

The test methods use @Test / @BeforeAll / @DisplayName from JUnit Jupiter (line 19 wildcard import). @Ignore is a JUnit 4 annotation and has no effect on the Jupiter test engine. Consider dropping it to avoid confusion.

Additionally, it's unclear why this test class is being disabled as part of a DuckLake feature PR. A comment or linked issue explaining the suppression reason would help future maintainers.

Proposed simplification
-@Ignore
 `@Disabled`
 `@Testcontainers`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/src/test/java/integration/DB2IntegrationTest.java` around lines 34 - 35,
Remove the redundant JUnit4 `@Ignore` annotation from the DB2IntegrationTest class
and keep the JUnit5 `@Disabled` annotation (i.e., delete the line containing
`@Ignore` and retain `@Disabled` on the class), and add a brief inline comment above
the `@Disabled` annotation referencing why the class is suppressed (or link to the
tracking issue/PR) so future maintainers understand the reason for disabling
this test; target the DB2IntegrationTest class and its class-level annotations
when making this change.
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (3)

312-352: Extractor loader methods duplicate logic from SourceGeneratorFactory.

loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor closely mirror the loadTableExtractor, loadViewExtractor, and loadColumnExtractor methods in SourceGeneratorFactory.java (lines 25–65 in that file). Consider extracting the shared pattern into a utility method to reduce duplication.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 312 - 352, The three DuckDB loader methods
(loadDuckDbTableExtractor, loadDuckDbViewExtractor, loadDuckDbColumnExtractor)
duplicate the same module-loading/instantiation pattern found in
SourceGeneratorFactory.loadTableExtractor/loadViewExtractor/loadColumnExtractor;
refactor by extracting a reusable utility method (e.g.,
ModuleLoaderUtil.loadExtractorOrDefault) that accepts the package/class token,
RosettaModuleTypes enum value, a default-instance supplier (or default class),
and optional constructor parameter types/args, then use that utility from
loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor
to remove the duplicated ModuleLoader.loadModuleByAnnotationClassValues +
getDeclaredConstructor + newInstance logic and preserve the same fallback
warnings and exception wrapping.

146-157: buildDuckDbUrl silently accepts any URL value without DuckDB-specific validation.

If connection.getUrl() is set to a non-DuckDB JDBC URL (e.g., jdbc:postgresql://...), this method will use it as-is since it doesn't start with jdbc:duckdb: and will prefix it — resulting in jdbc:duckdb:jdbc:postgresql://..., which will fail at connection time with a confusing error. Consider validating the URL scheme or logging a warning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 146 - 157, The buildDuckDbUrl method currently accepts any
connection.getUrl() and naïvely prefixes it, which can produce invalid URLs like
"jdbc:duckdb:jdbc:postgresql:..."; update buildDuckDbUrl to validate the scheme
from Connection.getUrl(): if url is non-empty and already
startsWith("jdbc:duckdb:") return it; if it startsWith("jdbc:") but not
"jdbc:duckdb:" treat it as invalid (either throw IllegalArgumentException with a
clear message referencing buildDuckDbUrl and Connection.getUrl(), or log a
warning and fall back to connection.getDuckdbDatabasePath()); only prefix
"jdbc:duckdb:" when the raw value is not a full JDBC URL (no leading "jdbc:").
Also apply the same validation logic when using
connection.getDuckdbDatabasePath().

110-130: Validation is thorough for identifiers; consider also validating path values for dangerous characters.

ducklakeDataPath and ducklakeMetadataDb are only checked for null/blank. While they are later passed through escapeSqlSingleQuotes, a path containing characters like backticks or semicolons could still be problematic depending on how DuckDB parses the ATTACH statement. The identifier validation is good; this is a defense-in-depth suggestion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 110 - 130, validateDuckLakeConfig currently only checks
ducklakeDataPath and ducklakeMetadataDb for null/blank; add defense-in-depth by
validating these path values for dangerous characters (e.g., backticks `,
semicolons ;, newlines, null bytes, path traversal sequences like ../, and other
shell/SQL-control chars) before they are used in ATTACH statements. Update
validateDuckLakeConfig to either reject values that match a conservative safe
path regex (allowing only expected characters: letters, digits, underscores,
dashes, slashes, dots) or explicitly check for and reject unsafe
characters/patterns, and reference c.getDucklakeDataPath() and
c.getDucklakeMetadataDb() in the checks; keep SAFE_IDENTIFIER unchanged for
identifiers and continue to use escapeSqlSingleQuotes where appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 83-88: The generated Database currently sets databaseType via
Database.setDatabaseType(connection.getDbType()) to "ducklake", but
DDLFactory.ddlForDatabaseType(...) (called from Cli.java:184) requires a
registered `@RosettaModule` for that database type; either change
DuckLakeGenerator to emit a databaseType that already has DDL handlers (e.g., a
supported alias) or add the missing DDL pieces: implement and register a DDL
generator class annotated with `@RosettaModule`(name="ducklake",
type=DDL_GENERATOR) and a corresponding executor implementation (DDL executor
registration) so DDLFactory can locate handlers for "ducklake"; update
DuckLakeGenerator, the new `@RosettaModule` classes, and ensure tests exercise
DDLFactory.ddlForDatabaseType("ducklake").

---

Outside diff comments:
In `@cli/src/test/java/integration/RedshiftDDLIntegrationTest.java`:
- Around line 95-109: The test assertions are stale because CREATE_ALL_TYPES_DDL
no longer defines ctimetz/ctimestamptz; update the testExtractDDL method to
remove or adjust the case branches referencing "c_timestamptz" and "c_timetz"
(or restore those columns in CREATE_ALL_TYPES_DDL if intended), and update the
textExtract assertion that expects 5 tables if that count changed; alternatively
add a TODO comment in testExtractDDL and textExtract noting the DDL change so
the TODO can be resolved before un-disabling the test.

---

Duplicate comments:
In `@cli/src/test/java/integration/RedshiftDDLIntegrationTest.java`:
- Around line 26-27: The test class RedshiftDDLIntegrationTest currently uses
both `@Ignore` and `@Disabled`; remove the redundant JUnit 4 `@Ignore` annotation and
keep only the JUnit 5 `@Disabled` on the class (or test method) to disable it;
update any imports or annotations related to `@Ignore` (from org.junit or
junit.framework) so only the JUnit 5 org.junit.jupiter.api.Disabled remains
referenced.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 246-269: The method listTablesFromDuckLakeMetadata currently
populates Table instances with name and schema but never sets the table type,
causing inconsistent behavior versus listTablesFallback; update
listTablesFromDuckLakeMetadata to set the type on each Table (e.g., call
t.setType("BASE TABLE") or the appropriate type) before adding to out so
downstream calls to Table.getType() behave identically to listTablesFallback.

---

Nitpick comments:
In `@cli/src/test/java/integration/DB2IntegrationTest.java`:
- Around line 34-35: Remove the redundant JUnit4 `@Ignore` annotation from the
DB2IntegrationTest class and keep the JUnit5 `@Disabled` annotation (i.e., delete
the line containing `@Ignore` and retain `@Disabled` on the class), and add a brief
inline comment above the `@Disabled` annotation referencing why the class is
suppressed (or link to the tracking issue/PR) so future maintainers understand
the reason for disabling this test; target the DB2IntegrationTest class and its
class-level annotations when making this change.

In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 312-352: The three DuckDB loader methods
(loadDuckDbTableExtractor, loadDuckDbViewExtractor, loadDuckDbColumnExtractor)
duplicate the same module-loading/instantiation pattern found in
SourceGeneratorFactory.loadTableExtractor/loadViewExtractor/loadColumnExtractor;
refactor by extracting a reusable utility method (e.g.,
ModuleLoaderUtil.loadExtractorOrDefault) that accepts the package/class token,
RosettaModuleTypes enum value, a default-instance supplier (or default class),
and optional constructor parameter types/args, then use that utility from
loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor
to remove the duplicated ModuleLoader.loadModuleByAnnotationClassValues +
getDeclaredConstructor + newInstance logic and preserve the same fallback
warnings and exception wrapping.
- Around line 146-157: The buildDuckDbUrl method currently accepts any
connection.getUrl() and naïvely prefixes it, which can produce invalid URLs like
"jdbc:duckdb:jdbc:postgresql:..."; update buildDuckDbUrl to validate the scheme
from Connection.getUrl(): if url is non-empty and already
startsWith("jdbc:duckdb:") return it; if it startsWith("jdbc:") but not
"jdbc:duckdb:" treat it as invalid (either throw IllegalArgumentException with a
clear message referencing buildDuckDbUrl and Connection.getUrl(), or log a
warning and fall back to connection.getDuckdbDatabasePath()); only prefix
"jdbc:duckdb:" when the raw value is not a full JDBC URL (no leading "jdbc:").
Also apply the same validation logic when using
connection.getDuckdbDatabasePath().
- Around line 110-130: validateDuckLakeConfig currently only checks
ducklakeDataPath and ducklakeMetadataDb for null/blank; add defense-in-depth by
validating these path values for dangerous characters (e.g., backticks `,
semicolons ;, newlines, null bytes, path traversal sequences like ../, and other
shell/SQL-control chars) before they are used in ATTACH statements. Update
validateDuckLakeConfig to either reject values that match a conservative safe
path regex (allowing only expected characters: letters, digits, underscores,
dashes, slashes, dots) or explicitly check for and reject unsafe
characters/patterns, and reference c.getDucklakeDataPath() and
c.getDucklakeMetadataDb() in the checks; keep SAFE_IDENTIFIER unchanged for
identifiers and continue to use escapeSqlSingleQuotes where appropriate.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a34ed8e and 8877ea2.

📒 Files selected for processing (3)
  • cli/src/test/java/integration/DB2IntegrationTest.java
  • cli/src/test/java/integration/RedshiftDDLIntegrationTest.java
  • source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java

@nbesimi nbesimi merged commit 9d9ebc6 into develop Feb 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants