feature: added support as source for ducklake#295
Conversation
📝 WalkthroughWalkthroughAdds DuckLake support: six new Connection fields (three DuckLake paths, three S3 creds), config-time substitutions for those fields, a new DuckLakeGenerator that uses DuckDB+DuckLake to extract metadata, factory routing to select DuckLakeGenerator when dbType is "ducklake", and two integration test classes are suppressed. Changes
Sequence DiagramsequenceDiagram
participant Config as Configuration
participant Factory as SourceGeneratorFactory
participant DLGen as DuckLakeGenerator
participant JDBC as DuckDB JDBC
participant DL as DuckLake Catalog
participant S3 as S3 (optional)
participant Extractors as Table/View/Column Extractors
participant DB as Database Object
Config->>Factory: sourceGenerator(connection, driverProvider)
Factory->>Factory: if dbType == "ducklake"
Factory->>DLGen: instantiate DuckLakeGenerator
Factory->>DLGen: generate(connection)
DLGen->>DLGen: validate ducklakeDataPath & metadata path
DLGen->>JDBC: open DuckDB JDBC connection
DLGen->>JDBC: install/load DuckLake
DLGen->>DL: attach catalog (DATA_PATH, METADATA_DB)
DLGen->>S3: apply S3 credentials if configured
DLGen->>DL: select catalog/schema
DLGen->>Extractors: load table/view/column extractors (with fallbacks)
DLGen->>JDBC: query DuckLake metadata for tables (fallback to information_schema)
DLGen->>JDBC: query views and columns
DLGen->>DLGen: filter internal DuckLake metadata tables
DLGen->>DB: assemble Database(name, tables, views, type)
DLGen->>Factory: return Database
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 243-248: The SQL string in DuckLakeGenerator directly interpolates
parquetFile.getAbsolutePath() into createTableSql which allows file path
injection via single quotes; sanitize the path before interpolation by escaping
single quotes (e.g., replace "'" with "''") and then use the escaped value when
building createTableSql (references: DuckLakeGenerator, variables parquetFile,
tableName, createStmt, createTableSql) so the generated CREATE TABLE ...
read_parquet('...') literal cannot break SQL syntax.
- Around line 300-344: The executeDebugSQL method currently logs full SQL and
result rows via log.info; change it to avoid leaking sensitive data by (1)
updating the method signature of executeDebugSQL (and any callers) to accept a
boolean flag like enableResultLogging (or use an existing debug flag on
Connection), (2) change the SQL statement log to a non-default level (TRACE or
DEBUG) using log.trace(...) and only emit it when tracing is enabled, and (3)
wrap the result-row logging block (the loop that builds rows and the log.info
calls) in a conditional that checks enableResultLogging before logging row
contents; keep setupDuckLake/connect logic but ensure no sensitive output is
logged when the flag is false (optionally move this utility to a test/debug-only
module if desired).
- Around line 190-208: The ATTACH/USE SQL builds raw SQL from user inputs
(catalogName, rosettaConnection.getDucklakeDataPath()) in DuckLakeGenerator
which allows SQL injection; validate and sanitize these inputs before use:
enforce a strict regex (e.g. only [A-Za-z0-9_]+) for catalogName and reject or
normalize any value that doesn't match, and validate ducklakeDataPath to ensure
it is a safe filesystem path (no semicolons, quotes, or SQL metacharacters) or
escape/quote it correctly; then update the attachSql construction and the
useStmt.execute call (the String.format building attachSql and the "USE " +
catalogName call) to use the validated/escaped values only, failing fast with a
clear exception if validation fails.
- Around line 353-356: The SQL built in importCsvToDuckLake interpolates
tableName and csvFilePath directly causing SQL injection; update
importCsvToDuckLake to validate and safely quote/escape inputs: validate
tableName against a strict identifier regex (e.g. [A-Za-z_][A-Za-z0-9_]* ) and
reject or sanitize otherwise, quote the identifier properly (escape any internal
quotes) rather than raw string concat, and treat csvFilePath as a data parameter
or at minimum escape/quote it (escape single quotes) before inserting into the
SQL string; ensure similar protection for catalogName if it can be controlled
externally and use prepared statements or the DB driver's identifier-quoting
helper where available.
🧹 Nitpick comments (4)
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (4)
60-65: Unchecked casts could mask type mismatches.The casts on lines 60 and 65 assume that
tableExtractor.extract()andviewExtractor.extract()returnCollection<Table>andCollection<View>respectively. While this aligns with the expected interface contracts, consider using generics on theTableExtractorandViewExtractorinterfaces to make this type-safe, or add explicit type checking.
104-123: Consider pattern-based filtering for metadata tables.The hardcoded list of DuckLake metadata tables may become stale as DuckLake evolves. Consider using a prefix-based filter (e.g., tables starting with
ducklake_) which would be more maintainable.♻️ Alternative: Prefix-based filtering
private Collection<Table> filterDuckLakeMetadataTables(Collection<Table> allTables) { - Set<String> metadataTableNames = Set.of( - "ducklake_column", "ducklake_column_tag", ... - ); - Collection<Table> userTables = new ArrayList<>(); for (Table table : allTables) { - if (!metadataTableNames.contains(table.getName())) { + if (!table.getName().startsWith("ducklake_")) { userTables.add(table); } } return userTables; }
43-77: Consider try-with-resources for cleaner resource management.The current try-finally pattern works but could be simplified using try-with-resources for
java.sql.Connection.♻️ Refactor to try-with-resources
Driver driver = driverProvider.getDriver(tempConnection); Properties properties = JDBCUtils.setJDBCAuth(tempConnection); - java.sql.Connection connect = driver.connect(duckdbUrl, properties); - - try { + try (java.sql.Connection connect = driver.connect(duckdbUrl, properties)) { String attachedCatalogAlias = setupDuckLake(connect, connection); // ... rest of method ... return database; - } finally { - connect.close(); }
80-102: Code duplication withgenerate()method.The
validate()method duplicates the connection setup and DuckLake initialization logic fromgenerate(). Consider extracting the common setup into a private helper method to improve maintainability.♻️ Extract common setup logic
private java.sql.Connection createDuckLakeConnection(Connection connection) throws Exception { if (connection.getDucklakeDataPath() == null || connection.getDucklakeDataPath().trim().isEmpty()) { throw new IllegalArgumentException("ducklakeDataPath is required for DuckLake connections"); } String duckdbUrl = buildDuckDbUrl(connection); Connection tempConnection = new Connection(); tempConnection.setUrl(duckdbUrl); tempConnection.setDbType("duckdb"); Driver driver = driverProvider.getDriver(tempConnection); Properties properties = JDBCUtils.setJDBCAuth(tempConnection); return driver.connect(duckdbUrl, properties); }Then use in both
generate()andvalidate().
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Outdated
Show resolved
Hide resolved
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Outdated
Show resolved
Hide resolved
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Outdated
Show resolved
Hide resolved
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 330-361: The public static helper executeDebugSQL in
DuckLakeGenerator exposes an arbitrary-SQL execution surface; remove or relocate
it out of production code by either (1) moving the method into a test utility
class (e.g., create a DuckLakeTestUtils and place executeDebugSQL there for
test-only use), or (2) eliminate it entirely if not required, or (3) at minimum
reduce its visibility (make it private/package-private) and annotate with
`@VisibleForTesting` so it cannot be called from production code; update any
callers to reference the new test utility or remove them accordingly and ensure
the DuckLakeGenerator class no longer exposes executeDebugSQL publicly.
- Around line 115-126: openDuckDbConnection creates a new Connection (temp) but
never copies authentication from the original, so JDBCUtils.setJDBCAuth(temp)
yields empty creds; fix by propagating credentials from the original Connection
into temp (or bypass temp and build Properties from original) before calling
JDBCUtils.setJDBCAuth. Update openDuckDbConnection to copy userName and password
(and any other auth fields) from the original parameter to temp prior to
driverProvider.getDriver / JDBCUtils.setJDBCAuth so the resulting Properties
include the source credentials.
- Around line 248-265: listTablesFallback creates Table instances with only the
name set, causing downstream column extraction to fail because table.getSchema()
and table.getType() are null; update listTablesFallback (the method) to populate
the Table.schema and Table.type fields as you build each Table: set schema from
the query parameter or from the result set column (e.g., result column
"table_schema") and set type to the table_type value (or hardcode "BASE
TABLE"/use rs.getString("table_type")), so that ColumnsExtractor.extract() can
call getMetaData().getColumns(databaseName, table.getSchema(), ...) correctly.
🧹 Nitpick comments (3)
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (3)
288-328: Three near-identical extractor loader methods — consider extracting a generic helper.
loadDuckDbTableExtractor,loadDuckDbViewExtractor, andloadDuckDbColumnExtractorshare the same structure: callModuleLoader, checkisEmpty, reflectively instantiate, catch the same exceptions. A single generic helper (parameterized on the extractor type) would eliminate the duplication.
267-286: Hardcoded metadata table list is fragile — consider prefix-based filtering.If DuckLake adds new internal tables in future versions, this static set will miss them. A prefix-based filter (
name.startsWith("ducklake_")) would be more resilient, assuming user tables don't use that prefix.
182-184: Silently catching all exceptions fromINSTALL ducklakecould mask real failures.While the subsequent
LOAD ducklakewould likely fail if installation truly failed, consider at least logging the caught exception atDEBUGlevel for troubleshooting.
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Show resolved
Hide resolved
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Show resolved
Hide resolved
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (4)
168-207: Silent swallow ofINSTALL ducklakefailure may hide real errors.Line 178 catches and ignores all
SQLExceptionfromINSTALL ducklake. This is presumably to handle the "already installed" case, but it also hides genuine failures (e.g., network errors downloading the extension, permission issues). Consider checking the error message — similar to how line 198 checks for "already exists" — to only ignore expected errors.Proposed fix
- try { stmt.execute("INSTALL ducklake"); } catch (SQLException ignored) {} + try { + stmt.execute("INSTALL ducklake"); + } catch (SQLException e) { + String msg = e.getMessage() == null ? "" : e.getMessage(); + if (!msg.contains("already installed") && !msg.contains("is already installed")) { + log.warn("INSTALL ducklake failed (non-fatal): {}", msg); + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java` around lines 168 - 207, The INSTALL ducklake call in setupDuckLake currently swallows all SQLExceptions; update the try/catch around stmt.execute("INSTALL ducklake") to inspect the SQLException message (similar to the attach block) and only ignore known benign messages (e.g., "already installed", "extension already exists" or whatever DuckLake returns), but rethrow for other errors; locate the stmt.execute("INSTALL ducklake") invocation inside setupDuckLake and replace the broad catch with logic that checks e.getMessage() and suppresses only expected "already installed" text, otherwise rethrow the exception.
304-344: Three nearly identical extractor-loading methods can be consolidated.
loadDuckDbTableExtractor,loadDuckDbViewExtractor, andloadDuckDbColumnExtractorshare the same ModuleLoader → fallback → reflective-instantiation pattern. A generic helper would eliminate the duplication.♻️ Sketch of a generic helper
`@SuppressWarnings`("unchecked") private <T> T loadExtractor(Connection connection, String pkg, RosettaModuleTypes moduleType, java.util.function.Supplier<T> fallback, Class<?>... ctorArgTypes) { Optional<Class<?>> mod = ModuleLoader.loadModuleByAnnotationClassValues( pkg, moduleType, connection.getDbType()); if (mod.isEmpty()) { log.warn("{} extractor not found for DuckDB, falling back to default.", moduleType); return fallback.get(); } try { if (ctorArgTypes.length == 0) { return (T) mod.get().getDeclaredConstructor().newInstance(); } return (T) mod.get().getDeclaredConstructor(ctorArgTypes).newInstance(connection); } catch (ReflectiveOperationException e) { throw new RuntimeException("Failed to instantiate " + moduleType + " extractor", e); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java` around lines 304 - 344, Consolidate the duplicated reflection/fallback logic in loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor into a single generic helper (e.g., loadExtractor) that takes Connection, package name, RosettaModuleTypes, a Supplier fallback, and optional ctor arg types/values; have the helper call ModuleLoader.loadModuleByAnnotationClassValues, log the fallback with moduleType, and perform reflective instantiation (no-arg or with Connection) catching ReflectiveOperationException and rethrowing a RuntimeException; then replace the three methods to call this helper (passing DefaultTablesExtractor::new, DefaultViewExtractor::new, and () -> new ColumnsExtractor(connection) as fallbacks and appropriate ctor arg types) so all reflection/fallback logic is centralized.
183-185: Same silent-swallow pattern forINSTALL httpfs.Same concern as above for
INSTALL ducklake. Line 184 silently catches allSQLExceptionfromINSTALL httpfs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java` around lines 183 - 185, The try/catch around stmt.execute("INSTALL httpfs") in DuckLakeGenerator where dataPath.startsWith("s3://") is currently swallowing SQLException; replace the empty catch with proper handling: catch the SQLException and log the error (include the exception message/stacktrace) using the class logger or error reporting used elsewhere in this class, and keep execution flow consistent with how INSTALL ducklake is handled so the failure is visible for debugging while allowing the subsequent stmt.execute("LOAD httpfs") behavior to proceed or fail explicitly.
238-261: Reliance on DuckLake internal metadata schema (__ducklake_metadata_<catalog>) is fragile.The metadata catalog naming convention
__ducklake_metadata_<catalog>and the internal table names (ducklake_table,ducklake_schema) are implementation details of DuckLake. If these change in a future DuckLake version, this query will silently fail (or throw). The fallback paths mitigate this partially, but only if the exception-handling gap above is fixed.Consider adding a comment documenting which DuckLake version this was tested against, so future maintainers know when to verify compatibility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java` around lines 238 - 261, The code in listTablesFromDuckLakeMetadata relies on the internal metadata catalog name constructed in metadataCatalog and on internal tables ducklake_table and ducklake_schema which are fragile across DuckLake versions; update the method by (1) adding a clear comment above listTablesFromDuckLakeMetadata stating the DuckLake version(s) this logic was tested against and that these internal names are implementation details, and (2) wrap the metadata-query block (the PreparedStatement/ResultSet usage that queries ducklake_table and ducklake_schema) in a safe SQLException catch that logs a clear warning including the exception and falls back gracefully (so callers can try alternate discovery paths) instead of letting the exception propagate silently. Ensure the comment and the try/catch reference the function name listTablesFromDuckLakeMetadata and the metadataCatalog/ducklake_table/ducklake_schema identifiers for easy location.cli/src/main/java/com/adaptivescale/rosetta/cli/ConfigYmlConverter.java (1)
42-64: Misleading comment: this is config-parameter cross-referencing, not environment variable substitution.Environment variables are already resolved in
processEnvParameters(line 29–33). TheStringSubstitutoron line 39 is built from the connection's own field map, so this block performs intra-connection field substitution (e.g.,${databaseName}insideducklakeDataPath). The comment on line 43 should be updated to avoid confusion.Also, the repeated null-check-substitute-set pattern is getting verbose. Consider extracting a small helper to reduce duplication.
♻️ Suggested refactor
+ private static String substituteIfPresent(StringSubstitutor sub, String value) { + return value != null ? sub.replace(value) : null; + } + private Config processConfigParameters(String configContent) throws IOException { Config config = new ObjectMapper(new YAMLFactory()).readValue(configContent, Config.class); for (Connection connection : config.getConnections()) { Map<String, String> configParameters = connection.toMap(); StringSubstitutor stringSubstitutor = new StringSubstitutor(configParameters, "${", "}"); String processedUrl = stringSubstitutor.replace(connection.getUrl()); connection.setUrl(processedUrl); - // Process DuckLake-specific fields for environment variable substitution - if (connection.getDuckdbDatabasePath() != null) { - String processedDuckdbPath = stringSubstitutor.replace(connection.getDuckdbDatabasePath()); - connection.setDuckdbDatabasePath(processedDuckdbPath); - } - if (connection.getDucklakeDataPath() != null) { - String processedDataPath = stringSubstitutor.replace(connection.getDucklakeDataPath()); - connection.setDucklakeDataPath(processedDataPath); - } - if (connection.getDucklakeMetadataDb() != null) { - String processedMetadataDb = stringSubstitutor.replace(connection.getDucklakeMetadataDb()); - connection.setDucklakeMetadataDb(processedMetadataDb); - } - if (connection.getS3Region() != null) { - connection.setS3Region(stringSubstitutor.replace(connection.getS3Region())); - } - if (connection.getS3AccessKeyId() != null) { - connection.setS3AccessKeyId(stringSubstitutor.replace(connection.getS3AccessKeyId())); - } - if (connection.getS3SecretAccessKey() != null) { - connection.setS3SecretAccessKey(stringSubstitutor.replace(connection.getS3SecretAccessKey())); - } + // Process DuckLake-specific fields for config parameter cross-referencing + connection.setDuckdbDatabasePath(substituteIfPresent(stringSubstitutor, connection.getDuckdbDatabasePath())); + connection.setDucklakeDataPath(substituteIfPresent(stringSubstitutor, connection.getDucklakeDataPath())); + connection.setDucklakeMetadataDb(substituteIfPresent(stringSubstitutor, connection.getDucklakeMetadataDb())); + connection.setS3Region(substituteIfPresent(stringSubstitutor, connection.getS3Region())); + connection.setS3AccessKeyId(substituteIfPresent(stringSubstitutor, connection.getS3AccessKeyId())); + connection.setS3SecretAccessKey(substituteIfPresent(stringSubstitutor, connection.getS3SecretAccessKey())); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/src/main/java/com/adaptivescale/rosetta/cli/ConfigYmlConverter.java` around lines 42 - 64, Update the misleading comment and reduce duplication: change the comment above the block in ConfigYmlConverter to state that this performs intra-connection field substitution using the connection's own StringSubstitutor (not environment-variable resolution which is handled by processEnvParameters), and refactor the repeated null-check/substitute/set pattern by extracting a small helper method (e.g., applySubstitutionIfNotNull(StringSubstitutor, Function/Getter like connection::getDucklakeDataPath, BiConsumer/Setter like connection::setDucklakeDataPath) or a private utility that accepts the value, performs stringSubstitutor.replace if non-null and returns the replaced value) and use that helper for fields referenced (getDuckdbDatabasePath/getDucklakeDataPath/getDucklakeMetadataDb/getS3Region/getS3AccessKeyId/getS3SecretAccessKey) to eliminate boilerplate while keeping semantics identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 209-222: applyS3Credentials currently injects secrets into SQL
literals via stmt.execute("SET s3_access_key_id=...") and stmt.execute("SET
s3_secret_access_key=..."), which can expose credentials in JDBC/SQL logs;
change the implementation to avoid embedding secrets in SQL: set S3 credentials
on the connection/config level (e.g., use Connection.setClientInfo or pass
s3_access_key_id/s3_secret_access_key as connection properties when creating the
Connection) or populate AWS env vars before creating the Connection instead of
calling stmt.execute with secret strings; update applyS3Credentials to use
rosetta connection-level configuration APIs (replace the stmt.execute lines) and
ensure any SQL statement logging is disabled or masked for these settings.
- Around line 49-60: The call to listTablesFromDuckLakeMetadata in
DuckLakeGenerator can throw and currently bypasses all fallbacks; wrap that
initial call in a try-catch around listTablesFromDuckLakeMetadata(jdbc, catalog,
duckdbConnection.getSchemaName()) and on any Exception treat it the same as an
empty result by logging the exception (use log.warn with the exception) and then
attempt the same fallback sequence: call
tableExtractor.extract(duckdbConnection, jdbc) and if that returns empty call
listTablesFallback(jdbc, catalog, duckdbConnection.getSchemaName()); ensure the
existing catch block behavior and logging remains for errors raised by
tableExtractor.extract as well.
- Around line 246-255: The listTablesFromDuckLakeMetadata path in
DuckLakeGenerator populates Table.name and Table.schema but never sets
Table.type, causing inconsistency with listTablesFallback which calls
t.setType("BASE TABLE"); update listTablesFromDuckLakeMetadata to set the type
on each Table (e.g., t.setType("BASE TABLE") or the appropriate JDBC type
constant) before adding to out so downstream code observing table.getType() sees
the same value as the fallback path.
---
Duplicate comments:
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 126-136: openDuckDbConnection creates a new Connection (temp) but
never copies userName/password from the original Connection, so
JDBCUtils.setJDBCAuth(temp) drops credentials; update openDuckDbConnection to
copy authentication fields (e.g., temp.setUserName(original.getUserName()) and
temp.setPassword(original.getPassword()) or equivalent getters/setters) before
calling JDBCUtils.setJDBCAuth(temp), mirroring the credential-forwarding logic
used in createDuckDbConnection, so the Properties passed to
driver.connect(duckdbUrl, props) include the source credentials.
---
Nitpick comments:
In `@cli/src/main/java/com/adaptivescale/rosetta/cli/ConfigYmlConverter.java`:
- Around line 42-64: Update the misleading comment and reduce duplication:
change the comment above the block in ConfigYmlConverter to state that this
performs intra-connection field substitution using the connection's own
StringSubstitutor (not environment-variable resolution which is handled by
processEnvParameters), and refactor the repeated null-check/substitute/set
pattern by extracting a small helper method (e.g.,
applySubstitutionIfNotNull(StringSubstitutor, Function/Getter like
connection::getDucklakeDataPath, BiConsumer/Setter like
connection::setDucklakeDataPath) or a private utility that accepts the value,
performs stringSubstitutor.replace if non-null and returns the replaced value)
and use that helper for fields referenced
(getDuckdbDatabasePath/getDucklakeDataPath/getDucklakeMetadataDb/getS3Region/getS3AccessKeyId/getS3SecretAccessKey)
to eliminate boilerplate while keeping semantics identical.
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 168-207: The INSTALL ducklake call in setupDuckLake currently
swallows all SQLExceptions; update the try/catch around stmt.execute("INSTALL
ducklake") to inspect the SQLException message (similar to the attach block) and
only ignore known benign messages (e.g., "already installed", "extension already
exists" or whatever DuckLake returns), but rethrow for other errors; locate the
stmt.execute("INSTALL ducklake") invocation inside setupDuckLake and replace the
broad catch with logic that checks e.getMessage() and suppresses only expected
"already installed" text, otherwise rethrow the exception.
- Around line 304-344: Consolidate the duplicated reflection/fallback logic in
loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor
into a single generic helper (e.g., loadExtractor) that takes Connection,
package name, RosettaModuleTypes, a Supplier fallback, and optional ctor arg
types/values; have the helper call
ModuleLoader.loadModuleByAnnotationClassValues, log the fallback with
moduleType, and perform reflective instantiation (no-arg or with Connection)
catching ReflectiveOperationException and rethrowing a RuntimeException; then
replace the three methods to call this helper (passing
DefaultTablesExtractor::new, DefaultViewExtractor::new, and () -> new
ColumnsExtractor(connection) as fallbacks and appropriate ctor arg types) so all
reflection/fallback logic is centralized.
- Around line 183-185: The try/catch around stmt.execute("INSTALL httpfs") in
DuckLakeGenerator where dataPath.startsWith("s3://") is currently swallowing
SQLException; replace the empty catch with proper handling: catch the
SQLException and log the error (include the exception message/stacktrace) using
the class logger or error reporting used elsewhere in this class, and keep
execution flow consistent with how INSTALL ducklake is handled so the failure is
visible for debugging while allowing the subsequent stmt.execute("LOAD httpfs")
behavior to proceed or fail explicitly.
- Around line 238-261: The code in listTablesFromDuckLakeMetadata relies on the
internal metadata catalog name constructed in metadataCatalog and on internal
tables ducklake_table and ducklake_schema which are fragile across DuckLake
versions; update the method by (1) adding a clear comment above
listTablesFromDuckLakeMetadata stating the DuckLake version(s) this logic was
tested against and that these internal names are implementation details, and (2)
wrap the metadata-query block (the PreparedStatement/ResultSet usage that
queries ducklake_table and ducklake_schema) in a safe SQLException catch that
logs a clear warning including the exception and falls back gracefully (so
callers can try alternate discovery paths) instead of letting the exception
propagate silently. Ensure the comment and the try/catch reference the function
name listTablesFromDuckLakeMetadata and the
metadataCatalog/ducklake_table/ducklake_schema identifiers for easy location.
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Outdated
Show resolved
Hide resolved
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Show resolved
Hide resolved
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cli/src/test/java/integration/RedshiftDDLIntegrationTest.java (1)
95-109:⚠️ Potential issue | 🟡 MinorRemoved columns from DDL but assertions still reference them — test is silently broken.
Per the AI summary,
ctimtz TIME WITH TIME ZONEandctimestamptz TIMESTAMP WITH TIME ZONEwere removed fromCREATE_ALL_TYPES_DDL. However, thetestExtractDDLmethod (lines 302–326) still hascase "c_timestamptz"andcase "c_timetz"assertions referencing theall_data_typestable. Additionally,textExtractat line 139 asserts a table count of 5, which may also be stale.Since
@Disabledhides these failures, re-enabling the test later will require fixing these stale assertions. Please either clean up the dead assertions now or leave a TODO noting the inconsistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/src/test/java/integration/RedshiftDDLIntegrationTest.java` around lines 95 - 109, The test assertions are stale because CREATE_ALL_TYPES_DDL no longer defines ctimetz/ctimestamptz; update the testExtractDDL method to remove or adjust the case branches referencing "c_timestamptz" and "c_timetz" (or restore those columns in CREATE_ALL_TYPES_DDL if intended), and update the textExtract assertion that expects 5 tables if that count changed; alternatively add a TODO comment in testExtractDDL and textExtract noting the DDL change so the TODO can be resolved before un-disabling the test.
♻️ Duplicate comments (3)
cli/src/test/java/integration/RedshiftDDLIntegrationTest.java (1)
26-27: Same redundant@Ignore+@Disabledpattern as DB2IntegrationTest.See the comment on
DB2IntegrationTest.java—@Disabledalone is sufficient for JUnit 5.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/src/test/java/integration/RedshiftDDLIntegrationTest.java` around lines 26 - 27, The test class RedshiftDDLIntegrationTest currently uses both `@Ignore` and `@Disabled`; remove the redundant JUnit 4 `@Ignore` annotation and keep only the JUnit 5 `@Disabled` on the class (or test method) to disable it; update any imports or annotations related to `@Ignore` (from org.junit or junit.framework) so only the JUnit 5 org.junit.jupiter.api.Disabled remains referenced.source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (2)
246-269:⚠️ Potential issue | 🟡 MinorTables from the primary metadata path are still missing
type, unlike the fallback path.
listTablesFromDuckLakeMetadatasetsnameandschemabut never callst.setType(...). In contrast,listTablesFallback(line 284) setstypeto"BASE TABLE". If downstream code checkstable.getType(), behavior will differ depending on which path was taken. This was flagged in a prior review and appears to still be unaddressed.Proposed fix
Table t = new Table(); t.setName(tableName); t.setSchema(schemaName); + t.setType("BASE TABLE"); out.add(t);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java` around lines 246 - 269, The method listTablesFromDuckLakeMetadata currently populates Table instances with name and schema but never sets the table type, causing inconsistent behavior versus listTablesFallback; update listTablesFromDuckLakeMetadata to set the type on each Table (e.g., call t.setType("BASE TABLE") or the appropriate type) before adding to out so downstream calls to Table.getType() behave identically to listTablesFallback.
217-230: S3 credentials are injected as SQLSETliterals — inherent log-exposure risk.While not directly logged in this code, if DuckDB's JDBC driver has statement-level logging enabled (or a connection pool / monitoring wrapper intercepts statements), the
s3_access_key_idands3_secret_access_keyvalues will appear in plaintext. This is inherent to theSET-based approach. Worth documenting as a known limitation or investigating whether DuckDB supports setting these via connection properties instead.
🧹 Nitpick comments (4)
cli/src/test/java/integration/DB2IntegrationTest.java (1)
34-35: Redundant@Ignore— the test class uses JUnit 5 (Jupiter), so@Disabledalone is sufficient.The test methods use
@Test/@BeforeAll/@DisplayNamefrom JUnit Jupiter (line 19 wildcard import).@Ignoreis a JUnit 4 annotation and has no effect on the Jupiter test engine. Consider dropping it to avoid confusion.Additionally, it's unclear why this test class is being disabled as part of a DuckLake feature PR. A comment or linked issue explaining the suppression reason would help future maintainers.
Proposed simplification
-@Ignore `@Disabled` `@Testcontainers`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/src/test/java/integration/DB2IntegrationTest.java` around lines 34 - 35, Remove the redundant JUnit4 `@Ignore` annotation from the DB2IntegrationTest class and keep the JUnit5 `@Disabled` annotation (i.e., delete the line containing `@Ignore` and retain `@Disabled` on the class), and add a brief inline comment above the `@Disabled` annotation referencing why the class is suppressed (or link to the tracking issue/PR) so future maintainers understand the reason for disabling this test; target the DB2IntegrationTest class and its class-level annotations when making this change.source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (3)
312-352: Extractor loader methods duplicate logic fromSourceGeneratorFactory.
loadDuckDbTableExtractor,loadDuckDbViewExtractor, andloadDuckDbColumnExtractorclosely mirror theloadTableExtractor,loadViewExtractor, andloadColumnExtractormethods inSourceGeneratorFactory.java(lines 25–65 in that file). Consider extracting the shared pattern into a utility method to reduce duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java` around lines 312 - 352, The three DuckDB loader methods (loadDuckDbTableExtractor, loadDuckDbViewExtractor, loadDuckDbColumnExtractor) duplicate the same module-loading/instantiation pattern found in SourceGeneratorFactory.loadTableExtractor/loadViewExtractor/loadColumnExtractor; refactor by extracting a reusable utility method (e.g., ModuleLoaderUtil.loadExtractorOrDefault) that accepts the package/class token, RosettaModuleTypes enum value, a default-instance supplier (or default class), and optional constructor parameter types/args, then use that utility from loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor to remove the duplicated ModuleLoader.loadModuleByAnnotationClassValues + getDeclaredConstructor + newInstance logic and preserve the same fallback warnings and exception wrapping.
146-157:buildDuckDbUrlsilently accepts any URL value without DuckDB-specific validation.If
connection.getUrl()is set to a non-DuckDB JDBC URL (e.g.,jdbc:postgresql://...), this method will use it as-is since it doesn't start withjdbc:duckdb:and will prefix it — resulting injdbc:duckdb:jdbc:postgresql://..., which will fail at connection time with a confusing error. Consider validating the URL scheme or logging a warning.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java` around lines 146 - 157, The buildDuckDbUrl method currently accepts any connection.getUrl() and naïvely prefixes it, which can produce invalid URLs like "jdbc:duckdb:jdbc:postgresql:..."; update buildDuckDbUrl to validate the scheme from Connection.getUrl(): if url is non-empty and already startsWith("jdbc:duckdb:") return it; if it startsWith("jdbc:") but not "jdbc:duckdb:" treat it as invalid (either throw IllegalArgumentException with a clear message referencing buildDuckDbUrl and Connection.getUrl(), or log a warning and fall back to connection.getDuckdbDatabasePath()); only prefix "jdbc:duckdb:" when the raw value is not a full JDBC URL (no leading "jdbc:"). Also apply the same validation logic when using connection.getDuckdbDatabasePath().
110-130: Validation is thorough for identifiers; consider also validating path values for dangerous characters.
ducklakeDataPathandducklakeMetadataDbare only checked for null/blank. While they are later passed throughescapeSqlSingleQuotes, a path containing characters like backticks or semicolons could still be problematic depending on how DuckDB parses theATTACHstatement. The identifier validation is good; this is a defense-in-depth suggestion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java` around lines 110 - 130, validateDuckLakeConfig currently only checks ducklakeDataPath and ducklakeMetadataDb for null/blank; add defense-in-depth by validating these path values for dangerous characters (e.g., backticks `, semicolons ;, newlines, null bytes, path traversal sequences like ../, and other shell/SQL-control chars) before they are used in ATTACH statements. Update validateDuckLakeConfig to either reject values that match a conservative safe path regex (allowing only expected characters: letters, digits, underscores, dashes, slashes, dots) or explicitly check for and reject unsafe characters/patterns, and reference c.getDucklakeDataPath() and c.getDucklakeMetadataDb() in the checks; keep SAFE_IDENTIFIER unchanged for identifiers and continue to use escapeSqlSingleQuotes where appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 83-88: The generated Database currently sets databaseType via
Database.setDatabaseType(connection.getDbType()) to "ducklake", but
DDLFactory.ddlForDatabaseType(...) (called from Cli.java:184) requires a
registered `@RosettaModule` for that database type; either change
DuckLakeGenerator to emit a databaseType that already has DDL handlers (e.g., a
supported alias) or add the missing DDL pieces: implement and register a DDL
generator class annotated with `@RosettaModule`(name="ducklake",
type=DDL_GENERATOR) and a corresponding executor implementation (DDL executor
registration) so DDLFactory can locate handlers for "ducklake"; update
DuckLakeGenerator, the new `@RosettaModule` classes, and ensure tests exercise
DDLFactory.ddlForDatabaseType("ducklake").
---
Outside diff comments:
In `@cli/src/test/java/integration/RedshiftDDLIntegrationTest.java`:
- Around line 95-109: The test assertions are stale because CREATE_ALL_TYPES_DDL
no longer defines ctimetz/ctimestamptz; update the testExtractDDL method to
remove or adjust the case branches referencing "c_timestamptz" and "c_timetz"
(or restore those columns in CREATE_ALL_TYPES_DDL if intended), and update the
textExtract assertion that expects 5 tables if that count changed; alternatively
add a TODO comment in testExtractDDL and textExtract noting the DDL change so
the TODO can be resolved before un-disabling the test.
---
Duplicate comments:
In `@cli/src/test/java/integration/RedshiftDDLIntegrationTest.java`:
- Around line 26-27: The test class RedshiftDDLIntegrationTest currently uses
both `@Ignore` and `@Disabled`; remove the redundant JUnit 4 `@Ignore` annotation and
keep only the JUnit 5 `@Disabled` on the class (or test method) to disable it;
update any imports or annotations related to `@Ignore` (from org.junit or
junit.framework) so only the JUnit 5 org.junit.jupiter.api.Disabled remains
referenced.
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 246-269: The method listTablesFromDuckLakeMetadata currently
populates Table instances with name and schema but never sets the table type,
causing inconsistent behavior versus listTablesFallback; update
listTablesFromDuckLakeMetadata to set the type on each Table (e.g., call
t.setType("BASE TABLE") or the appropriate type) before adding to out so
downstream calls to Table.getType() behave identically to listTablesFallback.
---
Nitpick comments:
In `@cli/src/test/java/integration/DB2IntegrationTest.java`:
- Around line 34-35: Remove the redundant JUnit4 `@Ignore` annotation from the
DB2IntegrationTest class and keep the JUnit5 `@Disabled` annotation (i.e., delete
the line containing `@Ignore` and retain `@Disabled` on the class), and add a brief
inline comment above the `@Disabled` annotation referencing why the class is
suppressed (or link to the tracking issue/PR) so future maintainers understand
the reason for disabling this test; target the DB2IntegrationTest class and its
class-level annotations when making this change.
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 312-352: The three DuckDB loader methods
(loadDuckDbTableExtractor, loadDuckDbViewExtractor, loadDuckDbColumnExtractor)
duplicate the same module-loading/instantiation pattern found in
SourceGeneratorFactory.loadTableExtractor/loadViewExtractor/loadColumnExtractor;
refactor by extracting a reusable utility method (e.g.,
ModuleLoaderUtil.loadExtractorOrDefault) that accepts the package/class token,
RosettaModuleTypes enum value, a default-instance supplier (or default class),
and optional constructor parameter types/args, then use that utility from
loadDuckDbTableExtractor, loadDuckDbViewExtractor, and loadDuckDbColumnExtractor
to remove the duplicated ModuleLoader.loadModuleByAnnotationClassValues +
getDeclaredConstructor + newInstance logic and preserve the same fallback
warnings and exception wrapping.
- Around line 146-157: The buildDuckDbUrl method currently accepts any
connection.getUrl() and naïvely prefixes it, which can produce invalid URLs like
"jdbc:duckdb:jdbc:postgresql:..."; update buildDuckDbUrl to validate the scheme
from Connection.getUrl(): if url is non-empty and already
startsWith("jdbc:duckdb:") return it; if it startsWith("jdbc:") but not
"jdbc:duckdb:" treat it as invalid (either throw IllegalArgumentException with a
clear message referencing buildDuckDbUrl and Connection.getUrl(), or log a
warning and fall back to connection.getDuckdbDatabasePath()); only prefix
"jdbc:duckdb:" when the raw value is not a full JDBC URL (no leading "jdbc:").
Also apply the same validation logic when using
connection.getDuckdbDatabasePath().
- Around line 110-130: validateDuckLakeConfig currently only checks
ducklakeDataPath and ducklakeMetadataDb for null/blank; add defense-in-depth by
validating these path values for dangerous characters (e.g., backticks `,
semicolons ;, newlines, null bytes, path traversal sequences like ../, and other
shell/SQL-control chars) before they are used in ATTACH statements. Update
validateDuckLakeConfig to either reject values that match a conservative safe
path regex (allowing only expected characters: letters, digits, underscores,
dashes, slashes, dots) or explicitly check for and reject unsafe
characters/patterns, and reference c.getDucklakeDataPath() and
c.getDucklakeMetadataDb() in the checks; keep SAFE_IDENTIFIER unchanged for
identifiers and continue to use escapeSqlSingleQuotes where appropriate.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cli/src/test/java/integration/DB2IntegrationTest.javacli/src/test/java/integration/RedshiftDDLIntegrationTest.javasource/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Summary by CodeRabbit
New Features
Improvements
Tests