-
Notifications
You must be signed in to change notification settings - Fork 287
feat: add multi-extractor support with improved UI and error handling… #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughExtraction endpoints changed to accept multiple extractors as a Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Endpoint as /extract Endpoint
participant Source as createSource()
participant Pipeline as performExtraction()
participant Collector as QuadCollectorDestination
participant Formatter as createFormatter()
participant Response as Response Builder
Client->>Endpoint: GET/POST with extractors: List[String], format, title/revid/xml
Endpoint->>Source: createSource(title, revid, xmlUrl)
Source-->>Endpoint: Source (XML or Wiki)
loop per extractor
Endpoint->>Pipeline: performExtraction(extractor, source, Collector)
Pipeline->>Collector: emit quads
Collector-->>Pipeline: accumulated quads
Pipeline-->>Endpoint: success or error (with warning)
end
Endpoint->>Formatter: createFormatter(format, headers)
Formatter-->>Endpoint: formatter instance
Endpoint->>Response: format accumulated quads + include warnings/errors
Response-->>Client: HTTP response (body + status reflecting partial failures)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (1)
191-228: Improve XML parsing error handling and resource management in createSource
- Wrap
XML.loadStringand response reading in try/catch to handleSAXException,IOExceptionetc., and rethrow as aWebApplicationException(e.g. 400 Bad Request) so malformed XML doesn’t bubble up as a 500.- Avoid loading the entire response into memory via
mkString; switch to a streaming parser or enforce a maximum response size.- Ensure the HTTP response (not just the client) is closed in the
finallyblock—e.g. callresponse.close()or explicitly close the entity’s input stream.
🧹 Nitpick comments (2)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (2)
72-86: Consider accessibility improvements for the checkbox UI.The checkbox UI is functional, but consider these enhancements:
- The
<select>dropdown at line 73 and the checkboxes below create a somewhat unusual UX pattern where users first pick a "mode" and then optionally customize. This could be simplified.- ID sanitization at line 83 is good for avoiding HTML/CSS issues, but ensure the original extractor names are preserved in the
valueattribute (which they are).- Consider adding ARIA attributes to improve screen reader support, e.g.,
aria-describedbyto link the select with the checkbox container.
177-187: Document thread-safety assumption for QuadCollectorDestination.The
QuadCollectorDestinationusesArrayBuffer, which is not thread-safe. While the current usage inperformExtraction(lines 265-289) is sequential and safe, future modifications could inadvertently introduce concurrent writes.Consider adding a comment to document this constraint:
/** - * Custom destination that collects quads in memory + * Custom destination that collects quads in memory. + * Note: Not thread-safe. Assumes sequential write operations. */ private class QuadCollectorDestination extends Destination {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)
core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)
TerseFormatter(13-14)core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)
DeduplicatingDestination(9-10)core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)
WriterDestination(16-17)server/src/main/scala/org/dbpedia/extraction/server/Server.scala (4)
Server(26-31)Server(159-159)getInstance(171-176)extractWithSpecificExtractor(87-106)server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)
writeHeader(13-18)core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)
fromXML(61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (3)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)
333-387: GET endpoint properly handles errors and defaults.The refactored GET endpoint has solid error handling:
- Line 344-345: Validates required parameters before processing
- Lines 370-386: Comprehensive exception handling with appropriate HTTP status codes
- Line 378-384: Full stack trace logging for debugging
- Lines 358-362: Backward-compatible default to "mappings" when no extractors specified
The comma-joining at line 361 is related to the fragility flagged in the
performExtractionmethod—consider the alternative approach suggested there.
389-399: LGTM!The
createFormattermethod cleanly maps format strings to their corresponding formatter implementations using pattern matching. The default case falling back to TriX is reasonable.
449-462: POST endpoint follows GET pattern.The POST endpoint has been refactored to match the GET endpoint's structure, delegating to
performExtraction. The same comma-joining approach at line 460 has the fragility noted in the GET endpoint review.Otherwise, the delegation pattern is clean and reduces code duplication as intended.
| <script type="text/javascript"> | ||
| {scala.xml.Unparsed(""" | ||
| function toggleCustomExtractors() { | ||
| var mode = document.getElementById('extractorMode').value; | ||
| var customDiv = document.getElementById('customExtractors'); | ||
| if (mode === 'custom-selection') { | ||
| customDiv.style.display = 'block'; | ||
| } else { | ||
| customDiv.style.display = 'none'; | ||
| } | ||
| } | ||
| document.querySelector('form').addEventListener('submit', function(e) { | ||
| var mode = document.getElementById('extractorMode').value; | ||
| if (mode === 'mappings') { | ||
| var input = document.createElement('input'); | ||
| input.type = 'hidden'; | ||
| input.name = 'extractors'; | ||
| input.value = 'mappings'; | ||
| this.appendChild(input); | ||
| } else if (mode === 'custom') { | ||
| var input = document.createElement('input'); | ||
| input.type = 'hidden'; | ||
| input.name = 'extractors'; | ||
| input.value = 'custom'; | ||
| this.appendChild(input); | ||
| } | ||
| }); | ||
| """)} | ||
| </script> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JavaScript form manipulation has potential issues.
The JavaScript code dynamically appends hidden inputs during form submission (lines 105-117), but:
- If the user submits the form multiple times (unlikely but possible with back button), hidden inputs could accumulate since they're never removed.
- The script doesn't validate that at least one extractor is selected when in "custom-selection" mode, potentially submitting an empty list.
Consider this improvement:
document.querySelector('form').addEventListener('submit', function(e) {
var mode = document.getElementById('extractorMode').value;
+
+ // Remove any previously added hidden inputs
+ var existing = this.querySelectorAll('input[type="hidden"][name="extractors"]');
+ existing.forEach(function(el) { el.remove(); });
if (mode === 'mappings') {
var input = document.createElement('input');
input.type = 'hidden';
input.name = 'extractors';
input.value = 'mappings';
this.appendChild(input);
} else if (mode === 'custom') {
var input = document.createElement('input');
input.type = 'hidden';
input.name = 'extractors';
input.value = 'custom';
this.appendChild(input);
+ } else if (mode === 'custom-selection') {
+ // Validate at least one checkbox is selected
+ var checked = this.querySelectorAll('input[name="extractors"]:checked');
+ if (checked.length === 0) {
+ alert('Please select at least one extractor');
+ e.preventDefault();
+ return false;
+ }
}
});| case specificExtractor if specificExtractor.contains(",") => | ||
| val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty) | ||
| logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comma-based multi-extractor detection is fragile.
Line 258 uses contains(",") to detect multiple extractors, and line 259 splits on comma. This approach has issues:
- If an extractor name legitimately contains a comma (unlikely but not impossible), it would be incorrectly split.
- The logic assumes the frontend will send comma-separated values when multiple checkboxes are selected. However, HTML forms typically send multiple values with the same name as separate query parameters (e.g.,
?extractors=foo&extractors=bar), which JAX-RS automatically collects into theList[String]parameter. The comma-joining at line 361 and subsequent splitting here is unnecessary indirection.
Consider this refactor to work directly with the list:
- case specificExtractor if specificExtractor.contains(",") =>
- val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty)
+ case specificExtractor if specificExtractor == "MULTI" || extractors.asScala.size > 1 =>
+ val extractorNames = extractors.asScala.toArrayThen at the call site (line 361), don't join:
- extractors.asScala.mkString(",")
+ if (extractors.asScala.size == 1) extractors.asScala.head else "MULTI"This eliminates the string manipulation entirely.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala
around lines 258 to 260, the code detects multiple extractors by checking for
contains(",") and splitting the string; instead, treat extractors as a
List[String] (the JAX-RS behaviour) and remove any comma-based parsing: replace
the contains/split branch with logic that accepts the incoming List[String]
directly (use .map(_.trim).filter(_.nonEmpty) on the list), log
extractorNames.length and extractorNames.mkString(", "), and process
extractorNames without string splitting; also update the call site around line
361 to stop joining the list into a single comma string and pass the
List[String] through as-is so no comma manipulation is necessary anywhere.
b67ad04 to
da82f68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)
258-262: Comma-based multi-extractor parsing is fragile.Line 258 detects multiple extractors using
contains(",")and line 259 splits on comma. This approach has issues:
- If an extractor name contains a comma (unlikely but possible), it would be split incorrectly.
- HTML forms naturally send multiple parameters as
?extractors=foo&extractors=bar, which JAX-RS collects intoList[String]automatically. The comma-joining at line 361 and subsequent splitting here is unnecessary indirection.- The logic conflates string manipulation with multi-value handling.
Consider refactoring to work directly with the list:
-case specificExtractor if specificExtractor.contains(",") => - val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty) - logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}") +case _ => + // Split the comma-separated string into individual extractor names + val extractorNames = extractorName.split(",").map(_.trim).filter(_.nonEmpty) + + // If only one extractor after parsing, delegate to single-extractor branch + if (extractorNames.length == 1) { + // Handle as single extractor (reuse existing logic) + val singleName = extractorNames.head + val writer = new StringWriter + val formatter = createFormatter(finalFormat, writer) + val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter)) + val source = sourceFactory() + singleName match { + case "mappings" => Server.getInstance().extractor.extract(source, destination, language, false) + case "custom" => Server.getInstance().extractor.extract(source, destination, language, true) + case _ => Server.getInstance().extractWithSpecificExtractor(source, destination, language, singleName) + } + Response.ok(writer.toString) + .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8") + .build() + } else { + // Handle multiple extractors + logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}")This eliminates the fragile pattern matching on
contains(",")while maintaining current functionality.
291-299: Error status logic relies on fragile string matching.Lines 296-297 determine HTTP status by checking if error messages contain "not found". This string matching is fragile:
- Line 280 creates "Extractor '$name' not found" for
IllegalArgumentException- Line 284 creates generic messages for other exceptions
- Line 296 checks for literal "not found" substring, which could have false positives if other errors mention those words
Consider tracking exception types separately:
val errors = new ArrayBuffer[String]() + val notFoundErrors = new ArrayBuffer[String]() val collector = new QuadCollectorDestination() for (name <- extractorNames) { try { // ... extraction logic ... } catch { case e: IllegalArgumentException => val errorMsg = s"Extractor '$name' not found" logger.warning(s"$errorMsg: ${e.getMessage}") errors += errorMsg + notFoundErrors += name case e: Exception => val errorMsg = s"Failed to extract with '$name': ${e.getMessage}" logger.severe(errorMsg) e.printStackTrace() errors += errorMsg } } if (errors.nonEmpty) { logger.warning(s"Partial extraction completed with errors: ${errors.mkString("; ")}") if (collector.quads.isEmpty) { val errorMessage = if (errors.size == 1) errors.head else s"Multiple extractor errors:\n${errors.mkString("\n")}" - val statusCode = if (errors.exists(_.contains("not found"))) Response.Status.BAD_REQUEST + val statusCode = if (notFoundErrors.nonEmpty) Response.Status.BAD_REQUEST else Response.Status.INTERNAL_SERVER_ERROR throw new WebApplicationException(new Exception(errorMessage), statusCode) } }
89-120: Consider improving form submission handling.The JavaScript form manipulation has two potential improvements:
Accumulation of hidden inputs: Lines 105-117 append hidden inputs on form submission, but if the user navigates back and resubmits (unlikely but possible), inputs could accumulate since they're never removed.
Missing validation: In "custom-selection" mode, there's no client-side check that at least one extractor is selected, potentially submitting an empty list.
The backend will default to "mappings" if no extractors are provided (line 358-359), so these are low-risk improvements:
document.querySelector('form').addEventListener('submit', function(e) { var mode = document.getElementById('extractorMode').value; + + // Remove any previously added hidden inputs to prevent accumulation + var existing = this.querySelectorAll('input[type="hidden"][name="extractors"]'); + existing.forEach(function(el) { el.remove(); }); if (mode === 'mappings') { var input = document.createElement('input'); input.type = 'hidden'; input.name = 'extractors'; input.value = 'mappings'; this.appendChild(input); } else if (mode === 'custom') { var input = document.createElement('input'); input.type = 'hidden'; input.name = 'extractors'; input.value = 'custom'; this.appendChild(input); + } else if (mode === 'custom-selection') { + // Optional: validate at least one checkbox is selected + var checked = this.querySelectorAll('input[name="extractors"]:checked'); + if (checked.length === 0) { + alert('Please select at least one extractor'); + e.preventDefault(); + return false; + } } });
🧹 Nitpick comments (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (1)
358-362: Consider eliminating comma-joining to simplify multi-extractor handling.Line 361 joins the extractor list with commas:
extractors.asScala.mkString(","). This creates the comma-separated string that must be parsed later inperformExtraction(line 258-259).Since
performExtractionalready handles different extractor modes via pattern matching, consider passing a marker for multi-extractor mode instead:val extractorName = if (extractors == null || extractors.isEmpty) { "mappings" +} else if (extractors.asScala.size == 1) { + extractors.asScala.head } else { - extractors.asScala.mkString(",") + // Join for backward compatibility, but consider passing List directly in future refactor + extractors.asScala.mkString(",") }Better yet, refactor
performExtractionto acceptList[String]directly instead of a single string, eliminating the need for comma manipulation entirely. This would require updating the method signature and pattern matching logic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dump/src/test/resources/shaclTestsCoverageTable.md(14 hunks)server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)
core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)
TerseFormatter(13-14)core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)
DeduplicatingDestination(9-10)core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)
WriterDestination(16-17)server/src/main/scala/org/dbpedia/extraction/server/Server.scala (3)
Server(26-31)Server(159-159)extractWithSpecificExtractor(87-106)server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)
writeHeader(13-18)core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)
fromXML(61-61)
🪛 markdownlint-cli2 (0.18.1)
dump/src/test/resources/shaclTestsCoverageTable.md
29-29: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
29-29: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
30-30: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
30-30: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
31-31: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
31-31: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
32-32: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
32-32: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
43-43: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
43-43: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
54-54: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
54-54: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
64-64: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
64-64: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
65-65: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
65-65: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
66-66: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
66-66: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
67-67: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
67-67: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
68-68: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
68-68: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
69-69: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
69-69: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
70-70: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
70-70: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
71-71: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
71-71: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
82-82: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
82-82: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
92-92: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
92-92: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
93-93: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
93-93: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
94-94: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
94-94: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
95-95: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
95-95: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
104-104: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
104-104: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
114-114: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
114-114: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
124-124: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
124-124: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
125-125: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
125-125: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
126-126: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
126-126: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
127-127: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
127-127: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
128-128: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
128-128: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
129-129: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
129-129: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
130-130: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
131-131: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
131-131: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
132-132: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
132-132: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
133-133: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
133-133: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
134-134: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
134-134: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
135-135: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
135-135: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
136-136: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
136-136: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
137-137: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
137-137: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
138-138: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
138-138: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
139-139: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
139-139: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
148-148: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
148-148: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
159-159: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
159-159: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data
(MD056, table-column-count)
160-160: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
160-160: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
161-161: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
161-161: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
162-162: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
162-162: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
163-163: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
163-163: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
164-164: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
164-164: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data
(MD056, table-column-count)
165-165: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
165-165: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
166-166: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
166-166: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
167-167: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
167-167: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
168-168: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
168-168: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
169-169: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
169-169: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
170-170: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
170-170: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
171-171: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
171-171: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
172-172: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
172-172: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
173-173: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
174-174: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
174-174: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
218-218: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
218-218: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
219-219: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
219-219: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
231-231: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
231-231: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
255-255: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
255-255: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
256-256: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
256-256: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
257-257: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe
(MD055, table-pipe-style)
257-257: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data
(MD056, table-column-count)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
… (no whitespaces) Add support for selecting and running multiple extractors simultaneously through an improved checkbox-based UI with proper error handling and code refactoring. Changes: - Replace dropdown with checkboxes for extractor selection - Update @QueryParam to accept List<String> for multiple extractors - Add QuadCollectorDestination to accumulate results from multiple extractors - Sanitize HTML IDs to prevent invalid characters in checkbox elements - Return appropriate HTTP status codes (400 for invalid, 500 for failures) Refactoring: - Extract common logic into performExtraction() helper method - Reduce code duplication from 30.6% to <3% - Add createSource() and createFormatter() helper methods - GET/POST endpoints now delegate to helper with source factory pattern - Fixes SonarQube quality gate failure Bug fixes: - Fix "mappings" and "custom" handling in multi-extractor selection Added pattern matching to route these mode flags correctly instead of treating them as extractor names, allowing users to combine extraction modes with specific extractors in a single request
- Redesign extractor selection with dropdown and custom panel - Allow partial results when some extractors fail - Add X-Extraction-Warnings response header - Include warning comments in text-based output formats - Enhanced error logging per extractor
da82f68 to
4731988
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (2)
89-120: Past review concern remains valid.The JavaScript form manipulation issue flagged in the previous review (potential accumulation of hidden inputs on multiple submissions and missing validation for empty selections in custom-selection mode) has not been addressed in this commit.
258-260: Past review concern remains valid.The comma-based multi-extractor detection issue flagged in the previous review (fragile string splitting that should be replaced with direct List handling) has not been addressed in this commit.
🧹 Nitpick comments (4)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (4)
199-211: Consider extracting user agent configuration logic.The user agent configuration logic (lines 199-211) appears to duplicate logic that might exist in
WikiApi. If this is the case, consider extracting it into a shared utility method to maintain consistency and reduce duplication.// In a shared utility object: object HttpClientUtils { def configureUserAgent(request: HttpGet): Unit = { val customUserAgentEnabled = System.getProperty("extract.wikiapi.customUserAgent.enabled", "false").toBoolean val customUserAgentText = System.getProperty("extract.wikiapi.customUserAgent.text", "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; [email protected])") if (customUserAgentEnabled) { request.setHeader("User-Agent", customUserAgentText) } } }Then in
createSource:val request = new HttpGet(xmlUrl) - // Apply user agent if enabled (same logic as WikiApi) - val customUserAgentEnabled = try { - System.getProperty("extract.wikiapi.customUserAgent.enabled", "false").toBoolean - } catch { - case ex: Exception => false - } - val customUserAgentText = try { - System.getProperty("extract.wikiapi.customUserAgent.text", "curl/8.6.0") - } catch { - case ex: Exception => "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; [email protected])" - } - if (customUserAgentEnabled) { - request.setHeader("User-Agent", customUserAgentText) - } + HttpClientUtils.configureUserAgent(request)
238-256: Reduce duplication between 'mappings' and 'custom' cases.The "mappings" and "custom" cases (lines 238-256) are nearly identical, differing only in the boolean parameter passed to
extract(). This duplication can be eliminated.Apply this diff to consolidate the cases:
- case "mappings" => - val writer = new StringWriter - val formatter = createFormatter(finalFormat, writer) - val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter)) - val source = sourceFactory() - Server.getInstance().extractor.extract(source, destination, language, false) - Response.ok(writer.toString) - .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8") - .build() - - case "custom" => + case "mappings" | "custom" => + val useAllExtractors = extractorName == "custom" val writer = new StringWriter val formatter = createFormatter(finalFormat, writer) val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter)) val source = sourceFactory() - Server.getInstance().extractor.extract(source, destination, language, true) + Server.getInstance().extractor.extract(source, destination, language, useAllExtractors) Response.ok(writer.toString) .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8") .build()
269-276: Extract single-extractor logic to eliminate duplication.The nested pattern match (lines 269-276) duplicates the extraction logic from the outer pattern match (lines 238-256, 321-330). This creates maintenance issues if the logic needs to change.
Extract the single-extractor logic into a helper method:
private def extractSingle( name: String, source: org.dbpedia.extraction.sources.Source, destination: Destination, language: Language ): Unit = { name match { case "mappings" => Server.getInstance().extractor.extract(source, destination, language, false) case "custom" => Server.getInstance().extractor.extract(source, destination, language, true) case _ => Server.getInstance().extractWithSpecificExtractor(source, destination, language, name) } }Then use it in the loop:
for (name <- extractorNames) { try { logger.info(s"Processing extractor: $name") val freshSource = sourceFactory() - name match { - case "mappings" => - Server.getInstance().extractor.extract(freshSource, collector, language, false) - case "custom" => - Server.getInstance().extractor.extract(freshSource, collector, language, true) - case _ => - Server.getInstance().extractWithSpecificExtractor(freshSource, collector, language, name) - } + extractSingle(name, freshSource, collector, language) logger.info(s"Completed extractor: $name")This also makes it easier to use the same logic in the outer pattern match cases.
358-362: Remove unnecessary List-to-string-to-array conversion.Lines 358-362 (and similarly line 460) join the List[String] into a comma-separated string, which is then split back into an array at line 259. This roundtrip string manipulation is fragile and unnecessary.
Refactor
performExtractionto acceptList[String]directly:private def performExtraction( - extractorName: String, + extractorNames: List[String], sourceFactory: () => org.dbpedia.extraction.sources.Source, finalFormat: String, contentType: String ): Response = { - extractorName match { - case "mappings" | "custom" => /* ... */ - case specificExtractor if specificExtractor.contains(",") => - val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty) + extractorNames match { + case List("mappings") | List("custom") => /* ... */ + case names if names.size > 1 => + val cleanedNames = names.map(_.trim).filter(_.nonEmpty) // ... rest of multi-extractor logic ... - case specificExtractor => /* single extractor case */ + case List(name) => /* single extractor case using name */ + case _ => throw new IllegalArgumentException("No extractors specified") } }Then update the call sites:
- val extractorName = if (extractors == null || extractors.isEmpty) { - "mappings" - } else { - extractors.asScala.mkString(",") - } + val extractorsList = if (extractors == null || extractors.isEmpty) { + List("mappings") + } else { + extractors.asScala.toList + } try { performExtraction( - extractorName, + extractorsList, () => createSource(xmlUrl, revid, title), finalFormat, contentType )This eliminates all comma-based string manipulation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)
core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)
TerseFormatter(13-14)core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)
DeduplicatingDestination(9-10)core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)
WriterDestination(16-17)server/src/main/scala/org/dbpedia/extraction/server/Server.scala (3)
Server(26-31)Server(159-159)extractWithSpecificExtractor(87-106)server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)
writeHeader(13-18)core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)
fromXML(61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (3)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)
82-85: LGTM: HTML ID sanitization is well-implemented.The ID sanitization on line 83 correctly replaces non-alphanumeric characters (except underscore and hyphen) with underscores, preventing invalid HTML IDs from extractor names that might contain special characters.
177-187: LGTM: QuadCollectorDestination implementation is correct for single-threaded use.The in-memory quad collector is well-implemented for accumulating extraction results. The
writemethod efficiently appends all quads using++=.Note: The implementation is not thread-safe. If multiple extractors might run concurrently in the future, consider using a synchronized collection or explicit locking.
389-399: LGTM: createFormatter implementation is clean and correct.The formatter mapping logic is well-structured with appropriate defaults. The pattern match covers all supported formats and defaults to TriX for unknown formats.
| if (errors.nonEmpty) { | ||
| logger.warning(s"Partial extraction completed with errors: ${errors.mkString("; ")}") | ||
| if (collector.quads.isEmpty) { | ||
| val errorMessage = if (errors.size == 1) errors.head | ||
| else s"Multiple extractor errors:\n${errors.mkString("\n")}" | ||
| val statusCode = if (errors.exists(_.contains("not found"))) Response.Status.BAD_REQUEST | ||
| else Response.Status.INTERNAL_SERVER_ERROR | ||
| throw new WebApplicationException(new Exception(errorMessage), statusCode) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String-based error status logic still present.
The past review comment for this section was marked as "Addressed," but the current code (line 296) still uses fragile string matching (errors.exists(_.contains("not found"))) to decide between HTTP 400 and 500 status codes.
The suggested type-based approach from the previous review has not been implemented.
| if (errors.nonEmpty && (finalFormat.contains("turtle") || finalFormat.contains("n-triples") || finalFormat.contains("n-quads"))) { | ||
| writer.write("# Warning: Partial results - some extractors failed:\n") | ||
| errors.foreach(err => writer.write(s"# - $err\n")) | ||
| writer.write("\n") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning comments only visible for text-based formats.
Lines 305-309 write warning comments only for turtle and N-Triples/N-Quads formats. Warnings are not included in the response body for trix and rdf-json formats, making errors invisible to users who don't inspect HTTP headers.
Consider either:
- Adding format-appropriate warnings for all formats (e.g., XML comments for trix, JSON field for rdf-json)
- Documenting that partial-error warnings are only available via the
X-Extraction-WarningsHTTP header for XML and JSON formats
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala
around lines 305-309, the code writes warning comments only for text-based RDF
serializations (turtle, n-triples, n-quads), leaving trix and rdf-json responses
without visible partial-extraction warnings; update the response generation so
that for trix you emit an XML comment or a dedicated <warnings> element
appropriate to the TRiX wrapper before the RDF graph, and for rdf-json you
include a top-level "warnings" array/object in the JSON response (or embed
warnings in the JSON-LD structure) while retaining the existing
X-Extraction-Warnings header; ensure the format-specific warning insertion is
only applied when errors.nonEmpty and does not break valid serialization for
each format.
|
|
||
| val source = if (xmlUrl != null && !xmlUrl.isEmpty) { | ||
| // Fetch XML from custom URL with user agent | ||
| // Helper method to create source from xmlUrl or fallback to WikiSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Helper method to create source from xmlUrl or fallback to WikiSource | |
| // Helper method to create source from xmlUrl or fall back to WikiSource |
- Added test-extraction-combinations.sh script to test different extractor combinations - Tests mappings-only, all-enabled, and PageId+Label combinations - Validates both N-Triples and Turtle output formats - Integrated into GitHub Actions workflow - TriX format warnings already implemented in Extraction.scala How to run manually: cd scripts/src/main/bash chmod +x test-extraction-combinations.sh ./test-extraction-combinations.sh Note: Server must be running on localhost:9999 before running tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/src/main/bash/test-extraction-combinations.sh (1)
1-21: Consider adding error case tests.The current tests only cover successful extraction scenarios. Consider adding tests for error cases to improve robustness:
- Invalid extractor names (should return 400)
- Missing required parameters (should return 400)
- Requesting partial results when some extractors fail (should include
X-Extraction-Warningsheader)Example additional tests:
# Test 5: Invalid extractor (should fail with 400) if ! curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=NonExistentExtractor" >/dev/null 2>&1; then echo "✓ Invalid extractor rejected" else echo "✗ FAIL: Invalid extractor should return error" FAILED=1 fi # Test 6: Missing title/revid/xmlUrl (should fail with 400) if ! curl -sf "$API/extract?format=n-triples&extractors=mappings" >/dev/null 2>&1; then echo "✓ Missing parameters rejected" else echo "✗ FAIL: Missing parameters should return error" FAILED=1 fi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/server-web-api-test.yml(1 hunks)scripts/src/main/bash/test-extraction-combinations.sh(1 hunks)server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)
core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)
TerseFormatter(13-14)core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)
DeduplicatingDestination(9-10)core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)
WriterDestination(16-17)server/src/main/scala/org/dbpedia/extraction/server/Server.scala (4)
Server(26-31)Server(159-159)getInstance(171-176)extractWithSpecificExtractor(87-106)server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)
writeHeader(13-18)core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)
fromXML(61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: deploy
- GitHub Check: build
🔇 Additional comments (8)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (7)
6-6: LGTM - Imports support new multi-extractor functionality.The new imports for formatters, destinations, Quad, and ArrayBuffer are all necessary for the multi-extractor flow and in-memory quad collection introduced in this PR.
Also applies to: 16-17, 20-20, 23-23
177-187: LGTM - Clean in-memory quad collector.The
QuadCollectorDestinationprovides a simple, correct implementation for accumulating quads from multiple extractors. The use ofArrayBufferis safe here since extractors run sequentially (line 265).
190-228: LGTM - Solid source creation logic.The
createSourcehelper properly handles XML URL fetching with appropriate error handling, custom user-agent support, and resource cleanup. The fallback to WikiSource is clear.Note: TallTed already suggested improving the comment at line 190 in a previous review.
230-289: LGTM - Well-structured multi-extractor flow.The
performExtractionhelper centralizes extraction logic effectively. Key strengths:
- Lines 269-276: The nested pattern match correctly distinguishes "mappings" and "custom" mode flags from actual extractor names, enabling mode+extractor combinations.
- Lines 262-289: Sequential extractor processing with per-extractor error collection allows partial results when some extractors fail.
- Line 293: Proper handling when all extractors fail vs. partial success.
The overall design supports the multi-extractor feature while maintaining error transparency.
352-353: LGTM - Good validation and delegation.Lines 352-353 properly validate that at least one of title, revid, or xmlUrl is provided. The delegation to
performExtractionwith the source factory pattern (lines 372-377) is clean and eliminates code duplication.Also applies to: 371-377
397-407: LGTM - Clean formatter factory.The
createFormatterhelper effectively centralizes format selection logic, reducing duplication across GET and POST endpoints. All supported formats are correctly mapped to their respective formatter implementations.
457-470: LGTM - Consistent POST endpoint refactoring.The POST endpoint correctly mirrors the GET endpoint pattern: accepting
java.util.List[String]for extractors and delegating toperformExtractionwith an XML source factory. The refactoring eliminates code duplication while maintaining consistent behavior..github/workflows/server-web-api-test.yml (1)
54-57: LGTM - Good test coverage for multi-extractor feature.The new CI step properly exercises multi-extractor combinations, validating the core functionality introduced in this PR. The placement after the coordinate test and before server shutdown is appropriate.
| #!/bin/bash | ||
| # Test Multi-Extractor API (PR #794) | ||
|
|
||
| API="http://localhost:9999/extraction/en" | ||
| TITLE="Albert_Einstein" | ||
|
|
||
| echo "Testing Multi-Extractor API" | ||
|
|
||
| # Test 1: Mappings-only + N-Triples | ||
| curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL" | ||
|
|
||
| # Test 2: All-enabled + Turtle | ||
| curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL" | ||
|
|
||
| # Test 3: PageId + Label + N-Triples | ||
| curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL" | ||
|
|
||
| # Test 4: PageId + Label + Turtle | ||
| curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL" | ||
|
|
||
| echo "Done" No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Test script doesn't fail CI on test failures.
The script prints "✗ FAIL" messages but never exits with a non-zero status code, so the CI step will always pass even when tests fail. Additionally, curl errors (network issues, server errors) are silently ignored.
Apply these changes to make tests fail the CI build:
#!/bin/bash
# Test Multi-Extractor API (PR #794)
+set -e # Exit on first error
+FAILED=0
API="http://localhost:9999/extraction/en"
TITLE="Albert_Einstein"
echo "Testing Multi-Extractor API"
# Test 1: Mappings-only + N-Triples
-curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL"
+if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<"; then
+ echo "✓ Mappings N-Triples"
+else
+ echo "✗ FAIL: Mappings N-Triples"
+ FAILED=1
+fi
# Test 2: All-enabled + Turtle
-curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL"
+if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)"; then
+ echo "✓ All-enabled Turtle"
+else
+ echo "✗ FAIL: All-enabled Turtle"
+ FAILED=1
+fi
# Test 3: PageId + Label + N-Triples
-curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL"
+if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<"; then
+ echo "✓ PageId+Label N-Triples"
+else
+ echo "✗ FAIL: PageId+Label N-Triples"
+ FAILED=1
+fi
# Test 4: PageId + Label + Turtle
-curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL"
+if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)"; then
+ echo "✓ PageId+Label Turtle"
+else
+ echo "✗ FAIL: PageId+Label Turtle"
+ FAILED=1
+fi
-echo "Done"
+if [ $FAILED -eq 0 ]; then
+ echo "✓ All tests passed"
+ exit 0
+else
+ echo "✗ Some tests failed"
+ exit 1
+fiNote: The -f flag makes curl fail on HTTP errors (4xx, 5xx), ensuring server errors are caught.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/bin/bash | |
| # Test Multi-Extractor API (PR #794) | |
| API="http://localhost:9999/extraction/en" | |
| TITLE="Albert_Einstein" | |
| echo "Testing Multi-Extractor API" | |
| # Test 1: Mappings-only + N-Triples | |
| curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL" | |
| # Test 2: All-enabled + Turtle | |
| curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL" | |
| # Test 3: PageId + Label + N-Triples | |
| curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL" | |
| # Test 4: PageId + Label + Turtle | |
| curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL" | |
| echo "Done" | |
| #!/bin/bash | |
| # Test Multi-Extractor API (PR #794) | |
| set -e # Exit on first error | |
| FAILED=0 | |
| API="http://localhost:9999/extraction/en" | |
| TITLE="Albert_Einstein" | |
| echo "Testing Multi-Extractor API" | |
| # Test 1: Mappings-only + N-Triples | |
| if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<"; then | |
| echo "✓ Mappings N-Triples" | |
| else | |
| echo "✗ FAIL: Mappings N-Triples" | |
| FAILED=1 | |
| fi | |
| # Test 2: All-enabled + Turtle | |
| if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)"; then | |
| echo "✓ All-enabled Turtle" | |
| else | |
| echo "✗ FAIL: All-enabled Turtle" | |
| FAILED=1 | |
| fi | |
| # Test 3: PageId + Label + N-Triples | |
| if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<"; then | |
| echo "✓ PageId+Label N-Triples" | |
| else | |
| echo "✗ FAIL: PageId+Label N-Triples" | |
| FAILED=1 | |
| fi | |
| # Test 4: PageId + Label + Turtle | |
| if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)"; then | |
| echo "✓ PageId+Label Turtle" | |
| else | |
| echo "✗ FAIL: PageId+Label Turtle" | |
| FAILED=1 | |
| fi | |
| if [ $FAILED -eq 0 ]; then | |
| echo "✓ All tests passed" | |
| exit 0 | |
| else | |
| echo "✗ Some tests failed" | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
In scripts/src/main/bash/test-extraction-combinations.sh lines 1-21, the script
prints "✗ FAIL" but never returns a non-zero exit code and it silently ignores
curl HTTP/network errors; enable strict failure handling (e.g., set -euo
pipefail) at the top, make curl fail on HTTP errors and surface failures using
-f (and -S for error output) while keeping -s if desired, and ensure each test
checks the combined exit status (curl | grep) and exits non-zero on failure
(e.g., replace the current || echo "✗ FAIL" with a branch that echoes the
failure and exits 1) so the CI step fails when any test or curl fails.
- Add yellow warning banner & detailed error section in TriX XSLT - Detect WARNING comments and display user-friendly notices - Fix exit code logic in test-extraction-combinations.sh to fail correctly
|



… (no whitespaces)
Add support for selecting and running multiple extractors simultaneously through an improved checkbox-based UI with proper error handling and code refactoring.
Changes:
Refactoring:
Bug fixes:
Summary by CodeRabbit
New Features
Tests