feat: add multi-extractor support with improved UI and error handling… #794

haniyakonain · 2025-10-08T19:56:17Z

… (no whitespaces)

Add support for selecting and running multiple extractors simultaneously through an improved checkbox-based UI with proper error handling and code refactoring.

Changes:

Replace dropdown with checkboxes for extractor selection
Update @QueryParam to accept List for multiple extractors
Add QuadCollectorDestination to accumulate results from multiple extractors
Sanitize HTML IDs to prevent invalid characters in checkbox elements
Return appropriate HTTP status codes (400 for invalid, 500 for failures)

Refactoring:

Extract common logic into performExtraction() helper method
Reduce code duplication from 30.6% to <3%
Add createSource() and createFormatter() helper methods
GET/POST endpoints now delegate to helper with source factory pattern
Fixes SonarQube quality gate failure

Bug fixes:

Fix "mappings" and "custom" handling in multi-extractor selection Added pattern matching to route these mode flags correctly instead of treating them as extractor names, allowing users to combine extraction modes with specific extractors in a single request

Summary by CodeRabbit

New Features
- Support multiple extractors per request with selectable extractor modes and per-extractor controls.
- In-memory accumulation of extraction results enabling combined outputs and partial-result emission with warnings.
- Unified extraction pipeline with improved content negotiation and aggregated error reporting.
Tests
- Added CI step and script to exercise multi-extractor combinations and verify output formats.

coderabbitai · 2025-10-08T19:57:12Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Extraction endpoints changed to accept multiple extractors as a java.util.List[String]. An in-memory quad collector (QuadCollectorDestination) and unified extraction helpers (createSource, createFormatter, performExtraction) were added. UI, GET and POST flows, content negotiation, and error/partial-result handling were updated to support multi-extractor runs.

Changes

Cohort / File(s)	Summary
Extraction endpoint & pipeline `server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala`	Replace single-`String` extractor parameter with `java.util.List[String]` for GET and POST. Add `QuadCollectorDestination` to accumulate quads in-memory. Introduce helpers `createSource`, `createFormatter`, and `performExtraction` to centralize extraction logic, per-extractor processing, partial-error collection and aggregated response formatting. Update content negotiation and destination wiring.
CI workflow `.github/workflows/server-web-api-test.yml`	Add new workflow step "Run Multi-Extractor Combination Tests" to execute the test script in CI.
Test script `scripts/src/main/bash/test-extraction-combinations.sh`	Add a Bash script exercising multi-extractor API combinations and formats (n-triples, turtle), verifying responses with grep and printing pass/fail results.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Endpoint as /extract Endpoint
    participant Source as createSource()
    participant Pipeline as performExtraction()
    participant Collector as QuadCollectorDestination
    participant Formatter as createFormatter()
    participant Response as Response Builder

    Client->>Endpoint: GET/POST with extractors: List[String], format, title/revid/xml
    Endpoint->>Source: createSource(title, revid, xmlUrl)
    Source-->>Endpoint: Source (XML or Wiki)

    loop per extractor
        Endpoint->>Pipeline: performExtraction(extractor, source, Collector)
        Pipeline->>Collector: emit quads
        Collector-->>Pipeline: accumulated quads
        Pipeline-->>Endpoint: success or error (with warning)
    end

    Endpoint->>Formatter: createFormatter(format, headers)
    Formatter-->>Endpoint: formatter instance
    Endpoint->>Response: format accumulated quads + include warnings/errors
    Response-->>Client: HTTP response (body + status reflecting partial failures)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pay attention to Extraction.scala sections that wire destinations and manage per-extractor error aggregation.
Verify correctness of content negotiation and formatter selection for multi-extractor output.
Review the new test script for shell robustness and CI path correctness.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "feat: add multi-extractor support with improved UI and error handling…" directly and accurately summarizes the main changes in the changeset. The core modification is the introduction of multi-extractor support, which is substantiated by the refactored Extraction.scala endpoint methods that now accept `java.util.List[String]` instead of single strings, the introduction of QuadCollectorDestination for accumulating results, and the unified extraction pipeline. The secondary improvements mentioned in the title—improved UI and error handling—are also clearly present in the changes, including the refactored checkbox-based UI and enhanced HTTP status handling. The title is specific, concise, and clear enough that a teammate reviewing the commit history would understand the primary purpose of this PR.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (1)

191-228: Improve XML parsing error handling and resource management in createSource

Wrap XML.loadString and response reading in try/catch to handle SAXException, IOException etc., and rethrow as a WebApplicationException (e.g. 400 Bad Request) so malformed XML doesn’t bubble up as a 500.

Avoid loading the entire response into memory via mkString; switch to a streaming parser or enforce a maximum response size.

Ensure the HTTP response (not just the client) is closed in the finally block—e.g. call response.close() or explicitly close the entity’s input stream.

🧹 Nitpick comments (2)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (2)
72-86: Consider accessibility improvements for the checkbox UI.

The checkbox UI is functional, but consider these enhancements:

The <select> dropdown at line 73 and the checkboxes below create a somewhat unusual UX pattern where users first pick a "mode" and then optionally customize. This could be simplified.

ID sanitization at line 83 is good for avoiding HTML/CSS issues, but ensure the original extractor names are preserved in the value attribute (which they are).

Consider adding ARIA attributes to improve screen reader support, e.g., aria-describedby to link the select with the checkbox container.

177-187: Document thread-safety assumption for QuadCollectorDestination.

The QuadCollectorDestination uses ArrayBuffer, which is not thread-safe. While the current usage in performExtraction (lines 265-289) is sequential and safe, future modifications could inadvertently introduce concurrent writes.

Consider adding a comment to document this constraint:
  /**
-  * Custom destination that collects quads in memory
+  * Custom destination that collects quads in memory.
+  * Note: Not thread-safe. Assumes sequential write operations.
   */
  private class QuadCollectorDestination extends Destination {

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7485d00 and b67ad04.

📒 Files selected for processing (1)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)

core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)

TerseFormatter (13-14)

core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)

DeduplicatingDestination (9-10)

core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)

WriterDestination (16-17)

server/src/main/scala/org/dbpedia/extraction/server/Server.scala (4)

Server (26-31)

Server (159-159)

getInstance (171-176)

extractWithSpecificExtractor (87-106)

server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)

writeHeader (13-18)

core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)

fromXML (61-61)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build
GitHub Check: build
GitHub Check: build

🔇 Additional comments (3)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)

333-387: GET endpoint properly handles errors and defaults.

The refactored GET endpoint has solid error handling:

Line 344-345: Validates required parameters before processing

Lines 370-386: Comprehensive exception handling with appropriate HTTP status codes

Line 378-384: Full stack trace logging for debugging

Lines 358-362: Backward-compatible default to "mappings" when no extractors specified

The comma-joining at line 361 is related to the fragility flagged in the performExtraction method—consider the alternative approach suggested there.

389-399: LGTM!

The createFormatter method cleanly maps format strings to their corresponding formatter implementations using pattern matching. The default case falling back to TriX is reasonable.

449-462: POST endpoint follows GET pattern.

The POST endpoint has been refactored to match the GET endpoint's structure, delegating to performExtraction. The same comma-joining approach at line 460 has the fragility noted in the GET endpoint review.

Otherwise, the delegation pattern is clean and reduces code duplication as intended.

coderabbitai · 2025-10-14T16:59:47Z

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala

+               <script type="text/javascript">
+                {scala.xml.Unparsed("""
+                function toggleCustomExtractors() {
+                  var mode = document.getElementById('extractorMode').value;
+                  var customDiv = document.getElementById('customExtractors');
+
+                  if (mode === 'custom-selection') {
+                    customDiv.style.display = 'block';
+                  } else {
+                    customDiv.style.display = 'none';
+                  }
+                }
+
+                document.querySelector('form').addEventListener('submit', function(e) {
+                  var mode = document.getElementById('extractorMode').value;
+
+                  if (mode === 'mappings') {
+                    var input = document.createElement('input');
+                    input.type = 'hidden';
+                    input.name = 'extractors';
+                    input.value = 'mappings';
+                    this.appendChild(input);
+                  } else if (mode === 'custom') {
+                    var input = document.createElement('input');
+                    input.type = 'hidden';
+                    input.name = 'extractors';
+                    input.value = 'custom';
+                    this.appendChild(input);
+                  }
+                });
+                """)}
+              </script>


⚠️ Potential issue | 🟡 Minor

JavaScript form manipulation has potential issues.

The JavaScript code dynamically appends hidden inputs during form submission (lines 105-117), but:

If the user submits the form multiple times (unlikely but possible with back button), hidden inputs could accumulate since they're never removed.

The script doesn't validate that at least one extractor is selected when in "custom-selection" mode, potentially submitting an empty list.

Consider this improvement:

document.querySelector('form').addEventListener('submit', function(e) { var mode = document.getElementById('extractorMode').value; + + // Remove any previously added hidden inputs + var existing = this.querySelectorAll('input[type="hidden"][name="extractors"]'); + existing.forEach(function(el) { el.remove(); }); if (mode === 'mappings') { var input = document.createElement('input'); input.type = 'hidden'; input.name = 'extractors'; input.value = 'mappings'; this.appendChild(input); } else if (mode === 'custom') { var input = document.createElement('input'); input.type = 'hidden'; input.name = 'extractors'; input.value = 'custom'; this.appendChild(input); + } else if (mode === 'custom-selection') { + // Validate at least one checkbox is selected + var checked = this.querySelectorAll('input[name="extractors"]:checked'); + if (checked.length === 0) { + alert('Please select at least one extractor'); + e.preventDefault(); + return false; + } } });

coderabbitai · 2025-10-14T16:59:48Z

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala

+      case specificExtractor if specificExtractor.contains(",") =>
+        val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty)
+        logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}")


⚠️ Potential issue | 🟡 Minor

Comma-based multi-extractor detection is fragile.

Line 258 uses contains(",") to detect multiple extractors, and line 259 splits on comma. This approach has issues:

If an extractor name legitimately contains a comma (unlikely but not impossible), it would be incorrectly split.

The logic assumes the frontend will send comma-separated values when multiple checkboxes are selected. However, HTML forms typically send multiple values with the same name as separate query parameters (e.g., ?extractors=foo&extractors=bar), which JAX-RS automatically collects into the List[String] parameter. The comma-joining at line 361 and subsequent splitting here is unnecessary indirection.

Consider this refactor to work directly with the list:

- case specificExtractor if specificExtractor.contains(",") => - val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty) + case specificExtractor if specificExtractor == "MULTI" || extractors.asScala.size > 1 => + val extractorNames = extractors.asScala.toArray

Then at the call site (line 361), don't join:

- extractors.asScala.mkString(",") + if (extractors.asScala.size == 1) extractors.asScala.head else "MULTI"

This eliminates the string manipulation entirely.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala around lines 258 to 260, the code detects multiple extractors by checking for contains(",") and splitting the string; instead, treat extractors as a List[String] (the JAX-RS behaviour) and remove any comma-based parsing: replace the contains/split branch with logic that accepts the incoming List[String] directly (use .map(_.trim).filter(_.nonEmpty) on the list), log extractorNames.length and extractorNames.mkString(", "), and process extractorNames without string splitting; also update the call site around line 361 to stop joining the list into a single comma string and pass the List[String] through as-is so no comma manipulation is necessary anywhere.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)

258-262: Comma-based multi-extractor parsing is fragile.

Line 258 detects multiple extractors using contains(",") and line 259 splits on comma. This approach has issues:

If an extractor name contains a comma (unlikely but possible), it would be split incorrectly.
HTML forms naturally send multiple parameters as ?extractors=foo&extractors=bar, which JAX-RS collects into List[String] automatically. The comma-joining at line 361 and subsequent splitting here is unnecessary indirection.
The logic conflates string manipulation with multi-value handling.

Consider refactoring to work directly with the list:

-case specificExtractor if specificExtractor.contains(",") =>
-  val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty)
-  logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}")
+case _ =>
+  // Split the comma-separated string into individual extractor names
+  val extractorNames = extractorName.split(",").map(_.trim).filter(_.nonEmpty)
+  
+  // If only one extractor after parsing, delegate to single-extractor branch
+  if (extractorNames.length == 1) {
+    // Handle as single extractor (reuse existing logic)
+    val singleName = extractorNames.head
+    val writer = new StringWriter
+    val formatter = createFormatter(finalFormat, writer)
+    val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter))
+    val source = sourceFactory()
+    singleName match {
+      case "mappings" => Server.getInstance().extractor.extract(source, destination, language, false)
+      case "custom" => Server.getInstance().extractor.extract(source, destination, language, true)
+      case _ => Server.getInstance().extractWithSpecificExtractor(source, destination, language, singleName)
+    }
+    Response.ok(writer.toString)
+      .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8")
+      .build()
+  } else {
+    // Handle multiple extractors
+    logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}")

This eliminates the fragile pattern matching on contains(",") while maintaining current functionality.

291-299: Error status logic relies on fragile string matching.

Lines 296-297 determine HTTP status by checking if error messages contain "not found". This string matching is fragile:

Line 280 creates "Extractor '$name' not found" for IllegalArgumentException
Line 284 creates generic messages for other exceptions
Line 296 checks for literal "not found" substring, which could have false positives if other errors mention those words

Consider tracking exception types separately:

  val errors = new ArrayBuffer[String]()
+ val notFoundErrors = new ArrayBuffer[String]()
  val collector = new QuadCollectorDestination()

  for (name <- extractorNames) {
    try {
      // ... extraction logic ...
    } catch {
      case e: IllegalArgumentException =>
        val errorMsg = s"Extractor '$name' not found"
        logger.warning(s"$errorMsg: ${e.getMessage}")
        errors += errorMsg
+       notFoundErrors += name
      case e: Exception =>
        val errorMsg = s"Failed to extract with '$name': ${e.getMessage}"
        logger.severe(errorMsg)
        e.printStackTrace()
        errors += errorMsg
    }
  }

  if (errors.nonEmpty) {
    logger.warning(s"Partial extraction completed with errors: ${errors.mkString("; ")}")
    if (collector.quads.isEmpty) {
      val errorMessage = if (errors.size == 1) errors.head
                        else s"Multiple extractor errors:\n${errors.mkString("\n")}"
-     val statusCode = if (errors.exists(_.contains("not found"))) Response.Status.BAD_REQUEST
+     val statusCode = if (notFoundErrors.nonEmpty) Response.Status.BAD_REQUEST
                      else Response.Status.INTERNAL_SERVER_ERROR
      throw new WebApplicationException(new Exception(errorMessage), statusCode)
    }
  }

89-120: Consider improving form submission handling.

The JavaScript form manipulation has two potential improvements:

Accumulation of hidden inputs: Lines 105-117 append hidden inputs on form submission, but if the user navigates back and resubmits (unlikely but possible), inputs could accumulate since they're never removed.
Missing validation: In "custom-selection" mode, there's no client-side check that at least one extractor is selected, potentially submitting an empty list.

The backend will default to "mappings" if no extractors are provided (line 358-359), so these are low-risk improvements:

  document.querySelector('form').addEventListener('submit', function(e) {
    var mode = document.getElementById('extractorMode').value;
+   
+   // Remove any previously added hidden inputs to prevent accumulation
+   var existing = this.querySelectorAll('input[type="hidden"][name="extractors"]');
+   existing.forEach(function(el) { el.remove(); });

    if (mode === 'mappings') {
      var input = document.createElement('input');
      input.type = 'hidden';
      input.name = 'extractors';
      input.value = 'mappings';
      this.appendChild(input);
    } else if (mode === 'custom') {
      var input = document.createElement('input');
      input.type = 'hidden';
      input.name = 'extractors';
      input.value = 'custom';
      this.appendChild(input);
+   } else if (mode === 'custom-selection') {
+     // Optional: validate at least one checkbox is selected
+     var checked = this.querySelectorAll('input[name="extractors"]:checked');
+     if (checked.length === 0) {
+       alert('Please select at least one extractor');
+       e.preventDefault();
+       return false;
+     }
    }
  });

🧹 Nitpick comments (1)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (1)
358-362: Consider eliminating comma-joining to simplify multi-extractor handling.

Line 361 joins the extractor list with commas: extractors.asScala.mkString(","). This creates the comma-separated string that must be parsed later in performExtraction (line 258-259).

Since performExtraction already handles different extractor modes via pattern matching, consider passing a marker for multi-extractor mode instead:
 val extractorName = if (extractors == null || extractors.isEmpty) {
   "mappings"
+} else if (extractors.asScala.size == 1) {
+  extractors.asScala.head
 } else {
-  extractors.asScala.mkString(",")
+  // Join for backward compatibility, but consider passing List directly in future refactor
+  extractors.asScala.mkString(",")
 }
Better yet, refactor performExtraction to accept List[String] directly instead of a single string, eliminating the need for comma manipulation entirely. This would require updating the method signature and pattern matching logic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b67ad04 and da82f68.

📒 Files selected for processing (2)

dump/src/test/resources/shaclTestsCoverageTable.md (14 hunks)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)

core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)

TerseFormatter (13-14)

core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)

DeduplicatingDestination (9-10)

core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)

WriterDestination (16-17)

server/src/main/scala/org/dbpedia/extraction/server/Server.scala (3)

Server (26-31)

Server (159-159)

extractWithSpecificExtractor (87-106)

server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)

writeHeader (13-18)

core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)

fromXML (61-61)

🪛 markdownlint-cli2 (0.18.1)

dump/src/test/resources/shaclTestsCoverageTable.md

29-29: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

29-29: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

30-30: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

30-30: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

31-31: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

31-31: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

32-32: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

32-32: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

43-43: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

43-43: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

54-54: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

54-54: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

64-64: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

64-64: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

65-65: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

65-65: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

66-66: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

66-66: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

67-67: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

67-67: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

68-68: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

68-68: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

69-69: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

69-69: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

70-70: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

70-70: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

71-71: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

71-71: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

82-82: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

82-82: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

92-92: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

92-92: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

93-93: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

93-93: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

94-94: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

94-94: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

95-95: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

95-95: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

104-104: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

104-104: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

114-114: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

114-114: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

124-124: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

124-124: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

125-125: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

125-125: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

126-126: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

126-126: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

127-127: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

127-127: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

128-128: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

128-128: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

129-129: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

129-129: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

130-130: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

131-131: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

131-131: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

132-132: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

132-132: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

133-133: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

133-133: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

134-134: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

134-134: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

135-135: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

135-135: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

136-136: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

136-136: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

137-137: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

137-137: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

138-138: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

138-138: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

139-139: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

139-139: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

148-148: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

148-148: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

159-159: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

159-159: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)

160-160: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

160-160: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

161-161: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

161-161: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

162-162: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

162-162: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

163-163: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

163-163: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

164-164: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

164-164: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

165-165: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

165-165: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

166-166: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

166-166: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

167-167: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

167-167: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

168-168: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

168-168: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

169-169: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

169-169: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

170-170: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

170-170: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

171-171: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

171-171: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

172-172: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

172-172: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

173-173: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

174-174: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

174-174: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

218-218: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

218-218: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

219-219: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

219-219: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

231-231: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

231-231: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

255-255: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

255-255: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

256-256: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

256-256: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

257-257: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)

257-257: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build
GitHub Check: build
GitHub Check: build

@QueryParam

… (no whitespaces) Add support for selecting and running multiple extractors simultaneously through an improved checkbox-based UI with proper error handling and code refactoring. Changes: - Replace dropdown with checkboxes for extractor selection - Update @QueryParam to accept List<String> for multiple extractors - Add QuadCollectorDestination to accumulate results from multiple extractors - Sanitize HTML IDs to prevent invalid characters in checkbox elements - Return appropriate HTTP status codes (400 for invalid, 500 for failures) Refactoring: - Extract common logic into performExtraction() helper method - Reduce code duplication from 30.6% to <3% - Add createSource() and createFormatter() helper methods - GET/POST endpoints now delegate to helper with source factory pattern - Fixes SonarQube quality gate failure Bug fixes: - Fix "mappings" and "custom" handling in multi-extractor selection Added pattern matching to route these mode flags correctly instead of treating them as extractor names, allowing users to combine extraction modes with specific extractors in a single request

- Redesign extractor selection with dropdown and custom panel - Allow partial results when some extractors fail - Add X-Extraction-Warnings response header - Include warning comments in text-based output formats - Enhanced error logging per extractor

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (2)

89-120: Past review concern remains valid.

The JavaScript form manipulation issue flagged in the previous review (potential accumulation of hidden inputs on multiple submissions and missing validation for empty selections in custom-selection mode) has not been addressed in this commit.

258-260: Past review concern remains valid.

The comma-based multi-extractor detection issue flagged in the previous review (fragile string splitting that should be replaced with direct List handling) has not been addressed in this commit.

🧹 Nitpick comments (4)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (4)

199-211: Consider extracting user agent configuration logic.

The user agent configuration logic (lines 199-211) appears to duplicate logic that might exist in WikiApi. If this is the case, consider extracting it into a shared utility method to maintain consistency and reduce duplication.

// In a shared utility object:
object HttpClientUtils {
  def configureUserAgent(request: HttpGet): Unit = {
    val customUserAgentEnabled = System.getProperty("extract.wikiapi.customUserAgent.enabled", "false").toBoolean
    val customUserAgentText = System.getProperty("extract.wikiapi.customUserAgent.text", 
      "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; [email protected])")
    if (customUserAgentEnabled) {
      request.setHeader("User-Agent", customUserAgentText)
    }
  }
}

Then in createSource:

            val request = new HttpGet(xmlUrl)
-           // Apply user agent if enabled (same logic as WikiApi)
-           val customUserAgentEnabled = try {
-             System.getProperty("extract.wikiapi.customUserAgent.enabled", "false").toBoolean
-           } catch {
-             case ex: Exception => false
-           }
-           val customUserAgentText = try {
-             System.getProperty("extract.wikiapi.customUserAgent.text", "curl/8.6.0")
-           } catch {
-             case ex: Exception => "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; [email protected])"
-           }
-           if (customUserAgentEnabled) {
-             request.setHeader("User-Agent", customUserAgentText)
-           }
+           HttpClientUtils.configureUserAgent(request)

238-256: Reduce duplication between 'mappings' and 'custom' cases.

The "mappings" and "custom" cases (lines 238-256) are nearly identical, differing only in the boolean parameter passed to extract(). This duplication can be eliminated.

Apply this diff to consolidate the cases:

-      case "mappings" =>
-        val writer = new StringWriter
-        val formatter = createFormatter(finalFormat, writer)
-        val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter))
-        val source = sourceFactory()
-        Server.getInstance().extractor.extract(source, destination, language, false)
-        Response.ok(writer.toString)
-          .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8")
-          .build()
-
-        case "custom" =>
+      case "mappings" | "custom" =>
+        val useAllExtractors = extractorName == "custom"
         val writer = new StringWriter
         val formatter = createFormatter(finalFormat, writer)
         val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter))
         val source = sourceFactory()
-        Server.getInstance().extractor.extract(source, destination, language, true)
+        Server.getInstance().extractor.extract(source, destination, language, useAllExtractors)
         Response.ok(writer.toString)
           .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8")
           .build()

269-276: Extract single-extractor logic to eliminate duplication.

The nested pattern match (lines 269-276) duplicates the extraction logic from the outer pattern match (lines 238-256, 321-330). This creates maintenance issues if the logic needs to change.

Extract the single-extractor logic into a helper method:

private def extractSingle(
  name: String,
  source: org.dbpedia.extraction.sources.Source,
  destination: Destination,
  language: Language
): Unit = {
  name match {
    case "mappings" =>
      Server.getInstance().extractor.extract(source, destination, language, false)
    case "custom" =>
      Server.getInstance().extractor.extract(source, destination, language, true)
    case _ =>
      Server.getInstance().extractWithSpecificExtractor(source, destination, language, name)
  }
}

Then use it in the loop:

         for (name <- extractorNames) {
           try {
             logger.info(s"Processing extractor: $name")
             val freshSource = sourceFactory()
-            name match {
-              case "mappings" =>
-                Server.getInstance().extractor.extract(freshSource, collector, language, false)
-              case "custom" =>
-                Server.getInstance().extractor.extract(freshSource, collector, language, true)
-              case _ =>
-                Server.getInstance().extractWithSpecificExtractor(freshSource, collector, language, name)
-            }
+            extractSingle(name, freshSource, collector, language)
             logger.info(s"Completed extractor: $name")

This also makes it easier to use the same logic in the outer pattern match cases.

358-362: Remove unnecessary List-to-string-to-array conversion.

Lines 358-362 (and similarly line 460) join the List[String] into a comma-separated string, which is then split back into an array at line 259. This roundtrip string manipulation is fragile and unnecessary.

Refactor performExtraction to accept List[String] directly:

  private def performExtraction(
-   extractorName: String,
+   extractorNames: List[String],
    sourceFactory: () => org.dbpedia.extraction.sources.Source,
    finalFormat: String,
    contentType: String
  ): Response = {
-   extractorName match {
-     case "mappings" | "custom" => /* ... */
-     case specificExtractor if specificExtractor.contains(",") =>
-       val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty)
+   extractorNames match {
+     case List("mappings") | List("custom") => /* ... */
+     case names if names.size > 1 =>
+       val cleanedNames = names.map(_.trim).filter(_.nonEmpty)
        // ... rest of multi-extractor logic ...
-     case specificExtractor => /* single extractor case */
+     case List(name) => /* single extractor case using name */
+     case _ => throw new IllegalArgumentException("No extractors specified")
    }
  }

Then update the call sites:

- val extractorName = if (extractors == null || extractors.isEmpty) {
-   "mappings"
- } else {
-   extractors.asScala.mkString(",")
- }
+ val extractorsList = if (extractors == null || extractors.isEmpty) {
+   List("mappings")
+ } else {
+   extractors.asScala.toList
+ }
  try {
    performExtraction(
-     extractorName,
+     extractorsList,
      () => createSource(xmlUrl, revid, title),
      finalFormat,
      contentType
    )

This eliminates all comma-based string manipulation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da82f68 and 4731988.

📒 Files selected for processing (1)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)

core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)

TerseFormatter (13-14)

core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)

DeduplicatingDestination (9-10)

core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)

WriterDestination (16-17)

server/src/main/scala/org/dbpedia/extraction/server/Server.scala (3)

Server (26-31)

Server (159-159)

extractWithSpecificExtractor (87-106)

server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)

writeHeader (13-18)

core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)

fromXML (61-61)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: build
GitHub Check: build
GitHub Check: build

🔇 Additional comments (3)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)

82-85: LGTM: HTML ID sanitization is well-implemented.

The ID sanitization on line 83 correctly replaces non-alphanumeric characters (except underscore and hyphen) with underscores, preventing invalid HTML IDs from extractor names that might contain special characters.

177-187: LGTM: QuadCollectorDestination implementation is correct for single-threaded use.

The in-memory quad collector is well-implemented for accumulating extraction results. The write method efficiently appends all quads using ++=.

Note: The implementation is not thread-safe. If multiple extractors might run concurrently in the future, consider using a synchronized collection or explicit locking.

389-399: LGTM: createFormatter implementation is clean and correct.

The formatter mapping logic is well-structured with appropriate defaults. The pattern match covers all supported formats and defaults to TriX for unknown formats.

coderabbitai · 2025-10-16T18:17:33Z

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala

+        if (errors.nonEmpty) {
+          logger.warning(s"Partial extraction completed with errors: ${errors.mkString("; ")}")
+          if (collector.quads.isEmpty) {
+          val errorMessage = if (errors.size == 1) errors.head
+                            else s"Multiple extractor errors:\n${errors.mkString("\n")}"
+          val statusCode = if (errors.exists(_.contains("not found"))) Response.Status.BAD_REQUEST
+                          else Response.Status.INTERNAL_SERVER_ERROR
+          throw new WebApplicationException(new Exception(errorMessage), statusCode)
+          }


⚠️ Potential issue | 🟡 Minor

String-based error status logic still present.

The past review comment for this section was marked as "Addressed," but the current code (line 296) still uses fragile string matching (errors.exists(_.contains("not found"))) to decide between HTTP 400 and 500 status codes.

The suggested type-based approach from the previous review has not been implemented.

coderabbitai · 2025-10-16T18:17:33Z

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala

+        if (errors.nonEmpty && (finalFormat.contains("turtle") || finalFormat.contains("n-triples") || finalFormat.contains("n-quads"))) {
+          writer.write("# Warning: Partial results - some extractors failed:\n")
+          errors.foreach(err => writer.write(s"# - $err\n"))
+          writer.write("\n")
+        }


⚠️ Potential issue | 🟡 Minor

Warning comments only visible for text-based formats.

Lines 305-309 write warning comments only for turtle and N-Triples/N-Quads formats. Warnings are not included in the response body for trix and rdf-json formats, making errors invisible to users who don't inspect HTTP headers.

Consider either:

Adding format-appropriate warnings for all formats (e.g., XML comments for trix, JSON field for rdf-json)

Documenting that partial-error warnings are only available via the X-Extraction-Warnings HTTP header for XML and JSON formats

🤖 Prompt for AI Agents

In server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala around lines 305-309, the code writes warning comments only for text-based RDF serializations (turtle, n-triples, n-quads), leaving trix and rdf-json responses without visible partial-extraction warnings; update the response generation so that for trix you emit an XML comment or a dedicated <warnings> element appropriate to the TRiX wrapper before the RDF graph, and for rdf-json you include a top-level "warnings" array/object in the JSON response (or embed warnings in the JSON-LD structure) while retaining the existing X-Extraction-Warnings header; ensure the format-specific warning insertion is only applied when errors.nonEmpty and does not break valid serialization for each format.

TallTed · 2025-10-17T18:04:59Z

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala


-        val source = if (xmlUrl != null && !xmlUrl.isEmpty) {
-          // Fetch XML from custom URL with user agent
+   // Helper method to create source from xmlUrl or fallback to WikiSource


Suggested change

// Helper method to create source from xmlUrl or fallback to WikiSource

// Helper method to create source from xmlUrl or fall back to WikiSource

- Added test-extraction-combinations.sh script to test different extractor combinations - Tests mappings-only, all-enabled, and PageId+Label combinations - Validates both N-Triples and Turtle output formats - Integrated into GitHub Actions workflow - TriX format warnings already implemented in Extraction.scala How to run manually: cd scripts/src/main/bash chmod +x test-extraction-combinations.sh ./test-extraction-combinations.sh Note: Server must be running on localhost:9999 before running tests

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

scripts/src/main/bash/test-extraction-combinations.sh (1)
1-21: Consider adding error case tests.

The current tests only cover successful extraction scenarios. Consider adding tests for error cases to improve robustness:

Invalid extractor names (should return 400)

Missing required parameters (should return 400)

Requesting partial results when some extractors fail (should include X-Extraction-Warnings header)

Example additional tests:
# Test 5: Invalid extractor (should fail with 400)
if ! curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=NonExistentExtractor" >/dev/null 2>&1; then
  echo "✓ Invalid extractor rejected"
else
  echo "✗ FAIL: Invalid extractor should return error"
  FAILED=1
fi

# Test 6: Missing title/revid/xmlUrl (should fail with 400)
if ! curl -sf "$API/extract?format=n-triples&extractors=mappings" >/dev/null 2>&1; then
  echo "✓ Missing parameters rejected"
else
  echo "✗ FAIL: Missing parameters should return error"
  FAILED=1
fi

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4731988 and b6ce78e.

📒 Files selected for processing (3)

.github/workflows/server-web-api-test.yml (1 hunks)
scripts/src/main/bash/test-extraction-combinations.sh (1 hunks)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)

core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)

TerseFormatter (13-14)

core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)

DeduplicatingDestination (9-10)

core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)

WriterDestination (16-17)

server/src/main/scala/org/dbpedia/extraction/server/Server.scala (4)

Server (26-31)

Server (159-159)

getInstance (171-176)

extractWithSpecificExtractor (87-106)

server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)

writeHeader (13-18)

core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)

fromXML (61-61)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: build
GitHub Check: build
GitHub Check: deploy
GitHub Check: build

🔇 Additional comments (8)

server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (7)

6-6: LGTM - Imports support new multi-extractor functionality.

The new imports for formatters, destinations, Quad, and ArrayBuffer are all necessary for the multi-extractor flow and in-memory quad collection introduced in this PR.

Also applies to: 16-17, 20-20, 23-23

177-187: LGTM - Clean in-memory quad collector.

The QuadCollectorDestination provides a simple, correct implementation for accumulating quads from multiple extractors. The use of ArrayBuffer is safe here since extractors run sequentially (line 265).

190-228: LGTM - Solid source creation logic.

The createSource helper properly handles XML URL fetching with appropriate error handling, custom user-agent support, and resource cleanup. The fallback to WikiSource is clear.

Note: TallTed already suggested improving the comment at line 190 in a previous review.

230-289: LGTM - Well-structured multi-extractor flow.

The performExtraction helper centralizes extraction logic effectively. Key strengths:

Lines 269-276: The nested pattern match correctly distinguishes "mappings" and "custom" mode flags from actual extractor names, enabling mode+extractor combinations.

Lines 262-289: Sequential extractor processing with per-extractor error collection allows partial results when some extractors fail.

Line 293: Proper handling when all extractors fail vs. partial success.

The overall design supports the multi-extractor feature while maintaining error transparency.

352-353: LGTM - Good validation and delegation.

Lines 352-353 properly validate that at least one of title, revid, or xmlUrl is provided. The delegation to performExtraction with the source factory pattern (lines 372-377) is clean and eliminates code duplication.

Also applies to: 371-377

397-407: LGTM - Clean formatter factory.

The createFormatter helper effectively centralizes format selection logic, reducing duplication across GET and POST endpoints. All supported formats are correctly mapped to their respective formatter implementations.

457-470: LGTM - Consistent POST endpoint refactoring.

The POST endpoint correctly mirrors the GET endpoint pattern: accepting java.util.List[String] for extractors and delegating to performExtraction with an XML source factory. The refactoring eliminates code duplication while maintaining consistent behavior.

.github/workflows/server-web-api-test.yml (1)

54-57: LGTM - Good test coverage for multi-extractor feature.

The new CI step properly exercises multi-extractor combinations, validating the core functionality introduced in this PR. The placement after the coordinate test and before server shutdown is appropriate.

coderabbitai · 2025-10-29T20:50:48Z

scripts/src/main/bash/test-extraction-combinations.sh

+#!/bin/bash
+# Test Multi-Extractor API (PR #794)
+
+API="http://localhost:9999/extraction/en"
+TITLE="Albert_Einstein"
+
+echo "Testing Multi-Extractor API"
+
+# Test 1: Mappings-only + N-Triples
+curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL"
+
+# Test 2: All-enabled + Turtle
+curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL"
+
+# Test 3: PageId + Label + N-Triples
+curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL"
+
+# Test 4: PageId + Label + Turtle
+curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL"
+
+echo "Done"


⚠️ Potential issue | 🔴 Critical

Critical: Test script doesn't fail CI on test failures.

The script prints "✗ FAIL" messages but never exits with a non-zero status code, so the CI step will always pass even when tests fail. Additionally, curl errors (network issues, server errors) are silently ignored.

Apply these changes to make tests fail the CI build:

#!/bin/bash # Test Multi-Extractor API (PR #794) +set -e # Exit on first error +FAILED=0 API="http://localhost:9999/extraction/en" TITLE="Albert_Einstein" echo "Testing Multi-Extractor API" # Test 1: Mappings-only + N-Triples -curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL" +if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<"; then + echo "✓ Mappings N-Triples" +else + echo "✗ FAIL: Mappings N-Triples" + FAILED=1 +fi # Test 2: All-enabled + Turtle -curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL" +if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)"; then + echo "✓ All-enabled Turtle" +else + echo "✗ FAIL: All-enabled Turtle" + FAILED=1 +fi # Test 3: PageId + Label + N-Triples -curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL" +if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<"; then + echo "✓ PageId+Label N-Triples" +else + echo "✗ FAIL: PageId+Label N-Triples" + FAILED=1 +fi # Test 4: PageId + Label + Turtle -curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL" +if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)"; then + echo "✓ PageId+Label Turtle" +else + echo "✗ FAIL: PageId+Label Turtle" + FAILED=1 +fi -echo "Done" +if [ $FAILED -eq 0 ]; then + echo "✓ All tests passed" + exit 0 +else + echo "✗ Some tests failed" + exit 1 +fi

Note: The -f flag makes curl fail on HTTP errors (4xx, 5xx), ensuring server errors are caught.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#!/bin/bash

# Test Multi-Extractor API (PR #794)

API="http://localhost:9999/extraction/en"

TITLE="Albert_Einstein"

echo "Testing Multi-Extractor API"

# Test 1: Mappings-only + N-Triples

curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL"

# Test 2: All-enabled + Turtle

curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL"

# Test 3: PageId + Label + N-Triples

curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL"

# Test 4: PageId + Label + Turtle

curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL"

echo "Done"

#!/bin/bash

# Test Multi-Extractor API (PR #794)

set -e # Exit on first error

FAILED=0

API="http://localhost:9999/extraction/en"

TITLE="Albert_Einstein"

echo "Testing Multi-Extractor API"

# Test 1: Mappings-only + N-Triples

if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<"; then

echo "✓ Mappings N-Triples"

else

echo "✗ FAIL: Mappings N-Triples"

FAILED=1

fi

# Test 2: All-enabled + Turtle

if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)"; then

echo "✓ All-enabled Turtle"

else

echo "✗ FAIL: All-enabled Turtle"

FAILED=1

fi

# Test 3: PageId + Label + N-Triples

if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<"; then

echo "✓ PageId+Label N-Triples"

else

echo "✗ FAIL: PageId+Label N-Triples"

FAILED=1

fi

# Test 4: PageId + Label + Turtle

if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)"; then

echo "✓ PageId+Label Turtle"

else

echo "✗ FAIL: PageId+Label Turtle"

FAILED=1

fi

if [ $FAILED -eq 0 ]; then

echo "✓ All tests passed"

exit 0

else

echo "✗ Some tests failed"

exit 1

fi

🤖 Prompt for AI Agents

In scripts/src/main/bash/test-extraction-combinations.sh lines 1-21, the script prints "✗ FAIL" but never returns a non-zero exit code and it silently ignores curl HTTP/network errors; enable strict failure handling (e.g., set -euo pipefail) at the top, make curl fail on HTTP errors and surface failures using -f (and -S for error output) while keeping -s if desired, and ensure each test checks the combined exit status (curl | grep) and exits non-zero on failure (e.g., replace the current || echo "✗ FAIL" with a branch that echoes the failure and exits 1) so the CI step fails when any test or curl fails.

- Add yellow warning banner & detailed error section in TriX XSLT - Detect WARNING comments and display user-friendly notices - Fix exit code logic in test-extraction-combinations.sh to fail correctly

sonarqubecloud · 2025-11-03T18:44:41Z

Quality Gate passed

Issues
9 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

haniyakonain force-pushed the multi-extractor-final branch from b67ad04 to da82f68 Compare October 15, 2025 16:35

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

haniyakonain and others added 2 commits October 16, 2025 23:42

haniyakonain force-pushed the multi-extractor-final branch from da82f68 to 4731988 Compare October 16, 2025 18:13

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

TallTed reviewed Oct 17, 2025

View reviewed changes

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

haniyakonain marked this pull request as draft November 3, 2025 18:38

fix: improve XSLT warnings + test script exit handling

d190f20

- Add yellow warning banner & detailed error section in TriX XSLT - Detect WARNING comments and display user-friendly notices - Fix exit code logic in test-extraction-combinations.sh to fail correctly

	// Helper method to create source from xmlUrl or fallback to WikiSource
	// Helper method to create source from xmlUrl or fall back to WikiSource

feat: add multi-extractor support with improved UI and error handling… #794

Are you sure you want to change the base?

feat: add multi-extractor support with improved UI and error handling… #794

Uh oh!

Conversation

haniyakonain commented Oct 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

TallTed Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Nov 3, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haniyakonain commented Oct 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 8, 2025 •

edited

Loading