Skip to content

Conversation

@haniyakonain
Copy link
Contributor

@haniyakonain haniyakonain commented Oct 8, 2025

… (no whitespaces)

Add support for selecting and running multiple extractors simultaneously through an improved checkbox-based UI with proper error handling and code refactoring.

Changes:

  • Replace dropdown with checkboxes for extractor selection
  • Update @QueryParam to accept List for multiple extractors
  • Add QuadCollectorDestination to accumulate results from multiple extractors
  • Sanitize HTML IDs to prevent invalid characters in checkbox elements
  • Return appropriate HTTP status codes (400 for invalid, 500 for failures)

Refactoring:

  • Extract common logic into performExtraction() helper method
  • Reduce code duplication from 30.6% to <3%
  • Add createSource() and createFormatter() helper methods
  • GET/POST endpoints now delegate to helper with source factory pattern
  • Fixes SonarQube quality gate failure

Bug fixes:

  • Fix "mappings" and "custom" handling in multi-extractor selection Added pattern matching to route these mode flags correctly instead of treating them as extractor names, allowing users to combine extraction modes with specific extractors in a single request

Summary by CodeRabbit

  • New Features

    • Support multiple extractors per request with selectable extractor modes and per-extractor controls.
    • In-memory accumulation of extraction results enabling combined outputs and partial-result emission with warnings.
    • Unified extraction pipeline with improved content negotiation and aggregated error reporting.
  • Tests

    • Added CI step and script to exercise multi-extractor combinations and verify output formats.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Extraction endpoints changed to accept multiple extractors as a java.util.List[String]. An in-memory quad collector (QuadCollectorDestination) and unified extraction helpers (createSource, createFormatter, performExtraction) were added. UI, GET and POST flows, content negotiation, and error/partial-result handling were updated to support multi-extractor runs.

Changes

Cohort / File(s) Summary
Extraction endpoint & pipeline
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala
Replace single-String extractor parameter with java.util.List[String] for GET and POST. Add QuadCollectorDestination to accumulate quads in-memory. Introduce helpers createSource, createFormatter, and performExtraction to centralize extraction logic, per-extractor processing, partial-error collection and aggregated response formatting. Update content negotiation and destination wiring.
CI workflow
.github/workflows/server-web-api-test.yml
Add new workflow step "Run Multi-Extractor Combination Tests" to execute the test script in CI.
Test script
scripts/src/main/bash/test-extraction-combinations.sh
Add a Bash script exercising multi-extractor API combinations and formats (n-triples, turtle), verifying responses with grep and printing pass/fail results.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Endpoint as /extract Endpoint
    participant Source as createSource()
    participant Pipeline as performExtraction()
    participant Collector as QuadCollectorDestination
    participant Formatter as createFormatter()
    participant Response as Response Builder

    Client->>Endpoint: GET/POST with extractors: List[String], format, title/revid/xml
    Endpoint->>Source: createSource(title, revid, xmlUrl)
    Source-->>Endpoint: Source (XML or Wiki)

    loop per extractor
        Endpoint->>Pipeline: performExtraction(extractor, source, Collector)
        Pipeline->>Collector: emit quads
        Collector-->>Pipeline: accumulated quads
        Pipeline-->>Endpoint: success or error (with warning)
    end

    Endpoint->>Formatter: createFormatter(format, headers)
    Formatter-->>Endpoint: formatter instance
    Endpoint->>Response: format accumulated quads + include warnings/errors
    Response-->>Client: HTTP response (body + status reflecting partial failures)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay attention to Extraction.scala sections that wire destinations and manage per-extractor error aggregation.
  • Verify correctness of content negotiation and formatter selection for multi-extractor output.
  • Review the new test script for shell robustness and CI path correctness.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: add multi-extractor support with improved UI and error handling…" directly and accurately summarizes the main changes in the changeset. The core modification is the introduction of multi-extractor support, which is substantiated by the refactored Extraction.scala endpoint methods that now accept java.util.List[String] instead of single strings, the introduction of QuadCollectorDestination for accumulating results, and the unified extraction pipeline. The secondary improvements mentioned in the title—improved UI and error handling—are also clearly present in the changes, including the refactored checkbox-based UI and enhanced HTTP status handling. The title is specific, concise, and clear enough that a teammate reviewing the commit history would understand the primary purpose of this PR.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (1)

191-228: Improve XML parsing error handling and resource management in createSource

  • Wrap XML.loadString and response reading in try/catch to handle SAXException, IOException etc., and rethrow as a WebApplicationException (e.g. 400 Bad Request) so malformed XML doesn’t bubble up as a 500.
  • Avoid loading the entire response into memory via mkString; switch to a streaming parser or enforce a maximum response size.
  • Ensure the HTTP response (not just the client) is closed in the finally block—e.g. call response.close() or explicitly close the entity’s input stream.
🧹 Nitpick comments (2)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (2)

72-86: Consider accessibility improvements for the checkbox UI.

The checkbox UI is functional, but consider these enhancements:

  1. The <select> dropdown at line 73 and the checkboxes below create a somewhat unusual UX pattern where users first pick a "mode" and then optionally customize. This could be simplified.
  2. ID sanitization at line 83 is good for avoiding HTML/CSS issues, but ensure the original extractor names are preserved in the value attribute (which they are).
  3. Consider adding ARIA attributes to improve screen reader support, e.g., aria-describedby to link the select with the checkbox container.

177-187: Document thread-safety assumption for QuadCollectorDestination.

The QuadCollectorDestination uses ArrayBuffer, which is not thread-safe. While the current usage in performExtraction (lines 265-289) is sequential and safe, future modifications could inadvertently introduce concurrent writes.

Consider adding a comment to document this constraint:

  /**
-  * Custom destination that collects quads in memory
+  * Custom destination that collects quads in memory.
+  * Note: Not thread-safe. Assumes sequential write operations.
   */
  private class QuadCollectorDestination extends Destination {
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7485d00 and b67ad04.

📒 Files selected for processing (1)
  • server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)
core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)
  • TerseFormatter (13-14)
core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)
  • DeduplicatingDestination (9-10)
core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)
  • WriterDestination (16-17)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (4)
  • Server (26-31)
  • Server (159-159)
  • getInstance (171-176)
  • extractWithSpecificExtractor (87-106)
server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)
  • writeHeader (13-18)
core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)
  • fromXML (61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (3)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)

333-387: GET endpoint properly handles errors and defaults.

The refactored GET endpoint has solid error handling:

  1. Line 344-345: Validates required parameters before processing
  2. Lines 370-386: Comprehensive exception handling with appropriate HTTP status codes
  3. Line 378-384: Full stack trace logging for debugging
  4. Lines 358-362: Backward-compatible default to "mappings" when no extractors specified

The comma-joining at line 361 is related to the fragility flagged in the performExtraction method—consider the alternative approach suggested there.


389-399: LGTM!

The createFormatter method cleanly maps format strings to their corresponding formatter implementations using pattern matching. The default case falling back to TriX is reasonable.


449-462: POST endpoint follows GET pattern.

The POST endpoint has been refactored to match the GET endpoint's structure, delegating to performExtraction. The same comma-joining approach at line 460 has the fragility noted in the GET endpoint review.

Otherwise, the delegation pattern is clean and reduces code duplication as intended.

Comment on lines +89 to +120
<script type="text/javascript">
{scala.xml.Unparsed("""
function toggleCustomExtractors() {
var mode = document.getElementById('extractorMode').value;
var customDiv = document.getElementById('customExtractors');
if (mode === 'custom-selection') {
customDiv.style.display = 'block';
} else {
customDiv.style.display = 'none';
}
}
document.querySelector('form').addEventListener('submit', function(e) {
var mode = document.getElementById('extractorMode').value;
if (mode === 'mappings') {
var input = document.createElement('input');
input.type = 'hidden';
input.name = 'extractors';
input.value = 'mappings';
this.appendChild(input);
} else if (mode === 'custom') {
var input = document.createElement('input');
input.type = 'hidden';
input.name = 'extractors';
input.value = 'custom';
this.appendChild(input);
}
});
""")}
</script>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

JavaScript form manipulation has potential issues.

The JavaScript code dynamically appends hidden inputs during form submission (lines 105-117), but:

  1. If the user submits the form multiple times (unlikely but possible with back button), hidden inputs could accumulate since they're never removed.
  2. The script doesn't validate that at least one extractor is selected when in "custom-selection" mode, potentially submitting an empty list.

Consider this improvement:

  document.querySelector('form').addEventListener('submit', function(e) {
    var mode = document.getElementById('extractorMode').value;
+   
+   // Remove any previously added hidden inputs
+   var existing = this.querySelectorAll('input[type="hidden"][name="extractors"]');
+   existing.forEach(function(el) { el.remove(); });

    if (mode === 'mappings') {
      var input = document.createElement('input');
      input.type = 'hidden';
      input.name = 'extractors';
      input.value = 'mappings';
      this.appendChild(input);
    } else if (mode === 'custom') {
      var input = document.createElement('input');
      input.type = 'hidden';
      input.name = 'extractors';
      input.value = 'custom';
      this.appendChild(input);
+   } else if (mode === 'custom-selection') {
+     // Validate at least one checkbox is selected
+     var checked = this.querySelectorAll('input[name="extractors"]:checked');
+     if (checked.length === 0) {
+       alert('Please select at least one extractor');
+       e.preventDefault();
+       return false;
+     }
    }
  });

Comment on lines +258 to +260
case specificExtractor if specificExtractor.contains(",") =>
val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty)
logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comma-based multi-extractor detection is fragile.

Line 258 uses contains(",") to detect multiple extractors, and line 259 splits on comma. This approach has issues:

  1. If an extractor name legitimately contains a comma (unlikely but not impossible), it would be incorrectly split.
  2. The logic assumes the frontend will send comma-separated values when multiple checkboxes are selected. However, HTML forms typically send multiple values with the same name as separate query parameters (e.g., ?extractors=foo&extractors=bar), which JAX-RS automatically collects into the List[String] parameter. The comma-joining at line 361 and subsequent splitting here is unnecessary indirection.

Consider this refactor to work directly with the list:

- case specificExtractor if specificExtractor.contains(",") =>
-   val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty)
+ case specificExtractor if specificExtractor == "MULTI" || extractors.asScala.size > 1 =>
+   val extractorNames = extractors.asScala.toArray

Then at the call site (line 361), don't join:

- extractors.asScala.mkString(",")
+ if (extractors.asScala.size == 1) extractors.asScala.head else "MULTI"

This eliminates the string manipulation entirely.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala
around lines 258 to 260, the code detects multiple extractors by checking for
contains(",") and splitting the string; instead, treat extractors as a
List[String] (the JAX-RS behaviour) and remove any comma-based parsing: replace
the contains/split branch with logic that accepts the incoming List[String]
directly (use .map(_.trim).filter(_.nonEmpty) on the list), log
extractorNames.length and extractorNames.mkString(", "), and process
extractorNames without string splitting; also update the call site around line
361 to stop joining the list into a single comma string and pass the
List[String] through as-is so no comma manipulation is necessary anywhere.

@haniyakonain haniyakonain force-pushed the multi-extractor-final branch from b67ad04 to da82f68 Compare October 15, 2025 16:35
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)

258-262: Comma-based multi-extractor parsing is fragile.

Line 258 detects multiple extractors using contains(",") and line 259 splits on comma. This approach has issues:

  1. If an extractor name contains a comma (unlikely but possible), it would be split incorrectly.
  2. HTML forms naturally send multiple parameters as ?extractors=foo&extractors=bar, which JAX-RS collects into List[String] automatically. The comma-joining at line 361 and subsequent splitting here is unnecessary indirection.
  3. The logic conflates string manipulation with multi-value handling.

Consider refactoring to work directly with the list:

-case specificExtractor if specificExtractor.contains(",") =>
-  val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty)
-  logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}")
+case _ =>
+  // Split the comma-separated string into individual extractor names
+  val extractorNames = extractorName.split(",").map(_.trim).filter(_.nonEmpty)
+  
+  // If only one extractor after parsing, delegate to single-extractor branch
+  if (extractorNames.length == 1) {
+    // Handle as single extractor (reuse existing logic)
+    val singleName = extractorNames.head
+    val writer = new StringWriter
+    val formatter = createFormatter(finalFormat, writer)
+    val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter))
+    val source = sourceFactory()
+    singleName match {
+      case "mappings" => Server.getInstance().extractor.extract(source, destination, language, false)
+      case "custom" => Server.getInstance().extractor.extract(source, destination, language, true)
+      case _ => Server.getInstance().extractWithSpecificExtractor(source, destination, language, singleName)
+    }
+    Response.ok(writer.toString)
+      .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8")
+      .build()
+  } else {
+    // Handle multiple extractors
+    logger.info(s"Processing ${extractorNames.length} extractors: ${extractorNames.mkString(", ")}")

This eliminates the fragile pattern matching on contains(",") while maintaining current functionality.


291-299: Error status logic relies on fragile string matching.

Lines 296-297 determine HTTP status by checking if error messages contain "not found". This string matching is fragile:

  1. Line 280 creates "Extractor '$name' not found" for IllegalArgumentException
  2. Line 284 creates generic messages for other exceptions
  3. Line 296 checks for literal "not found" substring, which could have false positives if other errors mention those words

Consider tracking exception types separately:

  val errors = new ArrayBuffer[String]()
+ val notFoundErrors = new ArrayBuffer[String]()
  val collector = new QuadCollectorDestination()

  for (name <- extractorNames) {
    try {
      // ... extraction logic ...
    } catch {
      case e: IllegalArgumentException =>
        val errorMsg = s"Extractor '$name' not found"
        logger.warning(s"$errorMsg: ${e.getMessage}")
        errors += errorMsg
+       notFoundErrors += name
      case e: Exception =>
        val errorMsg = s"Failed to extract with '$name': ${e.getMessage}"
        logger.severe(errorMsg)
        e.printStackTrace()
        errors += errorMsg
    }
  }

  if (errors.nonEmpty) {
    logger.warning(s"Partial extraction completed with errors: ${errors.mkString("; ")}")
    if (collector.quads.isEmpty) {
      val errorMessage = if (errors.size == 1) errors.head
                        else s"Multiple extractor errors:\n${errors.mkString("\n")}"
-     val statusCode = if (errors.exists(_.contains("not found"))) Response.Status.BAD_REQUEST
+     val statusCode = if (notFoundErrors.nonEmpty) Response.Status.BAD_REQUEST
                      else Response.Status.INTERNAL_SERVER_ERROR
      throw new WebApplicationException(new Exception(errorMessage), statusCode)
    }
  }

89-120: Consider improving form submission handling.

The JavaScript form manipulation has two potential improvements:

  1. Accumulation of hidden inputs: Lines 105-117 append hidden inputs on form submission, but if the user navigates back and resubmits (unlikely but possible), inputs could accumulate since they're never removed.

  2. Missing validation: In "custom-selection" mode, there's no client-side check that at least one extractor is selected, potentially submitting an empty list.

The backend will default to "mappings" if no extractors are provided (line 358-359), so these are low-risk improvements:

  document.querySelector('form').addEventListener('submit', function(e) {
    var mode = document.getElementById('extractorMode').value;
+   
+   // Remove any previously added hidden inputs to prevent accumulation
+   var existing = this.querySelectorAll('input[type="hidden"][name="extractors"]');
+   existing.forEach(function(el) { el.remove(); });

    if (mode === 'mappings') {
      var input = document.createElement('input');
      input.type = 'hidden';
      input.name = 'extractors';
      input.value = 'mappings';
      this.appendChild(input);
    } else if (mode === 'custom') {
      var input = document.createElement('input');
      input.type = 'hidden';
      input.name = 'extractors';
      input.value = 'custom';
      this.appendChild(input);
+   } else if (mode === 'custom-selection') {
+     // Optional: validate at least one checkbox is selected
+     var checked = this.querySelectorAll('input[name="extractors"]:checked');
+     if (checked.length === 0) {
+       alert('Please select at least one extractor');
+       e.preventDefault();
+       return false;
+     }
    }
  });
🧹 Nitpick comments (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (1)

358-362: Consider eliminating comma-joining to simplify multi-extractor handling.

Line 361 joins the extractor list with commas: extractors.asScala.mkString(","). This creates the comma-separated string that must be parsed later in performExtraction (line 258-259).

Since performExtraction already handles different extractor modes via pattern matching, consider passing a marker for multi-extractor mode instead:

 val extractorName = if (extractors == null || extractors.isEmpty) {
   "mappings"
+} else if (extractors.asScala.size == 1) {
+  extractors.asScala.head
 } else {
-  extractors.asScala.mkString(",")
+  // Join for backward compatibility, but consider passing List directly in future refactor
+  extractors.asScala.mkString(",")
 }

Better yet, refactor performExtraction to accept List[String] directly instead of a single string, eliminating the need for comma manipulation entirely. This would require updating the method signature and pattern matching logic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b67ad04 and da82f68.

📒 Files selected for processing (2)
  • dump/src/test/resources/shaclTestsCoverageTable.md (14 hunks)
  • server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)
core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)
  • TerseFormatter (13-14)
core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)
  • DeduplicatingDestination (9-10)
core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)
  • WriterDestination (16-17)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (3)
  • Server (26-31)
  • Server (159-159)
  • extractWithSpecificExtractor (87-106)
server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)
  • writeHeader (13-18)
core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)
  • fromXML (61-61)
🪛 markdownlint-cli2 (0.18.1)
dump/src/test/resources/shaclTestsCoverageTable.md

29-29: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


29-29: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


30-30: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


30-30: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


31-31: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


31-31: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


32-32: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


32-32: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


43-43: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


43-43: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


54-54: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


54-54: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


64-64: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


64-64: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


65-65: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


65-65: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


66-66: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


66-66: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


67-67: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


67-67: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


68-68: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


68-68: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


69-69: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


69-69: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


70-70: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


70-70: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


71-71: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


71-71: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


82-82: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


82-82: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


92-92: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


92-92: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


93-93: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


93-93: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


94-94: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


94-94: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


95-95: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


95-95: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


104-104: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


104-104: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


114-114: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


114-114: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


124-124: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


124-124: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


125-125: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


125-125: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


126-126: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


126-126: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


127-127: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


127-127: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


128-128: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


128-128: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


129-129: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


129-129: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


130-130: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


131-131: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


131-131: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


132-132: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


132-132: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


133-133: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


133-133: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


134-134: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


134-134: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


135-135: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


135-135: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


136-136: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


136-136: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


137-137: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


137-137: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


138-138: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


138-138: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


139-139: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


139-139: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


148-148: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


148-148: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


159-159: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


159-159: Table column count
Expected: 4; Actual: 3; Too few cells, row will be missing data

(MD056, table-column-count)


160-160: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


160-160: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


161-161: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


161-161: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


162-162: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


162-162: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


163-163: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


163-163: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


164-164: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


164-164: Table column count
Expected: 4; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


165-165: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


165-165: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


166-166: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


166-166: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


167-167: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


167-167: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


168-168: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


168-168: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


169-169: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


169-169: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


170-170: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


170-170: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


171-171: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


171-171: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


172-172: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


172-172: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


173-173: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


174-174: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


174-174: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


218-218: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


218-218: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


219-219: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


219-219: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


231-231: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


231-231: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


255-255: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


255-255: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


256-256: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


256-256: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)


257-257: Table pipe style
Expected: no_leading_or_trailing; Actual: trailing_only; Unexpected trailing pipe

(MD055, table-pipe-style)


257-257: Table column count
Expected: 4; Actual: 1; Too few cells, row will be missing data

(MD056, table-column-count)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build

haniyakonain and others added 2 commits October 16, 2025 23:42
… (no whitespaces)

Add support for selecting and running multiple extractors simultaneously 
through an improved checkbox-based UI with proper error handling and 
code refactoring.

Changes:
- Replace dropdown with checkboxes for extractor selection
- Update @QueryParam to accept List<String> for multiple extractors
- Add QuadCollectorDestination to accumulate results from multiple extractors
- Sanitize HTML IDs to prevent invalid characters in checkbox elements
- Return appropriate HTTP status codes (400 for invalid, 500 for failures)

Refactoring:
- Extract common logic into performExtraction() helper method
- Reduce code duplication from 30.6% to <3%
- Add createSource() and createFormatter() helper methods
- GET/POST endpoints now delegate to helper with source factory pattern
- Fixes SonarQube quality gate failure

Bug fixes:
- Fix "mappings" and "custom" handling in multi-extractor selection
  Added pattern matching to route these mode flags correctly instead of
  treating them as extractor names, allowing users to combine extraction
  modes with specific extractors in a single request
- Redesign extractor selection with dropdown and custom panel
- Allow partial results when some extractors fail
- Add X-Extraction-Warnings response header
- Include warning comments in text-based output formats
- Enhanced error logging per extractor
@haniyakonain haniyakonain force-pushed the multi-extractor-final branch from da82f68 to 4731988 Compare October 16, 2025 18:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (2)

89-120: Past review concern remains valid.

The JavaScript form manipulation issue flagged in the previous review (potential accumulation of hidden inputs on multiple submissions and missing validation for empty selections in custom-selection mode) has not been addressed in this commit.


258-260: Past review concern remains valid.

The comma-based multi-extractor detection issue flagged in the previous review (fragile string splitting that should be replaced with direct List handling) has not been addressed in this commit.

🧹 Nitpick comments (4)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (4)

199-211: Consider extracting user agent configuration logic.

The user agent configuration logic (lines 199-211) appears to duplicate logic that might exist in WikiApi. If this is the case, consider extracting it into a shared utility method to maintain consistency and reduce duplication.

// In a shared utility object:
object HttpClientUtils {
  def configureUserAgent(request: HttpGet): Unit = {
    val customUserAgentEnabled = System.getProperty("extract.wikiapi.customUserAgent.enabled", "false").toBoolean
    val customUserAgentText = System.getProperty("extract.wikiapi.customUserAgent.text", 
      "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; [email protected])")
    if (customUserAgentEnabled) {
      request.setHeader("User-Agent", customUserAgentText)
    }
  }
}

Then in createSource:

            val request = new HttpGet(xmlUrl)
-           // Apply user agent if enabled (same logic as WikiApi)
-           val customUserAgentEnabled = try {
-             System.getProperty("extract.wikiapi.customUserAgent.enabled", "false").toBoolean
-           } catch {
-             case ex: Exception => false
-           }
-           val customUserAgentText = try {
-             System.getProperty("extract.wikiapi.customUserAgent.text", "curl/8.6.0")
-           } catch {
-             case ex: Exception => "DBpedia-Extraction-Framework/1.0 (https://github.com/dbpedia/extraction-framework; [email protected])"
-           }
-           if (customUserAgentEnabled) {
-             request.setHeader("User-Agent", customUserAgentText)
-           }
+           HttpClientUtils.configureUserAgent(request)

238-256: Reduce duplication between 'mappings' and 'custom' cases.

The "mappings" and "custom" cases (lines 238-256) are nearly identical, differing only in the boolean parameter passed to extract(). This duplication can be eliminated.

Apply this diff to consolidate the cases:

-      case "mappings" =>
-        val writer = new StringWriter
-        val formatter = createFormatter(finalFormat, writer)
-        val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter))
-        val source = sourceFactory()
-        Server.getInstance().extractor.extract(source, destination, language, false)
-        Response.ok(writer.toString)
-          .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8")
-          .build()
-
-        case "custom" =>
+      case "mappings" | "custom" =>
+        val useAllExtractors = extractorName == "custom"
         val writer = new StringWriter
         val formatter = createFormatter(finalFormat, writer)
         val destination = new DeduplicatingDestination(new WriterDestination(() => writer, formatter))
         val source = sourceFactory()
-        Server.getInstance().extractor.extract(source, destination, language, true)
+        Server.getInstance().extractor.extract(source, destination, language, useAllExtractors)
         Response.ok(writer.toString)
           .header(HttpHeaders.CONTENT_TYPE, contentType + "; charset=UTF-8")
           .build()

269-276: Extract single-extractor logic to eliminate duplication.

The nested pattern match (lines 269-276) duplicates the extraction logic from the outer pattern match (lines 238-256, 321-330). This creates maintenance issues if the logic needs to change.

Extract the single-extractor logic into a helper method:

private def extractSingle(
  name: String,
  source: org.dbpedia.extraction.sources.Source,
  destination: Destination,
  language: Language
): Unit = {
  name match {
    case "mappings" =>
      Server.getInstance().extractor.extract(source, destination, language, false)
    case "custom" =>
      Server.getInstance().extractor.extract(source, destination, language, true)
    case _ =>
      Server.getInstance().extractWithSpecificExtractor(source, destination, language, name)
  }
}

Then use it in the loop:

         for (name <- extractorNames) {
           try {
             logger.info(s"Processing extractor: $name")
             val freshSource = sourceFactory()
-            name match {
-              case "mappings" =>
-                Server.getInstance().extractor.extract(freshSource, collector, language, false)
-              case "custom" =>
-                Server.getInstance().extractor.extract(freshSource, collector, language, true)
-              case _ =>
-                Server.getInstance().extractWithSpecificExtractor(freshSource, collector, language, name)
-            }
+            extractSingle(name, freshSource, collector, language)
             logger.info(s"Completed extractor: $name")

This also makes it easier to use the same logic in the outer pattern match cases.


358-362: Remove unnecessary List-to-string-to-array conversion.

Lines 358-362 (and similarly line 460) join the List[String] into a comma-separated string, which is then split back into an array at line 259. This roundtrip string manipulation is fragile and unnecessary.

Refactor performExtraction to accept List[String] directly:

  private def performExtraction(
-   extractorName: String,
+   extractorNames: List[String],
    sourceFactory: () => org.dbpedia.extraction.sources.Source,
    finalFormat: String,
    contentType: String
  ): Response = {
-   extractorName match {
-     case "mappings" | "custom" => /* ... */
-     case specificExtractor if specificExtractor.contains(",") =>
-       val extractorNames = specificExtractor.split(",").map(_.trim).filter(_.nonEmpty)
+   extractorNames match {
+     case List("mappings") | List("custom") => /* ... */
+     case names if names.size > 1 =>
+       val cleanedNames = names.map(_.trim).filter(_.nonEmpty)
        // ... rest of multi-extractor logic ...
-     case specificExtractor => /* single extractor case */
+     case List(name) => /* single extractor case using name */
+     case _ => throw new IllegalArgumentException("No extractors specified")
    }
  }

Then update the call sites:

- val extractorName = if (extractors == null || extractors.isEmpty) {
-   "mappings"
- } else {
-   extractors.asScala.mkString(",")
- }
+ val extractorsList = if (extractors == null || extractors.isEmpty) {
+   List("mappings")
+ } else {
+   extractors.asScala.toList
+ }
  try {
    performExtraction(
-     extractorName,
+     extractorsList,
      () => createSource(xmlUrl, revid, title),
      finalFormat,
      contentType
    )

This eliminates all comma-based string manipulation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da82f68 and 4731988.

📒 Files selected for processing (1)
  • server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)
core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)
  • TerseFormatter (13-14)
core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)
  • DeduplicatingDestination (9-10)
core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)
  • WriterDestination (16-17)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (3)
  • Server (26-31)
  • Server (159-159)
  • extractWithSpecificExtractor (87-106)
server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)
  • writeHeader (13-18)
core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)
  • fromXML (61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (3)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (3)

82-85: LGTM: HTML ID sanitization is well-implemented.

The ID sanitization on line 83 correctly replaces non-alphanumeric characters (except underscore and hyphen) with underscores, preventing invalid HTML IDs from extractor names that might contain special characters.


177-187: LGTM: QuadCollectorDestination implementation is correct for single-threaded use.

The in-memory quad collector is well-implemented for accumulating extraction results. The write method efficiently appends all quads using ++=.

Note: The implementation is not thread-safe. If multiple extractors might run concurrently in the future, consider using a synchronized collection or explicit locking.


389-399: LGTM: createFormatter implementation is clean and correct.

The formatter mapping logic is well-structured with appropriate defaults. The pattern match covers all supported formats and defaults to TriX for unknown formats.

Comment on lines +291 to +299
if (errors.nonEmpty) {
logger.warning(s"Partial extraction completed with errors: ${errors.mkString("; ")}")
if (collector.quads.isEmpty) {
val errorMessage = if (errors.size == 1) errors.head
else s"Multiple extractor errors:\n${errors.mkString("\n")}"
val statusCode = if (errors.exists(_.contains("not found"))) Response.Status.BAD_REQUEST
else Response.Status.INTERNAL_SERVER_ERROR
throw new WebApplicationException(new Exception(errorMessage), statusCode)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

String-based error status logic still present.

The past review comment for this section was marked as "Addressed," but the current code (line 296) still uses fragile string matching (errors.exists(_.contains("not found"))) to decide between HTTP 400 and 500 status codes.

The suggested type-based approach from the previous review has not been implemented.

Comment on lines 305 to 309
if (errors.nonEmpty && (finalFormat.contains("turtle") || finalFormat.contains("n-triples") || finalFormat.contains("n-quads"))) {
writer.write("# Warning: Partial results - some extractors failed:\n")
errors.foreach(err => writer.write(s"# - $err\n"))
writer.write("\n")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Warning comments only visible for text-based formats.

Lines 305-309 write warning comments only for turtle and N-Triples/N-Quads formats. Warnings are not included in the response body for trix and rdf-json formats, making errors invisible to users who don't inspect HTTP headers.

Consider either:

  1. Adding format-appropriate warnings for all formats (e.g., XML comments for trix, JSON field for rdf-json)
  2. Documenting that partial-error warnings are only available via the X-Extraction-Warnings HTTP header for XML and JSON formats
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala
around lines 305-309, the code writes warning comments only for text-based RDF
serializations (turtle, n-triples, n-quads), leaving trix and rdf-json responses
without visible partial-extraction warnings; update the response generation so
that for trix you emit an XML comment or a dedicated <warnings> element
appropriate to the TRiX wrapper before the RDF graph, and for rdf-json you
include a top-level "warnings" array/object in the JSON response (or embed
warnings in the JSON-LD structure) while retaining the existing
X-Extraction-Warnings header; ensure the format-specific warning insertion is
only applied when errors.nonEmpty and does not break valid serialization for
each format.


val source = if (xmlUrl != null && !xmlUrl.isEmpty) {
// Fetch XML from custom URL with user agent
// Helper method to create source from xmlUrl or fallback to WikiSource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Helper method to create source from xmlUrl or fallback to WikiSource
// Helper method to create source from xmlUrl or fall back to WikiSource

- Added test-extraction-combinations.sh script to test different extractor combinations
- Tests mappings-only, all-enabled, and PageId+Label combinations
- Validates both N-Triples and Turtle output formats
- Integrated into GitHub Actions workflow
- TriX format warnings already implemented in Extraction.scala

How to run manually:
  cd scripts/src/main/bash
  chmod +x test-extraction-combinations.sh
  ./test-extraction-combinations.sh

Note: Server must be running on localhost:9999 before running tests
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/src/main/bash/test-extraction-combinations.sh (1)

1-21: Consider adding error case tests.

The current tests only cover successful extraction scenarios. Consider adding tests for error cases to improve robustness:

  • Invalid extractor names (should return 400)
  • Missing required parameters (should return 400)
  • Requesting partial results when some extractors fail (should include X-Extraction-Warnings header)

Example additional tests:

# Test 5: Invalid extractor (should fail with 400)
if ! curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=NonExistentExtractor" >/dev/null 2>&1; then
  echo "✓ Invalid extractor rejected"
else
  echo "✗ FAIL: Invalid extractor should return error"
  FAILED=1
fi

# Test 6: Missing title/revid/xmlUrl (should fail with 400)
if ! curl -sf "$API/extract?format=n-triples&extractors=mappings" >/dev/null 2>&1; then
  echo "✓ Missing parameters rejected"
else
  echo "✗ FAIL: Missing parameters should return error"
  FAILED=1
fi
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4731988 and b6ce78e.

📒 Files selected for processing (3)
  • .github/workflows/server-web-api-test.yml (1 hunks)
  • scripts/src/main/bash/test-extraction-combinations.sh (1 hunks)
  • server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (6)
core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseFormatter.scala (1)
  • TerseFormatter (13-14)
core/src/main/scala/org/dbpedia/extraction/destinations/DeduplicatingDestination.scala (1)
  • DeduplicatingDestination (9-10)
core/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination.scala (1)
  • WriterDestination (16-17)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (4)
  • Server (26-31)
  • Server (159-159)
  • getInstance (171-176)
  • extractWithSpecificExtractor (87-106)
server/src/main/scala/org/dbpedia/extraction/server/resources/stylesheets/TriX.scala (1)
  • writeHeader (13-18)
core/src/main/scala/org/dbpedia/extraction/sources/XMLSource.scala (1)
  • fromXML (61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: deploy
  • GitHub Check: build
🔇 Additional comments (8)
server/src/main/scala/org/dbpedia/extraction/server/resources/Extraction.scala (7)

6-6: LGTM - Imports support new multi-extractor functionality.

The new imports for formatters, destinations, Quad, and ArrayBuffer are all necessary for the multi-extractor flow and in-memory quad collection introduced in this PR.

Also applies to: 16-17, 20-20, 23-23


177-187: LGTM - Clean in-memory quad collector.

The QuadCollectorDestination provides a simple, correct implementation for accumulating quads from multiple extractors. The use of ArrayBuffer is safe here since extractors run sequentially (line 265).


190-228: LGTM - Solid source creation logic.

The createSource helper properly handles XML URL fetching with appropriate error handling, custom user-agent support, and resource cleanup. The fallback to WikiSource is clear.

Note: TallTed already suggested improving the comment at line 190 in a previous review.


230-289: LGTM - Well-structured multi-extractor flow.

The performExtraction helper centralizes extraction logic effectively. Key strengths:

  • Lines 269-276: The nested pattern match correctly distinguishes "mappings" and "custom" mode flags from actual extractor names, enabling mode+extractor combinations.
  • Lines 262-289: Sequential extractor processing with per-extractor error collection allows partial results when some extractors fail.
  • Line 293: Proper handling when all extractors fail vs. partial success.

The overall design supports the multi-extractor feature while maintaining error transparency.


352-353: LGTM - Good validation and delegation.

Lines 352-353 properly validate that at least one of title, revid, or xmlUrl is provided. The delegation to performExtraction with the source factory pattern (lines 372-377) is clean and eliminates code duplication.

Also applies to: 371-377


397-407: LGTM - Clean formatter factory.

The createFormatter helper effectively centralizes format selection logic, reducing duplication across GET and POST endpoints. All supported formats are correctly mapped to their respective formatter implementations.


457-470: LGTM - Consistent POST endpoint refactoring.

The POST endpoint correctly mirrors the GET endpoint pattern: accepting java.util.List[String] for extractors and delegating to performExtraction with an XML source factory. The refactoring eliminates code duplication while maintaining consistent behavior.

.github/workflows/server-web-api-test.yml (1)

54-57: LGTM - Good test coverage for multi-extractor feature.

The new CI step properly exercises multi-extractor combinations, validating the core functionality introduced in this PR. The placement after the coordinate test and before server shutdown is appropriate.

Comment on lines 1 to 21
#!/bin/bash
# Test Multi-Extractor API (PR #794)

API="http://localhost:9999/extraction/en"
TITLE="Albert_Einstein"

echo "Testing Multi-Extractor API"

# Test 1: Mappings-only + N-Triples
curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL"

# Test 2: All-enabled + Turtle
curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL"

# Test 3: PageId + Label + N-Triples
curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL"

# Test 4: PageId + Label + Turtle
curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL"

echo "Done" No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Test script doesn't fail CI on test failures.

The script prints "✗ FAIL" messages but never exits with a non-zero status code, so the CI step will always pass even when tests fail. Additionally, curl errors (network issues, server errors) are silently ignored.

Apply these changes to make tests fail the CI build:

 #!/bin/bash
 # Test Multi-Extractor API (PR #794)
+set -e  # Exit on first error
+FAILED=0

 API="http://localhost:9999/extraction/en"
 TITLE="Albert_Einstein"

 echo "Testing Multi-Extractor API"

 # Test 1: Mappings-only + N-Triples
-curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL"
+if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<"; then
+  echo "✓ Mappings N-Triples"
+else
+  echo "✗ FAIL: Mappings N-Triples"
+  FAILED=1
+fi

 # Test 2: All-enabled + Turtle
-curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL"
+if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)"; then
+  echo "✓ All-enabled Turtle"
+else
+  echo "✗ FAIL: All-enabled Turtle"
+  FAILED=1
+fi

 # Test 3: PageId + Label + N-Triples
-curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL"
+if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<"; then
+  echo "✓ PageId+Label N-Triples"
+else
+  echo "✗ FAIL: PageId+Label N-Triples"
+  FAILED=1
+fi

 # Test 4: PageId + Label + Turtle
-curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL"
+if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)"; then
+  echo "✓ PageId+Label Turtle"
+else
+  echo "✗ FAIL: PageId+Label Turtle"
+  FAILED=1
+fi

-echo "Done"
+if [ $FAILED -eq 0 ]; then
+  echo "✓ All tests passed"
+  exit 0
+else
+  echo "✗ Some tests failed"
+  exit 1
+fi

Note: The -f flag makes curl fail on HTTP errors (4xx, 5xx), ensuring server errors are caught.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
# Test Multi-Extractor API (PR #794)
API="http://localhost:9999/extraction/en"
TITLE="Albert_Einstein"
echo "Testing Multi-Extractor API"
# Test 1: Mappings-only + N-Triples
curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<" && echo "✓ Mappings N-Triples" || echo "✗ FAIL"
# Test 2: All-enabled + Turtle
curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)" && echo "✓ All-enabled Turtle" || echo "✗ FAIL"
# Test 3: PageId + Label + N-Triples
curl -s "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<" && echo "✓ PageId+Label N-Triples" || echo "✗ FAIL"
# Test 4: PageId + Label + Turtle
curl -s "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)" && echo "✓ PageId+Label Turtle" || echo "✗ FAIL"
echo "Done"
#!/bin/bash
# Test Multi-Extractor API (PR #794)
set -e # Exit on first error
FAILED=0
API="http://localhost:9999/extraction/en"
TITLE="Albert_Einstein"
echo "Testing Multi-Extractor API"
# Test 1: Mappings-only + N-Triples
if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=mappings" | grep -q "^<"; then
echo "✓ Mappings N-Triples"
else
echo "✗ FAIL: Mappings N-Triples"
FAILED=1
fi
# Test 2: All-enabled + Turtle
if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=custom" | grep -qE "(@prefix|^<)"; then
echo "✓ All-enabled Turtle"
else
echo "✗ FAIL: All-enabled Turtle"
FAILED=1
fi
# Test 3: PageId + Label + N-Triples
if curl -sf "$API/extract?title=$TITLE&format=n-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -q "^<"; then
echo "✓ PageId+Label N-Triples"
else
echo "✗ FAIL: PageId+Label N-Triples"
FAILED=1
fi
# Test 4: PageId + Label + Turtle
if curl -sf "$API/extract?title=$TITLE&format=turtle-triples&extractors=PageIdExtractor&extractors=LabelExtractor" | grep -qE "(@prefix|^<)"; then
echo "✓ PageId+Label Turtle"
else
echo "✗ FAIL: PageId+Label Turtle"
FAILED=1
fi
if [ $FAILED -eq 0 ]; then
echo "✓ All tests passed"
exit 0
else
echo "✗ Some tests failed"
exit 1
fi
🤖 Prompt for AI Agents
In scripts/src/main/bash/test-extraction-combinations.sh lines 1-21, the script
prints "✗ FAIL" but never returns a non-zero exit code and it silently ignores
curl HTTP/network errors; enable strict failure handling (e.g., set -euo
pipefail) at the top, make curl fail on HTTP errors and surface failures using
-f (and -S for error output) while keeping -s if desired, and ensure each test
checks the combined exit status (curl | grep) and exits non-zero on failure
(e.g., replace the current || echo "✗ FAIL" with a branch that echoes the
failure and exits 1) so the CI step fails when any test or curl fails.

@haniyakonain haniyakonain marked this pull request as draft November 3, 2025 18:38
- Add yellow warning banner & detailed error section in TriX XSLT
- Detect WARNING comments and display user-friendly notices
- Fix exit code logic in test-extraction-combinations.sh to fail correctly
@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 3, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants