Skip to content

Conversation

@haniyakonain
Copy link
Contributor

@haniyakonain haniyakonain commented Oct 6, 2025

Enable Wikidata by adding 'wikidata' to the languages configuration.

Setup:

  1. Add 'languages=de,en,wikidata' to server.default.properties
  2. Run the server
  3. Navigate to http://localhost:9999/server/extraction/wikidata/

Example page titles to extract:

  • Q42 (Douglas Adams)
  • Q5 (human)
  • Q2 (Earth)

Summary by CodeRabbit

  • New Features

    • Added Wikidata extraction support with multiple modes (standard, mappings, custom) and additional extraction types.
  • Improvements

    • Extraction subsystem initialized earlier for faster availability.
    • Added basic Wikidata statistics for server reporting.
  • Tests

    • New stats & redirects test script and CI step to validate statistics and redirects endpoints.
  • Limitations

    • Some Wikidata extractor types are disabled for live extraction.
    • Template mapping statistics return 501 for Wikidata.

@coderabbitai
Copy link

coderabbitai bot commented Oct 6, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Replaces explicit languages with @mappings; adds multiple Wikidata extractor modes and a Wikidata statistics file; exposes a public DynamicExtractionManager on Server startup and calls updateAll; TemplateStatistics endpoints return HTTP 501 for Wikidata; adds CI step and a stats/redirects test script.

Changes

Cohort / File(s) Summary
Configuration
server/server.default.properties
Replaced explicit language list with @mappings; added extractors.wikidata blocks (default, mappings, custom) listing many Wikidata extractors; commented out WikidataLexemeExtractor and WikidataPropertyExtractor with a note about dump-only namespaces.
Statistics data
server/src/main/statistics/mappingstats_wikidata.txt
New static statistics file containing key-value mappings for Wikidata (`wikiStats
Server initialization
server/src/main/scala/.../Server.scala
Added public val extractor: ExtractionManager initialized as a DynamicExtractionManager(...) and calls extractor.updateAll() during Server construction.
API guard
server/src/main/scala/.../resources/TemplateStatistics.scala
Added guard: when language.wikiCode == "wikidata", endpoints return HTTP 501 with message Template mapping statistics are not available for Wikidata.
CI workflow
.github/workflows/server-web-api-test.yml
Removed trailing space from JAVA_TOOL_OPTIONS, set MAVEN_OPTS="-Daa.languages=@mappings" when starting the server, and added a step to run stats/redirects tests.
Tests / Scripts
scripts/src/main/bash/stats-redirects-test.sh
New Bash script that discovers languages from /mappings/ and checks /statistics/{lang}/ and /mappings/{lang}/redirects/ endpoints for HTTP 200; summarizes results and exits nonzero on failures.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Server.main
    participant Server as Server(instance)
    participant Extractor as DynamicExtractionManager

    Main->>Server: construct Server()
    Note right of Server: exposes public `val extractor`
    Server->>Extractor: instantiate DynamicExtractionManager(managers(...), languages, paths, ...)
    Server->>Extractor: extractor.updateAll()
    Note left of Extractor: triggers extraction update/refresh
Loading
sequenceDiagram
    participant Client as API client
    participant API as TemplateStatistics endpoint
    participant Server as Server(instance)

    Client->>API: GET /statistics/{lang}/
    alt lang == "wikidata"
        API-->>Client: HTTP 501 "Template mapping statistics are not available for Wikidata"
    else
        API->>Server: validate language & lookup manager
        Server-->>API: manager / stats
        API-->>Client: HTTP 200 + stats
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "chore: enable Wikidata extraction in server GUI" directly and accurately reflects the main objective of the changeset. The changes across multiple files—including configuration updates to add Wikidata to the languages list, new Wikidata extractors configuration, server class modifications, and supporting test infrastructure—all work together to accomplish this single primary goal. The title is concise, specific, and clear enough that a teammate scanning the repository history would immediately understand that this PR adds Wikidata support to the server GUI without requiring vague or misleading language.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e91b21 and 0831779.

📒 Files selected for processing (2)
  • server/server.default.properties (1 hunks)
  • server/src/main/statistics/mappingstats_wikidata.txt (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (1)

205-209: Consider adding explicit error handling for consistency.

The initialization calls updateAll without a try-catch block, unlike the cache loader pattern at lines 62-73 which includes explicit error handling and logging. While the fail-fast behavior (server won't start if initialization fails) is reasonable, adding explicit error handling would provide clearer error messages and be consistent with the existing error handling pattern in the codebase.

Example:

-      // Initialize all extractors and statistics
-      logger.info("Initializing extractors and statistics...")
-      _instance.extractor.updateAll
-      logger.info("Extractors and statistics initialized")
+      // Initialize all extractors and statistics
+      try {
+        logger.info("Initializing extractors and statistics...")
+        _instance.extractor.updateAll
+        logger.info("Extractors and statistics initialized")
+      } catch {
+        case e: Exception =>
+          logger.severe(s"Failed to initialize extractors and statistics: ${e.getMessage}")
+          e.printStackTrace()
+          throw e
+      }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0831779 and 7c2ba28.

📒 Files selected for processing (3)
  • server/server.default.properties (2 hunks)
  • server/src/main/scala/org/dbpedia/extraction/server/Server.scala (1 hunks)
  • server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (1)
server/src/main/scala/org/dbpedia/extraction/server/ExtractionManager.scala (1)
  • updateAll (64-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (3)
server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala (1)

90-94: LGTM!

The guard correctly blocks Wikidata from template statistics with an appropriate 501 (Not Implemented) status. The check is properly placed before the managers validation to provide a more specific error message. This aligns with the fact that Wikidata uses a different structure (Q-identifiers) rather than templates.

server/server.default.properties (2)

21-21: Previous concern resolved.

The critical issue from the previous review has been addressed. Adding wikidata to the languages list ensures that the Wikidata extractors defined at lines 110-115 will be properly activated and the extraction endpoint will be available at http://localhost:9999/server/extraction/wikidata/ as intended.


110-115: LGTM!

The Wikidata extractors configuration is comprehensive and includes appropriate extractors for Wikidata entities (labels, descriptions, aliases, references, etc.). The comment explaining why WikidataLexemeExtractor and WikidataPropertyExtractor are excluded provides valuable context about namespace handling limitations in live server extraction.

@haniyakonain haniyakonain force-pushed the enable-wikidata-server branch 2 times, most recently from d80f64e to 8b343aa Compare October 16, 2025 19:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d80f64e and 8b343aa.

📒 Files selected for processing (4)
  • server/server.default.properties (2 hunks)
  • server/src/main/scala/org/dbpedia/extraction/server/Server.scala (1 hunks)
  • server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala (1 hunks)
  • server/src/main/statistics/mappingstats_wikidata.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/src/main/statistics/mappingstats_wikidata.txt
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (1)
server/src/main/scala/org/dbpedia/extraction/server/ExtractionManager.scala (1)
  • updateAll (64-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (1)
server/server.default.properties (1)

21-21: LGTM! Wikidata configuration properly implemented.

The changes correctly address the PR objective and the previous review feedback:

  1. ✅ Line 21: Added wikidata to the languages list, resolving the critical issue raised in the previous review
  2. ✅ Lines 110-124: Comprehensive Wikidata extractor configuration with:
    • Default extractors for general Wikidata extraction
    • Mappings mode for label and same-as extraction
    • Custom mode for advanced extraction options
    • Clear documentation about excluded extractors (line 115) that require special namespace handling

The configuration follows established patterns used for other languages and provides flexibility with multiple extraction modes.

Also applies to: 110-124

Comment on lines +90 to +93
// Wikidata doesn't support template mapping statistics
if (language.wikiCode == "wikidata") {
throw new WebApplicationException(new Exception("Template mapping statistics are not available for Wikidata"), 501)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify Wikidata guard coverage across all endpoints.

The guard correctly prevents template statistics access for Wikidata in the HTML view. However, verify that other code paths properly handle Wikidata:

  1. JSON endpoint (line 288): The getAsJson method accesses Server.instance.managers(language).holder without a Wikidata guard. If a client requests /statistics/wikidata/json/, it will attempt to retrieve template statistics from the Wikidata manager.

  2. All-languages aggregation (line 46): The allLanguages method iterates through all managers including Wikidata and aggregates their template statistics. If Wikidata's manager contains no meaningful template statistics, this could produce misleading totals.

Run the following script to check if the JSON endpoint and all-languages view need similar guards:


🏁 Script executed:

#!/bin/bash
# Description: Check Wikidata handling in other statistics endpoints

# Check if getAsJson method has Wikidata guard
echo "=== Checking getAsJson method for Wikidata guard ==="
rg -n -A 10 -B 2 'def getAsJson' server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala

# Check if allLanguages method has Wikidata filtering
echo "=== Checking allLanguages method for Wikidata filtering ==="
rg -n -A 5 -B 2 'for.*managers' server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala

Length of output: 1137


🏁 Script executed:

#!/bin/bash
# Get full context of getAsJson method including else block and subsequent lines
echo "=== Full getAsJson method ==="
rg -n -A 30 'def getAsJson' server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala | head -50

# Check Language.getOrElse to see if it filters Wikidata
echo -e "\n=== Search for Language handling and Wikidata checks ==="
rg -n -B 2 -A 2 'Language\.getOrElse|wikidata|wikiCode' server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala

Length of output: 1907


Add Wikidata guard to getAsJson method and verify allLanguages method.

The Wikidata guard in singleLanguage (lines 90-93) does not protect the JSON endpoint. The getAsJson method lacks this guard and will process Wikidata in both code paths:

  1. Wildcard request (langCode="*"): Line 278 iterates all managers including Wikidata
  2. Specific language (langCode="wikidata"): Line 282 retrieves the language without checking if it's Wikidata, then line 288 accesses its statistics

Additionally, the allLanguages method (line 46) iterates all managers without filtering Wikidata. Apply the same Wikidata guard logic (lines 90-93) to both getAsJson (lines 277-283) and allLanguages (line 46) to ensure consistent behavior across all endpoints.

🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala
around lines 46 and 277-283, add the same Wikidata guard used in singleLanguage
(lines 90-93) so Wikidata is excluded from JSON and all-languages endpoints: in
allLanguages (line 46) filter out managers/languages whose wikiCode ==
"wikidata" before iterating, and in getAsJson (lines 277-283) check
language.wikiCode == "wikidata" and throw the same WebApplicationException(…,
501) when encountered (apply this both for wildcard iteration—skip Wikidata
managers—and for specific langCode="wikidata"—reject with the exception) to
ensure consistent handling.

Comment on lines +205 to +208
// Initialize all extractors and statistics
logger.info("Initializing extractors and statistics...")
_instance.extractor.updateAll
logger.info("Extractors and statistics initialized")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add error handling and consider timeout for blocking initialization.

The synchronous updateAll call lacks error handling, unlike the similar initialization pattern in singleExtractorCache (lines 62-72) that wraps initialization in try-catch. If updateAll throws an exception, the server fails to start without clear logging of the root cause.

Additionally, this blocking call delays HTTP server startup (line 222). If extractor initialization takes a long time (e.g., loading large mapping files, downloading statistics), the server becomes unresponsive during startup.

Consider these improvements:

1. Add error handling:

 // Initialize all extractors and statistics
 logger.info("Initializing extractors and statistics...")
-_instance.extractor.updateAll
-logger.info("Extractors and statistics initialized")
+try {
+  _instance.extractor.updateAll
+  logger.info("Extractors and statistics initialized")
+} catch {
+  case e: Exception =>
+    logger.severe(s"Failed to initialize extractors and statistics: ${e.getMessage}")
+    e.printStackTrace()
+    throw e
+}

2. Consider timeout or async initialization:

If initialization can take significant time, consider adding a timeout mechanism or making it asynchronous with proper status tracking, so the HTTP server can start responding to health checks while initialization completes in the background.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Initialize all extractors and statistics
logger.info("Initializing extractors and statistics...")
_instance.extractor.updateAll
logger.info("Extractors and statistics initialized")
// Initialize all extractors and statistics
logger.info("Initializing extractors and statistics...")
try {
_instance.extractor.updateAll
logger.info("Extractors and statistics initialized")
} catch {
case e: Exception =>
logger.severe(s"Failed to initialize extractors and statistics: ${e.getMessage}")
e.printStackTrace()
throw e
}
🤖 Prompt for AI Agents
In server/src/main/scala/org/dbpedia/extraction/server/Server.scala around lines
205 to 208, the synchronous call `_instance.extractor.updateAll` is unprotected
and can throw, and it also blocks HTTP startup; wrap the call in a try/catch to
log the full exception and fail-fast with a clear message, or better, make
initialization asynchronous: start the HTTP server immediately and run
`updateAll` in a background Future with configurable timeout and proper recovery
that logs errors and updates an initialization-ready flag or health endpoint; if
you prefer still-blocking behavior, use a configurable timeout (Await.result or
equivalent) and catch TimeoutException to log and handle it gracefully rather
than letting the server die silently.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
scripts/src/main/bash/stats-redirects-test.sh (1)

22-36: Consider explicit error handling for curl failures.

If curl encounters network errors or timeouts, the http_code variable may be empty or contain unexpected values. While the subsequent check will fail safely, adding explicit error detection improves robustness and debugging.

 test_endpoint() {
     local lang="$1"
     local type="$2"
     local url="$3"
 
-    http_code=$(curl -s -o /dev/null -w "%{http_code}" "$url")
+    http_code=$(curl -s -o /dev/null -w "%{http_code}" "$url" 2>/dev/null) || http_code="000"
 
     if [ "$http_code" = "200" ]; then
         echo "✅ $lang $type: HTTP $http_code"
     else
         echo "❌ $lang $type: HTTP $http_code"
         FAILED=$((FAILED + 1))
         FAILED_TESTS+=("$lang $type (HTTP $http_code)")
     fi
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad3157 and 066afc3.

📒 Files selected for processing (2)
  • .github/workflows/server-web-api-test.yml (3 hunks)
  • scripts/src/main/bash/stats-redirects-test.sh (1 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
scripts/src/main/bash/stats-redirects-test.sh

[warning] 15-15: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)


[error] 18-18: Argument mixes string and array. Use * or separate argument.

(SC2145)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build
  • GitHub Check: deploy
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (1)
.github/workflows/server-web-api-test.yml (1)

19-19: LGTM: Workflow integration looks correct.

The changes properly integrate the new test step and configure the server to load all mapped languages:

  • Line 31: MAVEN_OPTS="-Daa.languages=@mappings" enables dynamic language discovery including Wikidata
  • Lines 54-57: New test step validates statistics and redirects endpoints for all discovered languages
  • Step placement is correct: runs after coordinate test, before server shutdown

The workflow will now validate that Wikidata and other language endpoints are operational as part of the CI pipeline.

Also applies to: 31-31, 54-57

MAPPINGS_PAGE=$(curl -s "${BASE_URL}/mappings/")
[ -z "$MAPPINGS_PAGE" ] && { echo "❌ ERROR: Could not fetch mappings page"; exit 1; }

LANGUAGES=($(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix shellcheck SC2207: Properly handle array assignment from command output.

The unquoted command substitution in array assignment can cause word-splitting issues. Use mapfile for more robust handling.

-LANGUAGES=($(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u))
+mapfile -t LANGUAGES < <(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
LANGUAGES=($(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u))
mapfile -t LANGUAGES < <(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u)
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 15-15: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)

🤖 Prompt for AI Agents
In scripts/src/main/bash/stats-redirects-test.sh around line 15, the array
assignment using unquoted command substitution can trigger SC2207 and causes
word-splitting; replace the current LANGUAGES=($(...)) pattern with a mapfile
approach that reads the pipeline output (echoing MAPPINGS_PAGE piped to grep -oP
... and sort -u) into the LANGUAGES array using mapfile -t from a process
substitution, and ensure subsequent uses of LANGUAGES are properly quoted; this
removes word-splitting, handles entries with spaces safely, and satisfies
ShellCheck SC2207.

LANGUAGES=($(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u))
[ ${#LANGUAGES[@]} -eq 0 ] && { echo "❌ ERROR: No languages found"; exit 1; }

echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[@]}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix shellcheck SC2145: Use proper array expansion syntax.

When expanding an array variable in a string context, use the @ descriptor with proper quoting to ensure each element is treated as a separate word.

-echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[@]}"
+echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[*]}"

Note: ${LANGUAGES[*]} joins array elements with spaces (more readable for output), while ${LANGUAGES[@]} expands each element as a separate word. For this echo statement, either is acceptable, but [*] is idiomatic for display.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[@]}"
echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[*]}"
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 18-18: Argument mixes string and array. Use * or separate argument.

(SC2145)

🤖 Prompt for AI Agents
In scripts/src/main/bash/stats-redirects-test.sh around line 18, the echo
expands the LANGUAGES array with ${LANGUAGES[@]} without proper quoting which
triggers shellcheck SC2145; change the expansion to a quoted form for display
such as "${LANGUAGES[*]}" (or use a separate quoted argument with
"${LANGUAGES[@]}") so each element is handled correctly and the array prints as
intended.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 066afc3 and f5d36b0.

📒 Files selected for processing (1)
  • server/server.default.properties (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: deploy
🔇 Additional comments (1)
server/server.default.properties (1)

110-123: Well-structured Wikidata extractor configuration.

The Wikidata extractors are properly organized with three extraction modes (default, mappings, custom) and thoughtful comments about namespace limitations. The commented-out extractors (WikidataLexemeExtractor, WikidataPropertyExtractor) with explanation is good defensive documentation.

@haniyakonain haniyakonain marked this pull request as draft October 24, 2025 19:09
haniyakonain added 5 commits November 1, 2025 02:58
Enable Wikidata as a language option in the server extraction interface
by adding the required statistics file and configuration.

Changes:
- Add wikidata extractors to server.default.properties
- Create mappingstats_wikidata.txt with initial statistics
Add Wikidata configuration with WikidataSameAsExtractor, WikidataRawExtractor,
WikidataLabelExtractor, WikidataDescriptionExtractor, WikidataAliasExtractor,
WikidataR2RExtractor, WikidataReferenceExtractor, WikidataNameSpaceSameAsExtractor,
and WikidataLLExtractor. Return HTTP 501 for Wikidata template statistics.
Initialize extractors at startup to prevent NPE
Add extractor configurations for Wikidata:
- mappings mode: WikidataLabelExtractor, WikidataSameAsExtractor
- custom mode: 9 extractors for comprehensive entity extraction
@haniyakonain haniyakonain force-pushed the enable-wikidata-server branch from 5eb208b to 16b545c Compare October 31, 2025 21:31
haniyakonain added 2 commits November 2, 2025 21:34
Add comprehensive test script to validate all Wikidata extractors
in the server GUI and integrate it into the CI/CD pipeline.

Changes:
- Add test-wikidata-extractors.sh script that tests each Wikidata
  extractor individually against Barack Obama (Q76) entity
- Update GitHub Actions workflow to run Wikidata extractor tests
  after coordinate tests

Test Details:
- Tests 9 Wikidata extractors: SameAs, Raw, Label, Description,
  Alias, R2R, Reference, NameSpaceSameAs, and LL extractors
- Uses revision ID 2424202969 for reproducible results
- Validates HTTP 200 response and checks for error keywords
- Provides colored pass/fail output with summary
- Returns proper exit codes for CI/CD integration

This ensures all configured Wikidata extractors are functional
when the server starts with Wikidata language support.
Fix non-parsable Wikidata dates by adding proper zero-padding
to month and day values in ISO 8601 format.

Before: "1961-8-4"^^xsd:date (invalid)
After: "1961-08-04"^^xsd:date (valid)

Changes:
- Update WikidataUtil.getValue() for TimeValue case
- Use Scala string formatting with %04d for year, %02d for month/day
- Fixes dates in all Wikidata extractors (R2R, Raw, Reference, etc.)
@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 2, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants