-
Notifications
You must be signed in to change notification settings - Fork 287
chore: enable Wikidata extraction in server GUI #792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughReplaces explicit languages with Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Server.main
participant Server as Server(instance)
participant Extractor as DynamicExtractionManager
Main->>Server: construct Server()
Note right of Server: exposes public `val extractor`
Server->>Extractor: instantiate DynamicExtractionManager(managers(...), languages, paths, ...)
Server->>Extractor: extractor.updateAll()
Note left of Extractor: triggers extraction update/refresh
sequenceDiagram
participant Client as API client
participant API as TemplateStatistics endpoint
participant Server as Server(instance)
Client->>API: GET /statistics/{lang}/
alt lang == "wikidata"
API-->>Client: HTTP 501 "Template mapping statistics are not available for Wikidata"
else
API->>Server: validate language & lookup manager
Server-->>API: manager / stats
API-->>Client: HTTP 200 + stats
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/server.default.properties(1 hunks)server/src/main/statistics/mappingstats_wikidata.txt(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (1)
205-209: Consider adding explicit error handling for consistency.The initialization calls
updateAllwithout a try-catch block, unlike the cache loader pattern at lines 62-73 which includes explicit error handling and logging. While the fail-fast behavior (server won't start if initialization fails) is reasonable, adding explicit error handling would provide clearer error messages and be consistent with the existing error handling pattern in the codebase.Example:
- // Initialize all extractors and statistics - logger.info("Initializing extractors and statistics...") - _instance.extractor.updateAll - logger.info("Extractors and statistics initialized") + // Initialize all extractors and statistics + try { + logger.info("Initializing extractors and statistics...") + _instance.extractor.updateAll + logger.info("Extractors and statistics initialized") + } catch { + case e: Exception => + logger.severe(s"Failed to initialize extractors and statistics: ${e.getMessage}") + e.printStackTrace() + throw e + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
server/server.default.properties(2 hunks)server/src/main/scala/org/dbpedia/extraction/server/Server.scala(1 hunks)server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (1)
server/src/main/scala/org/dbpedia/extraction/server/ExtractionManager.scala (1)
updateAll(64-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (3)
server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala (1)
90-94: LGTM!The guard correctly blocks Wikidata from template statistics with an appropriate 501 (Not Implemented) status. The check is properly placed before the managers validation to provide a more specific error message. This aligns with the fact that Wikidata uses a different structure (Q-identifiers) rather than templates.
server/server.default.properties (2)
21-21: Previous concern resolved.The critical issue from the previous review has been addressed. Adding
wikidatato the languages list ensures that the Wikidata extractors defined at lines 110-115 will be properly activated and the extraction endpoint will be available athttp://localhost:9999/server/extraction/wikidata/as intended.
110-115: LGTM!The Wikidata extractors configuration is comprehensive and includes appropriate extractors for Wikidata entities (labels, descriptions, aliases, references, etc.). The comment explaining why
WikidataLexemeExtractorandWikidataPropertyExtractorare excluded provides valuable context about namespace handling limitations in live server extraction.
d80f64e to
8b343aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
server/server.default.properties(2 hunks)server/src/main/scala/org/dbpedia/extraction/server/Server.scala(1 hunks)server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala(1 hunks)server/src/main/statistics/mappingstats_wikidata.txt(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- server/src/main/statistics/mappingstats_wikidata.txt
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/scala/org/dbpedia/extraction/server/Server.scala (1)
server/src/main/scala/org/dbpedia/extraction/server/ExtractionManager.scala (1)
updateAll(64-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (1)
server/server.default.properties (1)
21-21: LGTM! Wikidata configuration properly implemented.The changes correctly address the PR objective and the previous review feedback:
- ✅ Line 21: Added
wikidatato the languages list, resolving the critical issue raised in the previous review- ✅ Lines 110-124: Comprehensive Wikidata extractor configuration with:
- Default extractors for general Wikidata extraction
- Mappings mode for label and same-as extraction
- Custom mode for advanced extraction options
- Clear documentation about excluded extractors (line 115) that require special namespace handling
The configuration follows established patterns used for other languages and provides flexibility with multiple extraction modes.
Also applies to: 110-124
| // Wikidata doesn't support template mapping statistics | ||
| if (language.wikiCode == "wikidata") { | ||
| throw new WebApplicationException(new Exception("Template mapping statistics are not available for Wikidata"), 501) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify Wikidata guard coverage across all endpoints.
The guard correctly prevents template statistics access for Wikidata in the HTML view. However, verify that other code paths properly handle Wikidata:
-
JSON endpoint (line 288): The
getAsJsonmethod accessesServer.instance.managers(language).holderwithout a Wikidata guard. If a client requests/statistics/wikidata/json/, it will attempt to retrieve template statistics from the Wikidata manager. -
All-languages aggregation (line 46): The
allLanguagesmethod iterates through all managers including Wikidata and aggregates their template statistics. If Wikidata's manager contains no meaningful template statistics, this could produce misleading totals.
Run the following script to check if the JSON endpoint and all-languages view need similar guards:
🏁 Script executed:
#!/bin/bash
# Description: Check Wikidata handling in other statistics endpoints
# Check if getAsJson method has Wikidata guard
echo "=== Checking getAsJson method for Wikidata guard ==="
rg -n -A 10 -B 2 'def getAsJson' server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala
# Check if allLanguages method has Wikidata filtering
echo "=== Checking allLanguages method for Wikidata filtering ==="
rg -n -A 5 -B 2 'for.*managers' server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scalaLength of output: 1137
🏁 Script executed:
#!/bin/bash
# Get full context of getAsJson method including else block and subsequent lines
echo "=== Full getAsJson method ==="
rg -n -A 30 'def getAsJson' server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala | head -50
# Check Language.getOrElse to see if it filters Wikidata
echo -e "\n=== Search for Language handling and Wikidata checks ==="
rg -n -B 2 -A 2 'Language\.getOrElse|wikidata|wikiCode' server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scalaLength of output: 1907
Add Wikidata guard to getAsJson method and verify allLanguages method.
The Wikidata guard in singleLanguage (lines 90-93) does not protect the JSON endpoint. The getAsJson method lacks this guard and will process Wikidata in both code paths:
- Wildcard request (
langCode="*"): Line 278 iterates all managers including Wikidata - Specific language (
langCode="wikidata"): Line 282 retrieves the language without checking if it's Wikidata, then line 288 accesses its statistics
Additionally, the allLanguages method (line 46) iterates all managers without filtering Wikidata. Apply the same Wikidata guard logic (lines 90-93) to both getAsJson (lines 277-283) and allLanguages (line 46) to ensure consistent behavior across all endpoints.
🤖 Prompt for AI Agents
In
server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala
around lines 46 and 277-283, add the same Wikidata guard used in singleLanguage
(lines 90-93) so Wikidata is excluded from JSON and all-languages endpoints: in
allLanguages (line 46) filter out managers/languages whose wikiCode ==
"wikidata" before iterating, and in getAsJson (lines 277-283) check
language.wikiCode == "wikidata" and throw the same WebApplicationException(…,
501) when encountered (apply this both for wildcard iteration—skip Wikidata
managers—and for specific langCode="wikidata"—reject with the exception) to
ensure consistent handling.
| // Initialize all extractors and statistics | ||
| logger.info("Initializing extractors and statistics...") | ||
| _instance.extractor.updateAll | ||
| logger.info("Extractors and statistics initialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling and consider timeout for blocking initialization.
The synchronous updateAll call lacks error handling, unlike the similar initialization pattern in singleExtractorCache (lines 62-72) that wraps initialization in try-catch. If updateAll throws an exception, the server fails to start without clear logging of the root cause.
Additionally, this blocking call delays HTTP server startup (line 222). If extractor initialization takes a long time (e.g., loading large mapping files, downloading statistics), the server becomes unresponsive during startup.
Consider these improvements:
1. Add error handling:
// Initialize all extractors and statistics
logger.info("Initializing extractors and statistics...")
-_instance.extractor.updateAll
-logger.info("Extractors and statistics initialized")
+try {
+ _instance.extractor.updateAll
+ logger.info("Extractors and statistics initialized")
+} catch {
+ case e: Exception =>
+ logger.severe(s"Failed to initialize extractors and statistics: ${e.getMessage}")
+ e.printStackTrace()
+ throw e
+}2. Consider timeout or async initialization:
If initialization can take significant time, consider adding a timeout mechanism or making it asynchronous with proper status tracking, so the HTTP server can start responding to health checks while initialization completes in the background.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Initialize all extractors and statistics | |
| logger.info("Initializing extractors and statistics...") | |
| _instance.extractor.updateAll | |
| logger.info("Extractors and statistics initialized") | |
| // Initialize all extractors and statistics | |
| logger.info("Initializing extractors and statistics...") | |
| try { | |
| _instance.extractor.updateAll | |
| logger.info("Extractors and statistics initialized") | |
| } catch { | |
| case e: Exception => | |
| logger.severe(s"Failed to initialize extractors and statistics: ${e.getMessage}") | |
| e.printStackTrace() | |
| throw e | |
| } |
🤖 Prompt for AI Agents
In server/src/main/scala/org/dbpedia/extraction/server/Server.scala around lines
205 to 208, the synchronous call `_instance.extractor.updateAll` is unprotected
and can throw, and it also blocks HTTP startup; wrap the call in a try/catch to
log the full exception and fail-fast with a clear message, or better, make
initialization asynchronous: start the HTTP server immediately and run
`updateAll` in a background Future with configurable timeout and proper recovery
that logs errors and updates an initialization-ready flag or health endpoint; if
you prefer still-blocking behavior, use a configurable timeout (Await.result or
equivalent) and catch TimeoutException to log and handle it gracefully rather
than letting the server die silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
scripts/src/main/bash/stats-redirects-test.sh (1)
22-36: Consider explicit error handling for curl failures.If curl encounters network errors or timeouts, the
http_codevariable may be empty or contain unexpected values. While the subsequent check will fail safely, adding explicit error detection improves robustness and debugging.test_endpoint() { local lang="$1" local type="$2" local url="$3" - http_code=$(curl -s -o /dev/null -w "%{http_code}" "$url") + http_code=$(curl -s -o /dev/null -w "%{http_code}" "$url" 2>/dev/null) || http_code="000" if [ "$http_code" = "200" ]; then echo "✅ $lang $type: HTTP $http_code" else echo "❌ $lang $type: HTTP $http_code" FAILED=$((FAILED + 1)) FAILED_TESTS+=("$lang $type (HTTP $http_code)") fi }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/server-web-api-test.yml(3 hunks)scripts/src/main/bash/stats-redirects-test.sh(1 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
scripts/src/main/bash/stats-redirects-test.sh
[warning] 15-15: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
[error] 18-18: Argument mixes string and array. Use * or separate argument.
(SC2145)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build
- GitHub Check: deploy
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (1)
.github/workflows/server-web-api-test.yml (1)
19-19: LGTM: Workflow integration looks correct.The changes properly integrate the new test step and configure the server to load all mapped languages:
- Line 31:
MAVEN_OPTS="-Daa.languages=@mappings"enables dynamic language discovery including Wikidata- Lines 54-57: New test step validates statistics and redirects endpoints for all discovered languages
- Step placement is correct: runs after coordinate test, before server shutdown
The workflow will now validate that Wikidata and other language endpoints are operational as part of the CI pipeline.
Also applies to: 31-31, 54-57
| MAPPINGS_PAGE=$(curl -s "${BASE_URL}/mappings/") | ||
| [ -z "$MAPPINGS_PAGE" ] && { echo "❌ ERROR: Could not fetch mappings page"; exit 1; } | ||
|
|
||
| LANGUAGES=($(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix shellcheck SC2207: Properly handle array assignment from command output.
The unquoted command substitution in array assignment can cause word-splitting issues. Use mapfile for more robust handling.
-LANGUAGES=($(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u))
+mapfile -t LANGUAGES < <(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| LANGUAGES=($(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u)) | |
| mapfile -t LANGUAGES < <(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u) |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 15-15: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
🤖 Prompt for AI Agents
In scripts/src/main/bash/stats-redirects-test.sh around line 15, the array
assignment using unquoted command substitution can trigger SC2207 and causes
word-splitting; replace the current LANGUAGES=($(...)) pattern with a mapfile
approach that reads the pipeline output (echoing MAPPINGS_PAGE piped to grep -oP
... and sort -u) into the LANGUAGES array using mapfile -t from a process
substitution, and ensure subsequent uses of LANGUAGES are properly quoted; this
removes word-splitting, handles entries with spaces safely, and satisfies
ShellCheck SC2207.
| LANGUAGES=($(echo "$MAPPINGS_PAGE" | grep -oP 'href="\K[^/"]+(?=/">)' | sort -u)) | ||
| [ ${#LANGUAGES[@]} -eq 0 ] && { echo "❌ ERROR: No languages found"; exit 1; } | ||
|
|
||
| echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[@]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix shellcheck SC2145: Use proper array expansion syntax.
When expanding an array variable in a string context, use the @ descriptor with proper quoting to ensure each element is treated as a separate word.
-echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[@]}"
+echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[*]}"Note: ${LANGUAGES[*]} joins array elements with spaces (more readable for output), while ${LANGUAGES[@]} expands each element as a separate word. For this echo statement, either is acceptable, but [*] is idiomatic for display.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[@]}" | |
| echo "Found ${#LANGUAGES[@]} languages: ${LANGUAGES[*]}" |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 18-18: Argument mixes string and array. Use * or separate argument.
(SC2145)
🤖 Prompt for AI Agents
In scripts/src/main/bash/stats-redirects-test.sh around line 18, the echo
expands the LANGUAGES array with ${LANGUAGES[@]} without proper quoting which
triggers shellcheck SC2145; change the expansion to a quoted form for display
such as "${LANGUAGES[*]}" (or use a separate quoted argument with
"${LANGUAGES[@]}") so each element is handled correctly and the array prints as
intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/server.default.properties(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
- GitHub Check: deploy
🔇 Additional comments (1)
server/server.default.properties (1)
110-123: Well-structured Wikidata extractor configuration.The Wikidata extractors are properly organized with three extraction modes (default, mappings, custom) and thoughtful comments about namespace limitations. The commented-out extractors (WikidataLexemeExtractor, WikidataPropertyExtractor) with explanation is good defensive documentation.
Enable Wikidata as a language option in the server extraction interface by adding the required statistics file and configuration. Changes: - Add wikidata extractors to server.default.properties - Create mappingstats_wikidata.txt with initial statistics
Add Wikidata configuration with WikidataSameAsExtractor, WikidataRawExtractor, WikidataLabelExtractor, WikidataDescriptionExtractor, WikidataAliasExtractor, WikidataR2RExtractor, WikidataReferenceExtractor, WikidataNameSpaceSameAsExtractor, and WikidataLLExtractor. Return HTTP 501 for Wikidata template statistics. Initialize extractors at startup to prevent NPE
Add extractor configurations for Wikidata: - mappings mode: WikidataLabelExtractor, WikidataSameAsExtractor - custom mode: 9 extractors for comprehensive entity extraction
5eb208b to
16b545c
Compare
Add comprehensive test script to validate all Wikidata extractors in the server GUI and integrate it into the CI/CD pipeline. Changes: - Add test-wikidata-extractors.sh script that tests each Wikidata extractor individually against Barack Obama (Q76) entity - Update GitHub Actions workflow to run Wikidata extractor tests after coordinate tests Test Details: - Tests 9 Wikidata extractors: SameAs, Raw, Label, Description, Alias, R2R, Reference, NameSpaceSameAs, and LL extractors - Uses revision ID 2424202969 for reproducible results - Validates HTTP 200 response and checks for error keywords - Provides colored pass/fail output with summary - Returns proper exit codes for CI/CD integration This ensures all configured Wikidata extractors are functional when the server starts with Wikidata language support.
Fix non-parsable Wikidata dates by adding proper zero-padding to month and day values in ISO 8601 format. Before: "1961-8-4"^^xsd:date (invalid) After: "1961-08-04"^^xsd:date (valid) Changes: - Update WikidataUtil.getValue() for TimeValue case - Use Scala string formatting with %04d for year, %02d for month/day - Fixes dates in all Wikidata extractors (R2R, Raw, Reference, etc.)
|



Enable Wikidata by adding 'wikidata' to the languages configuration.
Setup:
Example page titles to extract:
Summary by CodeRabbit
New Features
Improvements
Tests
Limitations