Dbpedia History #745

datalogism · 2022-12-06T14:42:23Z

First prototype of DBpedia History

Summary by CodeRabbit

New Features
- REST-based NIF extraction path.
- New History extraction module producing HistoryData and HistoryStats (standard and Spark jobs).
- Sample/minidump generation scripts for testing and demos.
Improvements
- More robust link parsing and HTML cleaning in NIF extraction.
- Expanded MediaWiki connection options; default parallelism reduced for stability.
- Updated template mappings for English, French, and Hungarian.
Documentation
- Added History module README and abstract test guide.
Tests
- New end-to-end extraction tests and utilities.

… HTML entities

…om_sample.sh

…ctorConfigured.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

…Abstract.md Co-authored-by: Ted Thibodeau Jr <[email protected]>

…Abstract.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Co-authored-by: Ted Thibodeau Jr <[email protected]>

…Abstract.md Co-authored-by: Ted Thibodeau Jr <[email protected]>

Co-authored-by: Ted Thibodeau Jr <[email protected]>

…ractionJob2.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Co-authored-by: Ted Thibodeau Jr <[email protected]>

…figLoader2.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Co-authored-by: Ted Thibodeau Jr <[email protected]>

sonarqubecloud · 2022-12-08T08:42:50Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

history/ReadMe.md

TallTed

Minor...

history/ReadMe.md

Co-authored-by: Ted Thibodeau Jr <[email protected]>

sonarqubecloud · 2023-01-06T12:08:51Z

Please retry analysis of this Pull-Request directly on SonarCloud.

sonarqubecloud · 2023-01-06T12:08:57Z

Please retry analysis of this Pull-Request directly on SonarCloud.

JJ-Author · 2025-10-09T00:15:20Z

@coderabbitai full review

coderabbitai · 2025-10-09T00:15:54Z

✅ Actions performed

Full review triggered.

coderabbitai · 2025-10-09T00:15:58Z

Caution

Review failed

Failed to post review comments

Walkthrough

Adds a new History module for revision-based extraction (config, sources, extractors, Spark pipeline, recorder/monitor), introduces REST-based NIF extraction path with new MediaWiki connectors, updates configurations (including MediaWikiConnection signature and defaults), extends datasets (history), adds dump test tooling/scripts, and enriches multiple mapping files.

Changes

Cohort / File(s)	Summary
NIF + MediaWiki connectors `core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`, `core/src/main/scala/org/dbpedia/extraction/mappings/NifExtractor.scala`, `core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala`, `core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`, `core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala`, `core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala`, `core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnector*.scala`	Adds REST-aware link parsing and NIF extraction path; introduces abstract/configured/REST MediaWiki connectors; updates HtmlNifExtractor flow and visibility tweaks; deprecates old connector; adds REST-specific extractor.
Core config updates `core/src/main/scala/org/dbpedia/extraction/config/Config.scala`	Expands MediaWikiConnection case class and construction (apiType and HTTP params); changes parallelProcesses default to 1.
Datasets & provenance `core/src/main/resources/datasetdefinitions.json`, `core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala`	Adds history datasets: history_dataset and history_stats, with corresponding provenance entries.
Dump test tooling `dump/src/test/bash/`, `dump/src/test/resources/extraction-configs/`, `dump/src/test/scala/.../ExtractionTestAbstract.*`	Adds scripts to create samples/minidumps, updates test extraction configs (REST/MWC params, abstracts-only), and a new abstract extraction test with logging.
History module: setup & configs `history/ReadMe.md`, `history/pom.xml`, `history/download.properties`, `history/extraction.properties`, `history/src/test/resources/extraction-configs/extraction.config.properties`	Introduces History module POM, docs, and properties for downloads and extraction.
History parsing & sources `history/src/main/java/.../WikipediaDumpParserHistory.java`, `history/src/main/scala/.../sources/Source2.scala`, `.../sources/XMLSource2.scala`	Adds history-capable dump parser and new Source2 abstractions for single/multi-file XML reading.
History configuration `history/src/main/scala/org/dbpedia/extraction/config/Config2.scala`	New Config2 with structured properties (NIF/Abstract parameters, MediaWikiConnection, languages, datasets, etc.).
History extraction pipeline `history/src/main/scala/.../dump/extract/ConfigLoader2.scala`, `.../Extraction2.scala`, `.../ExtractionJob2.scala`, `.../SparkExtractionJob2.scala`, `.../SerializableUtils2.scala`, `.../destinations/WriterDestination2.scala`	Adds job loaders, orchestrators (sequential/parallel/Spark), serialization helpers, and a writer destination with recording.
History model & extractors `history/src/main/scala/.../wikiparser/WikiPageWithRevisions.scala`, `.../wikiparser/RevisionNode.scala`, `.../mappings/WikiPageWithRevisionsExtractor.scala`, `.../mappings/HistoryPageExtractor.scala`, `.../mappings/HistoryStatsExtractor.scala`, `.../mappings/RedirectExtractor2.scala`, `.../mappings/Composite*2.scala`, `.../mappings/Redirects2.scala`, `.../mappings/RevisionNode.scala`, `.../mappings/RevisionNodeExtractor.scala`, `.../mappings/ExtractionMonitor2.scala`, `.../util/DumpExtractionContext2.scala`, `.../util/SparkExtractionContext2.scala`, `.../util/ExtractionRecorder2.scala`	Introduces revision/page models, history/statistics extractors, redirects handling, composite extractors, monitoring and recording utilities, and context traits.
History tests `history/src/test/scala/.../ExtractionTest.scala`	Adds a ScalaTest spec to run history extraction via Config2/ConfigLoader2.
Mappings updates `mappings/Mapping_en.xml`, `mappings/Mapping_fr.xml`, `mappings/Mapping_hu.xml`, `mappings/Mapping_el.xml`	Substantial additions/refinements to template-to-ontology mappings across EN/FR/HU/EL.
Root build meta `pom.xml`	Adds commented history module entry; no functional build change.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as NifExtractor
  participant Conf as Config.mediawikiConnection
  participant REST as MediaWikiConnectorRest
  participant MWC as MediawikiConnectorConfigured
  participant ExtrR as WikipediaNifExtractorRest
  participant Extr as WikipediaNifExtractor

  alt apiType == "rest"
    Runner->>REST: retrievePage(title, params)
    REST-->>Runner: html opt or error
    Runner->>ExtrR: extract(html)
    ExtrR-->>Runner: NIF sections/triples
  else non-REST
    Runner->>MWC: retrievePage(title, params)
    MWC-->>Runner: html opt or error
    Runner->>Extr: extract(html)
    Extr-->>Runner: NIF sections/triples
  end

sequenceDiagram
  autonumber
  participant Main as Extraction2
  participant Cfg as Config2
  participant Loader as ConfigLoader2
  participant Job as ExtractionJob2
  participant Src as Source2/XMLSource2
  participant Ext as CompositeParseExtractor2
  participant Dest as WriterDestination2
  participant Rec as ExtractionRecorder2

  Main->>Cfg: load(args)
  Main->>Loader: new(config)
  Loader-->>Main: jobs (by language)
  loop per job
    Main->>Job: run()
    Job->>Src: foreach(page)
    Src-->>Job: WikiPageWithRevisions
    Job->>Ext: extract(page, subjectUri)
    Ext-->>Job: quads
    Job->>Dest: write(quads)
    Job->>Rec: record page/quad status
  end
  Main-->>Main: wait for all jobs

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~180+ minutes

Poem

A hop through history, ears held high,
We wired new REST to catch the sky.
With quads in paws and maps refined,
We sift the past, by dumps aligned.
Spark trails glow, connectors sing—
Thump goes progress: spring-spring-spring! 🐇✨

Pre-merge checks and finishing touches and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “Dbpedia History” clearly identifies the main focus of this pull request as introducing DBpedia’s History functionality and is concise without unnecessary details.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 53

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

mappings/Mapping_hu.xml (1)
265-266: Fix incorrect ontology targets for church founding metadata.

alapítva now writes to dbo:publicationDate and alapító to dbo:bodyDiscovered, which will pollute those properties with church founding data. Please map them to the standard founding predicates instead.
-        {{PropertyMapping | templateProperty = alapítva | ontologyProperty = publicationDate }}
-        {{PropertyMapping | templateProperty = alapító | ontologyProperty = bodyDiscovered }}
+        {{PropertyMapping | templateProperty = alapítva | ontologyProperty = foundingDate }}
+        {{PropertyMapping | templateProperty = alapító | ontologyProperty = founder }}

🧹 Nitpick comments (39)

dump/src/test/bash/createSampleRandomFromPageIDdataset.sh (1)
21-22: Quote parameters passed to external commands.

$file, $n, and the constructed output name are expanded unquoted. Any whitespace or glob characters in the file path or generated list name will break the pipeline or trigger unintended globbing. Quote those substitutions before passing them to grep, shuf, and the final redirection target.
-grep -v "resource\/\w*\:" $file > temp.txt
-shuf -n $n temp.txt | ... > uri_sample_random_${lang}_${n}.lst
+grep -v "resource/\w*:" "$file" > temp.txt
+shuf -n "$n" temp.txt | ... > "uri_sample_random_${lang}_${n}.lst"
pom.xml (1)

64-65: History module is commented out—clarify activation plan

If the history module should build in CI, uncomment it or gate it behind a Maven profile (e.g., -P history). Otherwise, document why it’s intentionally disabled to avoid confusion.
history/ReadMe.md (7)
3-3: Tighten intro sentence

Minor copy edit and trailing space.
-DBpedia History enables the history of a Wikipedia chapter to be extracted into an RDF format 
+DBpedia History enables the history of a Wikipedia chapter to be extracted into RDF.
10-15: Avoid bare URLs; format citations cleanly (MD034)

Replace separate bare URLs with Markdown links.
-Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin. Wikipedia editing history in DBpedia: extracting and publishing the encyclopedia editing activity as linked data. IEEE/WIC/ACM International Joint Conference on Web Intelligence (WI' 16), Oct 2016, Omaha, United States. <hal-01359575>
-https://hal.inria.fr/hal-01359575
-
-Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin. Materializing the editing history of Wikipedia as linked data in DBpedia. ISWC 2016 - 15th International Semantic Web Conference, Oct 2016, Kobe, Japan. <http://iswc2016.semanticweb.org/>. <hal-01359583>
-https://hal.inria.fr/hal-01359583
+– [Wikipedia editing history in DBpedia: extracting and publishing the encyclopedia editing activity as linked data (WI'16)](https://hal.inria.fr/hal-01359575) — Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin.
+
+– [Materializing the editing history of Wikipedia as linked data in DBpedia (ISWC 2016)](https://hal.inria.fr/hal-01359583) — Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin.
36-43: Fix typos, grammar, and list indentation (MD007)

Tighten phrasing and remove extra indent spaces before bullets.
-  * Extract all revisions of every Wikipedia page
-  * Use the foaf, xsd, rdf, prov, dc, sioc ontologies 
-  * Describre each revisions of each page, the content / date / size / importance of that revision, the author of this one and the delta with the last version of the page updated by this one
-  * the id of the user are based depending on what is available : ip / nickname or the wikipedia id
+* Extract all revisions of every Wikipedia page
+* Use the FOAF, XSD, RDF, PROV, DC, and SIOC ontologies
+* Describe each revision of each page: content, date, size, importance, author, and the delta vs. the previous revision
+* The user identifier is based on what is available: IP address, username, or Wikipedia user ID
44-50: Clarify wording in HistoryStats bullets (grammar, MD007)

Improve list grammar and remove extra indent spaces.
-  * Extract statistics about revision activity for every page of Wikipedia : 
-    * number of revision per year / months 
-    * avg size of revision per year / month
-    * number of unique contribution
-  * This exctraction add some computation and could be not necessary
-  * Use dc, rdf, rdfs ontologies
+* Extract statistics about revision activity for every Wikipedia page:
+  * number of revisions per year/month
+  * average revision size per year/month
+  * number of unique contributors
+* This extraction adds additional computation and may be optional
+* Uses DC, RDF, and RDFS ontologies
53-64: Minor heading/punctuation fixes

Standardize headings and code formatting.
-## How to run it ? 
+## How to run it?
@@
-* and run  ```../run download download.properties```
+* and run: `../run download download.properties`
@@
-* and run  ```../run run extraction.properties```
+* and run: `../run run extraction.properties`
71-86: Specify language for fenced code block (MD040)

These are N-Triples; add a language hint for better rendering.
-```
+```ntriples
 ...
---

`88-182`: **Specify language for fenced code block (MD040)**

Same for the stats example.

```diff
-```
+```ntriples
 ...
</blockquote></details>
<details>
<summary>dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract.scala (2)</summary><blockquote>

`97-101`: **Remove unused debug variable/print**

`parallelProcesses` isn’t used; drop it and the println.

```diff
-    val parallelProcesses = 1
-    println(parallelProcesses)
+    // parallelism is determined by job config; keep output minimal in tests
140-141: Don’t swallow exceptions silently

Log or record the exception to aid debugging.
-      } catch {
-        case e: Exception =>  None
-      }
+      } catch {
+        case e: Exception =>
+          println(s"[warn] Failed to list or aggregate failed pages for ${lang.wikiCode}: ${e.getMessage}")
+      }
history/src/main/scala/org/dbpedia/extraction/dump/extract/SparkExtractionJob2.scala (3)
49-50: Remove stray debug print

Leftover println adds noise in Spark driver logs.
-    println("SPARKKKK")
70-73: Use consistent composite extractor implementation

File uses CompositeParseExtractor here and CompositeParseExtractor2 later; unify to the intended variant.
-      val extractor = CompositeParseExtractor.load(extractors, context)
+      val extractor = CompositeParseExtractor2.load(extractors, context)
211-218: Avoid external bash for concat if possible

Relying on an external script can break in cluster environments. Consider merging via Hadoop/Spark APIs (e.g., coalesce + save) or Java NIO, with retries and logging.
core/src/main/resources/datasetdefinitions.json (1)
629-642: Update "history" dataset names and descriptions
     "history_dataset": {
-      "name": "History Links",
+      "name": "History Data",
       "traits":"LinkedData, Published",
-      "desc": "All data related to history",
+      "desc": "Per-revision history data extracted from Wikipedia",
       "defaultgraph": "dataset"
     },
     "history_stats": {
       "name": "History Stats",
       "traits":"LinkedData, Published",
-      "desc": "Statistics related to edition statistics",
+      "desc": "Aggregated editing statistics per page (e.g., revisions per year/month, average sizes, unique contributors)",
       "defaultgraph": "dataset"
     }
Scala constants HistoryData and HistoryStats already map to these keys.
history/download.properties (2)
4-4: Remove commented debug/test URL.

Line 4 contains a commented-out URL that appears to be for testing/debugging. This should be removed from production configuration files.
-#https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Exporter&action=submit&history=1&pages=H%C3%B4tes_de_passage%0ARaymond_Beaudet
15-15: Document or parameterize the hardcoded dump date.

The dump date is hardcoded to 20221001, which will become outdated. The comment on line 14 states "if empty: the most recent dump-date is used."

Consider either:

Leaving it empty to automatically use the most recent dump

Adding documentation explaining why this specific date is required

Making it configurable via environment variable or build parameter
history/src/main/scala/org/dbpedia/extraction/mappings/CompositeExtractor2.scala (1)
7-7: Address or clarify the TODO comment.

The TODO comment "generic type may not be optimal" lacks context. Either:

Elaborate on specific concerns with the generic type approach

Remove the TODO if the current design is acceptable

File an issue to track the concern if it's deferred work

If the generic type is working as intended, consider replacing with a clarifying comment:
-  * TODO: generic type may not be optimal.
+  * Generic composite extractor that aggregates multiple extractors of the same input type.
+  * Type parameter N allows composition of extractors for different node types (WikiPageWithRevisions, RevisionNode, etc.).
history/src/main/scala/org/dbpedia/extraction/mappings/CompositeWikiPageWithRevisionExtractor.scala (1)

5-7: Align extends/with indentation.

Scala convention keeps extends/with indented under the class header for readability. Please indent these two lines to match surrounding code.
dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties (1)
34-34: Trailing space in languages value

Trim to avoid parsing surprises in loaders that don’t auto-trim.
-languages=fr 
+languages=fr
history/src/main/scala/org/dbpedia/extraction/mappings/RedirectExtractor2.scala (1)
28-30: Remove unused namespaces val

Declared but never used. Drop to reduce noise.
-  private val namespaces = if (language == Language.Commons) ExtractorUtils.commonsNamespacesContainingMetadata
-    else Set(Namespace.Main, Namespace.Template, Namespace.Category)
+  // Intentionally no namespace set here; see extract() filter.
core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala (2)
93-96: Link section relationships (top/sub and prev/next) like base extractor

Parity with WikipediaNifExtractor improves consumers relying on navigation.
-              tocMap.append(section)
+              tocMap.append(section)
+              section.top match {
+                case Some(s) => s.sub = Option(section)
+                case None => ()
+              }
+              section.prev match {
+                case Some(s) => s.next = Option(section)
+                case None => ()
+              }
45-47: Readability: avoid shadowing parameter name in lambda

Rename the lambda variable to prevent confusion with method param currentNodes.
-      subnodes = subnodes.dropWhile(currentNodes => !currentNodes.nodeName().matches("h\\d") && !currentNodes.nodeName().matches("section"))
+      subnodes = subnodes.dropWhile(n => !n.nodeName().matches("h\\d") && !n.nodeName().matches("section"))
history/src/main/scala/org/dbpedia/extraction/wikiparser/RevisionNode.scala (3)

49-57: Avoid reparsing timestamp twice.

getYearMonth parses the ISO timestamp twice. Cache the ZonedDateTime locally (or reuse the instance from getYear) to avoid the duplicate parse and associated exception risk when the timestamp is malformed twice.

60-64: Tighten boolean parsing.

parseBoolean treats any non-empty string as true, so "True" or "yes" behave differently than "false", and typos silently coerce to true. Consider normalizing with str.equalsIgnoreCase("true") (and trimming) so unexpected inputs yield false (or throw).

66-73: Return Option instead of sentinel -1.

Returning -1 for missing numeric fields leaks sentinel values into downstream logic. Prefer Option[Int] / Option[Long] or throw on malformed input to keep the API type-safe.
history/src/main/scala/org/dbpedia/extraction/mappings/ExtractionMonitor2.scala (2)
135-141: Return numeric successful count, not AtomicLong

Summaries generally expect values, not mutables.

Apply:
-    summary.put("SUCCESSFUL", s)
+    summary.put("SUCCESSFUL", Long.box(s.get()))
183-185: Replace printStackTrace with proper logging

Avoid printing to stdout in libraries.

Use a logger (java.util.logging or slf4j) with context and stack trace:
// logger.log(Level.WARNING, s"DATAID compare failed for $dataIDUrl", ex)
history/src/main/scala/org/dbpedia/extraction/config/Config2.scala (1)
260-281: mediawikiConnection: handle unknown mwc-type explicitly

Unknown values cause a MatchError that is wrapped and rethrown. Prefer explicit validation for clearer error messages.

Example:
-      apiUrl = this.getProperty("mwc-type").trim match {
+      apiUrl = this.getProperty("mwc-type", "").trim match {
         case "rest" =>  this.getProperty("mwc-apiRestUrl", "").trim
         case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim
         case "local" => this.getProperty("mwc-apiLocalUrl", "").trim
+        case other => throw new IllegalArgumentException(s"Unsupported mwc-type='$other'. Expected one of: rest|mwc|local.")
       },
history/src/main/scala/org/dbpedia/extraction/wikiparser/WikiPageWithRevisions.scala (2)
50-57: Typo: isRetryy -> isRetry

Minor naming nit for consistency.

Apply:
-  private var isRetryy = false
+  private var isRetry = false
...
-    this.isRetryy = !this.isRetryy
+    this.isRetry = !this.isRetry
...
-  def isRetry: Boolean = this.isRetryy
+  def isRetry: Boolean = this.isRetry
96-98: Avoid mapValues (lazy view) and return strict maps

mapValues returns a view in newer Scala; compute eagerly.

Apply:
-  def getRevPerYear: Map[String,Int] = {
-    this.revisions.groupBy(_.getYear).mapValues(_.size)
-  }
+  def getRevPerYear: Map[String,Int] =
+    this.revisions.groupBy(_.getYear).map { case (k, v) => k -> v.size }.toMap
...
-  def getRevPerYearMonth: Map[String, Int] = {
-     this.revisions.groupBy(_.getYearMonth).mapValues(_.size)
-  }
+  def getRevPerYearMonth: Map[String, Int] =
+    this.revisions.groupBy(_.getYearMonth).map { case (k, v) => k -> v.size }.toMap
...
-  def getRevPerYearMonthAvgSize: Map[String, Int] = {
-     this.revisions.groupBy(_.getYearMonth).map {
-      case (candidate, group) =>
-        candidate -> group.map { _.text_size }.sum / group.size
-    }
-  }
+  def getRevPerYearMonthAvgSize: Map[String, Int] =
+    this.revisions.groupBy(_.getYearMonth).map { case (k, g) => k -> (g.map(_.text_size).sum / g.size) }.toMap
Also applies to: 107-109, 111-116
core/src/main/scala/org/dbpedia/extraction/util/MediawikiConnectorConfigured.scala (2)
60-67: Parameter construction/escaping is fragile; prefer proper URL-encoding and clearer templating.

Manual replacements on title and ad-hoc replace/format on apiParameterString are error-prone. Use URLEncoder.encode for values and build parameters with a small helper to avoid mismatched %s/%d templates and accidental deletions.

Build a Map[String,String] of parameters (uselang, page or pageid, maxlag).

URL-encode only values with UTF-8.

Join with '&' to a body string.
I can provide a helper if you want it in this PR.

Also applies to: 50-53

26-28: Remove unused vars and null Try; simplify.

availableProcessors, gzipok are unused.

parsedAnswer should not be initialized to null.

Apply:
-  private val osBean = java.lang.management.ManagementFactory.getOperatingSystemMXBean
-  private val availableProcessors = osBean.getAvailableProcessors
+  // remove unused osBean/availableProcessors

-    var gzipok = true
-    var parsedAnswer: Try[String] = null
+    var parsedAnswer: Try[String] = Failure(new IllegalStateException("uninitialized"))
Also drop gzipok and guard gzip by Content-Encoding as suggested above.

Also applies to: 41-43, 107-115
history/src/main/scala/org/dbpedia/extraction/util/ExtractionRecorder2.scala (4)
220-223: Don’t increment “successful pages” for quads; use a dedicated triples counter.

recordQuad invokes increaseAndGetSuccessfulPages, skewing page stats. Use the existing successfulTripleCount keyed by quad.dataset.

Apply:
-  def recordQuad(quad: Quad, severity: RecordSeverity.Value, lang:Language): Unit = synchronized {
-    if(increaseAndGetSuccessfulPages(lang) % reportInterval == 0)
+  def recordQuad(quad: Quad, severity: RecordSeverity.Value, lang: Language): Unit = synchronized {
+    val triples = increaseAndGetSuccessfulTriples(quad.dataset)
+    if (triples % reportInterval == 0)
       printLabeledLine("processed {page} quads; {mspp} per quad; {fail} failed quads", severity, lang)
   }
Optionally adapt placeholders to reflect triple count and ms/triple.

323-339: Avoid finalize; provide explicit close and implement AutoCloseable.

finalize is deprecated and unreliable. Expose a close() (or closeWriter()) and let callers use try/finally or Loan pattern.

Suggested change:
-  override def finalize(): Unit ={
-    if(writerOpen){
-      logWriter.close()
-      writerOpen = false
-    }
-    ...
-    super.finalize()
-  }
+  def close(): Unit = synchronized {
+    if (writerOpen) {
+      try logWriter.close() finally writerOpen = false
+    }
+  }
Then call recorder.close() from the orchestration layer when finishing.

468-481: Set Content-Type when posting Slack JSON; optionally add timeouts.

Slack expects application/json; missing header may fail depending on gateway.

Apply:
-      val resp = Http(url).postData(new String(baos.toByteArray, Charset.defaultCharset())).asString
+      val resp = Http(url)
+        .header("Content-Type", "application/json; charset=UTF-8")
+        .timeout(connTimeoutMs = 5000, readTimeoutMs = 10000)
+        .postData(new String(baos.toByteArray, Charset.forName("UTF-8")))
+        .asString
126-151: Minor: remove unused val and simplify branch.

val msg is unused; just execute side effects.

Apply:
-        case _  =>
-          val msg = Option(record.errorMsg) match{
-            case Some(m) => printLabeledLine(m, record.severity, record.language)
-            case None =>
-              if(record.error != null) failedRecord(null, record.page, record.error, record.language)
-              else recordGenericPage(record.language, record.page.toString)
-          }
+        case _  =>
+          Option(record.errorMsg) match {
+            case Some(m) => printLabeledLine(m, record.severity, record.language)
+            case None =>
+              if (record.error != null) failedRecord(null, record.page, record.error, record.language)
+              else recordGenericPage(record.language, record.page.toString)
+          }
core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorAbstract.scala (2)
8-11: Remove unused imports.

OutputStreamWriter, URL, ChronoUnit are unused in this abstract class.

Apply:
-import java.io.{InputStream, OutputStreamWriter}
-import java.net.{HttpURLConnection, URL}
-import java.time.temporal.ChronoUnit
+import java.io.InputStream
42-44: Drop unused availableProcessors.

These fields aren’t used; dead code.

Apply:
-  private val osBean = java.lang.management.ManagementFactory.getOperatingSystemMXBean
-  private val availableProcessors = osBean.getAvailableProcessors
+  // removed unused OS bean/utilization fields

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 284bafc and f97dafa.

⛔ Files ignored due to path filters (3)

history/sample-xml-dump/frwiki/20220101/frwiki-20220101-history-dataset.ttl.bz2 is excluded by !**/*.bz2
history/sample-xml-dump/frwiki/20220101/frwiki-20220101-history-stats.ttl.bz2 is excluded by !**/*.bz2
history/sample-xml-dump/frwiki/20220101/frwiki-20220101-template-redirects.obj is excluded by !**/*.obj

📒 Files selected for processing (57)

core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java (8 hunks)
core/src/main/resources/datasetdefinitions.json (1 hunks)
core/src/main/scala/org/dbpedia/extraction/config/Config.scala (3 hunks)
core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala (1 hunks)
core/src/main/scala/org/dbpedia/extraction/mappings/NifExtractor.scala (3 hunks)
core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala (2 hunks)
core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala (4 hunks)
core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala (1 hunks)
core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala (1 hunks)
core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnector.scala (1 hunks)
core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorAbstract.scala (1 hunks)
core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorRest.scala (1 hunks)
core/src/main/scala/org/dbpedia/extraction/util/MediawikiConnectorConfigured.scala (1 hunks)
dump/src/test/bash/createMinidump_custom_sample.sh (1 hunks)
dump/src/test/bash/createSampleRandomFromPageIDdataset.sh (1 hunks)
dump/src/test/bash/create_custom_sample.sh (1 hunks)
dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties (3 hunks)
dump/src/test/resources/extraction-configs/extraction.plain.abstracts.properties (3 hunks)
dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract.md (1 hunks)
dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract.scala (1 hunks)
history/ReadMe.md (1 hunks)
history/download.properties (1 hunks)
history/extraction.properties (1 hunks)
history/pom.xml (1 hunks)
history/src/main/java/org/dbpedia/extraction/sources/WikipediaDumpParserHistory.java (1 hunks)
history/src/main/scala/org/dbpedia/extraction/config/Config2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/dump/extract/ConfigLoader2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/dump/extract/Extraction2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/dump/extract/ExtractionJob2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/dump/extract/SerializableUtils2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/dump/extract/SparkExtractionJob2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/CompositeExtractor2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/CompositeParseExtractor2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/CompositeWikiPageWithRevisionExtractor.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/ExtractionMonitor2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/HistoryPageExtractor.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/HistoryStatsExtractor.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/RedirectExtractor2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/Redirects2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/RevisionNode.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/RevisionNodeExtractor.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/mappings/WikiPageWithRevisionsExtractor.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/sources/Source2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/sources/XMLSource2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/util/DumpExtractionContext2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/util/ExtractionRecorder2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/util/SparkExtractionContext2.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/wikiparser/RevisionNode.scala (1 hunks)
history/src/main/scala/org/dbpedia/extraction/wikiparser/WikiPageWithRevisions.scala (1 hunks)
history/src/test/resources/extraction-configs/extraction.config.properties (1 hunks)
history/src/test/scala/org/dbpedia/extraction/dump/extract/ExtractionTest.scala (1 hunks)
mappings/Mapping_el.xml (1 hunks)
mappings/Mapping_en.xml (30 hunks)
mappings/Mapping_fr.xml (31 hunks)
mappings/Mapping_hu.xml (3 hunks)
pom.xml (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

history/ReadMe.md

11-11: Bare URL used

(MD034, no-bare-urls)

14-14: Bare URL used

(MD034, no-bare-urls)

38-38: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

39-39: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

40-40: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

41-41: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

44-44: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

45-45: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

46-46: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

47-47: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

48-48: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

49-49: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)

72-72: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

89-89: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)

dump/src/test/bash/create_custom_sample.sh

[warning] 8-13: Invalid flags are not handled. Add a *) case.

(SC2220)

[error] 35-35: Double quote array expansions to avoid re-splitting elements.

(SC2068)

[warning] 35-35: In POSIX sh, array references are undefined.

(SC3054)

[warning] 37-37: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 37-37: In POSIX sh, =~ regex matching is undefined.

(SC3015)

[warning] 37-37: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)

[warning] 42-42: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 42-42: In POSIX sh, == in place of = is undefined.

(SC3014)

[warning] 55-55: In POSIX sh, 'declare' is undefined.

(SC3044)

[warning] 57-57: In POSIX sh, $'..' is undefined.

(SC3003)

[warning] 57-57: In POSIX sh, arrays are undefined.

(SC3030)

[warning] 57-57: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)

[warning] 58-58: In POSIX sh, array references are undefined.

(SC3054)

[warning] 59-59: In POSIX sh, array references are undefined.

(SC3054)

[warning] 60-60: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 61-61: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 64-64: In POSIX sh, array references are undefined.

(SC3054)

[warning] 70-70: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 70-70: In POSIX sh, == in place of = is undefined.

(SC3014)

[warning] 71-71: In POSIX sh, indirect expansion is undefined.

(SC3053)

[warning] 71-71: In POSIX sh, array key expansion is undefined.

(SC3055)

[warning] 73-73: In POSIX sh, array references are undefined.

(SC3054)

[warning] 76-76: In POSIX sh, indirect expansion is undefined.

(SC3053)

[warning] 76-76: In POSIX sh, array key expansion is undefined.

(SC3055)

[warning] 78-78: In POSIX sh, array references are undefined.

(SC3054)

dump/src/test/bash/createSampleRandomFromPageIDdataset.sh

[warning] 7-12: Invalid flags are not handled. Add a *) case.

(SC2220)

dump/src/test/bash/createMinidump_custom_sample.sh

[warning] 5-7: Invalid flags are not handled. Add a *) case.

(SC2220)

[warning] 54-54: The surrounding quotes actually unquote this. Remove or escape them.

(SC2027)

🔇 Additional comments (8)

history/src/test/resources/extraction-configs/extraction.config.properties (1)

21-21: Verify extractor configuration completeness.

The AI summary indicates that history extraction uses both HistoryPageExtractor and HistoryStatsExtractor, but this test configuration only includes HistoryPageExtractor. The main history/extraction.properties file configures both extractors (lines 21-22).

Please confirm whether this test intentionally uses a subset of extractors or if HistoryStatsExtractor should also be included for comprehensive testing.

history/src/main/scala/org/dbpedia/extraction/mappings/RevisionNodeExtractor.scala (1)

1-3: LGTM!

The trait definition is clean and follows the established extractor pattern, providing type safety for RevisionNode extractors.

history/src/main/scala/org/dbpedia/extraction/mappings/CompositeExtractor2.scala (1)

11-15: LGTM - composition pattern correctly implemented.

The composite extractor properly aggregates datasets and extraction results from multiple extractors. The use of flatMap and toSet ensures correct union semantics for datasets, and the extraction concatenation follows the expected composite pattern.

history/src/main/scala/org/dbpedia/extraction/mappings/WikiPageWithRevisionsExtractor.scala (1)

1-10: LGTM!

The trait provides clear type safety for WikiPageWithRevisions extractors. The documentation appropriately explains the purpose and relationship to CompositeExtractor.

core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala (1)

281-286: LGTM!

The new history dataset declarations follow the established pattern and are properly documented. The mappings to history_dataset and history_stats align with the dataset definitions introduced in this PR.
core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnector.scala (1)
20-20: [running scripts]
#!/bin/bash
# Verify if 'new MediaWikiConnector' instantiations were added in this PR's diff
git fetch origin main
git diff --unified=0 origin/main... | rg '^\+\s*new\s+MediaWikiConnector'
core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala (1)

213-223: Good change: helpers to protected for subclassing

Visibility relaxation is appropriate for REST subclass overrides; no behavior change.

history/src/main/scala/org/dbpedia/extraction/dump/extract/Extraction2.scala (1)

29-31: Validate parallelProcesses >= 1

If config.parallelProcesses is 0, the scheduler stalls. Enforce min 1 or validate in Config2.

coderabbitai · 2025-10-09T00:32:57Z

core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java

+			String linkPrefix = "/wiki/";
+			String linkPrefix2= "./";
+			if(uri.contains(linkPrefix)){
+				uri=uri.substring(uri.indexOf("?title=")+7);
+			} else if (uri.contains(linkPrefix2)) {
+				uri=uri.substring(uri.indexOf("?title=")+3);
+			}
 			//TODO central string management
 			if(!this.context.language.equals("en")) {
-
-				uri="http://"+this.context.language+".dbpedia.org/resource/"+uri.substring(uri.indexOf("?title=")+7);
+				uri="http://"+this.context.language+".dbpedia.org/resource/"+uri;



⚠️ Potential issue | 🔴 Critical

Incorrect substring when stripping REST wiki links

The new cleanLink branch drops the first three characters after ./ and assumes ?title= is absent. For REST HTML such as href="./?title=Main_Page&oldid=123", the current code returns tle=Main_Page&oldid=123, which later becomes http://dbpedia.org/resource/tle=Main_Page&oldid=123. Similarly, /wiki/Foo?oldid=123 keeps the ?oldid tail. Both produce invalid resources. Please normalize the prefixes and strip any ?title=/query/fragment portions before building the DBpedia URI.

- if(uri.contains(linkPrefix)){ - uri=uri.substring(uri.indexOf("?title=")+7); - } else if (uri.contains(linkPrefix2)) { - uri=uri.substring(uri.indexOf("?title=")+3); - } + if (uri.startsWith(linkPrefix)) { + uri = uri.substring(linkPrefix.length()); + } else if (uri.startsWith(linkPrefix2)) { + uri = uri.substring(linkPrefix2.length()); + } + if (uri.contains("?title=")) { + uri = uri.substring(uri.indexOf("?title=") + 7); + } + int queryIdx = uri.indexOf('?'); + if (queryIdx >= 0) { + uri = uri.substring(0, queryIdx); + } + int fragmentIdx = uri.indexOf('#'); + if (fragmentIdx >= 0) { + uri = uri.substring(0, fragmentIdx); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

String linkPrefix = "/wiki/";

String linkPrefix2= "./";

if(uri.contains(linkPrefix)){

uri=uri.substring(uri.indexOf("?title=")+7);

} else if (uri.contains(linkPrefix2)) {

uri=uri.substring(uri.indexOf("?title=")+3);

}

//TODO central string management

if(!this.context.language.equals("en")) {

uri="http://"+this.context.language+".dbpedia.org/resource/"+uri.substring(uri.indexOf("?title=")+7);

uri="http://"+this.context.language+".dbpedia.org/resource/"+uri;

String linkPrefix = "/wiki/";

String linkPrefix2= "./";

if (uri.startsWith(linkPrefix)) {

uri = uri.substring(linkPrefix.length());

} else if (uri.startsWith(linkPrefix2)) {

uri = uri.substring(linkPrefix2.length());

}

if (uri.contains("?title=")) {

uri = uri.substring(uri.indexOf("?title=") + 7);

}

int queryIdx = uri.indexOf('?');

if (queryIdx >= 0) {

uri = uri.substring(0, queryIdx);

}

int fragmentIdx = uri.indexOf('#');

if (fragmentIdx >= 0) {

uri = uri.substring(0, fragmentIdx);

}

//TODO central string management

if (!this.context.language.equals("en")) {

uri = "http://" + this.context.language + ".dbpedia.org/resource/" + uri;

}

coderabbitai · 2025-10-09T00:32:57Z

core/src/main/scala/org/dbpedia/extraction/config/Config.scala

+      apiType = this.getProperty("mwc-type", "").trim,
+      apiUrl = this.getProperty("mwc-type").trim match {
+        case "rest" =>  this.getProperty("mwc-apiRestUrl", "").trim
+        case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim
+        case "local" => this.getProperty("mwc-apiLocalUrl", "").trim
+      },
      maxRetries = this.getProperty("mwc-maxRetries", "4").trim.toInt,


⚠️ Potential issue | 🔴 Critical

Handle unknown mwc-type safely

The match on this.getProperty("mwc-type").trim has no fallback, so a missing or misspelled property now throws MatchError at startup. Please keep the previous behaviour by providing a sensible default (or an explicit failure with a clear message).

- apiUrl = this.getProperty("mwc-type").trim match { - case "rest" => this.getProperty("mwc-apiRestUrl", "").trim - case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim - case "local" => this.getProperty("mwc-apiLocalUrl", "").trim - }, + apiUrl = this.getProperty("mwc-type", "mwc").trim match { + case "rest" => this.getProperty("mwc-apiRestUrl", "").trim + case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim + case "local" => this.getProperty("mwc-apiLocalUrl", "").trim + case other => throw new IllegalArgumentException(s"Unsupported mwc-type '$other'") + },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

apiType = this.getProperty("mwc-type", "").trim,

apiUrl = this.getProperty("mwc-type").trim match {

case "rest" => this.getProperty("mwc-apiRestUrl", "").trim

case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim

case "local" => this.getProperty("mwc-apiLocalUrl", "").trim

},

maxRetries = this.getProperty("mwc-maxRetries", "4").trim.toInt,

apiType = this.getProperty("mwc-type", "").trim,

apiUrl = this.getProperty("mwc-type", "mwc").trim match {

case "rest" => this.getProperty("mwc-apiRestUrl", "").trim

case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim

case "local" => this.getProperty("mwc-apiLocalUrl", "").trim

case other => throw new IllegalArgumentException(s"Unsupported mwc-type '$other'")

},

maxRetries = this.getProperty("mwc-maxRetries", "4").trim.toInt,

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/config/Config.scala around lines 264 to 270, the pattern match on this.getProperty("mwc-type").trim can throw MatchError for unknown or missing values; change it to match the apiType value you already read and add a fallback case (case _ => "") or an explicit IllegalArgumentException with a clear message. Concretely, use the apiType val in the match and add a default branch that returns a sensible default (e.g. an empty string) or throws a descriptive error so startup fails with a clear message instead of a MatchError.

coderabbitai · 2025-10-09T00:32:57Z

core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala

+    var nodes = doc.select("body").first.childNodes.asScala
+
+    val currentSection = new ListBuffer[Int]()                  //keeps track of section number


⚠️ Potential issue | 🟡 Minor

Add null-safety around body/section selection

doc.select("body").first and .select("section").first can be null for some pages. Guard to avoid NPEs.

- var nodes = doc.select("body").first.childNodes.asScala + val body = doc.select("body").first + if (body == null) return tocMap + var nodes = body.childNodes.asScala @@ - val abstractSect=doc.select("body").select("section").first.childNodes.asScala //get first section + val firstSection = body.select("section").first + if (firstSection == null) return tocMap + val abstractSect = firstSection.childNodes.asScala

Also applies to: 108-111

🤖 Prompt for AI Agents

In core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala around lines 37-39 (and similarly at lines 108-111), calls to doc.select("body").first and doc.select("section").first can return null and cause NPEs; add null-safety by checking for null (or wrapping in Option) before accessing .first or .childNodes, and handle the absent case by skipping processing or returning early as appropriate; specifically, replace direct .first usage with a safe lookup (e.g., Option(...).toList or if (elem != null) ...) and only access childNodes or call .asScala when the element is present so the extractor continues safely when body/section is missing.

coderabbitai · 2025-10-09T00:32:58Z

core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorRest.scala

+    val parameters = "redirect=true"
+    val apiUrl: URL = new URL(url.concat(titleParam).concat("?"+parameters))
+
+
+
+    //println(s"mediawikiurl: $apiUrl")
+
+
+    for (counter <- 1 to maxRetries) {
+
+      val conn = apiUrl.openConnection
+      conn.setDoOutput(true) // POST REQUEST to verify
+
+      val start = java.time.LocalTime.now()
+
+      conn.setConnectTimeout(retryFactor * connectMs)
+      conn.setReadTimeout(retryFactor * readMs)
+      conn.setRequestProperty("accept", apiAccept)
+      conn.setRequestProperty("charset", apiCharset)
+      conn.setRequestProperty("profile", apiProfile)
+      conn.setRequestProperty("Accept-Language", pageTitle.language.wikiCode)
+      conn.setRequestProperty("User-Agent", userAgent)
+
+      val inputStream = conn.getInputStream
+      val answerHeader = conn.getHeaderFields()
+      val answerClean = answerHeader.asScala.filterKeys(_ != null)
+
+      if(conn.getHeaderField(null).contains("HTTP/1.1 200 OK") ){
+
+
+        val end = java.time.LocalTime.now()
+        conn match {
+          case connection: HttpURLConnection =>
+            log.debug("Request type: " + connection.getRequestMethod + "; URL: " + connection.getURL +
+              "; Parameters: " + parameters + "; HTTP code: " + connection.getHeaderField(null) +
+              "; Request time: " + start + "; Response time: " + end + "; Time needed: " +
+              start.until(end, ChronoUnit.MILLIS))
+          case _ =>
+        }
+        // Read answer
+        parsedAnswer = readInAbstract(inputStream)
+        SuccessParsing = parsedAnswer match {
+          case Success(str) => true
+          case Failure(_) => false
+        }
+      }
+      if(!SuccessParsing){
+        var sleepMs = sleepFactorMs
+        if (retryAfter && answerClean.contains("retry-after")) {
+          //println("GIVEN RETRY-AFTER > "+ answer_clean("retry-after").get(0))
+          waitingTime = Integer.parseInt(answerClean("retry-after").get(0)) * 1000
+
+          // exponential backoff test
+          sleepMs = pow(waitingTime, counter).toInt
+          //println("WITH EXPONENTIAL BACK OFF" + counter)
+          //println("Sleeping time double >>>>>>>>>>>" + pow(waiting_time, counter))
+          //println("Sleeping time int >>>>>>>>>>>" + sleepMs)
+
+        }
+        if (counter < maxRetries)
+          Thread.sleep(sleepMs)
+        else
+          throw new Exception("Timeout error retrieving abstract of " + pageTitle + " in " + counter + " tries.")
+      } else {
+
+
+        //println(s"mediawikiurl: $apiUrl?$parameters")
+        return parsedAnswer match {
+          case Success(str) => Option(str)


⚠️ Potential issue | 🟠 Major

Use the supplied apiParameterString when building the REST URL

retrievePage ignores apiParameterString and always forces redirect=true. Callers rely on that parameter (e.g., to request specific sections or formats); dropping it makes the REST connector fetch the wrong content. Please merge the caller-provided parameters with your default instead of discarding them.

- val parameters = "redirect=true" - val apiUrl: URL = new URL(url.concat(titleParam).concat("?"+parameters)) + val baseParams = Option(apiParameterString) + .map(_.stripPrefix("?")) + .filter(_.nonEmpty) + .getOrElse("") + val allParams = Seq(baseParams, "redirect=true").filter(_.nonEmpty).mkString("&") + val suffix = if (allParams.isEmpty) "" else "?" + allParams + val apiUrl: URL = new URL(url.concat(titleParam).concat(suffix))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val parameters = "redirect=true"

val apiUrl: URL = new URL(url.concat(titleParam).concat("?"+parameters))

//println(s"mediawikiurl: $apiUrl")

for (counter <- 1 to maxRetries) {

val conn = apiUrl.openConnection

conn.setDoOutput(true) // POST REQUEST to verify

val start = java.time.LocalTime.now()

conn.setConnectTimeout(retryFactor * connectMs)

conn.setReadTimeout(retryFactor * readMs)

conn.setRequestProperty("accept", apiAccept)

conn.setRequestProperty("charset", apiCharset)

conn.setRequestProperty("profile", apiProfile)

conn.setRequestProperty("Accept-Language", pageTitle.language.wikiCode)

conn.setRequestProperty("User-Agent", userAgent)

val inputStream = conn.getInputStream

val answerHeader = conn.getHeaderFields()

val answerClean = answerHeader.asScala.filterKeys(_ != null)

if(conn.getHeaderField(null).contains("HTTP/1.1 200 OK") ){

val end = java.time.LocalTime.now()

conn match {

case connection: HttpURLConnection =>

log.debug("Request type: " + connection.getRequestMethod + "; URL: " + connection.getURL +

"; Parameters: " + parameters + "; HTTP code: " + connection.getHeaderField(null) +

"; Request time: " + start + "; Response time: " + end + "; Time needed: " +

start.until(end, ChronoUnit.MILLIS))

case _ =>

}

// Read answer

parsedAnswer = readInAbstract(inputStream)

SuccessParsing = parsedAnswer match {

case Success(str) => true

case Failure(_) => false

}

}

if(!SuccessParsing){

var sleepMs = sleepFactorMs

if (retryAfter && answerClean.contains("retry-after")) {

//println("GIVEN RETRY-AFTER > "+ answer_clean("retry-after").get(0))

waitingTime = Integer.parseInt(answerClean("retry-after").get(0)) * 1000

// exponential backoff test

sleepMs = pow(waitingTime, counter).toInt

//println("WITH EXPONENTIAL BACK OFF" + counter)

//println("Sleeping time double >>>>>>>>>>>" + pow(waiting_time, counter))

//println("Sleeping time int >>>>>>>>>>>" + sleepMs)

}

if (counter < maxRetries)

Thread.sleep(sleepMs)

else

throw new Exception("Timeout error retrieving abstract of " + pageTitle + " in " + counter + " tries.")

} else {

//println(s"mediawikiurl: $apiUrl?$parameters")

return parsedAnswer match {

case Success(str) => Option(str)

val baseParams = Option(apiParameterString)

.map(_.stripPrefix("?"))

.filter(_.nonEmpty)

.getOrElse("")

val allParams = Seq(baseParams, "redirect=true").filter(_.nonEmpty).mkString("&")

val suffix = if (allParams.isEmpty) "" else "?" + allParams

val apiUrl: URL = new URL(url.concat(titleParam).concat(suffix))

coderabbitai · 2025-10-09T00:32:58Z

dump/src/test/bash/create_custom_sample.sh

+#!/bin/sh
+lang="";
+n=1000;
+date_archive=$(date -d "$(date +%Y-%m-01) -1 day" +%Y-%m);
+sort="desc"
+while getopts l:d:n:s: flag
+do
+    case "${flag}" in
+        l) lang=${OPTARG};;
+        d) date_archive=${OPTARG};;
+        n) n=${OPTARG};;
+        s) sort=${OPTARG};;
+    esac
+done
+echo "========================="
+echo "lang: $lang";
+echo "date: $date_archive";
+echo "n: $n";
+echo "sort: $sort";
+echo "========================="
+
+
+clickstream_data="clickstream_data_${lang}_${date_archive}"
+if [ -f "$clickstream_data" ]
+then
+  echo "File found"
+else
+   echo "File not found"
+   clickstream_url="https://dumps.wikimedia.org/other/clickstream/";
+   content=$(curl -L "$clickstream_url$date_archive/")
+   links=$( echo $content | grep -Po '(?<=href=")[^"]*');
+   toextract="";
+   substr="-${lang}wiki-"
+   echo $substr
+   for link in ${links[@]}; do
+    echo $link
+    if [[ $link =~  "-${lang}wiki-" ]];then
+        toextract="$clickstream_url$date_archive/$link";
+    fi
+   done
+
+   if [[ $toextract == "" ]]; then
+      echo "Lang not found in clickstream";
+      exit 1;
+   fi
+
+   echo ">>>> DOWNLOAD $toextract and save it"
+
+  wget -O "${clickstream_data}.gz" $toextract;
+  gzip -d "${clickstream_data}.gz"
+fi
+
+
+echo ">>>> COMPUTE SUM OF CLICKS"
+declare -A dict
+while IFS= read -r line; do
+   IFS=$'\t'; arrIN=($line); unset IFS;
+   key=${arrIN[1]}
+   val=${arrIN[3]}
+   if [[ ${key} != *"List"* ]];then
+     if [[ ${#dict[${key}]} -eq 0 ]] ;then
+       dict[${key}]=$(($val));
+     else
+       dict[${key}]=$((${dict[${key}]}+$val));
+     fi
+   fi
+done < $clickstream_data
+
+echo ">>>> SORT IT AND SAVE TEMP"
+if [[ $sort == "desc" ]]; then
+  for page in "${!dict[@]}"
+  do
+    echo "$page ${dict[$page]}"
+  done | sort -rn -k2 | head -n "$n" | cut -d ' ' -f 1 >> temp.txt;
+else
+    for page in "${!dict[@]}"
+    do
+      echo "$page ${dict[$page]}"
+    done | sort -n -k2 | head -n "$n" | cut -d ' ' -f 1 >> temp.txt;
+fi


⚠️ Potential issue | 🔴 Critical

Fix shebang or drop Bash-specific syntax.

The script declares #!/bin/sh yet relies on Bash-only features (declare -A, ${array[@]}, [[ … ]], regex =~). On systems where /bin/sh is not Bash, the script aborts. Either switch the shebang to #!/bin/bash (and ensure Bash is available) or refactor to POSIX-compliant constructs.

-#!/bin/sh +#!/bin/bash

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#!/bin/sh

lang="";

n=1000;

date_archive=$(date -d "$(date +%Y-%m-01) -1 day" +%Y-%m);

sort="desc"

while getopts l:d:n:s: flag

do

case "${flag}" in

l) lang=${OPTARG};;

d) date_archive=${OPTARG};;

n) n=${OPTARG};;

s) sort=${OPTARG};;

esac

done

echo "========================="

echo "lang: $lang";

echo "date: $date_archive";

echo "n: $n";

echo "sort: $sort";

echo "========================="

clickstream_data="clickstream_data_${lang}_${date_archive}"

if [ -f "$clickstream_data" ]

then

echo "File found"

else

echo "File not found"

clickstream_url="https://dumps.wikimedia.org/other/clickstream/";

content=$(curl -L "$clickstream_url$date_archive/")

links=$( echo $content | grep -Po '(?<=href=")[^"]*');

toextract="";

substr="-${lang}wiki-"

echo $substr

for link in ${links[@]}; do

echo $link

if [[ $link =~ "-${lang}wiki-" ]];then

toextract="$clickstream_url$date_archive/$link";

fi

done

if [[ $toextract == "" ]]; then

echo "Lang not found in clickstream";

exit 1;

fi

echo ">>>> DOWNLOAD $toextract and save it"

wget -O "${clickstream_data}.gz" $toextract;

gzip -d "${clickstream_data}.gz"

fi

echo ">>>> COMPUTE SUM OF CLICKS"

declare -A dict

while IFS= read -r line; do

IFS=$'\t'; arrIN=($line); unset IFS;

key=${arrIN[1]}

val=${arrIN[3]}

if [[ ${key} != *"List"* ]];then

if [[ ${#dict[${key}]} -eq 0 ]] ;then

dict[${key}]=$(($val));

else

dict[${key}]=$((${dict[${key}]}+$val));

fi

fi

done < $clickstream_data

echo ">>>> SORT IT AND SAVE TEMP"

if [[ $sort == "desc" ]]; then

for page in "${!dict[@]}"

do

echo "$page ${dict[$page]}"

done | sort -rn -k2 | head -n "$n" | cut -d ' ' -f 1 >> temp.txt;

else

for page in "${!dict[@]}"

do

echo "$page ${dict[$page]}"

done | sort -n -k2 | head -n "$n" | cut -d ' ' -f 1 >> temp.txt;

fi

#!/bin/bash

lang="";

n=1000;

date_archive=$(date -d "$(date +%Y-%m-01) -1 day" +%Y-%m);

sort="desc"

while getopts l:d:n:s: flag

do

case "${flag}" in

l) lang=${OPTARG};;

d) date_archive=${OPTARG};;

n) n=${OPTARG};;

s) sort=${OPTARG};;

esac

done

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 8-13: Invalid flags are not handled. Add a *) case.

(SC2220)

[error] 35-35: Double quote array expansions to avoid re-splitting elements.

(SC2068)

[warning] 35-35: In POSIX sh, array references are undefined.

(SC3054)

[warning] 37-37: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 37-37: In POSIX sh, =~ regex matching is undefined.

(SC3015)

[warning] 37-37: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)

[warning] 42-42: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 42-42: In POSIX sh, == in place of = is undefined.

(SC3014)

[warning] 55-55: In POSIX sh, 'declare' is undefined.

(SC3044)

[warning] 57-57: In POSIX sh, $'..' is undefined.

(SC3003)

[warning] 57-57: In POSIX sh, arrays are undefined.

(SC3030)

[warning] 57-57: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)

[warning] 58-58: In POSIX sh, array references are undefined.

(SC3054)

[warning] 59-59: In POSIX sh, array references are undefined.

(SC3054)

[warning] 60-60: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 61-61: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 64-64: In POSIX sh, array references are undefined.

(SC3054)

[warning] 70-70: In POSIX sh, [[ ]] is undefined.

(SC3010)

[warning] 70-70: In POSIX sh, == in place of = is undefined.

(SC3014)

[warning] 71-71: In POSIX sh, indirect expansion is undefined.

(SC3053)

[warning] 71-71: In POSIX sh, array key expansion is undefined.

(SC3055)

[warning] 73-73: In POSIX sh, array references are undefined.

(SC3054)

[warning] 76-76: In POSIX sh, indirect expansion is undefined.

(SC3053)

[warning] 76-76: In POSIX sh, array key expansion is undefined.

(SC3055)

[warning] 78-78: In POSIX sh, array references are undefined.

(SC3054)

🤖 Prompt for AI Agents

In dump/src/test/bash/create_custom_sample.sh lines 1-80, the script uses Bash-only features (declare -A, [[ ]], =~, ${array[@]}) but declares #!/bin/sh; change the shebang to a Bash-aware interpreter (e.g., #!/usr/bin/env bash) to guarantee Bash semantics across systems, or alternatively refactor all Bash-specific constructs to POSIX-compliant equivalents; update the shebang to the env-backed bash if you choose the simpler fix and verify the script runs on target CI/hosts.

coderabbitai · 2025-10-09T00:33:03Z

history/src/test/resources/extraction-configs/extraction.config.properties

@@ -0,0 +1,27 @@
+# download and extraction target dir
+#base-dir= moved to $extraction-framework/core/src/main/resources/universal.properties
+base-dir=/home/cringwal/Desktop/DBpediaHistory/extraction-framework/history/sample-xml-dump


⚠️ Potential issue | 🟠 Major

Remove hardcoded absolute path with username.

The hardcoded path /home/cringwal/Desktop/DBpediaHistory/... contains a specific username and will fail on other machines or CI environments.

Consider one of these approaches:

Use a relative path:

-base-dir=/home/cringwal/Desktop/DBpediaHistory/extraction-framework/history/sample-xml-dump +base-dir=./sample-xml-dump

Or rely on the universal.properties default and remove this line entirely if the default is suitable for tests.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

base-dir=/home/cringwal/Desktop/DBpediaHistory/extraction-framework/history/sample-xml-dump

base-dir=./sample-xml-dump

🤖 Prompt for AI Agents

In history/src/test/resources/extraction-configs/extraction.config.properties around line 3, the property base-dir contains a hardcoded absolute path with a username; remove this platform-specific path and either delete the line so the test falls back to the universal.properties default, or replace it with a portable relative path (e.g. ./sample-xml-dump or ../sample-xml-dump) that points to the test resource directory so the tests run on any machine/CI.

coderabbitai · 2025-10-09T00:33:03Z

history/src/test/scala/org/dbpedia/extraction/dump/extract/ExtractionTest.scala

+  test("test Historic extraction", HistoTestTag) {
+    val jobsRunning = new ConcurrentLinkedQueue[Future[Unit]]()//
+    val classLoader: ClassLoader = getClass.getClassLoader
+    val histoConfig =     new Config2(classLoader.getResource("extraction-configs/extraction.config.properties").getFile)
+    println(classLoader.getResource("extraction-configs/extraction.config.properties").getFile.toString)
+    println("BEFORE EXTRACT")
+    extract(histoConfig, jobsRunning)
+    println("AFTER EXTRACT")
+  }
+
+  def extract(config: Config2, jobsRunning: ConcurrentLinkedQueue[Future[Unit]]): Unit = {
+    val configLoader = new ConfigLoader2(config)
+    val jobs = configLoader.getExtractionJobs
+    println(">>>>>>>>> EXTRACT - NBJOBS > " + jobs.size)
+    println("LAUNCH JOBS")
+    for (job <- jobs) {
+      job.run()
+    }
+      while (jobsRunning.size() > 0) {
+
+        Thread.sleep(1000)
+      }
+
+      jobsRunning.clear()
+  }


⚠️ Potential issue | 🟠 Major

Resource loading via getFile breaks off-classpath / espaços.

classLoader.getResource(...).getFile only works for plain file URLs without spaces. In Maven/Gradle runs the resource often sits in target/test-classes, so toURI is safer; inside a JAR it will explode. Please resolve the URL properly (Paths.get(resource.toURI)) or stream it.

Fix sketch:

- val histoConfig = new Config2(classLoader.getResource("extraction-configs/extraction.config.properties").getFile) + val url = classLoader.getResource("extraction-configs/extraction.config.properties") + require(url != null, "Missing extraction-config resource") + val histoConfig = new Config2(Paths.get(url.toURI).toString)

Add the requisite java.nio.file.Paths import.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

test("test Historic extraction", HistoTestTag) {

val jobsRunning = new ConcurrentLinkedQueue[Future[Unit]]()//

val classLoader: ClassLoader = getClass.getClassLoader

val histoConfig = new Config2(classLoader.getResource("extraction-configs/extraction.config.properties").getFile)

println(classLoader.getResource("extraction-configs/extraction.config.properties").getFile.toString)

println("BEFORE EXTRACT")

extract(histoConfig, jobsRunning)

println("AFTER EXTRACT")

}

def extract(config: Config2, jobsRunning: ConcurrentLinkedQueue[Future[Unit]]): Unit = {

val configLoader = new ConfigLoader2(config)

val jobs = configLoader.getExtractionJobs

println(">>>>>>>>> EXTRACT - NBJOBS > " + jobs.size)

println("LAUNCH JOBS")

for (job <- jobs) {

job.run()

}

while (jobsRunning.size() > 0) {

Thread.sleep(1000)

}

jobsRunning.clear()

}

test("test Historic extraction", HistoTestTag) {

val jobsRunning = new ConcurrentLinkedQueue[Future[Unit]]()//

val classLoader: ClassLoader = getClass.getClassLoader

val url = classLoader.getResource("extraction-configs/extraction.config.properties")

require(url != null, "Missing extraction-config resource")

val histoConfig = new Config2(Paths.get(url.toURI).toString)

println(classLoader.getResource("extraction-configs/extraction.config.properties").getFile.toString)

println("BEFORE EXTRACT")

extract(histoConfig, jobsRunning)

println("AFTER EXTRACT")

}

def extract(config: Config2, jobsRunning: ConcurrentLinkedQueue[Future[Unit]]): Unit = {

val configLoader = new ConfigLoader2(config)

val jobs = configLoader.getExtractionJobs

println(">>>>>>>>> EXTRACT - NBJOBS > " + jobs.size)

println("LAUNCH JOBS")

for (job <- jobs) {

job.run()

}

while (jobsRunning.size() > 0) {

Thread.sleep(1000)

}

jobsRunning.clear()

}

🤖 Prompt for AI Agents

In history/src/test/scala/org/dbpedia/extraction/dump/extract/ExtractionTest.scala around lines 17 to 41, the test uses classLoader.getResource(...).getFile which fails for off-classpath resources, spaces, and JAR-packed resources; replace that usage by resolving the URL to a URI and converting via Paths.get(resource.toURI()) or, better, load the resource as a stream with getResourceAsStream for JAR safety, add a null-check for the resource, and add the requisite java.nio.file.Paths import (or use java.io.InputStream handling) so the test works in Maven/Gradle and inside JARs.

coderabbitai · 2025-10-09T00:33:03Z

mappings/Mapping_el.xml

            {{PropertyMapping | templateProperty = όνομα  | ontologyProperty = foaf:name }}
            {{PropertyMapping | templateProperty = Χώρα  | ontologyProperty = country }}
+            {{PropertyMapping | templateProperty = Πληθυσμός  | ontologyProperty = population }}
+


⚠️ Potential issue | 🔴 Critical

Use populationTotal for Πληθυσμός.

ontologyProperty = population is not defined in the DBpedia ontology, so the extractor will drop these values instead of populating the village population. Swap to populationTotal, which is already used elsewhere for settlements and compiles cleanly.

- {{PropertyMapping | templateProperty = Πληθυσμός | ontologyProperty = population }} + {{PropertyMapping | templateProperty = Πληθυσμός | ontologyProperty = populationTotal }}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{{PropertyMapping | templateProperty = όνομα | ontologyProperty = foaf:name }}

{{PropertyMapping | templateProperty = Χώρα | ontologyProperty = country }}

{{PropertyMapping | templateProperty = Πληθυσμός | ontologyProperty = population }}

{{PropertyMapping | templateProperty = όνομα | ontologyProperty = foaf:name }}

{{PropertyMapping | templateProperty = Χώρα | ontologyProperty = country }}

{{PropertyMapping | templateProperty = Πληθυσμός | ontologyProperty = populationTotal }}

🤖 Prompt for AI Agents

In mappings/Mapping_el.xml around lines 3396 to 3399, the mapping uses ontologyProperty = population which is not defined in DBpedia ontology; change it to ontologyProperty = populationTotal so the extractor will retain and populate village population values consistently with other settlement mappings.

coderabbitai · 2025-10-09T00:33:04Z

mappings/Mapping_en.xml

+   {{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
+   {{ PropertyMapping | templateProperty = date | ontologyProperty = date }}
+   {{ PropertyMapping | templateProperty = disease | ontologyProperty = disease }}
+   {{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virus_strain }}
+   {{ PropertyMapping | templateProperty = location | ontologyProperty = Location }}
+   {{ PropertyMapping | templateProperty = deaths | ontologyProperty = Deaths }}
+   {{ PropertyMapping | templateProperty = symptom | ontologyProperty = Symptom }}


⚠️ Potential issue | 🔴 Critical

Fix invalid ontology property names in Infobox pandemic mapping.

Location, Deaths, Symptom, and virus_strain are not valid DBpedia ontology properties; the extractor will either mint non-existent predicates or drop the data. Please map to the canonical property names (e.g., location, numberOfDeaths, symptom, virusStrain).

- {{ PropertyMapping | templateProperty = location | ontologyProperty = Location }} - {{ PropertyMapping | templateProperty = deaths | ontologyProperty = Deaths }} - {{ PropertyMapping | templateProperty = symptom | ontologyProperty = Symptom }} - {{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virus_strain }} + {{ PropertyMapping | templateProperty = location | ontologyProperty = location }} + {{ PropertyMapping | templateProperty = deaths | ontologyProperty = numberOfDeaths }} + {{ PropertyMapping | templateProperty = symptom | ontologyProperty = symptom }} + {{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virusStrain }}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}

{{ PropertyMapping | templateProperty = date | ontologyProperty = date }}

{{ PropertyMapping | templateProperty = disease | ontologyProperty = disease }}

{{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virus_strain }}

{{ PropertyMapping | templateProperty = location | ontologyProperty = Location }}

{{ PropertyMapping | templateProperty = deaths | ontologyProperty = Deaths }}

{{ PropertyMapping | templateProperty = symptom | ontologyProperty = Symptom }}

{{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}

{{ PropertyMapping | templateProperty = date | ontologyProperty = date }}

{{ PropertyMapping | templateProperty = disease | ontologyProperty = disease }}

{{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virusStrain }}

{{ PropertyMapping | templateProperty = location | ontologyProperty = location }}

{{ PropertyMapping | templateProperty = deaths | ontologyProperty = numberOfDeaths }}

{{ PropertyMapping | templateProperty = symptom | ontologyProperty = symptom }}

🤖 Prompt for AI Agents

In mappings/Mapping_en.xml around lines 21493 to 21499, several ontologyProperty values in the Infobox pandemic PropertyMapping entries are invalid; update the mappings so DBpedia uses canonical property names: change ontologyProperty="Location" to ontologyProperty="location", "Deaths" to "numberOfDeaths", "Symptom" to "symptom", and "virus_strain" to the canonical "virusStrain" (also keep foaf:name and date/disease mappings as-is); ensure exact spelling and casing to prevent minting non-existent predicates or data loss.

coderabbitai · 2025-10-09T00:33:04Z

mappings/Mapping_fr.xml

+        {{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = nextTrackNumber}}
+        {{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = previousTrackNumber}}


⚠️ Potential issue | 🟠 Major

Swap previous/next track property mappings

numéro piste préc (previous track) is mapped to nextTrackNumber, and numéro piste suiv (next track) is mapped to previousTrackNumber, inverting the semantics returned by the extractor.

- {{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = nextTrackNumber}} - {{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = previousTrackNumber}} + {{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = previousTrackNumber}} + {{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = nextTrackNumber}}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = nextTrackNumber}}

{{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = previousTrackNumber}}

{{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = previousTrackNumber}}

{{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = nextTrackNumber}}

🤖 Prompt for AI Agents

In mappings/Mapping_fr.xml around lines 5374-5375, the two PropertyMapping entries have their ontologyProperty values inverted: "numéro piste préc" (previous track) is incorrectly mapped to nextTrackNumber and "numéro piste suiv" (next track) is incorrectly mapped to previousTrackNumber; swap the ontologyProperty values so "numéro piste préc" maps to previousTrackNumber and "numéro piste suiv" maps to nextTrackNumber to restore correct semantics.

celian_ringwald and others added 30 commits September 14, 2022 17:58

Manage REST API answer link shape

5342469

Not cleaning HTML before getJsoupDoc and clean it inside for managing…

7174c71

… HTML entities

Create a WikipediaNifExtractor extension for REST API answer

d00cf7a

change connector

89c6d5c

add possibility to choose connector

6461bce

deprecate class

9cdeb79

Create a MediaWikiConnector Abstract class for gathering common params

29329c0

Create a new connector for the REST API

06202d5

Create a new connector for the REST API

053f0ab

script for creating custom dump sample

2050875

script for generating Minidump from uri list generated by create_cust…

7ad20ae

…om_sample.sh

script for creating uri list randomly from id list

f55d803

adapt property files to new possible APIS

f7686ec

add new param for MWC api

782214b

new Test for abstract benchmark

984b5d4

Add new properties for API connectors

021ca01

adapt for extension

7001bcd

Update core/src/main/scala/org/dbpedia/extraction/util/MediawikiConne…

78d91d6

…ctorConfigured.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…

d7929da

…Abstract.md Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…

167b342

…Abstract.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…

6112b94

…Abstract.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update core/src/main/scala/org/dbpedia/extraction/config/Config.scala

e97dd88

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update core/src/main/scala/org/dbpedia/extraction/config/Config.scala

969f3b4

Co-authored-by: Ted Thibodeau Jr <[email protected]>

clear comments of API config and fix plain abstract API urls

e89f813

snake case to camel case

2246f21

first dev on historic

da5f135

add last dev

eb68385

ADD FIRST HISTORY PROTOTYPE

cb076dd

ADD final version of History prototype

9f870bb

clean

837e402

datalogism and others added 12 commits December 8, 2022 09:40

Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…

9ff3818

…Abstract.md Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTest…

8932d97

…Abstract.md Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

367dfe9

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

a8c7736

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

716a780

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/src/main/scala/org/dbpedia/extraction/dump/extract/Ext…

fb87924

…ractionJob2.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/src/main/scala/org/dbpedia/extraction/dump/extract/Ext…

a3a8063

…ractionJob2.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

376d1cc

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

b247260

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/src/main/scala/org/dbpedia/extraction/dump/extract/Con…

d9c12f7

…figLoader2.scala Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

4221c0f

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

a755007

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update ReadMe.md

67312b1

TallTed reviewed Dec 8, 2022

View reviewed changes

history/ReadMe.md Outdated Show resolved Hide resolved

Update ReadMe.md

b08442d

TallTed reviewed Dec 9, 2022

View reviewed changes

history/ReadMe.md Outdated Show resolved Hide resolved

datalogism added 3 commits January 5, 2023 16:15

Update ReadMe.md

4286753

Update ReadMe.md

a6ebbc5

Update ReadMe.md

f87066b

TallTed suggested changes Jan 5, 2023

View reviewed changes

history/ReadMe.md Outdated Show resolved Hide resolved

history/ReadMe.md Outdated Show resolved Hide resolved

datalogism and others added 3 commits January 6, 2023 13:07

Update history/ReadMe.md

447bc7a

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

e119c8b

Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update history/ReadMe.md

f97dafa

Co-authored-by: Ted Thibodeau Jr <[email protected]>

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

-			String linkPrefix = "/wiki/";
-			String linkPrefix2= "./";
-			if(uri.contains(linkPrefix)){
-				uri=uri.substring(uri.indexOf("?title=")+7);
-			} else if (uri.contains(linkPrefix2)) {
-				uri=uri.substring(uri.indexOf("?title=")+3);
-			}
-			//TODO central string management
-			if(!this.context.language.equals("en")) {
-				uri="http://"+this.context.language+".dbpedia.org/resource/"+uri.substring(uri.indexOf("?title=")+7);
-				uri="http://"+this.context.language+".dbpedia.org/resource/"+uri;
+            String linkPrefix = "/wiki/";
+            String linkPrefix2= "./";
+            if (uri.startsWith(linkPrefix)) {
+                uri = uri.substring(linkPrefix.length());
+            } else if (uri.startsWith(linkPrefix2)) {
+                uri = uri.substring(linkPrefix2.length());
+            }
+            if (uri.contains("?title=")) {
+                uri = uri.substring(uri.indexOf("?title=") + 7);
+            }
+            int queryIdx = uri.indexOf('?');
+            if (queryIdx >= 0) {
+                uri = uri.substring(0, queryIdx);
+            }
+            int fragmentIdx = uri.indexOf('#');
+            if (fragmentIdx >= 0) {
+                uri = uri.substring(0, fragmentIdx);
+            }
+            //TODO central string management
+            if (!this.context.language.equals("en")) {
+                uri = "http://" + this.context.language + ".dbpedia.org/resource/" + uri;
+            }

		var nodes = doc.select("body").first.childNodes.asScala

		val currentSection = new ListBuffer[Int]() //keeps track of section number

	base-dir=/home/cringwal/Desktop/DBpediaHistory/extraction-framework/history/sample-xml-dump
	base-dir=./sample-xml-dump

		{{PropertyMapping \| templateProperty = numéro piste préc \| ontologyProperty = nextTrackNumber}}
		{{PropertyMapping \| templateProperty = numéro piste suiv \| ontologyProperty = previousTrackNumber}}

Dbpedia History #745

Are you sure you want to change the base?

Dbpedia History #745

Uh oh!

Conversation

datalogism commented Dec 6, 2022 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

sonarqubecloud bot commented Dec 8, 2022

Uh oh!

Uh oh!

Uh oh!

TallTed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 6, 2023

Uh oh!

sonarqubecloud bot commented Jan 6, 2023

Uh oh!

JJ-Author commented Oct 9, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

datalogism commented Dec 6, 2022 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading