Skip to content

Conversation

@datalogism
Copy link
Member

@datalogism datalogism commented Dec 6, 2022

First prototype of DBpedia History

Summary by CodeRabbit

  • New Features

    • REST-based NIF extraction path.
    • New History extraction module producing HistoryData and HistoryStats (standard and Spark jobs).
    • Sample/minidump generation scripts for testing and demos.
  • Improvements

    • More robust link parsing and HTML cleaning in NIF extraction.
    • Expanded MediaWiki connection options; default parallelism reduced for stability.
    • Updated template mappings for English, French, and Hungarian.
  • Documentation

    • Added History module README and abstract test guide.
  • Tests

    • New end-to-end extraction tests and utilities.

celian_ringwald and others added 30 commits September 14, 2022 17:58
datalogism and others added 12 commits December 8, 2022 09:40
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 8, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Copy link
Contributor

@TallTed TallTed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor...

datalogism and others added 3 commits January 6, 2023 13:07
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 6, 2023

Please retry analysis of this Pull-Request directly on SonarCloud.

1 similar comment
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 6, 2023

Please retry analysis of this Pull-Request directly on SonarCloud.

@JJ-Author
Copy link
Contributor

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Oct 9, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Oct 9, 2025

Caution

Review failed

Failed to post review comments

Walkthrough

Adds a new History module for revision-based extraction (config, sources, extractors, Spark pipeline, recorder/monitor), introduces REST-based NIF extraction path with new MediaWiki connectors, updates configurations (including MediaWikiConnection signature and defaults), extends datasets (history), adds dump test tooling/scripts, and enriches multiple mapping files.

Changes

Cohort / File(s) Summary
NIF + MediaWiki connectors
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java, core/src/main/scala/org/dbpedia/extraction/mappings/NifExtractor.scala, core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala, core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala, core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala, core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala, core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnector*.scala
Adds REST-aware link parsing and NIF extraction path; introduces abstract/configured/REST MediaWiki connectors; updates HtmlNifExtractor flow and visibility tweaks; deprecates old connector; adds REST-specific extractor.
Core config updates
core/src/main/scala/org/dbpedia/extraction/config/Config.scala
Expands MediaWikiConnection case class and construction (apiType and HTTP params); changes parallelProcesses default to 1.
Datasets & provenance
core/src/main/resources/datasetdefinitions.json, core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala
Adds history datasets: history_dataset and history_stats, with corresponding provenance entries.
Dump test tooling
dump/src/test/bash/*, dump/src/test/resources/extraction-configs/*, dump/src/test/scala/.../ExtractionTestAbstract.*
Adds scripts to create samples/minidumps, updates test extraction configs (REST/MWC params, abstracts-only), and a new abstract extraction test with logging.
History module: setup & configs
history/ReadMe.md, history/pom.xml, history/download.properties, history/extraction.properties, history/src/test/resources/extraction-configs/extraction.config.properties
Introduces History module POM, docs, and properties for downloads and extraction.
History parsing & sources
history/src/main/java/.../WikipediaDumpParserHistory.java, history/src/main/scala/.../sources/Source2.scala, .../sources/XMLSource2.scala
Adds history-capable dump parser and new Source2 abstractions for single/multi-file XML reading.
History configuration
history/src/main/scala/org/dbpedia/extraction/config/Config2.scala
New Config2 with structured properties (NIF/Abstract parameters, MediaWikiConnection, languages, datasets, etc.).
History extraction pipeline
history/src/main/scala/.../dump/extract/ConfigLoader2.scala, .../Extraction2.scala, .../ExtractionJob2.scala, .../SparkExtractionJob2.scala, .../SerializableUtils2.scala, .../destinations/WriterDestination2.scala
Adds job loaders, orchestrators (sequential/parallel/Spark), serialization helpers, and a writer destination with recording.
History model & extractors
history/src/main/scala/.../wikiparser/WikiPageWithRevisions.scala, .../wikiparser/RevisionNode.scala, .../mappings/WikiPageWithRevisionsExtractor.scala, .../mappings/HistoryPageExtractor.scala, .../mappings/HistoryStatsExtractor.scala, .../mappings/RedirectExtractor2.scala, .../mappings/Composite*2.scala, .../mappings/Redirects2.scala, .../mappings/RevisionNode.scala, .../mappings/RevisionNodeExtractor.scala, .../mappings/ExtractionMonitor2.scala, .../util/DumpExtractionContext2.scala, .../util/SparkExtractionContext2.scala, .../util/ExtractionRecorder2.scala
Introduces revision/page models, history/statistics extractors, redirects handling, composite extractors, monitoring and recording utilities, and context traits.
History tests
history/src/test/scala/.../ExtractionTest.scala
Adds a ScalaTest spec to run history extraction via Config2/ConfigLoader2.
Mappings updates
mappings/Mapping_en.xml, mappings/Mapping_fr.xml, mappings/Mapping_hu.xml, mappings/Mapping_el.xml
Substantial additions/refinements to template-to-ontology mappings across EN/FR/HU/EL.
Root build meta
pom.xml
Adds commented history module entry; no functional build change.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as NifExtractor
  participant Conf as Config.mediawikiConnection
  participant REST as MediaWikiConnectorRest
  participant MWC as MediawikiConnectorConfigured
  participant ExtrR as WikipediaNifExtractorRest
  participant Extr as WikipediaNifExtractor

  alt apiType == "rest"
    Runner->>REST: retrievePage(title, params)
    REST-->>Runner: html opt or error
    Runner->>ExtrR: extract(html)
    ExtrR-->>Runner: NIF sections/triples
  else non-REST
    Runner->>MWC: retrievePage(title, params)
    MWC-->>Runner: html opt or error
    Runner->>Extr: extract(html)
    Extr-->>Runner: NIF sections/triples
  end
Loading
sequenceDiagram
  autonumber
  participant Main as Extraction2
  participant Cfg as Config2
  participant Loader as ConfigLoader2
  participant Job as ExtractionJob2
  participant Src as Source2/XMLSource2
  participant Ext as CompositeParseExtractor2
  participant Dest as WriterDestination2
  participant Rec as ExtractionRecorder2

  Main->>Cfg: load(args)
  Main->>Loader: new(config)
  Loader-->>Main: jobs (by language)
  loop per job
    Main->>Job: run()
    Job->>Src: foreach(page)
    Src-->>Job: WikiPageWithRevisions
    Job->>Ext: extract(page, subjectUri)
    Ext-->>Job: quads
    Job->>Dest: write(quads)
    Job->>Rec: record page/quad status
  end
  Main-->>Main: wait for all jobs
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~180+ minutes

Poem

A hop through history, ears held high,
We wired new REST to catch the sky.
With quads in paws and maps refined,
We sift the past, by dumps aligned.
Spark trails glow, connectors sing—
Thump goes progress: spring-spring-spring! 🐇✨

Pre-merge checks and finishing touches and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “Dbpedia History” clearly identifies the main focus of this pull request as introducing DBpedia’s History functionality and is concise without unnecessary details.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 53

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mappings/Mapping_hu.xml (1)

265-266: Fix incorrect ontology targets for church founding metadata.

alapítva now writes to dbo:publicationDate and alapító to dbo:bodyDiscovered, which will pollute those properties with church founding data. Please map them to the standard founding predicates instead.

-        {{PropertyMapping | templateProperty = alapítva | ontologyProperty = publicationDate }}
-        {{PropertyMapping | templateProperty = alapító | ontologyProperty = bodyDiscovered }}
+        {{PropertyMapping | templateProperty = alapítva | ontologyProperty = foundingDate }}
+        {{PropertyMapping | templateProperty = alapító | ontologyProperty = founder }}
🧹 Nitpick comments (39)
dump/src/test/bash/createSampleRandomFromPageIDdataset.sh (1)

21-22: Quote parameters passed to external commands.

$file, $n, and the constructed output name are expanded unquoted. Any whitespace or glob characters in the file path or generated list name will break the pipeline or trigger unintended globbing. Quote those substitutions before passing them to grep, shuf, and the final redirection target.

-grep -v "resource\/\w*\:" $file > temp.txt
-shuf -n $n temp.txt | ... > uri_sample_random_${lang}_${n}.lst
+grep -v "resource/\w*:" "$file" > temp.txt
+shuf -n "$n" temp.txt | ... > "uri_sample_random_${lang}_${n}.lst"
pom.xml (1)

64-65: History module is commented out—clarify activation plan

If the history module should build in CI, uncomment it or gate it behind a Maven profile (e.g., -P history). Otherwise, document why it’s intentionally disabled to avoid confusion.

history/ReadMe.md (7)

3-3: Tighten intro sentence

Minor copy edit and trailing space.

-DBpedia History enables the history of a Wikipedia chapter to be extracted into an RDF format 
+DBpedia History enables the history of a Wikipedia chapter to be extracted into RDF.

10-15: Avoid bare URLs; format citations cleanly (MD034)

Replace separate bare URLs with Markdown links.

-Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin. Wikipedia editing history in DBpedia: extracting and publishing the encyclopedia editing activity as linked data. IEEE/WIC/ACM International Joint Conference on Web Intelligence (WI' 16), Oct 2016, Omaha, United States. <hal-01359575>
-https://hal.inria.fr/hal-01359575
-
-Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin. Materializing the editing history of Wikipedia as linked data in DBpedia. ISWC 2016 - 15th International Semantic Web Conference, Oct 2016, Kobe, Japan. <http://iswc2016.semanticweb.org/>. <hal-01359583>
-https://hal.inria.fr/hal-01359583
+– [Wikipedia editing history in DBpedia: extracting and publishing the encyclopedia editing activity as linked data (WI'16)](https://hal.inria.fr/hal-01359575) — Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin.
+
+– [Materializing the editing history of Wikipedia as linked data in DBpedia (ISWC 2016)](https://hal.inria.fr/hal-01359583) — Fabien Gandon, Raphael Boyer, Olivier Corby, Alexandre Monnin.

36-43: Fix typos, grammar, and list indentation (MD007)

Tighten phrasing and remove extra indent spaces before bullets.

-  * Extract all revisions of every Wikipedia page
-  * Use the foaf, xsd, rdf, prov, dc, sioc ontologies 
-  * Describre each revisions of each page, the content / date / size / importance of that revision, the author of this one and the delta with the last version of the page updated by this one
-  * the id of the user are based depending on what is available : ip / nickname or the wikipedia id
+* Extract all revisions of every Wikipedia page
+* Use the FOAF, XSD, RDF, PROV, DC, and SIOC ontologies
+* Describe each revision of each page: content, date, size, importance, author, and the delta vs. the previous revision
+* The user identifier is based on what is available: IP address, username, or Wikipedia user ID

44-50: Clarify wording in HistoryStats bullets (grammar, MD007)

Improve list grammar and remove extra indent spaces.

-  * Extract statistics about revision activity for every page of Wikipedia : 
-    * number of revision per year / months 
-    * avg size of revision per year / month
-    * number of unique contribution
-  * This exctraction add some computation and could be not necessary
-  * Use dc, rdf, rdfs ontologies
+* Extract statistics about revision activity for every Wikipedia page:
+  * number of revisions per year/month
+  * average revision size per year/month
+  * number of unique contributors
+* This extraction adds additional computation and may be optional
+* Uses DC, RDF, and RDFS ontologies

53-64: Minor heading/punctuation fixes

Standardize headings and code formatting.

-## How to run it ? 
+## How to run it?
@@
-* and run  ```../run download download.properties```
+* and run: `../run download download.properties`
@@
-* and run  ```../run run extraction.properties```
+* and run: `../run run extraction.properties`

71-86: Specify language for fenced code block (MD040)

These are N-Triples; add a language hint for better rendering.

-```
+```ntriples
 ...

---

`88-182`: **Specify language for fenced code block (MD040)**

Same for the stats example.

```diff
-```
+```ntriples
 ...

</blockquote></details>
<details>
<summary>dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract.scala (2)</summary><blockquote>

`97-101`: **Remove unused debug variable/print**

`parallelProcesses` isn’t used; drop it and the println.

```diff
-    val parallelProcesses = 1
-    println(parallelProcesses)
+    // parallelism is determined by job config; keep output minimal in tests

140-141: Don’t swallow exceptions silently

Log or record the exception to aid debugging.

-      } catch {
-        case e: Exception =>  None
-      }
+      } catch {
+        case e: Exception =>
+          println(s"[warn] Failed to list or aggregate failed pages for ${lang.wikiCode}: ${e.getMessage}")
+      }
history/src/main/scala/org/dbpedia/extraction/dump/extract/SparkExtractionJob2.scala (3)

49-50: Remove stray debug print

Leftover println adds noise in Spark driver logs.

-    println("SPARKKKK")

70-73: Use consistent composite extractor implementation

File uses CompositeParseExtractor here and CompositeParseExtractor2 later; unify to the intended variant.

-      val extractor = CompositeParseExtractor.load(extractors, context)
+      val extractor = CompositeParseExtractor2.load(extractors, context)

211-218: Avoid external bash for concat if possible

Relying on an external script can break in cluster environments. Consider merging via Hadoop/Spark APIs (e.g., coalesce + save) or Java NIO, with retries and logging.

core/src/main/resources/datasetdefinitions.json (1)

629-642: Update "history" dataset names and descriptions

     "history_dataset": {
-      "name": "History Links",
+      "name": "History Data",
       "traits":"LinkedData, Published",
-      "desc": "All data related to history",
+      "desc": "Per-revision history data extracted from Wikipedia",
       "defaultgraph": "dataset"
     },
     "history_stats": {
       "name": "History Stats",
       "traits":"LinkedData, Published",
-      "desc": "Statistics related to edition statistics",
+      "desc": "Aggregated editing statistics per page (e.g., revisions per year/month, average sizes, unique contributors)",
       "defaultgraph": "dataset"
     }

Scala constants HistoryData and HistoryStats already map to these keys.

history/download.properties (2)

4-4: Remove commented debug/test URL.

Line 4 contains a commented-out URL that appears to be for testing/debugging. This should be removed from production configuration files.

-#https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Exporter&action=submit&history=1&pages=H%C3%B4tes_de_passage%0ARaymond_Beaudet

15-15: Document or parameterize the hardcoded dump date.

The dump date is hardcoded to 20221001, which will become outdated. The comment on line 14 states "if empty: the most recent dump-date is used."

Consider either:

  1. Leaving it empty to automatically use the most recent dump
  2. Adding documentation explaining why this specific date is required
  3. Making it configurable via environment variable or build parameter
history/src/main/scala/org/dbpedia/extraction/mappings/CompositeExtractor2.scala (1)

7-7: Address or clarify the TODO comment.

The TODO comment "generic type may not be optimal" lacks context. Either:

  1. Elaborate on specific concerns with the generic type approach
  2. Remove the TODO if the current design is acceptable
  3. File an issue to track the concern if it's deferred work

If the generic type is working as intended, consider replacing with a clarifying comment:

-  * TODO: generic type may not be optimal.
+  * Generic composite extractor that aggregates multiple extractors of the same input type.
+  * Type parameter N allows composition of extractors for different node types (WikiPageWithRevisions, RevisionNode, etc.).
history/src/main/scala/org/dbpedia/extraction/mappings/CompositeWikiPageWithRevisionExtractor.scala (1)

5-7: Align extends/with indentation.

Scala convention keeps extends/with indented under the class header for readability. Please indent these two lines to match surrounding code.

dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties (1)

34-34: Trailing space in languages value

Trim to avoid parsing surprises in loaders that don’t auto-trim.

-languages=fr 
+languages=fr
history/src/main/scala/org/dbpedia/extraction/mappings/RedirectExtractor2.scala (1)

28-30: Remove unused namespaces val

Declared but never used. Drop to reduce noise.

-  private val namespaces = if (language == Language.Commons) ExtractorUtils.commonsNamespacesContainingMetadata
-    else Set(Namespace.Main, Namespace.Template, Namespace.Category)
+  // Intentionally no namespace set here; see extract() filter.
core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala (2)

93-96: Link section relationships (top/sub and prev/next) like base extractor

Parity with WikipediaNifExtractor improves consumers relying on navigation.

-              tocMap.append(section)
+              tocMap.append(section)
+              section.top match {
+                case Some(s) => s.sub = Option(section)
+                case None => ()
+              }
+              section.prev match {
+                case Some(s) => s.next = Option(section)
+                case None => ()
+              }

45-47: Readability: avoid shadowing parameter name in lambda

Rename the lambda variable to prevent confusion with method param currentNodes.

-      subnodes = subnodes.dropWhile(currentNodes => !currentNodes.nodeName().matches("h\\d") && !currentNodes.nodeName().matches("section"))
+      subnodes = subnodes.dropWhile(n => !n.nodeName().matches("h\\d") && !n.nodeName().matches("section"))
history/src/main/scala/org/dbpedia/extraction/wikiparser/RevisionNode.scala (3)

49-57: Avoid reparsing timestamp twice.

getYearMonth parses the ISO timestamp twice. Cache the ZonedDateTime locally (or reuse the instance from getYear) to avoid the duplicate parse and associated exception risk when the timestamp is malformed twice.


60-64: Tighten boolean parsing.

parseBoolean treats any non-empty string as true, so "True" or "yes" behave differently than "false", and typos silently coerce to true. Consider normalizing with str.equalsIgnoreCase("true") (and trimming) so unexpected inputs yield false (or throw).


66-73: Return Option instead of sentinel -1.

Returning -1 for missing numeric fields leaks sentinel values into downstream logic. Prefer Option[Int] / Option[Long] or throw on malformed input to keep the API type-safe.

history/src/main/scala/org/dbpedia/extraction/mappings/ExtractionMonitor2.scala (2)

135-141: Return numeric successful count, not AtomicLong

Summaries generally expect values, not mutables.

Apply:

-    summary.put("SUCCESSFUL", s)
+    summary.put("SUCCESSFUL", Long.box(s.get()))

183-185: Replace printStackTrace with proper logging

Avoid printing to stdout in libraries.

Use a logger (java.util.logging or slf4j) with context and stack trace:

// logger.log(Level.WARNING, s"DATAID compare failed for $dataIDUrl", ex)
history/src/main/scala/org/dbpedia/extraction/config/Config2.scala (1)

260-281: mediawikiConnection: handle unknown mwc-type explicitly

Unknown values cause a MatchError that is wrapped and rethrown. Prefer explicit validation for clearer error messages.

Example:

-      apiUrl = this.getProperty("mwc-type").trim match {
+      apiUrl = this.getProperty("mwc-type", "").trim match {
         case "rest" =>  this.getProperty("mwc-apiRestUrl", "").trim
         case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim
         case "local" => this.getProperty("mwc-apiLocalUrl", "").trim
+        case other => throw new IllegalArgumentException(s"Unsupported mwc-type='$other'. Expected one of: rest|mwc|local.")
       },
history/src/main/scala/org/dbpedia/extraction/wikiparser/WikiPageWithRevisions.scala (2)

50-57: Typo: isRetryy -> isRetry

Minor naming nit for consistency.

Apply:

-  private var isRetryy = false
+  private var isRetry = false
...
-    this.isRetryy = !this.isRetryy
+    this.isRetry = !this.isRetry
...
-  def isRetry: Boolean = this.isRetryy
+  def isRetry: Boolean = this.isRetry

96-98: Avoid mapValues (lazy view) and return strict maps

mapValues returns a view in newer Scala; compute eagerly.

Apply:

-  def getRevPerYear: Map[String,Int] = {
-    this.revisions.groupBy(_.getYear).mapValues(_.size)
-  }
+  def getRevPerYear: Map[String,Int] =
+    this.revisions.groupBy(_.getYear).map { case (k, v) => k -> v.size }.toMap
...
-  def getRevPerYearMonth: Map[String, Int] = {
-     this.revisions.groupBy(_.getYearMonth).mapValues(_.size)
-  }
+  def getRevPerYearMonth: Map[String, Int] =
+    this.revisions.groupBy(_.getYearMonth).map { case (k, v) => k -> v.size }.toMap
...
-  def getRevPerYearMonthAvgSize: Map[String, Int] = {
-     this.revisions.groupBy(_.getYearMonth).map {
-      case (candidate, group) =>
-        candidate -> group.map { _.text_size }.sum / group.size
-    }
-  }
+  def getRevPerYearMonthAvgSize: Map[String, Int] =
+    this.revisions.groupBy(_.getYearMonth).map { case (k, g) => k -> (g.map(_.text_size).sum / g.size) }.toMap

Also applies to: 107-109, 111-116

core/src/main/scala/org/dbpedia/extraction/util/MediawikiConnectorConfigured.scala (2)

60-67: Parameter construction/escaping is fragile; prefer proper URL-encoding and clearer templating.

Manual replacements on title and ad-hoc replace/format on apiParameterString are error-prone. Use URLEncoder.encode for values and build parameters with a small helper to avoid mismatched %s/%d templates and accidental deletions.

  • Build a Map[String,String] of parameters (uselang, page or pageid, maxlag).
  • URL-encode only values with UTF-8.
  • Join with '&' to a body string.
    I can provide a helper if you want it in this PR.

Also applies to: 50-53


26-28: Remove unused vars and null Try; simplify.

  • availableProcessors, gzipok are unused.
  • parsedAnswer should not be initialized to null.

Apply:

-  private val osBean = java.lang.management.ManagementFactory.getOperatingSystemMXBean
-  private val availableProcessors = osBean.getAvailableProcessors
+  // remove unused osBean/availableProcessors

-    var gzipok = true
-    var parsedAnswer: Try[String] = null
+    var parsedAnswer: Try[String] = Failure(new IllegalStateException("uninitialized"))

Also drop gzipok and guard gzip by Content-Encoding as suggested above.

Also applies to: 41-43, 107-115

history/src/main/scala/org/dbpedia/extraction/util/ExtractionRecorder2.scala (4)

220-223: Don’t increment “successful pages” for quads; use a dedicated triples counter.

recordQuad invokes increaseAndGetSuccessfulPages, skewing page stats. Use the existing successfulTripleCount keyed by quad.dataset.

Apply:

-  def recordQuad(quad: Quad, severity: RecordSeverity.Value, lang:Language): Unit = synchronized {
-    if(increaseAndGetSuccessfulPages(lang) % reportInterval == 0)
+  def recordQuad(quad: Quad, severity: RecordSeverity.Value, lang: Language): Unit = synchronized {
+    val triples = increaseAndGetSuccessfulTriples(quad.dataset)
+    if (triples % reportInterval == 0)
       printLabeledLine("processed {page} quads; {mspp} per quad; {fail} failed quads", severity, lang)
   }

Optionally adapt placeholders to reflect triple count and ms/triple.


323-339: Avoid finalize; provide explicit close and implement AutoCloseable.

finalize is deprecated and unreliable. Expose a close() (or closeWriter()) and let callers use try/finally or Loan pattern.

Suggested change:

-  override def finalize(): Unit ={
-    if(writerOpen){
-      logWriter.close()
-      writerOpen = false
-    }
-    ...
-    super.finalize()
-  }
+  def close(): Unit = synchronized {
+    if (writerOpen) {
+      try logWriter.close() finally writerOpen = false
+    }
+  }

Then call recorder.close() from the orchestration layer when finishing.


468-481: Set Content-Type when posting Slack JSON; optionally add timeouts.

Slack expects application/json; missing header may fail depending on gateway.

Apply:

-      val resp = Http(url).postData(new String(baos.toByteArray, Charset.defaultCharset())).asString
+      val resp = Http(url)
+        .header("Content-Type", "application/json; charset=UTF-8")
+        .timeout(connTimeoutMs = 5000, readTimeoutMs = 10000)
+        .postData(new String(baos.toByteArray, Charset.forName("UTF-8")))
+        .asString

126-151: Minor: remove unused val and simplify branch.

val msg is unused; just execute side effects.

Apply:

-        case _  =>
-          val msg = Option(record.errorMsg) match{
-            case Some(m) => printLabeledLine(m, record.severity, record.language)
-            case None =>
-              if(record.error != null) failedRecord(null, record.page, record.error, record.language)
-              else recordGenericPage(record.language, record.page.toString)
-          }
+        case _  =>
+          Option(record.errorMsg) match {
+            case Some(m) => printLabeledLine(m, record.severity, record.language)
+            case None =>
+              if (record.error != null) failedRecord(null, record.page, record.error, record.language)
+              else recordGenericPage(record.language, record.page.toString)
+          }
core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorAbstract.scala (2)

8-11: Remove unused imports.

OutputStreamWriter, URL, ChronoUnit are unused in this abstract class.

Apply:

-import java.io.{InputStream, OutputStreamWriter}
-import java.net.{HttpURLConnection, URL}
-import java.time.temporal.ChronoUnit
+import java.io.InputStream

42-44: Drop unused availableProcessors.

These fields aren’t used; dead code.

Apply:

-  private val osBean = java.lang.management.ManagementFactory.getOperatingSystemMXBean
-  private val availableProcessors = osBean.getAvailableProcessors
+  // removed unused OS bean/utilization fields
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 284bafc and f97dafa.

⛔ Files ignored due to path filters (3)
  • history/sample-xml-dump/frwiki/20220101/frwiki-20220101-history-dataset.ttl.bz2 is excluded by !**/*.bz2
  • history/sample-xml-dump/frwiki/20220101/frwiki-20220101-history-stats.ttl.bz2 is excluded by !**/*.bz2
  • history/sample-xml-dump/frwiki/20220101/frwiki-20220101-template-redirects.obj is excluded by !**/*.obj
📒 Files selected for processing (57)
  • core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java (8 hunks)
  • core/src/main/resources/datasetdefinitions.json (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/config/Config.scala (3 hunks)
  • core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/mappings/NifExtractor.scala (3 hunks)
  • core/src/main/scala/org/dbpedia/extraction/mappings/PlainAbstractExtractor.scala (2 hunks)
  • core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala (4 hunks)
  • core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnector.scala (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorAbstract.scala (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnectorRest.scala (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/util/MediawikiConnectorConfigured.scala (1 hunks)
  • dump/src/test/bash/createMinidump_custom_sample.sh (1 hunks)
  • dump/src/test/bash/createSampleRandomFromPageIDdataset.sh (1 hunks)
  • dump/src/test/bash/create_custom_sample.sh (1 hunks)
  • dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties (3 hunks)
  • dump/src/test/resources/extraction-configs/extraction.plain.abstracts.properties (3 hunks)
  • dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract.md (1 hunks)
  • dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract.scala (1 hunks)
  • history/ReadMe.md (1 hunks)
  • history/download.properties (1 hunks)
  • history/extraction.properties (1 hunks)
  • history/pom.xml (1 hunks)
  • history/src/main/java/org/dbpedia/extraction/sources/WikipediaDumpParserHistory.java (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/config/Config2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/destinations/WriterDestination2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/dump/extract/ConfigLoader2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/dump/extract/Extraction2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/dump/extract/ExtractionJob2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/dump/extract/SerializableUtils2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/dump/extract/SparkExtractionJob2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/CompositeExtractor2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/CompositeParseExtractor2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/CompositeWikiPageWithRevisionExtractor.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/ExtractionMonitor2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/HistoryPageExtractor.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/HistoryStatsExtractor.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/RedirectExtractor2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/Redirects2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/RevisionNode.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/RevisionNodeExtractor.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/mappings/WikiPageWithRevisionsExtractor.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/sources/Source2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/sources/XMLSource2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/util/DumpExtractionContext2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/util/ExtractionRecorder2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/util/SparkExtractionContext2.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/wikiparser/RevisionNode.scala (1 hunks)
  • history/src/main/scala/org/dbpedia/extraction/wikiparser/WikiPageWithRevisions.scala (1 hunks)
  • history/src/test/resources/extraction-configs/extraction.config.properties (1 hunks)
  • history/src/test/scala/org/dbpedia/extraction/dump/extract/ExtractionTest.scala (1 hunks)
  • mappings/Mapping_el.xml (1 hunks)
  • mappings/Mapping_en.xml (30 hunks)
  • mappings/Mapping_fr.xml (31 hunks)
  • mappings/Mapping_hu.xml (3 hunks)
  • pom.xml (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
history/ReadMe.md

11-11: Bare URL used

(MD034, no-bare-urls)


14-14: Bare URL used

(MD034, no-bare-urls)


38-38: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


39-39: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


40-40: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


41-41: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


44-44: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


45-45: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


46-46: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


47-47: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


48-48: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


49-49: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


72-72: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


89-89: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)
dump/src/test/bash/create_custom_sample.sh

[warning] 8-13: Invalid flags are not handled. Add a *) case.

(SC2220)


[error] 35-35: Double quote array expansions to avoid re-splitting elements.

(SC2068)


[warning] 35-35: In POSIX sh, array references are undefined.

(SC3054)


[warning] 37-37: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 37-37: In POSIX sh, =~ regex matching is undefined.

(SC3015)


[warning] 37-37: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)


[warning] 42-42: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 42-42: In POSIX sh, == in place of = is undefined.

(SC3014)


[warning] 55-55: In POSIX sh, 'declare' is undefined.

(SC3044)


[warning] 57-57: In POSIX sh, $'..' is undefined.

(SC3003)


[warning] 57-57: In POSIX sh, arrays are undefined.

(SC3030)


[warning] 57-57: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 58-58: In POSIX sh, array references are undefined.

(SC3054)


[warning] 59-59: In POSIX sh, array references are undefined.

(SC3054)


[warning] 60-60: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 61-61: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 64-64: In POSIX sh, array references are undefined.

(SC3054)


[warning] 70-70: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 70-70: In POSIX sh, == in place of = is undefined.

(SC3014)


[warning] 71-71: In POSIX sh, indirect expansion is undefined.

(SC3053)


[warning] 71-71: In POSIX sh, array key expansion is undefined.

(SC3055)


[warning] 73-73: In POSIX sh, array references are undefined.

(SC3054)


[warning] 76-76: In POSIX sh, indirect expansion is undefined.

(SC3053)


[warning] 76-76: In POSIX sh, array key expansion is undefined.

(SC3055)


[warning] 78-78: In POSIX sh, array references are undefined.

(SC3054)

dump/src/test/bash/createSampleRandomFromPageIDdataset.sh

[warning] 7-12: Invalid flags are not handled. Add a *) case.

(SC2220)

dump/src/test/bash/createMinidump_custom_sample.sh

[warning] 5-7: Invalid flags are not handled. Add a *) case.

(SC2220)


[warning] 54-54: The surrounding quotes actually unquote this. Remove or escape them.

(SC2027)

🔇 Additional comments (8)
history/src/test/resources/extraction-configs/extraction.config.properties (1)

21-21: Verify extractor configuration completeness.

The AI summary indicates that history extraction uses both HistoryPageExtractor and HistoryStatsExtractor, but this test configuration only includes HistoryPageExtractor. The main history/extraction.properties file configures both extractors (lines 21-22).

Please confirm whether this test intentionally uses a subset of extractors or if HistoryStatsExtractor should also be included for comprehensive testing.

history/src/main/scala/org/dbpedia/extraction/mappings/RevisionNodeExtractor.scala (1)

1-3: LGTM!

The trait definition is clean and follows the established extractor pattern, providing type safety for RevisionNode extractors.

history/src/main/scala/org/dbpedia/extraction/mappings/CompositeExtractor2.scala (1)

11-15: LGTM - composition pattern correctly implemented.

The composite extractor properly aggregates datasets and extraction results from multiple extractors. The use of flatMap and toSet ensures correct union semantics for datasets, and the extraction concatenation follows the expected composite pattern.

history/src/main/scala/org/dbpedia/extraction/mappings/WikiPageWithRevisionsExtractor.scala (1)

1-10: LGTM!

The trait provides clear type safety for WikiPageWithRevisions extractors. The documentation appropriately explains the purpose and relationship to CompositeExtractor.

core/src/main/scala/org/dbpedia/extraction/config/provenance/DBpediaDatasets.scala (1)

281-286: LGTM!

The new history dataset declarations follow the established pattern and are properly documented. The mappings to history_dataset and history_stats align with the dataset definitions introduced in this PR.

core/src/main/scala/org/dbpedia/extraction/util/MediaWikiConnector.scala (1)

20-20: [running scripts]

#!/bin/bash
# Verify if 'new MediaWikiConnector' instantiations were added in this PR's diff
git fetch origin main
git diff --unified=0 origin/main... | rg '^\+\s*new\s+MediaWikiConnector'
core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractor.scala (1)

213-223: Good change: helpers to protected for subclassing

Visibility relaxation is appropriate for REST subclass overrides; no behavior change.

history/src/main/scala/org/dbpedia/extraction/dump/extract/Extraction2.scala (1)

29-31: Validate parallelProcesses >= 1

If config.parallelProcesses is 0, the scheduler stalls. Enforce min 1 or validate in Config2.

Comment on lines +194 to 204
String linkPrefix = "/wiki/";
String linkPrefix2= "./";
if(uri.contains(linkPrefix)){
uri=uri.substring(uri.indexOf("?title=")+7);
} else if (uri.contains(linkPrefix2)) {
uri=uri.substring(uri.indexOf("?title=")+3);
}
//TODO central string management
if(!this.context.language.equals("en")) {

uri="http://"+this.context.language+".dbpedia.org/resource/"+uri.substring(uri.indexOf("?title=")+7);
uri="http://"+this.context.language+".dbpedia.org/resource/"+uri;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incorrect substring when stripping REST wiki links

The new cleanLink branch drops the first three characters after ./ and assumes ?title= is absent. For REST HTML such as href="./?title=Main_Page&oldid=123", the current code returns tle=Main_Page&oldid=123, which later becomes http://dbpedia.org/resource/tle=Main_Page&oldid=123. Similarly, /wiki/Foo?oldid=123 keeps the ?oldid tail. Both produce invalid resources. Please normalize the prefixes and strip any ?title=/query/fragment portions before building the DBpedia URI.

-            if(uri.contains(linkPrefix)){
-                uri=uri.substring(uri.indexOf("?title=")+7);
-            } else if (uri.contains(linkPrefix2)) {
-                uri=uri.substring(uri.indexOf("?title=")+3);
-            }
+            if (uri.startsWith(linkPrefix)) {
+                uri = uri.substring(linkPrefix.length());
+            } else if (uri.startsWith(linkPrefix2)) {
+                uri = uri.substring(linkPrefix2.length());
+            }
+            if (uri.contains("?title=")) {
+                uri = uri.substring(uri.indexOf("?title=") + 7);
+            }
+            int queryIdx = uri.indexOf('?');
+            if (queryIdx >= 0) {
+                uri = uri.substring(0, queryIdx);
+            }
+            int fragmentIdx = uri.indexOf('#');
+            if (fragmentIdx >= 0) {
+                uri = uri.substring(0, fragmentIdx);
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
String linkPrefix = "/wiki/";
String linkPrefix2= "./";
if(uri.contains(linkPrefix)){
uri=uri.substring(uri.indexOf("?title=")+7);
} else if (uri.contains(linkPrefix2)) {
uri=uri.substring(uri.indexOf("?title=")+3);
}
//TODO central string management
if(!this.context.language.equals("en")) {
uri="http://"+this.context.language+".dbpedia.org/resource/"+uri.substring(uri.indexOf("?title=")+7);
uri="http://"+this.context.language+".dbpedia.org/resource/"+uri;
String linkPrefix = "/wiki/";
String linkPrefix2= "./";
if (uri.startsWith(linkPrefix)) {
uri = uri.substring(linkPrefix.length());
} else if (uri.startsWith(linkPrefix2)) {
uri = uri.substring(linkPrefix2.length());
}
if (uri.contains("?title=")) {
uri = uri.substring(uri.indexOf("?title=") + 7);
}
int queryIdx = uri.indexOf('?');
if (queryIdx >= 0) {
uri = uri.substring(0, queryIdx);
}
int fragmentIdx = uri.indexOf('#');
if (fragmentIdx >= 0) {
uri = uri.substring(0, fragmentIdx);
}
//TODO central string management
if (!this.context.language.equals("en")) {
uri = "http://" + this.context.language + ".dbpedia.org/resource/" + uri;
}

Comment on lines +264 to 270
apiType = this.getProperty("mwc-type", "").trim,
apiUrl = this.getProperty("mwc-type").trim match {
case "rest" => this.getProperty("mwc-apiRestUrl", "").trim
case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim
case "local" => this.getProperty("mwc-apiLocalUrl", "").trim
},
maxRetries = this.getProperty("mwc-maxRetries", "4").trim.toInt,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Handle unknown mwc-type safely

The match on this.getProperty("mwc-type").trim has no fallback, so a missing or misspelled property now throws MatchError at startup. Please keep the previous behaviour by providing a sensible default (or an explicit failure with a clear message).

-      apiUrl = this.getProperty("mwc-type").trim match {
-        case "rest" =>  this.getProperty("mwc-apiRestUrl", "").trim
-        case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim
-        case "local" => this.getProperty("mwc-apiLocalUrl", "").trim
-      },
+      apiUrl = this.getProperty("mwc-type", "mwc").trim match {
+        case "rest"  => this.getProperty("mwc-apiRestUrl", "").trim
+        case "mwc"   => this.getProperty("mwc-apiMWCUrl", "").trim
+        case "local" => this.getProperty("mwc-apiLocalUrl", "").trim
+        case other   => throw new IllegalArgumentException(s"Unsupported mwc-type '$other'")
+      },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
apiType = this.getProperty("mwc-type", "").trim,
apiUrl = this.getProperty("mwc-type").trim match {
case "rest" => this.getProperty("mwc-apiRestUrl", "").trim
case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim
case "local" => this.getProperty("mwc-apiLocalUrl", "").trim
},
maxRetries = this.getProperty("mwc-maxRetries", "4").trim.toInt,
apiType = this.getProperty("mwc-type", "").trim,
apiUrl = this.getProperty("mwc-type", "mwc").trim match {
case "rest" => this.getProperty("mwc-apiRestUrl", "").trim
case "mwc" => this.getProperty("mwc-apiMWCUrl", "").trim
case "local" => this.getProperty("mwc-apiLocalUrl", "").trim
case other => throw new IllegalArgumentException(s"Unsupported mwc-type '$other'")
},
maxRetries = this.getProperty("mwc-maxRetries", "4").trim.toInt,
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/config/Config.scala around lines
264 to 270, the pattern match on this.getProperty("mwc-type").trim can throw
MatchError for unknown or missing values; change it to match the apiType value
you already read and add a fallback case (case _ => "") or an explicit
IllegalArgumentException with a clear message. Concretely, use the apiType val
in the match and add a default branch that returns a sensible default (e.g. an
empty string) or throws a descriptive error so startup fails with a clear
message instead of a MatchError.

Comment on lines +37 to +39
var nodes = doc.select("body").first.childNodes.asScala

val currentSection = new ListBuffer[Int]() //keeps track of section number
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add null-safety around body/section selection

doc.select("body").first and .select("section").first can be null for some pages. Guard to avoid NPEs.

-    var nodes = doc.select("body").first.childNodes.asScala
+    val body = doc.select("body").first
+    if (body == null) return tocMap
+    var nodes = body.childNodes.asScala
@@
-    val abstractSect=doc.select("body").select("section").first.childNodes.asScala //get first section
+    val firstSection = body.select("section").first
+    if (firstSection == null) return tocMap
+    val abstractSect = firstSection.childNodes.asScala

Also applies to: 108-111

🤖 Prompt for AI Agents
In
core/src/main/scala/org/dbpedia/extraction/nif/WikipediaNifExtractorRest.scala
around lines 37-39 (and similarly at lines 108-111), calls to
doc.select("body").first and doc.select("section").first can return null and
cause NPEs; add null-safety by checking for null (or wrapping in Option) before
accessing .first or .childNodes, and handle the absent case by skipping
processing or returning early as appropriate; specifically, replace direct
.first usage with a safe lookup (e.g., Option(...).toList or if (elem != null)
...) and only access childNodes or call .asScala when the element is present so
the extractor continues safely when body/section is missing.

Comment on lines +52 to +120
val parameters = "redirect=true"
val apiUrl: URL = new URL(url.concat(titleParam).concat("?"+parameters))



//println(s"mediawikiurl: $apiUrl")


for (counter <- 1 to maxRetries) {

val conn = apiUrl.openConnection
conn.setDoOutput(true) // POST REQUEST to verify

val start = java.time.LocalTime.now()

conn.setConnectTimeout(retryFactor * connectMs)
conn.setReadTimeout(retryFactor * readMs)
conn.setRequestProperty("accept", apiAccept)
conn.setRequestProperty("charset", apiCharset)
conn.setRequestProperty("profile", apiProfile)
conn.setRequestProperty("Accept-Language", pageTitle.language.wikiCode)
conn.setRequestProperty("User-Agent", userAgent)

val inputStream = conn.getInputStream
val answerHeader = conn.getHeaderFields()
val answerClean = answerHeader.asScala.filterKeys(_ != null)

if(conn.getHeaderField(null).contains("HTTP/1.1 200 OK") ){


val end = java.time.LocalTime.now()
conn match {
case connection: HttpURLConnection =>
log.debug("Request type: " + connection.getRequestMethod + "; URL: " + connection.getURL +
"; Parameters: " + parameters + "; HTTP code: " + connection.getHeaderField(null) +
"; Request time: " + start + "; Response time: " + end + "; Time needed: " +
start.until(end, ChronoUnit.MILLIS))
case _ =>
}
// Read answer
parsedAnswer = readInAbstract(inputStream)
SuccessParsing = parsedAnswer match {
case Success(str) => true
case Failure(_) => false
}
}
if(!SuccessParsing){
var sleepMs = sleepFactorMs
if (retryAfter && answerClean.contains("retry-after")) {
//println("GIVEN RETRY-AFTER > "+ answer_clean("retry-after").get(0))
waitingTime = Integer.parseInt(answerClean("retry-after").get(0)) * 1000

// exponential backoff test
sleepMs = pow(waitingTime, counter).toInt
//println("WITH EXPONENTIAL BACK OFF" + counter)
//println("Sleeping time double >>>>>>>>>>>" + pow(waiting_time, counter))
//println("Sleeping time int >>>>>>>>>>>" + sleepMs)

}
if (counter < maxRetries)
Thread.sleep(sleepMs)
else
throw new Exception("Timeout error retrieving abstract of " + pageTitle + " in " + counter + " tries.")
} else {


//println(s"mediawikiurl: $apiUrl?$parameters")
return parsedAnswer match {
case Success(str) => Option(str)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the supplied apiParameterString when building the REST URL

retrievePage ignores apiParameterString and always forces redirect=true. Callers rely on that parameter (e.g., to request specific sections or formats); dropping it makes the REST connector fetch the wrong content. Please merge the caller-provided parameters with your default instead of discarding them.

-    val parameters = "redirect=true"
-    val apiUrl: URL = new URL(url.concat(titleParam).concat("?"+parameters))
+    val baseParams = Option(apiParameterString)
+      .map(_.stripPrefix("?"))
+      .filter(_.nonEmpty)
+      .getOrElse("")
+    val allParams = Seq(baseParams, "redirect=true").filter(_.nonEmpty).mkString("&")
+    val suffix = if (allParams.isEmpty) "" else "?" + allParams
+    val apiUrl: URL = new URL(url.concat(titleParam).concat(suffix))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val parameters = "redirect=true"
val apiUrl: URL = new URL(url.concat(titleParam).concat("?"+parameters))
//println(s"mediawikiurl: $apiUrl")
for (counter <- 1 to maxRetries) {
val conn = apiUrl.openConnection
conn.setDoOutput(true) // POST REQUEST to verify
val start = java.time.LocalTime.now()
conn.setConnectTimeout(retryFactor * connectMs)
conn.setReadTimeout(retryFactor * readMs)
conn.setRequestProperty("accept", apiAccept)
conn.setRequestProperty("charset", apiCharset)
conn.setRequestProperty("profile", apiProfile)
conn.setRequestProperty("Accept-Language", pageTitle.language.wikiCode)
conn.setRequestProperty("User-Agent", userAgent)
val inputStream = conn.getInputStream
val answerHeader = conn.getHeaderFields()
val answerClean = answerHeader.asScala.filterKeys(_ != null)
if(conn.getHeaderField(null).contains("HTTP/1.1 200 OK") ){
val end = java.time.LocalTime.now()
conn match {
case connection: HttpURLConnection =>
log.debug("Request type: " + connection.getRequestMethod + "; URL: " + connection.getURL +
"; Parameters: " + parameters + "; HTTP code: " + connection.getHeaderField(null) +
"; Request time: " + start + "; Response time: " + end + "; Time needed: " +
start.until(end, ChronoUnit.MILLIS))
case _ =>
}
// Read answer
parsedAnswer = readInAbstract(inputStream)
SuccessParsing = parsedAnswer match {
case Success(str) => true
case Failure(_) => false
}
}
if(!SuccessParsing){
var sleepMs = sleepFactorMs
if (retryAfter && answerClean.contains("retry-after")) {
//println("GIVEN RETRY-AFTER > "+ answer_clean("retry-after").get(0))
waitingTime = Integer.parseInt(answerClean("retry-after").get(0)) * 1000
// exponential backoff test
sleepMs = pow(waitingTime, counter).toInt
//println("WITH EXPONENTIAL BACK OFF" + counter)
//println("Sleeping time double >>>>>>>>>>>" + pow(waiting_time, counter))
//println("Sleeping time int >>>>>>>>>>>" + sleepMs)
}
if (counter < maxRetries)
Thread.sleep(sleepMs)
else
throw new Exception("Timeout error retrieving abstract of " + pageTitle + " in " + counter + " tries.")
} else {
//println(s"mediawikiurl: $apiUrl?$parameters")
return parsedAnswer match {
case Success(str) => Option(str)
val baseParams = Option(apiParameterString)
.map(_.stripPrefix("?"))
.filter(_.nonEmpty)
.getOrElse("")
val allParams = Seq(baseParams, "redirect=true").filter(_.nonEmpty).mkString("&")
val suffix = if (allParams.isEmpty) "" else "?" + allParams
val apiUrl: URL = new URL(url.concat(titleParam).concat(suffix))

Comment on lines +1 to +80
#!/bin/sh
lang="";
n=1000;
date_archive=$(date -d "$(date +%Y-%m-01) -1 day" +%Y-%m);
sort="desc"
while getopts l:d:n:s: flag
do
case "${flag}" in
l) lang=${OPTARG};;
d) date_archive=${OPTARG};;
n) n=${OPTARG};;
s) sort=${OPTARG};;
esac
done
echo "========================="
echo "lang: $lang";
echo "date: $date_archive";
echo "n: $n";
echo "sort: $sort";
echo "========================="


clickstream_data="clickstream_data_${lang}_${date_archive}"
if [ -f "$clickstream_data" ]
then
echo "File found"
else
echo "File not found"
clickstream_url="https://dumps.wikimedia.org/other/clickstream/";
content=$(curl -L "$clickstream_url$date_archive/")
links=$( echo $content | grep -Po '(?<=href=")[^"]*');
toextract="";
substr="-${lang}wiki-"
echo $substr
for link in ${links[@]}; do
echo $link
if [[ $link =~ "-${lang}wiki-" ]];then
toextract="$clickstream_url$date_archive/$link";
fi
done

if [[ $toextract == "" ]]; then
echo "Lang not found in clickstream";
exit 1;
fi

echo ">>>> DOWNLOAD $toextract and save it"

wget -O "${clickstream_data}.gz" $toextract;
gzip -d "${clickstream_data}.gz"
fi


echo ">>>> COMPUTE SUM OF CLICKS"
declare -A dict
while IFS= read -r line; do
IFS=$'\t'; arrIN=($line); unset IFS;
key=${arrIN[1]}
val=${arrIN[3]}
if [[ ${key} != *"List"* ]];then
if [[ ${#dict[${key}]} -eq 0 ]] ;then
dict[${key}]=$(($val));
else
dict[${key}]=$((${dict[${key}]}+$val));
fi
fi
done < $clickstream_data

echo ">>>> SORT IT AND SAVE TEMP"
if [[ $sort == "desc" ]]; then
for page in "${!dict[@]}"
do
echo "$page ${dict[$page]}"
done | sort -rn -k2 | head -n "$n" | cut -d ' ' -f 1 >> temp.txt;
else
for page in "${!dict[@]}"
do
echo "$page ${dict[$page]}"
done | sort -n -k2 | head -n "$n" | cut -d ' ' -f 1 >> temp.txt;
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix shebang or drop Bash-specific syntax.

The script declares #!/bin/sh yet relies on Bash-only features (declare -A, ${array[@]}, [[ … ]], regex =~). On systems where /bin/sh is not Bash, the script aborts. Either switch the shebang to #!/bin/bash (and ensure Bash is available) or refactor to POSIX-compliant constructs.

-#!/bin/sh
+#!/bin/bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/sh
lang="";
n=1000;
date_archive=$(date -d "$(date +%Y-%m-01) -1 day" +%Y-%m);
sort="desc"
while getopts l:d:n:s: flag
do
case "${flag}" in
l) lang=${OPTARG};;
d) date_archive=${OPTARG};;
n) n=${OPTARG};;
s) sort=${OPTARG};;
esac
done
echo "========================="
echo "lang: $lang";
echo "date: $date_archive";
echo "n: $n";
echo "sort: $sort";
echo "========================="
clickstream_data="clickstream_data_${lang}_${date_archive}"
if [ -f "$clickstream_data" ]
then
echo "File found"
else
echo "File not found"
clickstream_url="https://dumps.wikimedia.org/other/clickstream/";
content=$(curl -L "$clickstream_url$date_archive/")
links=$( echo $content | grep -Po '(?<=href=")[^"]*');
toextract="";
substr="-${lang}wiki-"
echo $substr
for link in ${links[@]}; do
echo $link
if [[ $link =~ "-${lang}wiki-" ]];then
toextract="$clickstream_url$date_archive/$link";
fi
done
if [[ $toextract == "" ]]; then
echo "Lang not found in clickstream";
exit 1;
fi
echo ">>>> DOWNLOAD $toextract and save it"
wget -O "${clickstream_data}.gz" $toextract;
gzip -d "${clickstream_data}.gz"
fi
echo ">>>> COMPUTE SUM OF CLICKS"
declare -A dict
while IFS= read -r line; do
IFS=$'\t'; arrIN=($line); unset IFS;
key=${arrIN[1]}
val=${arrIN[3]}
if [[ ${key} != *"List"* ]];then
if [[ ${#dict[${key}]} -eq 0 ]] ;then
dict[${key}]=$(($val));
else
dict[${key}]=$((${dict[${key}]}+$val));
fi
fi
done < $clickstream_data
echo ">>>> SORT IT AND SAVE TEMP"
if [[ $sort == "desc" ]]; then
for page in "${!dict[@]}"
do
echo "$page ${dict[$page]}"
done | sort -rn -k2 | head -n "$n" | cut -d ' ' -f 1 >> temp.txt;
else
for page in "${!dict[@]}"
do
echo "$page ${dict[$page]}"
done | sort -n -k2 | head -n "$n" | cut -d ' ' -f 1 >> temp.txt;
fi
#!/bin/bash
lang="";
n=1000;
date_archive=$(date -d "$(date +%Y-%m-01) -1 day" +%Y-%m);
sort="desc"
while getopts l:d:n:s: flag
do
case "${flag}" in
l) lang=${OPTARG};;
d) date_archive=${OPTARG};;
n) n=${OPTARG};;
s) sort=${OPTARG};;
esac
done
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 8-13: Invalid flags are not handled. Add a *) case.

(SC2220)


[error] 35-35: Double quote array expansions to avoid re-splitting elements.

(SC2068)


[warning] 35-35: In POSIX sh, array references are undefined.

(SC3054)


[warning] 37-37: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 37-37: In POSIX sh, =~ regex matching is undefined.

(SC3015)


[warning] 37-37: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)


[warning] 42-42: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 42-42: In POSIX sh, == in place of = is undefined.

(SC3014)


[warning] 55-55: In POSIX sh, 'declare' is undefined.

(SC3044)


[warning] 57-57: In POSIX sh, $'..' is undefined.

(SC3003)


[warning] 57-57: In POSIX sh, arrays are undefined.

(SC3030)


[warning] 57-57: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 58-58: In POSIX sh, array references are undefined.

(SC3054)


[warning] 59-59: In POSIX sh, array references are undefined.

(SC3054)


[warning] 60-60: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 61-61: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 64-64: In POSIX sh, array references are undefined.

(SC3054)


[warning] 70-70: In POSIX sh, [[ ]] is undefined.

(SC3010)


[warning] 70-70: In POSIX sh, == in place of = is undefined.

(SC3014)


[warning] 71-71: In POSIX sh, indirect expansion is undefined.

(SC3053)


[warning] 71-71: In POSIX sh, array key expansion is undefined.

(SC3055)


[warning] 73-73: In POSIX sh, array references are undefined.

(SC3054)


[warning] 76-76: In POSIX sh, indirect expansion is undefined.

(SC3053)


[warning] 76-76: In POSIX sh, array key expansion is undefined.

(SC3055)


[warning] 78-78: In POSIX sh, array references are undefined.

(SC3054)

🤖 Prompt for AI Agents
In dump/src/test/bash/create_custom_sample.sh lines 1-80, the script uses
Bash-only features (declare -A, [[ ]], =~, ${array[@]}) but declares #!/bin/sh;
change the shebang to a Bash-aware interpreter (e.g., #!/usr/bin/env bash) to
guarantee Bash semantics across systems, or alternatively refactor all
Bash-specific constructs to POSIX-compliant equivalents; update the shebang to
the env-backed bash if you choose the simpler fix and verify the script runs on
target CI/hosts.

@@ -0,0 +1,27 @@
# download and extraction target dir
#base-dir= moved to $extraction-framework/core/src/main/resources/universal.properties
base-dir=/home/cringwal/Desktop/DBpediaHistory/extraction-framework/history/sample-xml-dump
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove hardcoded absolute path with username.

The hardcoded path /home/cringwal/Desktop/DBpediaHistory/... contains a specific username and will fail on other machines or CI environments.

Consider one of these approaches:

  1. Use a relative path:
-base-dir=/home/cringwal/Desktop/DBpediaHistory/extraction-framework/history/sample-xml-dump
+base-dir=./sample-xml-dump
  1. Or rely on the universal.properties default and remove this line entirely if the default is suitable for tests.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
base-dir=/home/cringwal/Desktop/DBpediaHistory/extraction-framework/history/sample-xml-dump
base-dir=./sample-xml-dump
🤖 Prompt for AI Agents
In history/src/test/resources/extraction-configs/extraction.config.properties
around line 3, the property base-dir contains a hardcoded absolute path with a
username; remove this platform-specific path and either delete the line so the
test falls back to the universal.properties default, or replace it with a
portable relative path (e.g. ./sample-xml-dump or ../sample-xml-dump) that
points to the test resource directory so the tests run on any machine/CI.

Comment on lines +17 to +41
test("test Historic extraction", HistoTestTag) {
val jobsRunning = new ConcurrentLinkedQueue[Future[Unit]]()//
val classLoader: ClassLoader = getClass.getClassLoader
val histoConfig = new Config2(classLoader.getResource("extraction-configs/extraction.config.properties").getFile)
println(classLoader.getResource("extraction-configs/extraction.config.properties").getFile.toString)
println("BEFORE EXTRACT")
extract(histoConfig, jobsRunning)
println("AFTER EXTRACT")
}

def extract(config: Config2, jobsRunning: ConcurrentLinkedQueue[Future[Unit]]): Unit = {
val configLoader = new ConfigLoader2(config)
val jobs = configLoader.getExtractionJobs
println(">>>>>>>>> EXTRACT - NBJOBS > " + jobs.size)
println("LAUNCH JOBS")
for (job <- jobs) {
job.run()
}
while (jobsRunning.size() > 0) {

Thread.sleep(1000)
}

jobsRunning.clear()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resource loading via getFile breaks off-classpath / espaços.

classLoader.getResource(...).getFile only works for plain file URLs without spaces. In Maven/Gradle runs the resource often sits in target/test-classes, so toURI is safer; inside a JAR it will explode. Please resolve the URL properly (Paths.get(resource.toURI)) or stream it.

Fix sketch:

-    val histoConfig = new Config2(classLoader.getResource("extraction-configs/extraction.config.properties").getFile)
+    val url = classLoader.getResource("extraction-configs/extraction.config.properties")
+    require(url != null, "Missing extraction-config resource")
+    val histoConfig = new Config2(Paths.get(url.toURI).toString)

Add the requisite java.nio.file.Paths import.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
test("test Historic extraction", HistoTestTag) {
val jobsRunning = new ConcurrentLinkedQueue[Future[Unit]]()//
val classLoader: ClassLoader = getClass.getClassLoader
val histoConfig = new Config2(classLoader.getResource("extraction-configs/extraction.config.properties").getFile)
println(classLoader.getResource("extraction-configs/extraction.config.properties").getFile.toString)
println("BEFORE EXTRACT")
extract(histoConfig, jobsRunning)
println("AFTER EXTRACT")
}
def extract(config: Config2, jobsRunning: ConcurrentLinkedQueue[Future[Unit]]): Unit = {
val configLoader = new ConfigLoader2(config)
val jobs = configLoader.getExtractionJobs
println(">>>>>>>>> EXTRACT - NBJOBS > " + jobs.size)
println("LAUNCH JOBS")
for (job <- jobs) {
job.run()
}
while (jobsRunning.size() > 0) {
Thread.sleep(1000)
}
jobsRunning.clear()
}
test("test Historic extraction", HistoTestTag) {
val jobsRunning = new ConcurrentLinkedQueue[Future[Unit]]()//
val classLoader: ClassLoader = getClass.getClassLoader
val url = classLoader.getResource("extraction-configs/extraction.config.properties")
require(url != null, "Missing extraction-config resource")
val histoConfig = new Config2(Paths.get(url.toURI).toString)
println(classLoader.getResource("extraction-configs/extraction.config.properties").getFile.toString)
println("BEFORE EXTRACT")
extract(histoConfig, jobsRunning)
println("AFTER EXTRACT")
}
def extract(config: Config2, jobsRunning: ConcurrentLinkedQueue[Future[Unit]]): Unit = {
val configLoader = new ConfigLoader2(config)
val jobs = configLoader.getExtractionJobs
println(">>>>>>>>> EXTRACT - NBJOBS > " + jobs.size)
println("LAUNCH JOBS")
for (job <- jobs) {
job.run()
}
while (jobsRunning.size() > 0) {
Thread.sleep(1000)
}
jobsRunning.clear()
}
🤖 Prompt for AI Agents
In
history/src/test/scala/org/dbpedia/extraction/dump/extract/ExtractionTest.scala
around lines 17 to 41, the test uses classLoader.getResource(...).getFile which
fails for off-classpath resources, spaces, and JAR-packed resources; replace
that usage by resolving the URL to a URI and converting via
Paths.get(resource.toURI()) or, better, load the resource as a stream with
getResourceAsStream for JAR safety, add a null-check for the resource, and add
the requisite java.nio.file.Paths import (or use java.io.InputStream handling)
so the test works in Maven/Gradle and inside JARs.

Comment on lines 3396 to +3399
{{PropertyMapping | templateProperty = όνομα | ontologyProperty = foaf:name }}
{{PropertyMapping | templateProperty = Χώρα | ontologyProperty = country }}
{{PropertyMapping | templateProperty = Πληθυσμός | ontologyProperty = population }}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use populationTotal for Πληθυσμός.

ontologyProperty = population is not defined in the DBpedia ontology, so the extractor will drop these values instead of populating the village population. Swap to populationTotal, which is already used elsewhere for settlements and compiles cleanly.

-            {{PropertyMapping | templateProperty = Πληθυσμός  | ontologyProperty = population }}
+            {{PropertyMapping | templateProperty = Πληθυσμός  | ontologyProperty = populationTotal }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{PropertyMapping | templateProperty = όνομα | ontologyProperty = foaf:name }}
{{PropertyMapping | templateProperty = Χώρα | ontologyProperty = country }}
{{PropertyMapping | templateProperty = Πληθυσμός | ontologyProperty = population }}
{{PropertyMapping | templateProperty = όνομα | ontologyProperty = foaf:name }}
{{PropertyMapping | templateProperty = Χώρα | ontologyProperty = country }}
{{PropertyMapping | templateProperty = Πληθυσμός | ontologyProperty = populationTotal }}
🤖 Prompt for AI Agents
In mappings/Mapping_el.xml around lines 3396 to 3399, the mapping uses
ontologyProperty = population which is not defined in DBpedia ontology; change
it to ontologyProperty = populationTotal so the extractor will retain and
populate village population values consistently with other settlement mappings.

Comment on lines +21493 to +21499
{{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
{{ PropertyMapping | templateProperty = date | ontologyProperty = date }}
{{ PropertyMapping | templateProperty = disease | ontologyProperty = disease }}
{{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virus_strain }}
{{ PropertyMapping | templateProperty = location | ontologyProperty = Location }}
{{ PropertyMapping | templateProperty = deaths | ontologyProperty = Deaths }}
{{ PropertyMapping | templateProperty = symptom | ontologyProperty = Symptom }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix invalid ontology property names in Infobox pandemic mapping.

Location, Deaths, Symptom, and virus_strain are not valid DBpedia ontology properties; the extractor will either mint non-existent predicates or drop the data. Please map to the canonical property names (e.g., location, numberOfDeaths, symptom, virusStrain).

-   {{ PropertyMapping | templateProperty = location | ontologyProperty = Location }}
-   {{ PropertyMapping | templateProperty = deaths | ontologyProperty = Deaths }}
-   {{ PropertyMapping | templateProperty = symptom | ontologyProperty = Symptom }}
-   {{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virus_strain }}
+   {{ PropertyMapping | templateProperty = location | ontologyProperty = location }}
+   {{ PropertyMapping | templateProperty = deaths | ontologyProperty = numberOfDeaths }}
+   {{ PropertyMapping | templateProperty = symptom | ontologyProperty = symptom }}
+   {{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virusStrain }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
{{ PropertyMapping | templateProperty = date | ontologyProperty = date }}
{{ PropertyMapping | templateProperty = disease | ontologyProperty = disease }}
{{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virus_strain }}
{{ PropertyMapping | templateProperty = location | ontologyProperty = Location }}
{{ PropertyMapping | templateProperty = deaths | ontologyProperty = Deaths }}
{{ PropertyMapping | templateProperty = symptom | ontologyProperty = Symptom }}
{{ PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
{{ PropertyMapping | templateProperty = date | ontologyProperty = date }}
{{ PropertyMapping | templateProperty = disease | ontologyProperty = disease }}
{{ PropertyMapping | templateProperty = virus_strain | ontologyProperty = virusStrain }}
{{ PropertyMapping | templateProperty = location | ontologyProperty = location }}
{{ PropertyMapping | templateProperty = deaths | ontologyProperty = numberOfDeaths }}
{{ PropertyMapping | templateProperty = symptom | ontologyProperty = symptom }}
🤖 Prompt for AI Agents
In mappings/Mapping_en.xml around lines 21493 to 21499, several ontologyProperty
values in the Infobox pandemic PropertyMapping entries are invalid; update the
mappings so DBpedia uses canonical property names: change
ontologyProperty="Location" to ontologyProperty="location", "Deaths" to
"numberOfDeaths", "Symptom" to "symptom", and "virus_strain" to the canonical
"virusStrain" (also keep foaf:name and date/disease mappings as-is); ensure
exact spelling and casing to prevent minting non-existent predicates or data
loss.

Comment on lines +5374 to +5375
{{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = nextTrackNumber}}
{{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = previousTrackNumber}}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Swap previous/next track property mappings

numéro piste préc (previous track) is mapped to nextTrackNumber, and numéro piste suiv (next track) is mapped to previousTrackNumber, inverting the semantics returned by the extractor.

-        {{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = nextTrackNumber}}
-        {{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = previousTrackNumber}}
+        {{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = previousTrackNumber}}
+        {{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = nextTrackNumber}}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = nextTrackNumber}}
{{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = previousTrackNumber}}
{{PropertyMapping | templateProperty = numéro piste préc | ontologyProperty = previousTrackNumber}}
{{PropertyMapping | templateProperty = numéro piste suiv | ontologyProperty = nextTrackNumber}}
🤖 Prompt for AI Agents
In mappings/Mapping_fr.xml around lines 5374-5375, the two PropertyMapping
entries have their ontologyProperty values inverted: "numéro piste préc"
(previous track) is incorrectly mapped to nextTrackNumber and "numéro piste
suiv" (next track) is incorrectly mapped to previousTrackNumber; swap the
ontologyProperty values so "numéro piste préc" maps to previousTrackNumber and
"numéro piste suiv" maps to nextTrackNumber to restore correct semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants