Skip to content

Conversation

@ghost-2362003
Copy link
Contributor

@ghost-2362003 ghost-2362003 commented Nov 2, 2025

Updated the fromLines() and fromLine() methods of WikiInfo.scala for proper parsing of wikipedias.csv.

Summary by CodeRabbit

  • Chores

    • Added CSV processing library to project dependencies
  • Bug Fixes

    • Improved CSV data parsing with enhanced field validation and error handling for invalid entries

@coderabbitai
Copy link

coderabbitai bot commented Nov 2, 2025

📝 Walkthrough

Walkthrough

A Maven dependency on the scala-csv library is added to the project. WikiInfo.scala refactors CSV parsing from naive string splitting to proper CSV reader-based parsing with explicit field validation and language code verification, replacing exceptions with warning logs and None returns for invalid input.

Changes

Cohort / File(s) Summary
Maven Dependency
core/pom.xml
Added scala-csv library dependency (com.github.tototoshi:scala-csv_2.11:1.3.10) to support CSV parsing functionality
CSV Parsing Refactor
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala
Migrated fromLines and fromLine methods from naive string splitting to CSVReader-based parsing; added field count validation (≥15 fields); added language code validation; replaced exception throwing with warning logs and None returns for invalid input; ensured reader resource cleanup in finally block

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Line(s)
    participant Parser as CSVReader
    participant Validator as Field Validator
    participant LangCheck as Language Validator
    participant Output as Result

    Input->>Parser: Parse CSV
    Parser->>Validator: Extract fields
    Validator->>Validator: Check field count ≥ 15
    alt Field count valid
        Validator->>LangCheck: Extract & validate language code
        alt Language code valid
            LangCheck->>Output: Return WikiInfo(Some)
        else Language code invalid
            LangCheck->>Output: Log warning, Return None
        end
    else Field count invalid
        Validator->>Output: Log warning, Return None
    end
    Output-->>Input: Result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • CSV parsing correctness: Verify that CSVReader properly handles edge cases, field extraction, and line joining logic in fromLines
  • Validation logic: Ensure field count (≥15) and language code validation are consistent and correct
  • Error handling strategy: Confirm that warning logs and None returns are appropriate fallback behavior instead of exceptions
  • Resource management: Verify CSVReader is properly closed via finally block in fromLine
  • Dependency compatibility: Confirm scala-csv 1.3.10 is compatible with the Scala 2.11 version used in the project

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The pull request title "Patch WikiInfo.scala" is vague and uses non-descriptive language that fails to convey the actual nature of the changes. While the title correctly identifies WikiInfo.scala as the modified file, it employs the generic term "Patch" without explaining what the patch accomplishes. According to the PR objectives, the main goal is to update the CSV parsing methods to enable proper parsing of wikipedias.csv, which is a meaningful change that should be reflected in the title. The current title does not communicate this purpose to reviewers scanning the repository history. Consider revising the title to be more specific and descriptive, such as "Add CSV parsing support to WikiInfo.scala" or "Implement proper CSV parsing for wikipedias.csv in WikiInfo.scala". This would clearly communicate the main objective of the pull request and help reviewers understand the changeset at a glance without needing to read the detailed description.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Nov 2, 2025

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (2)

63-63: Use a specific exception type instead of generic Exception.

Throwing a generic Exception is not a best practice. Consider using IllegalStateException or a custom exception type for better error handling.

Apply this diff:

-      if (allRows.isEmpty) throw new Exception("empty file")
+      if (allRows.isEmpty) throw new IllegalStateException("CSV file is empty")

66-66: Consider validating the CSV header.

The code skips the first row assuming it's a header, but doesn't validate that it contains the expected column names. This could lead to silent failures if the CSV format changes.

Consider adding a header validation check:

// After line 63
val header = allRows.head
val expectedColumns = Seq("id", "lang", "wiki", "...", "...", "good", ...)  // adjust as needed
if (!header.startsWith(expectedColumns)) {
  logger.warning(s"Unexpected CSV header format: ${header.take(5).mkString(",")}")
}
core/pom.xml (1)

112-116: Update scala-csv dependency to a newer version.

The scala-csv_2.11:1.3.10 version is outdated; version 2.0.0 is available on Maven Central. No security vulnerabilities were found for this library. Consider updating to the latest version, but be aware that upgrading from 1.3.10 to 2.0.0 may include breaking changes (major version bump)—review the changelog and test thoroughly before upgrading.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14cca4b and 3a0d2f4.

📒 Files selected for processing (2)
  • core/pom.xml (1 hunks)
  • core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (1)
core/src/main/scala/org/dbpedia/extraction/config/ConfigUtils.scala (1)
  • ConfigUtils (22-211)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: build
  • GitHub Check: deploy
🔇 Additional comments (3)
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (3)

56-61: Consider memory implications of loading entire file into memory.

The current implementation loads all lines into a single string and then reads all rows into a Seq, which could be memory-intensive for large CSV files. While this may be acceptable for the specific wikipedias.csv file, consider whether a streaming approach would be more robust.

If the wikipedias.csv file is expected to remain reasonably sized, this approach is acceptable. However, if the file could grow significantly, consider refactoring to process rows in a streaming manner without loading everything into memory upfront.


67-75: Good implementation of safe parsing and validation.

The code correctly validates row length before accessing fields, safely parses the pages integer with a fallback, and validates the language code. This is a solid improvement over naive string splitting.


86-110: Good implementation with proper validation and error handling.

The fromLine method is well-structured with:

  • Option return type for functional error handling
  • Field count validation with helpful warnings
  • Safe integer parsing with fallback to 0
  • Language code validation
  • Proper resource cleanup

This is a solid improvement over the previous implementation.

Comment on lines +8 to +9
import java.io.StringReader
import java.io.StringReader
Copy link

@coderabbitai coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate import.

StringReader is imported twice on lines 8 and 9. Remove one of the duplicate imports.

Apply this diff to fix the duplicate import:

 import java.io.File
 import java.net.URL
 import java.util.logging.Logger
 import java.io.StringReader
-import java.io.StringReader
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import java.io.StringReader
import java.io.StringReader
import java.io.File
import java.net.URL
import java.util.logging.Logger
import java.io.StringReader
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala around lines 8
to 9, there is a duplicate import of java.io.StringReader; remove the redundant
import line so StringReader is imported only once (delete one of the two
identical import statements).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean
Failed to execute goal org.sonatype.plugins:nexus-staging-maven-plugin:1.6.8:deploy (injected-nexus-deploy) on project server: Failed to deploy artifacts: Could not transfer artifact org.dbpedia.extraction:core:jar:javadoc:pr-806-20251102.073958-1 from/to github (https://maven.pkg.github.com/dbpedia/extraction-framework): status code: 403, reason phrase: Forbidden (403) -> [Help 1]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

GitHub Packages Maven 403 Forbidden error authentication

💡 Result:

Quick checklist to fix GitHub Packages (Maven) 403 Forbidden authentication:

  1. Use a valid token:

    • Use a personal access token (classic) or GITHUB_TOKEN (in Actions). PAT must include read:packages to download and read:packages plus write:packages to publish; private-package access may also need repo. Renew/replace expired tokens. [1][2]
  2. Ensure settings.xml server id matches POM:

    • The in ~/.m2/settings.xml must exactly match the used in your pom.xml repository. Username = your GitHub username (or GITHUB_ACTOR in Actions); password = the token. Run mvn -X to confirm which settings.xml is used. [1][3]
  3. Use the correct repository URL for your GitHub instance:

    • For GitHub.com use https://maven.pkg.github.com/OWNER/REPOSITORY (or the documented URL pattern for enterprise/self-hosted instances). If your instance uses subdomain isolation use the maven.HOSTNAME form. URL must match the repo owner and name. [1][3]
  4. Check token scope vs. package visibility:

    • Public packages may need fewer scopes; private/internal packages require the write/read scopes (and repo for classic tokens depending on registry permissions). [2]
  5. CI / Actions specifics:

    • If using GITHUB_TOKEN in Actions, ensure the workflow has permissions: packages: read/write (or set permissions in workflow). For PATs used in CI, store as secrets and reference in settings.xml or MAVEN_SETTINGS. [2]
  6. Debug tips:

    • Run mvn -X to view auth attempts and which server id/credentials Maven uses.
    • Confirm there are no typos, uppercase owner names (must be lowercase in repository URL), and the PAT isn’t rate-limited or revoked. [3]

If you want, share your relevant settings.xml server block and your pom.xml distributionManagement snippet (redact the token) and I’ll spot the mismatch.

Sources: GitHub Packages Maven docs and community guide. [1] [2] [3]

References
[1] Working with the Apache Maven registry — GitHub Docs. (docs.github.com)
[2] REST API / About permissions for GitHub Packages — GitHub Docs. (docs.github.com)
[3] Practical guide: private GitHub Maven packages (example & settings.xml tips). (joe-mccarthy.github.io)

Citations:


@ghost-2362003, this error means Maven failed to authenticate when trying to deploy your package to GitHub Packages. A 403 Forbidden status indicates an authentication or authorization issue.

Common causes and solutions:

  1. Missing or invalid GitHub token:

    • You need a personal access token (PAT) with write:packages scope to publish packages
    • If the token expired or was revoked, generate a new one at: https://github.com/settings/tokens
    • For CI/CD workflows, ensure GITHUB_TOKEN has proper permissions
  2. Maven settings.xml configuration:

    • The <server><id> in your ~/.m2/settings.xml must exactly match the <id> in your pom.xml <distributionManagement> section
    • Your settings.xml should contain:
      <server>
        <id>github</id>
        <username>YOUR_GITHUB_USERNAME</username>
        <password>YOUR_GITHUB_TOKEN</password>
      </server>
  3. Repository URL format:

    • The URL must follow the pattern https://maven.pkg.github.com/OWNER/REPOSITORY (owner must be lowercase)
    • Verify your pom.xml has the correct repository URL
  4. Debug the issue:

    • Run mvn -X deploy to see which credentials Maven is using and identify mismatches

Since this appears to be happening in a CI/CD pipeline (based on the artifact name with timestamp), make sure your GitHub Actions workflow has the necessary permissions and token configured. Let me know if you need help checking your Maven configuration!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changes can i make from my end for this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all pom.xml files to check distribution management configuration
fd -e xml pom.xml --exec cat {}

Length of output: 78895


@ghost-2362003, I found the issue! Looking at your parent pom.xml, there's a repository URL mismatch that's causing the 403 error.

The Problem:
In your parent pom.xml (around line 48), the github.repo.url property points to:

<github.repo.url>https://maven.pkg.github.com/DBpedia-Enterprise/extraction-framework</github.repo.url>

But your actual repository is dbpedia/extraction-framework (not DBpedia-Enterprise/extraction-framework).

What you need to change:

Update the github.repo.url property in your parent pom.xml to match your actual repository:

<github.repo.url>https://maven.pkg.github.com/dbpedia/extraction-framework</github.repo.url>

Location to edit:

  • File: pom.xml (parent/root pom.xml)
  • Line: ~48 (in the <properties> section)

This mismatch is causing Maven to try deploying to the wrong repository URL, which results in the 403 Forbidden error because you don't have permissions to deploy to DBpedia-Enterprise/extraction-framework.

After making this change, the deployment should work correctly (assuming you have the proper GitHub token configured in your CI/CD environment or local Maven settings).


🧬 Code Graph Analysis Results

core/src/main/scala/org/dbpedia/extraction/config/ConfigUtils.scala

<possibly_relevant_code_snippets>core/src/main/scala/org/dbpedia/extraction/config/ConfigUtils.scala</possibly_relevant_code_snippets>

Snippet (lines 22-211)

object ConfigUtils {

  /**
    * Simple regex matching Wikipedia language codes.
    * Language codes have at least two characters, start with a lower-case letter and contain only
    * lower-case letters and dash, but there are also dumps for "wikimania2005wiki" etc.
    */
  val LanguageRegex: Regex = """([a-z][a-z0-9-]+)""".r

  /**
    * Regex used for excluding languages from the import.
    */
  val ExcludedLanguageRegex: Regex = """!([a-z][a-z0-9-]+)""".r

  /**
    * Regex for numeric range, both limits optional
    */
  val RangeRegex: Regex = """(\d*)-(\d*)""".r

  //val baseDir = getValue(universalConfig , "base-dir", true){
   // x => new File(x)
      //if (! dir.exists) throw error("dir "+dir+" does not exist")
      //dir
  //}

  def loadConfig(filePath: String, charset: String = "UTF-8"): Properties = {
    val file = new File(filePath)
    loadFromStream(new FileInputStream(file), charset)
  }

  def loadConfig(url: URL): Object = {

    url match {
      case selection =>
        if(selection.getFile.endsWith(".json"))
          loadJsonComfig(url)
        else
          loadFromStream(url.openStream())
    }
  }

  def loadJsonComfig(url: URL): JsonNode ={
    val objectMapper = new ObjectMapper(new JsonFactory())
    val objectReader: ObjectReader = objectMapper.reader()
    val inputStream = url.openStream()
    val res = objectReader.readTree(inputStream)
    inputStream.close()
    res
  }

  private def loadFromStream(file: InputStream, charset: String = "UTF-8"): Properties ={
    val config = new Properties()
    try config.load(new InputStreamReader(file, charset))
    finally file.close()
    config
  }


  def getValues[T](config: Properties, key: String, sep: String, required: Boolean = false)(map: String => T): Seq[T] = {
    getStrings(config, key, sep, required).map(map(_))
  }

  def getStrings(config: Properties, key: String, sep: String, required: Boolean = false): Seq[String] = {
    val string = getString(config, key, required)
    if (string == null) Seq.empty
    else string.trimSplit(sep)
  }

  def getStringMap(config: Properties, key: String, sep: String, required: Boolean = false): Map[String, String] = {
    getStrings(config, key, sep, required).map(x => x.split("->")).map( y => y(0) -> y(1)).toMap
  }

  def getValue[T](config: Properties, key: String, required: Boolean = false)(map: String => T): T = {
    val string = getString(config, key, required)
    if (string == null) null.asInstanceOf[T]
    else map(string)
  }
  
  def getString(config: Properties, key: String, required: Boolean = false): String = {
    val string = config.getProperty(key)
    if (string != null) string
    else if (! required) null
    else throw new IllegalArgumentException("property '"+key+"' not defined.")
  }
  
  /**
   * @param baseDir directory of wikipedia.csv, needed to resolve article count ranges
   * @param args array of space- or comma-separated language codes or article count ranges
   * @return languages, sorted by language code
   */
  def parseLanguages(baseDir: File, args: Seq[String], wikiPostFix: String = "wiki"): Array[Language] = {
    if(!baseDir.exists())
      throw new IllegalArgumentException("Base directory does not exist yet: " + baseDir)
    
    val keys = for(arg <- args; key <- arg.split("[,\\s]"); if key.nonEmpty) yield key
        
    var languages = SortedSet[Language]()
    var excludedLanguages = SortedSet[Language]()
    
    val ranges = new mutable.HashSet[(Int,Int)]
  
    for (key <- keys) key match {
      case "@mappings" => languages ++= Namespace.mappingLanguages
      case "@chapters" => languages ++= Namespace.chapterLanguages
      case "@downloaded" => languages ++= downloadedLanguages(baseDir, wikiPostFix)
      case "@all" => languages ++= Language.map.values
      case "@abstracts" =>
        //@downloaded - Commons & Wikidata
        languages ++= downloadedLanguages(baseDir, wikiPostFix)
        excludedLanguages += Language.Commons
        excludedLanguages += Language.Wikidata
      case RangeRegex(from, to) => ranges += toRange(from, to)
      case LanguageRegex(language) => languages += Language(language)
      case ExcludedLanguageRegex(language) => excludedLanguages += Language(language)
      case other => throw new IllegalArgumentException("Invalid language / range '"+other+"'")
    }
    
    // resolve page count ranges to languages
    if (ranges.nonEmpty)
    {
      // Note: the file is in ASCII, any non-ASCII chars are XML-encoded like '&#231;'. 
      // There is no Codec.ASCII, but UTF-8 also works for ASCII. Luckily we don't use 
      // these non-ASCII chars anyway, so we don't have to unescape them.
      
      // for all wikis in one of the desired ranges...
      languages ++= (for ((from, to) <- ranges; lang <- Language.map.values; if from <= lang.pages && lang.pages <= to) yield lang)
    }

    languages --= excludedLanguages
    languages.toArray
  }

  private def downloadedLanguages(baseDir: File, wikiPostFix: String = "wiki"): Array[Language] = {
    (for (file <- baseDir.listFiles().filter(x => x.isDirectory)) yield
      Language.get(file.getName.replaceAll(wikiPostFix + "$", "").replace("_", "-")) match{
        case Some(l) => l
        case None => null
      }).filter(x => x != null)
  }

  def toRange(from: String, to: String): (Int, Int) = {
    val lo: Int = if (from.isEmpty) 0 else from.toInt
    val hi: Int = if (to.isEmpty) Int.MaxValue else to.toInt
    if (lo > hi) throw new NumberFormatException
    (lo, hi)
  }

  def parseVersionString(str: String): Try[String] =Try {
    Option(str) match {
      case Some(v) => "2\\d{3}-\\d{2}".r.findFirstMatchIn(v.trim) match {
        case Some(y) => if (y.end == 7) v.trim else throw new IllegalArgumentException("Provided version string did not match 2\\d{3}-\\d{2}")
        case None => throw new IllegalArgumentException("Provided version string did not match 2\\d{3}-\\d{2}")
      }
      case None => throw new IllegalArgumentException("No version string was provided.")
    }
  }

  /**
    * This function was extracted from the ImageExtractor object, since
    *  the free & nonfree images are now extracted before starting the extraction jobs
    * @param source pages_articles of a given language
    * @param wikiCode the wikicode of a given language
    * @return two lists: ._1: list of free images, ._2: list of nonfree images
    */
  def loadImages(source: Source, wikiCode: String, extractionRecorder: ExtractionRecorder[WikiPage] = null): (Seq[String], Seq[String]) =
  {
    val freeImages = new mutable.HashSet[String]()
    val nonFreeImages = new mutable.HashSet[String]()

    for(page <- source if page.title.namespace == Namespace.File;
        ImageExtractorConfig.ImageLinkRegex() <- List(page.title.encoded) )
    {
      if(extractionRecorder != null) {
        val records = page.getExtractionRecords() match {
          case seq: Seq[RecordEntry[WikiPage]] if seq.nonEmpty => seq
          case _ => Seq(new RecordEntry[WikiPage](page, page.uri, RecordSeverity.Info, page.title.language))
        }
        //forward all records to the recorder
        extractionRecorder.record(records:_*)
      }
      ImageExtractorConfig.NonFreeRegex(wikiCode).findFirstIn(page.source) match
      {
        case Some(_) => nonFreeImages += page.title.encoded
        case None => if (freeImages != null) freeImages += page.title.encoded
      }
    }

    (freeImages.toSeq, nonFreeImages.toSeq)
  }
}

</possibly_relevant_code_snippets>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried CodeRabbit’s suggestion to update the repository URL, but the deployment still fails with a 403 error. It looks like this might be a permissions issue affecting contributor PRs

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to jfery and he said that it needs to be fixed form his end.

@haniyakonain
Copy link
Contributor

Hi @ghost-2362003! Could you confirm how you tested these parsing changes?

@ghost-2362003
Copy link
Contributor Author

I tested these by running the entire framework using redeploy-server script.
Earlier it used to gave errors due to parsing of wikipedias.csv
Now it does not do so. The framework compiles fine

@haniyakonain
Copy link
Contributor

Could you share the exact error message/stack trace that was occurring before your fix when you ran the redeploy-server script?

The current test failures I'm seeing are:

  • NonIsoLanguagesMappingTest - failing due to Wikipedia API user-agent issue
  • BooleanParserTest - failing due to Language$ class initialization

These appear unrelated to your WikiInfo CSV parsing changes. I want to see what the original CSV parsing error looked like to confirm your fix addresses the right issue.

@ghost-2362003
Copy link
Contributor Author

Unfortunately I did not take any screenshots
I only pasted the error in the prompt to understand and fix the error

@haniyakonain
Copy link
Contributor

Maybe you can try rebasing with the latest master and run the redeploy-server script again you might get the same error message. If it appears, please paste it here so we can confirm it matches the original CSV parsing issue.

@ghost-2362003
Copy link
Contributor Author

Well I was not able to reproduce the exact error, however i did get this by resetting a dummy branch to a relatively old commit, about 2 months old. Would this be of any help ?
Screenshot 2025-11-11 214452

@haniyakonain
Copy link
Contributor

Thanks! The screenshot shows an XML dump parsing error (corrupted/incomplete dump), not the CSV issue so it doesn’t confirm the original problem.

@ghost-2362003
Copy link
Contributor Author

@haniyakonain Do you know how to get past the error of snapshot deploy ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants