-
Notifications
You must be signed in to change notification settings - Fork 287
Patch WikiInfo.scala #806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Patch WikiInfo.scala #806
Conversation
📝 WalkthroughWalkthroughA Maven dependency on the scala-csv library is added to the project. WikiInfo.scala refactors CSV parsing from naive string splitting to proper CSV reader-based parsing with explicit field validation and language code verification, replacing exceptions with warning logs and None returns for invalid input. Changes
Sequence Diagram(s)sequenceDiagram
participant Input as Input Line(s)
participant Parser as CSVReader
participant Validator as Field Validator
participant LangCheck as Language Validator
participant Output as Result
Input->>Parser: Parse CSV
Parser->>Validator: Extract fields
Validator->>Validator: Check field count ≥ 15
alt Field count valid
Validator->>LangCheck: Extract & validate language code
alt Language code valid
LangCheck->>Output: Return WikiInfo(Some)
else Language code invalid
LangCheck->>Output: Log warning, Return None
end
else Field count invalid
Validator->>Output: Log warning, Return None
end
Output-->>Input: Result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (2)
63-63: Use a specific exception type instead of generic Exception.Throwing a generic
Exceptionis not a best practice. Consider usingIllegalStateExceptionor a custom exception type for better error handling.Apply this diff:
- if (allRows.isEmpty) throw new Exception("empty file") + if (allRows.isEmpty) throw new IllegalStateException("CSV file is empty")
66-66: Consider validating the CSV header.The code skips the first row assuming it's a header, but doesn't validate that it contains the expected column names. This could lead to silent failures if the CSV format changes.
Consider adding a header validation check:
// After line 63 val header = allRows.head val expectedColumns = Seq("id", "lang", "wiki", "...", "...", "good", ...) // adjust as needed if (!header.startsWith(expectedColumns)) { logger.warning(s"Unexpected CSV header format: ${header.take(5).mkString(",")}") }core/pom.xml (1)
112-116: Update scala-csv dependency to a newer version.The scala-csv_2.11:1.3.10 version is outdated; version 2.0.0 is available on Maven Central. No security vulnerabilities were found for this library. Consider updating to the latest version, but be aware that upgrading from 1.3.10 to 2.0.0 may include breaking changes (major version bump)—review the changelog and test thoroughly before upgrading.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
core/pom.xml(1 hunks)core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (1)
core/src/main/scala/org/dbpedia/extraction/config/ConfigUtils.scala (1)
ConfigUtils(22-211)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build
- GitHub Check: build
- GitHub Check: build
- GitHub Check: deploy
🔇 Additional comments (3)
core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala (3)
56-61: Consider memory implications of loading entire file into memory.The current implementation loads all lines into a single string and then reads all rows into a Seq, which could be memory-intensive for large CSV files. While this may be acceptable for the specific wikipedias.csv file, consider whether a streaming approach would be more robust.
If the wikipedias.csv file is expected to remain reasonably sized, this approach is acceptable. However, if the file could grow significantly, consider refactoring to process rows in a streaming manner without loading everything into memory upfront.
67-75: Good implementation of safe parsing and validation.The code correctly validates row length before accessing fields, safely parses the pages integer with a fallback, and validates the language code. This is a solid improvement over naive string splitting.
86-110: Good implementation with proper validation and error handling.The
fromLinemethod is well-structured with:
- Option return type for functional error handling
- Field count validation with helpful warnings
- Safe integer parsing with fallback to 0
- Language code validation
- Proper resource cleanup
This is a solid improvement over the previous implementation.
| import java.io.StringReader | ||
| import java.io.StringReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate import.
StringReader is imported twice on lines 8 and 9. Remove one of the duplicate imports.
Apply this diff to fix the duplicate import:
import java.io.File
import java.net.URL
import java.util.logging.Logger
import java.io.StringReader
-import java.io.StringReader📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import java.io.StringReader | |
| import java.io.StringReader | |
| import java.io.File | |
| import java.net.URL | |
| import java.util.logging.Logger | |
| import java.io.StringReader |
🤖 Prompt for AI Agents
In core/src/main/scala/org/dbpedia/extraction/util/WikiInfo.scala around lines 8
to 9, there is a duplicate import of java.io.StringReader; remove the redundant
import line so StringReader is imported only once (delete one of the two
identical import statements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean
Failed to execute goal org.sonatype.plugins:nexus-staging-maven-plugin:1.6.8:deploy (injected-nexus-deploy) on project server: Failed to deploy artifacts: Could not transfer artifact org.dbpedia.extraction:core:jar:javadoc:pr-806-20251102.073958-1 from/to github (https://maven.pkg.github.com/dbpedia/extraction-framework): status code: 403, reason phrase: Forbidden (403) -> [Help 1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
GitHub Packages Maven 403 Forbidden error authentication
💡 Result:
Quick checklist to fix GitHub Packages (Maven) 403 Forbidden authentication:
-
Use a valid token:
- Use a personal access token (classic) or GITHUB_TOKEN (in Actions). PAT must include read:packages to download and read:packages plus write:packages to publish; private-package access may also need repo. Renew/replace expired tokens. [1][2]
-
Ensure settings.xml server id matches POM:
- The in ~/.m2/settings.xml must exactly match the used in your pom.xml repository. Username = your GitHub username (or GITHUB_ACTOR in Actions); password = the token. Run mvn -X to confirm which settings.xml is used. [1][3]
-
Use the correct repository URL for your GitHub instance:
- For GitHub.com use https://maven.pkg.github.com/OWNER/REPOSITORY (or the documented URL pattern for enterprise/self-hosted instances). If your instance uses subdomain isolation use the maven.HOSTNAME form. URL must match the repo owner and name. [1][3]
-
Check token scope vs. package visibility:
- Public packages may need fewer scopes; private/internal packages require the write/read scopes (and repo for classic tokens depending on registry permissions). [2]
-
CI / Actions specifics:
- If using GITHUB_TOKEN in Actions, ensure the workflow has permissions: packages: read/write (or set permissions in workflow). For PATs used in CI, store as secrets and reference in settings.xml or MAVEN_SETTINGS. [2]
-
Debug tips:
- Run mvn -X to view auth attempts and which server id/credentials Maven uses.
- Confirm there are no typos, uppercase owner names (must be lowercase in repository URL), and the PAT isn’t rate-limited or revoked. [3]
If you want, share your relevant settings.xml server block and your pom.xml distributionManagement snippet (redact the token) and I’ll spot the mismatch.
Sources: GitHub Packages Maven docs and community guide. [1] [2] [3]
References
[1] Working with the Apache Maven registry — GitHub Docs. (docs.github.com)
[2] REST API / About permissions for GitHub Packages — GitHub Docs. (docs.github.com)
[3] Practical guide: private GitHub Maven packages (example & settings.xml tips). (joe-mccarthy.github.io)
Citations:
- 1: https://docs.github.com/en/enterprise-server%403.15/packages/working-with-a-github-packages-registry/working-with-the-apache-maven-registry?utm_source=openai
- 2: https://docs.github.com/rest/reference/packages?utm_source=openai
- 3: https://joe-mccarthy.github.io/private-github-maven-packages/?utm_source=openai
@ghost-2362003, this error means Maven failed to authenticate when trying to deploy your package to GitHub Packages. A 403 Forbidden status indicates an authentication or authorization issue.
Common causes and solutions:
-
Missing or invalid GitHub token:
- You need a personal access token (PAT) with
write:packagesscope to publish packages - If the token expired or was revoked, generate a new one at: https://github.com/settings/tokens
- For CI/CD workflows, ensure
GITHUB_TOKENhas proper permissions
- You need a personal access token (PAT) with
-
Maven settings.xml configuration:
- The
<server><id>in your~/.m2/settings.xmlmust exactly match the<id>in yourpom.xml<distributionManagement>section - Your settings.xml should contain:
<server> <id>github</id> <username>YOUR_GITHUB_USERNAME</username> <password>YOUR_GITHUB_TOKEN</password> </server>
- The
-
Repository URL format:
- The URL must follow the pattern
https://maven.pkg.github.com/OWNER/REPOSITORY(owner must be lowercase) - Verify your
pom.xmlhas the correct repository URL
- The URL must follow the pattern
-
Debug the issue:
- Run
mvn -X deployto see which credentials Maven is using and identify mismatches
- Run
Since this appears to be happening in a CI/CD pipeline (based on the artifact name with timestamp), make sure your GitHub Actions workflow has the necessary permissions and token configured. Let me know if you need help checking your Maven configuration!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changes can i make from my end for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all pom.xml files to check distribution management configuration
fd -e xml pom.xml --exec cat {}
Length of output: 78895
@ghost-2362003, I found the issue! Looking at your parent pom.xml, there's a repository URL mismatch that's causing the 403 error.
The Problem:
In your parent pom.xml (around line 48), the github.repo.url property points to:
<github.repo.url>https://maven.pkg.github.com/DBpedia-Enterprise/extraction-framework</github.repo.url>But your actual repository is dbpedia/extraction-framework (not DBpedia-Enterprise/extraction-framework).
What you need to change:
Update the github.repo.url property in your parent pom.xml to match your actual repository:
<github.repo.url>https://maven.pkg.github.com/dbpedia/extraction-framework</github.repo.url>Location to edit:
- File:
pom.xml(parent/root pom.xml) - Line: ~48 (in the
<properties>section)
This mismatch is causing Maven to try deploying to the wrong repository URL, which results in the 403 Forbidden error because you don't have permissions to deploy to DBpedia-Enterprise/extraction-framework.
After making this change, the deployment should work correctly (assuming you have the proper GitHub token configured in your CI/CD environment or local Maven settings).
🧬 Code Graph Analysis Results
core/src/main/scala/org/dbpedia/extraction/config/ConfigUtils.scala
<possibly_relevant_code_snippets>core/src/main/scala/org/dbpedia/extraction/config/ConfigUtils.scala</possibly_relevant_code_snippets>
Snippet (lines 22-211)
object ConfigUtils {
/**
* Simple regex matching Wikipedia language codes.
* Language codes have at least two characters, start with a lower-case letter and contain only
* lower-case letters and dash, but there are also dumps for "wikimania2005wiki" etc.
*/
val LanguageRegex: Regex = """([a-z][a-z0-9-]+)""".r
/**
* Regex used for excluding languages from the import.
*/
val ExcludedLanguageRegex: Regex = """!([a-z][a-z0-9-]+)""".r
/**
* Regex for numeric range, both limits optional
*/
val RangeRegex: Regex = """(\d*)-(\d*)""".r
//val baseDir = getValue(universalConfig , "base-dir", true){
// x => new File(x)
//if (! dir.exists) throw error("dir "+dir+" does not exist")
//dir
//}
def loadConfig(filePath: String, charset: String = "UTF-8"): Properties = {
val file = new File(filePath)
loadFromStream(new FileInputStream(file), charset)
}
def loadConfig(url: URL): Object = {
url match {
case selection =>
if(selection.getFile.endsWith(".json"))
loadJsonComfig(url)
else
loadFromStream(url.openStream())
}
}
def loadJsonComfig(url: URL): JsonNode ={
val objectMapper = new ObjectMapper(new JsonFactory())
val objectReader: ObjectReader = objectMapper.reader()
val inputStream = url.openStream()
val res = objectReader.readTree(inputStream)
inputStream.close()
res
}
private def loadFromStream(file: InputStream, charset: String = "UTF-8"): Properties ={
val config = new Properties()
try config.load(new InputStreamReader(file, charset))
finally file.close()
config
}
def getValues[T](config: Properties, key: String, sep: String, required: Boolean = false)(map: String => T): Seq[T] = {
getStrings(config, key, sep, required).map(map(_))
}
def getStrings(config: Properties, key: String, sep: String, required: Boolean = false): Seq[String] = {
val string = getString(config, key, required)
if (string == null) Seq.empty
else string.trimSplit(sep)
}
def getStringMap(config: Properties, key: String, sep: String, required: Boolean = false): Map[String, String] = {
getStrings(config, key, sep, required).map(x => x.split("->")).map( y => y(0) -> y(1)).toMap
}
def getValue[T](config: Properties, key: String, required: Boolean = false)(map: String => T): T = {
val string = getString(config, key, required)
if (string == null) null.asInstanceOf[T]
else map(string)
}
def getString(config: Properties, key: String, required: Boolean = false): String = {
val string = config.getProperty(key)
if (string != null) string
else if (! required) null
else throw new IllegalArgumentException("property '"+key+"' not defined.")
}
/**
* @param baseDir directory of wikipedia.csv, needed to resolve article count ranges
* @param args array of space- or comma-separated language codes or article count ranges
* @return languages, sorted by language code
*/
def parseLanguages(baseDir: File, args: Seq[String], wikiPostFix: String = "wiki"): Array[Language] = {
if(!baseDir.exists())
throw new IllegalArgumentException("Base directory does not exist yet: " + baseDir)
val keys = for(arg <- args; key <- arg.split("[,\\s]"); if key.nonEmpty) yield key
var languages = SortedSet[Language]()
var excludedLanguages = SortedSet[Language]()
val ranges = new mutable.HashSet[(Int,Int)]
for (key <- keys) key match {
case "@mappings" => languages ++= Namespace.mappingLanguages
case "@chapters" => languages ++= Namespace.chapterLanguages
case "@downloaded" => languages ++= downloadedLanguages(baseDir, wikiPostFix)
case "@all" => languages ++= Language.map.values
case "@abstracts" =>
//@downloaded - Commons & Wikidata
languages ++= downloadedLanguages(baseDir, wikiPostFix)
excludedLanguages += Language.Commons
excludedLanguages += Language.Wikidata
case RangeRegex(from, to) => ranges += toRange(from, to)
case LanguageRegex(language) => languages += Language(language)
case ExcludedLanguageRegex(language) => excludedLanguages += Language(language)
case other => throw new IllegalArgumentException("Invalid language / range '"+other+"'")
}
// resolve page count ranges to languages
if (ranges.nonEmpty)
{
// Note: the file is in ASCII, any non-ASCII chars are XML-encoded like 'ç'.
// There is no Codec.ASCII, but UTF-8 also works for ASCII. Luckily we don't use
// these non-ASCII chars anyway, so we don't have to unescape them.
// for all wikis in one of the desired ranges...
languages ++= (for ((from, to) <- ranges; lang <- Language.map.values; if from <= lang.pages && lang.pages <= to) yield lang)
}
languages --= excludedLanguages
languages.toArray
}
private def downloadedLanguages(baseDir: File, wikiPostFix: String = "wiki"): Array[Language] = {
(for (file <- baseDir.listFiles().filter(x => x.isDirectory)) yield
Language.get(file.getName.replaceAll(wikiPostFix + "$", "").replace("_", "-")) match{
case Some(l) => l
case None => null
}).filter(x => x != null)
}
def toRange(from: String, to: String): (Int, Int) = {
val lo: Int = if (from.isEmpty) 0 else from.toInt
val hi: Int = if (to.isEmpty) Int.MaxValue else to.toInt
if (lo > hi) throw new NumberFormatException
(lo, hi)
}
def parseVersionString(str: String): Try[String] =Try {
Option(str) match {
case Some(v) => "2\\d{3}-\\d{2}".r.findFirstMatchIn(v.trim) match {
case Some(y) => if (y.end == 7) v.trim else throw new IllegalArgumentException("Provided version string did not match 2\\d{3}-\\d{2}")
case None => throw new IllegalArgumentException("Provided version string did not match 2\\d{3}-\\d{2}")
}
case None => throw new IllegalArgumentException("No version string was provided.")
}
}
/**
* This function was extracted from the ImageExtractor object, since
* the free & nonfree images are now extracted before starting the extraction jobs
* @param source pages_articles of a given language
* @param wikiCode the wikicode of a given language
* @return two lists: ._1: list of free images, ._2: list of nonfree images
*/
def loadImages(source: Source, wikiCode: String, extractionRecorder: ExtractionRecorder[WikiPage] = null): (Seq[String], Seq[String]) =
{
val freeImages = new mutable.HashSet[String]()
val nonFreeImages = new mutable.HashSet[String]()
for(page <- source if page.title.namespace == Namespace.File;
ImageExtractorConfig.ImageLinkRegex() <- List(page.title.encoded) )
{
if(extractionRecorder != null) {
val records = page.getExtractionRecords() match {
case seq: Seq[RecordEntry[WikiPage]] if seq.nonEmpty => seq
case _ => Seq(new RecordEntry[WikiPage](page, page.uri, RecordSeverity.Info, page.title.language))
}
//forward all records to the recorder
extractionRecorder.record(records:_*)
}
ImageExtractorConfig.NonFreeRegex(wikiCode).findFirstIn(page.source) match
{
case Some(_) => nonFreeImages += page.title.encoded
case None => if (freeImages != null) freeImages += page.title.encoded
}
}
(freeImages.toSeq, nonFreeImages.toSeq)
}
}
</possibly_relevant_code_snippets>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried CodeRabbit’s suggestion to update the repository URL, but the deployment still fails with a 403 error. It looks like this might be a permissions issue affecting contributor PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked to jfery and he said that it needs to be fixed form his end.
|
Hi @ghost-2362003! Could you confirm how you tested these parsing changes? |
|
I tested these by running the entire framework using |
|
Could you share the exact error message/stack trace that was occurring before your fix when you ran the redeploy-server script? The current test failures I'm seeing are:
These appear unrelated to your WikiInfo CSV parsing changes. I want to see what the original CSV parsing error looked like to confirm your fix addresses the right issue. |
|
Unfortunately I did not take any screenshots |
|
Maybe you can try rebasing with the latest master and run the redeploy-server script again you might get the same error message. If it appears, please paste it here so we can confirm it matches the original CSV parsing issue. |
|
Thanks! The screenshot shows an XML dump parsing error (corrupted/incomplete dump), not the CSV issue so it doesn’t confirm the original problem. |
|
@haniyakonain Do you know how to get past the error of snapshot deploy ? |




Updated the
fromLines()andfromLine()methods ofWikiInfo.scalafor proper parsing ofwikipedias.csv.Summary by CodeRabbit
Chores
Bug Fixes