Skip to content

feat(ontology): add rdfs:comment to infobox property definitions#840

Open
anshuman9468 wants to merge 8 commits intodbpedia:masterfrom
anshuman9468:feature/ontology-consistency-updates
Open

feat(ontology): add rdfs:comment to infobox property definitions#840
anshuman9468 wants to merge 8 commits intodbpedia:masterfrom
anshuman9468:feature/ontology-consistency-updates

Conversation

@anshuman9468
Copy link
Copy Markdown
Contributor

@anshuman9468 anshuman9468 commented Apr 16, 2026

This PR adds rdfs:comment to the properties extracted from Wikipedia infoboxes in the InfoboxExtractor. This provides descriptive metadata about the original raw property, derived from the Wikipedia infobox key. This change addresses the requirement to include more information in the infobox-property-definitions dataset so it can be viewed as HTML in the store.

Summary by CodeRabbit

  • New Features

    • Extraction and semantic tagging of citation links, including linking citation notes to external targets
    • Automatic comment generation for infobox property definitions
  • Bug Fixes

    • Safer disambiguation title handling with a fallback when mappings are missing
    • Improved robustness when files, dates or optional mapping inputs are missing
    • Removed an interfering reference selector to improve reference extraction quality

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d6c83d23-394e-421c-ad52-8f738d1e15ee

📥 Commits

Reviewing files that changed from the base of the PR and between f49805a and 9fc7dfd.

📒 Files selected for processing (5)
  • dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties
  • dump/src/test/resources/extraction-configs/extraction.plain.abstracts.properties
  • dump/src/test/resources/extraction-configs/generic-spark.extraction.minidump.properties
  • dump/src/test/resources/extraction-configs/mappings.extraction.minidump.properties
  • dump/src/test/resources/extraction-configs/wikidata.extraction.properties
✅ Files skipped from review due to trivial changes (4)
  • dump/src/test/resources/extraction-configs/wikidata.extraction.properties
  • dump/src/test/resources/extraction-configs/mappings.extraction.minidump.properties
  • dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties
  • dump/src/test/resources/extraction-configs/extraction.plain.abstracts.properties

📝 Walkthrough

Walkthrough

Adds citation tracking and emission: Link and LinkExtractor now identify in-page citation anchors; HtmlNifExtractor collects citation URLs and emits citation-specific RDF; NIF config removes sup.reference; InfoboxExtractor adds rdfs:comment; various utilities and scripts tolerate missing files/dates; CI deploy adds -DskipNexusStagingDeployMojo=true.

Changes

Cohort / File(s) Summary
Citation model
core/src/main/java/org/dbpedia/extraction/nif/Link.java
Added citation boolean and citationId string with getters/setters; minor formatting tweaks.
Citation detection
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java
Detect href starting with #cite_note-, mark Link as citation and set citationId; small control-flow/formatting normalizations.
Citation extraction & RDF emission
core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala
Added in-memory citationMap from li#cite_note-* → external a.external.text href; makeWordsFromLinks emits dbpedia:Citation and uses citationMap for its:taIdentRef when link.isCitation.
NIF config
core/src/main/resources/nifextractionconfig.json
Removed selector sup.reference from nif-remove-elements.
Infobox metadata
core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala
Emit additional rdfs:comment quad when a property URI is first seen.
Date/finder resilience
core/src/main/scala/org/dbpedia/extraction/util/DateFinder.scala, core/src/main/scala/org/dbpedia/extraction/util/Finder.scala
byName/byPattern gain required param and avoid .last on empty results; Finder.dates now deduplicates with .distinct.
Script robustness
scripts/src/main/scala/org/dbpedia/extraction/scripts/QuadReader.scala, .../MapObjectUris.scala, .../ResolveTransitiveLinks.scala, .../TypeConsistencyCheck.scala, .../MapObjectUris.scala
readQuads/read-multiple accept required flag; callers now handle missing files/dates gracefully, skip processing and log instead of failing; safer Option handling for input/output.
Disambiguation fallback
core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala
Use getOrElse for disambiguation title mapping to avoid key lookup exceptions.
CI deploy
.github/workflows/snapshot_deploy.yml
Maven mvn deploy invocation adds -DskipNexusStagingDeployMojo=true.
Test configs
dump/src/test/resources/extraction-configs/*
Added compare-dataset-ids=false to multiple test extraction config files.

Sequence Diagram(s)

sequenceDiagram
    participant HTML as HTML Document
    participant HtmlExt as HtmlNifExtractor
    participant LinkExt as LinkExtractor
    participant CitMap as CitationMap
    participant RDF as RDF Emitter

    HTML->>HtmlExt: parse document
    HtmlExt->>HtmlExt: extract li elements id="cite_note-*" and store id→external href in CitMap
    HtmlExt->>LinkExt: traverse <a> nodes
    LinkExt->>LinkExt: detect href starting "#cite_note-*" → set citation=true, citationId
    LinkExt-->>HtmlExt: return Link objects (with citation flag/id)
    HtmlExt->>CitMap: lookup citationId → URL (if available)
    HtmlExt->>RDF: emit dbpedia:Citation triple for word
    HtmlExt->>RDF: emit its:taIdentRef with citation URL (if found)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(ontology): add rdfs:comment to infobox property definitions' accurately reflects the primary functional change in the PR: adding rdfs:comment annotations to infobox property definitions in InfoboxExtractor.scala, though the changeset also includes several broader refactoring efforts related to citation handling and optional file resolution.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 274-279: The loop in getTableCount() incorrectly references the
field/variable "paragraph" instead of the loop variable "p", which can produce
wrong counts or NPEs; update the body to use p.getHtmlStrings().size() and add a
null-safe check (skip if p is null or if p.getHtmlStrings() is null) so
getTableCount() iterates over this.getParagraphs() using the loop variable "p"
and sums sizes safely.

In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`:
- Line 25: The current line initializes replaceString by silently falling back
to the English literal " (disambiguation)"; instead, change initialization in
DisambiguationExtractor (the replaceString assignment referencing
DisambiguationExtractorConfig.disambiguationTitlePartMap and language.wikiCode)
to fail-fast or emit an explicit warning/metric when the map has no entry for
language.wikiCode—e.g., check
DisambiguationExtractorConfig.disambiguationTitlePartMap.contains(language.wikiCode)
and if missing either throw a clear exception (fail-fast) or
log/processLogger.warn and increment a metric before using a fallback; ensure
the message includes language.wikiCode so missing configs are discoverable.

In `@core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala`:
- Around line 169-170: Normalize property.key before including it in the
InfoboxPropertyDefinitions comment: replace newline and tab characters and trim
the string the same way as used elsewhere (e.g., .replace("\n", "
").replace("\t", " ").trim) when building propertyComment in
InfoboxExtractor.scala so that the value passed to new Quad(...,
DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty,
propertyComment, splitNode.sourceIri, rdfLangStrDt) is sanitized and consistent
with the normalization used for statistics extraction.

In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`:
- Around line 42-43: The extractor's mutable field citationMap in
HtmlNifExtractor is reused across documents causing cross-page contamination;
reset or clear citationMap at the start of each document processing run (e.g.,
in the method that begins processing a page / processDocument / extract or the
function around the code handling each page before collecting cite_note-* links)
so the map is empty for each page; ensure any other per-document state
initialized in lines referenced (358-365) is also reinitialized or cleared to
avoid stale URL lookups.
- Around line 291-293: The citation branch is emitting its:taIdentRef using the
raw citation URL (citationMap.get(link.getCitationId)) without validating it;
update the code in HtmlNifExtractor.scala so that the citation path uses
UriUtils.createURI() (the same validation used in the non-citation link branch)
to parse/validate the citation URL before calling nifLinks (words +=
nifLinks(...)), and if UriUtils.createURI() fails, skip emitting the
its:taIdentRef (and optionally log the invalid citation URL) rather than writing
an invalid IRI.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 356bed8d-b522-4f7b-b5be-51fc33ae5f53

📥 Commits

Reviewing files that changed from the base of the PR and between e3dfe8b and 38d6293.

📒 Files selected for processing (7)
  • .github/workflows/snapshot_deploy.yml
  • core/src/main/java/org/dbpedia/extraction/nif/Link.java
  • core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java
  • core/src/main/resources/nifextractionconfig.json
  • core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala
  • core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala
  • core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala
💤 Files with no reviewable changes (1)
  • core/src/main/resources/nifextractionconfig.json

Comment on lines +274 to 279
public int getTableCount() {
int count = 0;
for (Paragraph p : this.getParagraphs()) {
count += paragraph.getHtmlStrings().size();
}
return count;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the loop variable in getTableCount() (current code can NPE).

Line 277 uses paragraph instead of p. This can produce wrong counts and fail at runtime when paragraph is null.

💡 Proposed fix
 public int getTableCount() {
 	int count = 0;
 	for (Paragraph p : this.getParagraphs()) {
-		count += paragraph.getHtmlStrings().size();
+		count += p.getHtmlStrings().size();
 	}
 	return count;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public int getTableCount() {
int count = 0;
for (Paragraph p : this.getParagraphs()) {
count += paragraph.getHtmlStrings().size();
}
return count;
public int getTableCount() {
int count = 0;
for (Paragraph p : this.getParagraphs()) {
count += p.getHtmlStrings().size();
}
return count;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around
lines 274 - 279, The loop in getTableCount() incorrectly references the
field/variable "paragraph" instead of the loop variable "p", which can produce
wrong counts or NPEs; update the body to use p.getHtmlStrings().size() and add a
null-safe check (skip if p is null or if p.getHtmlStrings() is null) so
getTableCount() iterates over this.getParagraphs() using the loop variable "p"
and sums sizes safely.

private val language = context.language

private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap(language.wikiCode)
private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid silently defaulting to English when language config is missing.

Using " (disambiguation)" as a universal fallback can silently mis-handle non-English wikis and hide missing configuration. Prefer fail-fast (or at least emit explicit warning/metric) so missing disambiguationTitlePartMap entries are visible.

Suggested fix (fail-fast for missing language mapping)
-  private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")
+  private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode).getOrElse {
+    throw new IllegalArgumentException(
+      s"Missing disambiguation title suffix for language '${language.wikiCode}' in DisambiguationExtractorConfig.disambiguationTitlePartMap"
+    )
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`
at line 25, The current line initializes replaceString by silently falling back
to the English literal " (disambiguation)"; instead, change initialization in
DisambiguationExtractor (the replaceString assignment referencing
DisambiguationExtractorConfig.disambiguationTitlePartMap and language.wikiCode)
to fail-fast or emit an explicit warning/metric when the map has no entry for
language.wikiCode—e.g., check
DisambiguationExtractorConfig.disambiguationTitlePartMap.contains(language.wikiCode)
and if missing either throw a clear exception (fail-fast) or
log/processLogger.warn and increment a metric before using a fallback; ensure
the message includes language.wikiCode so missing configs are discoverable.

Comment on lines +169 to +170
val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key
quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
fd -t f "InfoboxExtractor.scala" --exec wc -l {} \;

Repository: dbpedia/extraction-framework

Length of output: 152


🏁 Script executed:

# Read the relevant section of the file
sed -n '160,180p' core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 1250


🏁 Script executed:

# Search for property.key definition and usage to understand data flow
rg "property\.key" core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala -B 3 -A 3

Repository: dbpedia/extraction-framework

Length of output: 2924


🏁 Script executed:

# Check for existing normalization or escaping patterns in the file
rg "\.replace|\.trim|normalize|escape" core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala -B 2 -A 2

Repository: dbpedia/extraction-framework

Length of output: 1155


Inconsistent normalization of property.key in infobox property definitions.

Line 169 injects raw property.key without normalization, while the same data is normalized elsewhere in the file (around line 215) with .replace("\n", " ").replace("\t", " ").trim for statistics extraction. Apply the same normalization here for consistency and to prevent malformed content in RDF comments.

Proposed hardening diff
-                                val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key
+                                val normalizedKey = property.key.replace("\n", " ").replace("\t", " ").trim
+                                val propertyComment = "Raw property extracted from Wikipedia infobox: " + normalizedKey
                                 quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key
quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)
val normalizedKey = property.key.replace("\n", " ").replace("\t", " ").trim
val propertyComment = "Raw property extracted from Wikipedia infobox: " + normalizedKey
quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala`
around lines 169 - 170, Normalize property.key before including it in the
InfoboxPropertyDefinitions comment: replace newline and tab characters and trim
the string the same way as used elsewhere (e.g., .replace("\n", "
").replace("\t", " ").trim) when building propertyComment in
InfoboxExtractor.scala so that the value passed to new Quad(...,
DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty,
propertyComment, splitNode.sourceIri, rdfLangStrDt) is sanitized and consistent
with the normalization used for statistics extraction.

Comment on lines +42 to 43
private val citationMap = new mutable.HashMap[String, String]()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clear citationMap per document to prevent cross-page contamination.

citationMap is shared mutable state on the extractor instance and is only appended to. If the same extractor processes multiple pages, old cite_note-* IDs can resolve to stale URLs.

💡 Proposed fix
 protected def getJsoupDoc(html: String): Document = {
   val doc = Jsoup.parse(cleanHtml(html))
+  citationMap.clear()

   //extract citations
   for(note <- doc.select("li[id^=cite_note-]").asScala){

Also applies to: 358-365

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around
lines 42 - 43, The extractor's mutable field citationMap in HtmlNifExtractor is
reused across documents causing cross-page contamination; reset or clear
citationMap at the start of each document processing run (e.g., in the method
that begins processing a page / processDocument / extract or the function around
the code handling each page before collecting cite_note-* links) so the map is
empty for each page; ensure any other per-document state initialized in lines
referenced (358-365) is also reinitialized or cleared to avoid stale URL
lookups.

Comment on lines +291 to +293
citationMap.get(link.getCitationId) match {
case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null)
case None =>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the specific file and the lines in question
head -n 310 core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala | tail -n 50

Repository: dbpedia/extraction-framework

Length of output: 3428


🏁 Script executed:

# Let's get a broader view to understand the context
wc -l core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 145


🏁 Script executed:

# Search for the non-citation validation path mentioned in the comment
rg -n "nifLinks" core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala | head -20

Repository: dbpedia/extraction-framework

Length of output: 1543


🏁 Script executed:

# Check if UriUtils.createURI is used elsewhere in this file
rg -n "UriUtils" core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 204


Validate citation URLs before emitting its:taIdentRef.

The non-citation link path (line 296) validates URLs using UriUtils.createURI() and handles failures, but the citation path (line 292) uses citation URLs directly without validation. Malformed href attributes from HTML can produce invalid IRIs in output quads.

Proposed fix
-          citationMap.get(link.getCitationId) match {
-            case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null)
-            case None =>
-          }
+          citationMap.get(link.getCitationId) match {
+            case Some(url) =>
+              UriUtils.createURI(url) match {
+                case Success(u) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", u.toString, sourceUrl, null)
+                case Failure(_) =>
+              }
+            case None =>
+          }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
citationMap.get(link.getCitationId) match {
case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null)
case None =>
citationMap.get(link.getCitationId) match {
case Some(url) =>
UriUtils.createURI(url) match {
case Success(u) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", u.toString, sourceUrl, null)
case Failure(_) =>
}
case None =>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around
lines 291 - 293, The citation branch is emitting its:taIdentRef using the raw
citation URL (citationMap.get(link.getCitationId)) without validating it; update
the code in HtmlNifExtractor.scala so that the citation path uses
UriUtils.createURI() (the same validation used in the non-citation link branch)
to parse/validate the citation URL before calling nifLinks (words +=
nifLinks(...)), and if UriUtils.createURI() fails, skip emitting the
its:taIdentRef (and optionally log the invalid citation URL) rather than writing
an invalid IRI.

Added rdfs:comment to the properties extracted from Wikipedia infoboxes in the InfoboxExtractor. This provides descriptive metadata about the original raw property.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala (1)

122-135: ⚠️ Potential issue | 🟡 Minor

Heads-up: DateFinder._date is shared mutable state across worker threads.

Both Workers.work pools operate on the same outer finder (line 123). DateFinder.byName(..., auto = true, ...) lazily sets _date under no synchronization, so concurrent workers in this pool (and between the two pools) can race. This predates the PR, but the new required = false paths make it much more likely the mapping pool never assigns _date, and then the input pool races to auto-resolve it from multiple threads.

Consider giving each worker its own DateFinder, or resolving the date once up-front before kicking off the workers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`
around lines 122 - 135, The DateFinder instance `finder` is shared across
concurrent workers (used in `Workers.work` / `SimpleWorkers`), and
`DateFinder._date` is set lazily (via `DateFinder.byName(..., auto = true)`),
causing races; fix by avoiding shared mutable `finder`: either pre-resolve the
date once before spinning up workers (call `DateFinder.byName(..., auto = true)`
/ ensure `_date` is initialized) or construct a new `DateFinder` per worker
(move creation of `new DateFinder(baseDir, language)` into the worker closure so
each worker gets its own instance); update references to `finder` in
`QuadMapper().readQuads(...)` accordingly and remove reliance on unsynchronized
lazy `_date`.
scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala (1)

67-119: ⚠️ Potential issue | 🔴 Critical

Don't use return inside the SimpleWorkers lambda (lines 73 and 118).

Scala compiles return in a closure into a throw scala.runtime.NonLocalReturnControl targeting the enclosing method frame (main). Because Workers.work executes the lambda on worker threads via worker.process() (Workers.scala:234), that frame is no longer on the current thread's stack. The exception either propagates as an uncaught exception and kills the worker thread, or is silently swallowed—either way, the intended "skip this language" behavior fails.

Restructure with if/else (or a pattern match) so the lambda naturally completes instead of returning early.

🐛 Proposed fix
       val finder = new DateFinder(baseDir, language)
-      val inputFile = finder.byName(input + suffix, auto = true, required = false)
-      if (inputFile.isEmpty) {
-        err.println(language.wikiCode + ": input dataset " + input + " not found, skipping.")
-        return
-      }
-
-      // use LinkedHashMap to preserve order
-      val map = new LinkedHashMap[String, String]()
-      ...
-      if (predicate == null) {
-        err.println(language.wikiCode + ": input dataset " + input + " not found or empty, skipping.")
-        return
-      }
-      err.println(language.wikiCode + ": " + count + " redirects were suppressed ...")
+      finder.byName(input + suffix, auto = true, required = false) match {
+        case None =>
+          err.println(language.wikiCode + ": input dataset " + input + " not found, skipping.")
+        case Some(_) =>
+          // existing body (map building, predicate check, closure, writing)
+          // replace the second `return` with an `if (predicate != null) { ... }`
+          // or wrap the tail in the `else` branch.
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`
around lines 67 - 119, The lambda passed to Workers.work (SimpleWorkers)
currently uses early return statements to skip a language (the two "return"
usages inside the worker lambda), which causes NonLocalReturnControl exceptions
when run on worker threads; replace both returns by restructuring the closure so
it completes normally (e.g., wrap the worker body in an if/else or pattern match
that checks inputFile.isEmpty and predicate==null conditions), ensuring you do
not call return from inside the lambda used by Workers.work or QuadMapper
callbacks (refer to variables/methods: Workers.work, SimpleWorkers(...){
language => ... }, finder.byName, inputFile.isEmpty, and the predicate
null-check) so the worker thread simply exits the closure instead of throwing.
🧹 Nitpick comments (2)
scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala (1)

85-85: .get here is safe, but brittle — consider consistency with the new optional contract.

By the time this line runs, _date is guaranteed set (line 70 succeeded), and finder.file returns Some(path) without probing existence, so .get won't fail today. Still, once you fix the return control-flow above, you could thread the Option through for consistency rather than leaving a stray .get.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`
at line 85, The use of finder.byName(output + suffix).get is brittle; replace
the direct .get with proper Option handling so the optional contract is
preserved—e.g., call finder.byName(output + suffix) and either pattern-match
(case Some(path) => val outputFile = path ...) or use .getOrElse to fail with a
clear error; ensure this flows consistently with the earlier _date/return
control-flow so outputFile is only used when the Option is present (refer to
finder.byName, output, suffix and the val outputFile).
scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala (1)

101-117: Tolerant skip looks correct.

dates(required = false) + pattern-matching both file options with an exists guard cleanly avoids the previous .last/.get hard-failure paths.

Optional: the _ => branch swallows which file is missing; including the names in the log would make triage easier when only one of the two datasets is absent.

♻️ Suggested log tweak
-          case _ =>
-            Console.err.println(lang.wikiCode + ": required datasets missing, skipping.")
+          case _ =>
+            val missing = Seq(
+              typeDatasetFileOption.filter(_.exists).map(_ => "").getOrElse(typesDataset),
+              mappedTripleDatasetFileOption.filter(_.exists).map(_ => "").getOrElse(mappedTripleDataset)
+            ).filter(_.nonEmpty).mkString(", ")
+            Console.err.println(s"${lang.wikiCode}: required datasets missing ($missing), skipping.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala`
around lines 101 - 117, The default match branch currently logs a generic
"required datasets missing" message; update it to report which dataset(s) are
missing by checking the Options returned from finder.file for typesDataset and
mappedTripleDataset and whether the underlying File.exists is false, then
include those dataset names (typesDataset and/or mappedTripleDataset) and their
expected paths in the Console.err message; modify the match block around
(typeDatasetFileOption, mappedTripleDatasetFileOption) in
TypeConsistencyCheck.scala so the case _ branch inspects typeDatasetFileOption
and mappedTripleDatasetFileOption (and their .exists) and emits a clear,
specific error like "<lang.wikiCode>: missing <datasetName> at <path>" before
skipping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 138-139: Remove the duplicate local variable declaration for
count: there are two identical lines declaring "var count = 0" in MapObjectUris
(duplicate in the same scope), which causes a compilation error; keep a single
"var count = 0" and delete the extra declaration so only one variable named
count is defined in that scope.

---

Outside diff comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 122-135: The DateFinder instance `finder` is shared across
concurrent workers (used in `Workers.work` / `SimpleWorkers`), and
`DateFinder._date` is set lazily (via `DateFinder.byName(..., auto = true)`),
causing races; fix by avoiding shared mutable `finder`: either pre-resolve the
date once before spinning up workers (call `DateFinder.byName(..., auto = true)`
/ ensure `_date` is initialized) or construct a new `DateFinder` per worker
(move creation of `new DateFinder(baseDir, language)` into the worker closure so
each worker gets its own instance); update references to `finder` in
`QuadMapper().readQuads(...)` accordingly and remove reliance on unsynchronized
lazy `_date`.

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`:
- Around line 67-119: The lambda passed to Workers.work (SimpleWorkers)
currently uses early return statements to skip a language (the two "return"
usages inside the worker lambda), which causes NonLocalReturnControl exceptions
when run on worker threads; replace both returns by restructuring the closure so
it completes normally (e.g., wrap the worker body in an if/else or pattern match
that checks inputFile.isEmpty and predicate==null conditions), ensuring you do
not call return from inside the lambda used by Workers.work or QuadMapper
callbacks (refer to variables/methods: Workers.work, SimpleWorkers(...){
language => ... }, finder.byName, inputFile.isEmpty, and the predicate
null-check) so the worker thread simply exits the closure instead of throwing.

---

Nitpick comments:
In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`:
- Line 85: The use of finder.byName(output + suffix).get is brittle; replace the
direct .get with proper Option handling so the optional contract is
preserved—e.g., call finder.byName(output + suffix) and either pattern-match
(case Some(path) => val outputFile = path ...) or use .getOrElse to fail with a
clear error; ensure this flows consistently with the earlier _date/return
control-flow so outputFile is only used when the Option is present (refer to
finder.byName, output, suffix and the val outputFile).

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala`:
- Around line 101-117: The default match branch currently logs a generic
"required datasets missing" message; update it to report which dataset(s) are
missing by checking the Options returned from finder.file for typesDataset and
mappedTripleDataset and whether the underlying File.exists is false, then
include those dataset names (typesDataset and/or mappedTripleDataset) and their
expected paths in the Console.err message; modify the match block around
(typeDatasetFileOption, mappedTripleDatasetFileOption) in
TypeConsistencyCheck.scala so the case _ branch inspects typeDatasetFileOption
and mappedTripleDatasetFileOption (and their .exists) and emits a clear,
specific error like "<lang.wikiCode>: missing <datasetName> at <path>" before
skipping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ed760297-542b-4a96-9c99-2bfd4088a591

📥 Commits

Reviewing files that changed from the base of the PR and between 38d6293 and ac97387.

📒 Files selected for processing (6)
  • core/src/main/scala/org/dbpedia/extraction/util/DateFinder.scala
  • core/src/main/scala/org/dbpedia/extraction/util/Finder.scala
  • scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala
  • scripts/src/main/scala/org/dbpedia/extraction/scripts/QuadReader.scala
  • scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala
  • scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala

Comment thread scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 150-164: The changeCount is incremented for every non-literal quad
but should reflect quads actually remapped; move or adjust the increment so it
only increases when a mapping exists (i.e. when ret is nonEmpty) and count the
number of produced replacements (use ret.size or ret.length) rather than
unconditionally adding 1. Locate the block in MapObjectUris.scala where
map.get(quad.value) is assigned to uris, ret is yielded from uris, and
changeCount is updated; update changeCount to add ret.size after determining ret
is nonEmpty (or compute ret first then increment accordingly) so the final
err.println reports true remapped quad count.
- Around line 139-142: The output path lookup uses finder.byName for
outputFileOption which returns None because the output file doesn't exist yet;
change outputFileOption so it doesn't depend on byName existence checks: when
isExternal keep using new RichFile(new File(secondary, input._1 + extension +
input._2)), otherwise construct the output path directly (e.g. use
finder.file(extension + input._2) or build the File for the date directory)
instead of finder.byName(input._1 + extension + input._2, ...); update the
variable outputFileOption construction so inputFileOption.isDefined &&
outputFileOption.isDefined check no longer fails due to missing on-disk output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1a91abb-27eb-448c-98c7-a28e5dca5519

📥 Commits

Reviewing files that changed from the base of the PR and between ac97387 and f49805a.

📒 Files selected for processing (1)
  • scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala

Comment on lines +139 to +142
val inputFileOption: Option[FileLike[_]] = if(isExternal) Some(new RichFile(new File(secondary, input._1 + input._2))) else finder.byName(input._1 + input._2, auto = true, required = false).map(x => x: FileLike[_])
val outputFileOption: Option[FileLike[_]] = if(isExternal) Some(new RichFile(new File(secondary, input._1 + extension + input._2))) else finder.byName(input._1 + extension + input._2, auto = true, required = false).map(x => x: FileLike[_])

if (inputFileOption.isDefined && outputFileOption.isDefined && inputFileOption.get.exists) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect DateFinder.byName to confirm whether it returns None for files that don't exist.
fd -t f 'DateFinder.scala'
fd -t f 'DateFinder.scala' --exec cat {}

Repository: dbpedia/extraction-framework

Length of output: 1287


🏁 Script executed:

fd -t f 'Finder.scala' -o -type f -name 'Finder.scala'

Repository: dbpedia/extraction-framework

Length of output: 188


🏁 Script executed:

fd -t f 'Finder.scala' | head -20

Repository: dbpedia/extraction-framework

Length of output: 197


🏁 Script executed:

cat core/src/main/scala/org/dbpedia/extraction/util/Finder.scala

Repository: dbpedia/extraction-framework

Length of output: 4341


🏁 Script executed:

# Check if resolve() implementation or FileLike checks existence
rg -A 5 'def resolve' core/src/main/scala/org/dbpedia/extraction/util/ --max-count=10

Repository: dbpedia/extraction-framework

Length of output: 3521


🏁 Script executed:

# Look at the MapObjectUris context around lines 139-142 to understand the full picture
sed -n '130,150p' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala

Repository: dbpedia/extraction-framework

Length of output: 1386


🏁 Script executed:

# Check if there's a wikiDate method or alternative path construction mentioned in the review
rg 'wikiDate|wikiDir' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala

Repository: dbpedia/extraction-framework

Length of output: 54


🏁 Script executed:

# Look at what finder actually is in MapObjectUris context
sed -n '100,145p' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala

Repository: dbpedia/extraction-framework

Length of output: 2630


Fix the output path lookup to not depend on file existence.

The finder.byName(input._1 + extension + input._2, auto = true, required = false) call at line 140 will return None for the output file because DateFinder.byName() internally calls finder.dates(name, required), which filters dates based on whether files matching that name actually exist on disk. Since the output file doesn't exist yet, no matching date is found, dates() returns empty, and byName() returns None. This causes the guard at line 142 to fail, silently skipping the entire language×dataset pair.

The output path should be constructed directly (similar to the isExternal branch using RichFile and new File()) rather than looked up via byName(). For example, use finder.file(extension + input._2) to get the path in the main wiki directory, or construct the path directly for the date directory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`
around lines 139 - 142, The output path lookup uses finder.byName for
outputFileOption which returns None because the output file doesn't exist yet;
change outputFileOption so it doesn't depend on byName existence checks: when
isExternal keep using new RichFile(new File(secondary, input._1 + extension +
input._2)), otherwise construct the output path directly (e.g. use
finder.file(extension + input._2) or build the File for the date directory)
instead of finder.byName(input._1 + extension + input._2, ...); update the
variable outputFileOption construction so inputFileOption.isDefined &&
outputFileOption.isDefined check no longer fails due to missing on-disk output.

Comment on lines +150 to +164
else {
val uris = map.get(quad.value).asScala
changeCount = changeCount + 1
val ret = for (uri <- uris)
yield quad.copy(
value = uri, // change object URI
context = if (quad.context == null) quad.context else quad.context + "&objectMappedFrom=" + quad.value) // add change provenance
// none found
if(ret.isEmpty)
List(quad)
else
ret
}
}
err.println(input._1 + ": changed " + changeCount + " quads.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

changeCount over-reports — increments before checking whether a mapping exists.

At line 152 the counter is bumped for every non-literal quad, but the ret.isEmpty branch at lines 158–159 keeps the original quad unchanged when no mapping is found. The log at line 164 ("changed N quads") therefore conflates "non-literal quads processed" with "quads actually remapped". Given the rename from countchangeCount in this PR, the intended semantics appears to be the latter.

🐛 Proposed fix
             else {
               val uris = map.get(quad.value).asScala
-              changeCount = changeCount + 1
               val ret = for (uri <- uris)
                 yield quad.copy(
                   value = uri, // change object URI
                   context = if (quad.context == null) quad.context else quad.context + "&objectMappedFrom=" + quad.value) // add change provenance
               // none found
               if(ret.isEmpty)
                 List(quad)
-              else
+              else {
+                changeCount += 1
                 ret
+              }
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`
around lines 150 - 164, The changeCount is incremented for every non-literal
quad but should reflect quads actually remapped; move or adjust the increment so
it only increases when a mapping exists (i.e. when ret is nonEmpty) and count
the number of produced replacements (use ret.size or ret.length) rather than
unconditionally adding 1. Locate the block in MapObjectUris.scala where
map.get(quad.value) is assigned to uris, ret is yielded from uris, and
changeCount is updated; update changeCount to add ret.size after determining ret
is nonEmpty (or compute ret first then increment accordingly) so the final
err.println reports true remapped quad count.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants