feat(ontology): add rdfs:comment to infobox property definitions by anshuman9468 · Pull Request #840 · dbpedia/extraction-framework

anshuman9468 · 2026-04-16T09:56:07Z

This PR adds rdfs:comment to the properties extracted from Wikipedia infoboxes in the InfoboxExtractor. This provides descriptive metadata about the original raw property, derived from the Wikipedia infobox key. This change addresses the requirement to include more information in the infobox-property-definitions dataset so it can be viewed as HTML in the store.

Summary by CodeRabbit

New Features
- Extraction and semantic tagging of citation links, including linking citation notes to external targets
- Automatic comment generation for infobox property definitions
Bug Fixes
- Safer disambiguation title handling with a fallback when mappings are missing
- Improved robustness when files, dates or optional mapping inputs are missing
- Removed an interfering reference selector to improve reference extraction quality

…lse for language config

coderabbitai · 2026-04-16T09:56:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d6c83d23-394e-421c-ad52-8f738d1e15ee

📥 Commits

Reviewing files that changed from the base of the PR and between f49805a and 9fc7dfd.

📒 Files selected for processing (5)

dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties
dump/src/test/resources/extraction-configs/extraction.plain.abstracts.properties
dump/src/test/resources/extraction-configs/generic-spark.extraction.minidump.properties
dump/src/test/resources/extraction-configs/mappings.extraction.minidump.properties
dump/src/test/resources/extraction-configs/wikidata.extraction.properties

✅ Files skipped from review due to trivial changes (4)

dump/src/test/resources/extraction-configs/wikidata.extraction.properties
dump/src/test/resources/extraction-configs/mappings.extraction.minidump.properties
dump/src/test/resources/extraction-configs/extraction.nif.abstracts.properties
dump/src/test/resources/extraction-configs/extraction.plain.abstracts.properties

📝 Walkthrough

Walkthrough

Adds citation tracking and emission: Link and LinkExtractor now identify in-page citation anchors; HtmlNifExtractor collects citation URLs and emits citation-specific RDF; NIF config removes sup.reference; InfoboxExtractor adds rdfs:comment; various utilities and scripts tolerate missing files/dates; CI deploy adds -DskipNexusStagingDeployMojo=true.

Changes

Cohort / File(s)	Summary
Citation model `core/src/main/java/org/dbpedia/extraction/nif/Link.java`	Added `citation` boolean and `citationId` string with getters/setters; minor formatting tweaks.
Citation detection `core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`	Detect `href` starting with `#cite_note-`, mark Link as citation and set `citationId`; small control-flow/formatting normalizations.
Citation extraction & RDF emission `core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`	Added in-memory `citationMap` from `li#cite_note-*` → external `a.external.text` href; `makeWordsFromLinks` emits `dbpedia:Citation` and uses `citationMap` for `its:taIdentRef` when `link.isCitation`.
NIF config `core/src/main/resources/nifextractionconfig.json`	Removed selector `sup.reference` from `nif-remove-elements`.
Infobox metadata `core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala`	Emit additional `rdfs:comment` quad when a property URI is first seen.
Date/finder resilience `core/src/main/scala/org/dbpedia/extraction/util/DateFinder.scala`, `core/src/main/scala/org/dbpedia/extraction/util/Finder.scala`	`byName`/`byPattern` gain `required` param and avoid .last on empty results; `Finder.dates` now deduplicates with `.distinct`.
Script robustness `scripts/src/main/scala/org/dbpedia/extraction/scripts/QuadReader.scala`, `.../MapObjectUris.scala`, `.../ResolveTransitiveLinks.scala`, `.../TypeConsistencyCheck.scala`, `.../MapObjectUris.scala`	`readQuads`/read-multiple accept `required` flag; callers now handle missing files/dates gracefully, skip processing and log instead of failing; safer Option handling for input/output.
Disambiguation fallback `core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`	Use `getOrElse` for disambiguation title mapping to avoid key lookup exceptions.
CI deploy `.github/workflows/snapshot_deploy.yml`	Maven `mvn deploy` invocation adds `-DskipNexusStagingDeployMojo=true`.
Test configs `dump/src/test/resources/extraction-configs/*`	Added `compare-dataset-ids=false` to multiple test extraction config files.

Sequence Diagram(s)

sequenceDiagram
    participant HTML as HTML Document
    participant HtmlExt as HtmlNifExtractor
    participant LinkExt as LinkExtractor
    participant CitMap as CitationMap
    participant RDF as RDF Emitter

    HTML->>HtmlExt: parse document
    HtmlExt->>HtmlExt: extract li elements id="cite_note-*" and store id→external href in CitMap
    HtmlExt->>LinkExt: traverse <a> nodes
    LinkExt->>LinkExt: detect href starting "#cite_note-*" → set citation=true, citationId
    LinkExt-->>HtmlExt: return Link objects (with citation flag/id)
    HtmlExt->>CitMap: lookup citationId → URL (if available)
    HtmlExt->>RDF: emit dbpedia:Citation triple for word
    HtmlExt->>RDF: emit its:taIdentRef with citation URL (if found)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(ontology): add rdfs:comment to infobox property definitions' accurately reflects the primary functional change in the PR: adding rdfs:comment annotations to infobox property definitions in InfoboxExtractor.scala, though the changeset also includes several broader refactoring efforts related to citation handling and optional file resolution.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 274-279: The loop in getTableCount() incorrectly references the
field/variable "paragraph" instead of the loop variable "p", which can produce
wrong counts or NPEs; update the body to use p.getHtmlStrings().size() and add a
null-safe check (skip if p is null or if p.getHtmlStrings() is null) so
getTableCount() iterates over this.getParagraphs() using the loop variable "p"
and sums sizes safely.

In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`:
- Line 25: The current line initializes replaceString by silently falling back
to the English literal " (disambiguation)"; instead, change initialization in
DisambiguationExtractor (the replaceString assignment referencing
DisambiguationExtractorConfig.disambiguationTitlePartMap and language.wikiCode)
to fail-fast or emit an explicit warning/metric when the map has no entry for
language.wikiCode—e.g., check
DisambiguationExtractorConfig.disambiguationTitlePartMap.contains(language.wikiCode)
and if missing either throw a clear exception (fail-fast) or
log/processLogger.warn and increment a metric before using a fallback; ensure
the message includes language.wikiCode so missing configs are discoverable.

In `@core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala`:
- Around line 169-170: Normalize property.key before including it in the
InfoboxPropertyDefinitions comment: replace newline and tab characters and trim
the string the same way as used elsewhere (e.g., .replace("\n", "
").replace("\t", " ").trim) when building propertyComment in
InfoboxExtractor.scala so that the value passed to new Quad(...,
DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty,
propertyComment, splitNode.sourceIri, rdfLangStrDt) is sanitized and consistent
with the normalization used for statistics extraction.

In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`:
- Around line 42-43: The extractor's mutable field citationMap in
HtmlNifExtractor is reused across documents causing cross-page contamination;
reset or clear citationMap at the start of each document processing run (e.g.,
in the method that begins processing a page / processDocument / extract or the
function around the code handling each page before collecting cite_note-* links)
so the map is empty for each page; ensure any other per-document state
initialized in lines referenced (358-365) is also reinitialized or cleared to
avoid stale URL lookups.
- Around line 291-293: The citation branch is emitting its:taIdentRef using the
raw citation URL (citationMap.get(link.getCitationId)) without validating it;
update the code in HtmlNifExtractor.scala so that the citation path uses
UriUtils.createURI() (the same validation used in the non-citation link branch)
to parse/validate the citation URL before calling nifLinks (words +=
nifLinks(...)), and if UriUtils.createURI() fails, skip emitting the
its:taIdentRef (and optionally log the invalid citation URL) rather than writing
an invalid IRI.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 356bed8d-b522-4f7b-b5be-51fc33ae5f53

📥 Commits

Reviewing files that changed from the base of the PR and between e3dfe8b and 38d6293.

📒 Files selected for processing (7)

.github/workflows/snapshot_deploy.yml
core/src/main/java/org/dbpedia/extraction/nif/Link.java
core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java
core/src/main/resources/nifextractionconfig.json
core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala
core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala
core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala

💤 Files with no reviewable changes (1)

core/src/main/resources/nifextractionconfig.json

coderabbitai · 2026-04-16T10:01:52Z

+	public int getTableCount() {
+		int count = 0;
+		for (Paragraph p : this.getParagraphs()) {
 			count += paragraph.getHtmlStrings().size();
 		}
 		return count;


⚠️ Potential issue | 🔴 Critical

Use the loop variable in getTableCount() (current code can NPE).

Line 277 uses paragraph instead of p. This can produce wrong counts and fail at runtime when paragraph is null.

💡 Proposed fix

public int getTableCount() { int count = 0; for (Paragraph p : this.getParagraphs()) { - count += paragraph.getHtmlStrings().size(); + count += p.getHtmlStrings().size(); } return count; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

public int getTableCount() {

int count = 0;

for (Paragraph p : this.getParagraphs()) {

count += paragraph.getHtmlStrings().size();

}

return count;

public int getTableCount() {

int count = 0;

for (Paragraph p : this.getParagraphs()) {

count += p.getHtmlStrings().size();

}

return count;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around lines 274 - 279, The loop in getTableCount() incorrectly references the field/variable "paragraph" instead of the loop variable "p", which can produce wrong counts or NPEs; update the body to use p.getHtmlStrings().size() and add a null-safe check (skip if p is null or if p.getHtmlStrings() is null) so getTableCount() iterates over this.getParagraphs() using the loop variable "p" and sums sizes safely.

coderabbitai · 2026-04-16T10:01:53Z

  private val language = context.language

-  private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap(language.wikiCode)
+  private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")


⚠️ Potential issue | 🟠 Major

Avoid silently defaulting to English when language config is missing.

Using " (disambiguation)" as a universal fallback can silently mis-handle non-English wikis and hide missing configuration. Prefer fail-fast (or at least emit explicit warning/metric) so missing disambiguationTitlePartMap entries are visible.

Suggested fix (fail-fast for missing language mapping)

- private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)") + private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode).getOrElse { + throw new IllegalArgumentException( + s"Missing disambiguation title suffix for language '${language.wikiCode}' in DisambiguationExtractorConfig.disambiguationTitlePartMap" + ) + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala` at line 25, The current line initializes replaceString by silently falling back to the English literal " (disambiguation)"; instead, change initialization in DisambiguationExtractor (the replaceString assignment referencing DisambiguationExtractorConfig.disambiguationTitlePartMap and language.wikiCode) to fail-fast or emit an explicit warning/metric when the map has no entry for language.wikiCode—e.g., check DisambiguationExtractorConfig.disambiguationTitlePartMap.contains(language.wikiCode) and if missing either throw a clear exception (fail-fast) or log/processLogger.warn and increment a metric before using a fallback; ensure the message includes language.wikiCode so missing configs are discoverable.

coderabbitai · 2026-04-16T10:01:53Z

+                                val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key
+                                quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file fd -t f "InfoboxExtractor.scala" --exec wc -l {} \;

Repository: dbpedia/extraction-framework

Length of output: 152

🏁 Script executed:

# Read the relevant section of the file sed -n '160,180p' core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 1250

🏁 Script executed:

# Search for property.key definition and usage to understand data flow rg "property\.key" core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala -B 3 -A 3

Repository: dbpedia/extraction-framework

Length of output: 2924

🏁 Script executed:

# Check for existing normalization or escaping patterns in the file rg "\.replace|\.trim|normalize|escape" core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala -B 2 -A 2

Repository: dbpedia/extraction-framework

Length of output: 1155

Inconsistent normalization of property.key in infobox property definitions.

Line 169 injects raw property.key without normalization, while the same data is normalized elsewhere in the file (around line 215) with .replace("\n", " ").replace("\t", " ").trim for statistics extraction. Apply the same normalization here for consistency and to prevent malformed content in RDF comments.

Proposed hardening diff

- val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key + val normalizedKey = property.key.replace("\n", " ").replace("\t", " ").trim + val propertyComment = "Raw property extracted from Wikipedia infobox: " + normalizedKey quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key

quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)

val normalizedKey = property.key.replace("\n", " ").replace("\t", " ").trim

val propertyComment = "Raw property extracted from Wikipedia infobox: " + normalizedKey

quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala` around lines 169 - 170, Normalize property.key before including it in the InfoboxPropertyDefinitions comment: replace newline and tab characters and trim the string the same way as used elsewhere (e.g., .replace("\n", " ").replace("\t", " ").trim) when building propertyComment in InfoboxExtractor.scala so that the value passed to new Quad(..., DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt) is sanitized and consistent with the normalization used for statistics extraction.

coderabbitai · 2026-04-16T10:01:53Z

+  private val citationMap = new mutable.HashMap[String, String]()



⚠️ Potential issue | 🟠 Major

Clear citationMap per document to prevent cross-page contamination.

citationMap is shared mutable state on the extractor instance and is only appended to. If the same extractor processes multiple pages, old cite_note-* IDs can resolve to stale URLs.

💡 Proposed fix

protected def getJsoupDoc(html: String): Document = { val doc = Jsoup.parse(cleanHtml(html)) + citationMap.clear() //extract citations for(note <- doc.select("li[id^=cite_note-]").asScala){

Also applies to: 358-365

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around lines 42 - 43, The extractor's mutable field citationMap in HtmlNifExtractor is reused across documents causing cross-page contamination; reset or clear citationMap at the start of each document processing run (e.g., in the method that begins processing a page / processDocument / extract or the function around the code handling each page before collecting cite_note-* links) so the map is empty for each page; ensure any other per-document state initialized in lines referenced (358-365) is also reinitialized or cleared to avoid stale URL lookups.

coderabbitai · 2026-04-16T10:01:53Z

+          citationMap.get(link.getCitationId) match {
+            case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null)
+            case None =>


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the specific file and the lines in question head -n 310 core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala | tail -n 50

Repository: dbpedia/extraction-framework

Length of output: 3428

🏁 Script executed:

# Let's get a broader view to understand the context wc -l core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 145

🏁 Script executed:

# Search for the non-citation validation path mentioned in the comment rg -n "nifLinks" core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala | head -20

Repository: dbpedia/extraction-framework

Length of output: 1543

🏁 Script executed:

# Check if UriUtils.createURI is used elsewhere in this file rg -n "UriUtils" core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala

Repository: dbpedia/extraction-framework

Length of output: 204

Validate citation URLs before emitting its:taIdentRef.

The non-citation link path (line 296) validates URLs using UriUtils.createURI() and handles failures, but the citation path (line 292) uses citation URLs directly without validation. Malformed href attributes from HTML can produce invalid IRIs in output quads.

Proposed fix

- citationMap.get(link.getCitationId) match { - case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null) - case None => - } + citationMap.get(link.getCitationId) match { + case Some(url) => + UriUtils.createURI(url) match { + case Success(u) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", u.toString, sourceUrl, null) + case Failure(_) => + } + case None => + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

citationMap.get(link.getCitationId) match {

case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null)

case None =>

citationMap.get(link.getCitationId) match {

case Some(url) =>

UriUtils.createURI(url) match {

case Success(u) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", u.toString, sourceUrl, null)

case Failure(_) =>

}

case None =>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around lines 291 - 293, The citation branch is emitting its:taIdentRef using the raw citation URL (citationMap.get(link.getCitationId)) without validating it; update the code in HtmlNifExtractor.scala so that the citation path uses UriUtils.createURI() (the same validation used in the non-citation link branch) to parse/validate the citation URL before calling nifLinks (words += nifLinks(...)), and if UriUtils.createURI() fails, skip emitting the its:taIdentRef (and optionally log the invalid citation URL) rather than writing an invalid IRI.

Added rdfs:comment to the properties extracted from Wikipedia infoboxes in the InfoboxExtractor. This provides descriptive metadata about the original raw property.

…ump tests

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala (1)
122-135: ⚠️ Potential issue | 🟡 Minor

Heads-up: DateFinder._date is shared mutable state across worker threads.

Both Workers.work pools operate on the same outer finder (line 123). DateFinder.byName(..., auto = true, ...) lazily sets _date under no synchronization, so concurrent workers in this pool (and between the two pools) can race. This predates the PR, but the new required = false paths make it much more likely the mapping pool never assigns _date, and then the input pool races to auto-resolve it from multiple threads.

Consider giving each worker its own DateFinder, or resolving the date once up-front before kicking off the workers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`
around lines 122 - 135, The DateFinder instance `finder` is shared across
concurrent workers (used in `Workers.work` / `SimpleWorkers`), and
`DateFinder._date` is set lazily (via `DateFinder.byName(..., auto = true)`),
causing races; fix by avoiding shared mutable `finder`: either pre-resolve the
date once before spinning up workers (call `DateFinder.byName(..., auto = true)`
/ ensure `_date` is initialized) or construct a new `DateFinder` per worker
(move creation of `new DateFinder(baseDir, language)` into the worker closure so
each worker gets its own instance); update references to `finder` in
`QuadMapper().readQuads(...)` accordingly and remove reliance on unsynchronized
lazy `_date`.
scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala (1)
67-119: ⚠️ Potential issue | 🔴 Critical

Don't use return inside the SimpleWorkers lambda (lines 73 and 118).

Scala compiles return in a closure into a throw scala.runtime.NonLocalReturnControl targeting the enclosing method frame (main). Because Workers.work executes the lambda on worker threads via worker.process() (Workers.scala:234), that frame is no longer on the current thread's stack. The exception either propagates as an uncaught exception and kills the worker thread, or is silently swallowed—either way, the intended "skip this language" behavior fails.

Restructure with if/else (or a pattern match) so the lambda naturally completes instead of returning early.
🐛 Proposed fix
       val finder = new DateFinder(baseDir, language)
-      val inputFile = finder.byName(input + suffix, auto = true, required = false)
-      if (inputFile.isEmpty) {
-        err.println(language.wikiCode + ": input dataset " + input + " not found, skipping.")
-        return
-      }
-
-      // use LinkedHashMap to preserve order
-      val map = new LinkedHashMap[String, String]()
-      ...
-      if (predicate == null) {
-        err.println(language.wikiCode + ": input dataset " + input + " not found or empty, skipping.")
-        return
-      }
-      err.println(language.wikiCode + ": " + count + " redirects were suppressed ...")
+      finder.byName(input + suffix, auto = true, required = false) match {
+        case None =>
+          err.println(language.wikiCode + ": input dataset " + input + " not found, skipping.")
+        case Some(_) =>
+          // existing body (map building, predicate check, closure, writing)
+          // replace the second `return` with an `if (predicate != null) { ... }`
+          // or wrap the tail in the `else` branch.
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`
around lines 67 - 119, The lambda passed to Workers.work (SimpleWorkers)
currently uses early return statements to skip a language (the two "return"
usages inside the worker lambda), which causes NonLocalReturnControl exceptions
when run on worker threads; replace both returns by restructuring the closure so
it completes normally (e.g., wrap the worker body in an if/else or pattern match
that checks inputFile.isEmpty and predicate==null conditions), ensuring you do
not call return from inside the lambda used by Workers.work or QuadMapper
callbacks (refer to variables/methods: Workers.work, SimpleWorkers(...){
language => ... }, finder.byName, inputFile.isEmpty, and the predicate
null-check) so the worker thread simply exits the closure instead of throwing.

🧹 Nitpick comments (2)

scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala (1)

85-85: .get here is safe, but brittle — consider consistency with the new optional contract.

By the time this line runs, _date is guaranteed set (line 70 succeeded), and finder.file returns Some(path) without probing existence, so .get won't fail today. Still, once you fix the return control-flow above, you could thread the Option through for consistency rather than leaving a stray .get.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`
at line 85, The use of finder.byName(output + suffix).get is brittle; replace
the direct .get with proper Option handling so the optional contract is
preserved—e.g., call finder.byName(output + suffix) and either pattern-match
(case Some(path) => val outputFile = path ...) or use .getOrElse to fail with a
clear error; ensure this flows consistently with the earlier _date/return
control-flow so outputFile is only used when the Option is present (refer to
finder.byName, output, suffix and the val outputFile).

scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala (1)

101-117: Tolerant skip looks correct.

dates(required = false) + pattern-matching both file options with an exists guard cleanly avoids the previous .last/.get hard-failure paths.

Optional: the _ => branch swallows which file is missing; including the names in the log would make triage easier when only one of the two datasets is absent.

♻️ Suggested log tweak

-          case _ =>
-            Console.err.println(lang.wikiCode + ": required datasets missing, skipping.")
+          case _ =>
+            val missing = Seq(
+              typeDatasetFileOption.filter(_.exists).map(_ => "").getOrElse(typesDataset),
+              mappedTripleDatasetFileOption.filter(_.exists).map(_ => "").getOrElse(mappedTripleDataset)
+            ).filter(_.nonEmpty).mkString(", ")
+            Console.err.println(s"${lang.wikiCode}: required datasets missing ($missing), skipping.")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala`
around lines 101 - 117, The default match branch currently logs a generic
"required datasets missing" message; update it to report which dataset(s) are
missing by checking the Options returned from finder.file for typesDataset and
mappedTripleDataset and whether the underlying File.exists is false, then
include those dataset names (typesDataset and/or mappedTripleDataset) and their
expected paths in the Console.err message; modify the match block around
(typeDatasetFileOption, mappedTripleDatasetFileOption) in
TypeConsistencyCheck.scala so the case _ branch inspects typeDatasetFileOption
and mappedTripleDatasetFileOption (and their .exists) and emits a clear,
specific error like "<lang.wikiCode>: missing <datasetName> at <path>" before
skipping.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 138-139: Remove the duplicate local variable declaration for
count: there are two identical lines declaring "var count = 0" in MapObjectUris
(duplicate in the same scope), which causes a compilation error; keep a single
"var count = 0" and delete the extra declaration so only one variable named
count is defined in that scope.

---

Outside diff comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 122-135: The DateFinder instance `finder` is shared across
concurrent workers (used in `Workers.work` / `SimpleWorkers`), and
`DateFinder._date` is set lazily (via `DateFinder.byName(..., auto = true)`),
causing races; fix by avoiding shared mutable `finder`: either pre-resolve the
date once before spinning up workers (call `DateFinder.byName(..., auto = true)`
/ ensure `_date` is initialized) or construct a new `DateFinder` per worker
(move creation of `new DateFinder(baseDir, language)` into the worker closure so
each worker gets its own instance); update references to `finder` in
`QuadMapper().readQuads(...)` accordingly and remove reliance on unsynchronized
lazy `_date`.

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`:
- Around line 67-119: The lambda passed to Workers.work (SimpleWorkers)
currently uses early return statements to skip a language (the two "return"
usages inside the worker lambda), which causes NonLocalReturnControl exceptions
when run on worker threads; replace both returns by restructuring the closure so
it completes normally (e.g., wrap the worker body in an if/else or pattern match
that checks inputFile.isEmpty and predicate==null conditions), ensuring you do
not call return from inside the lambda used by Workers.work or QuadMapper
callbacks (refer to variables/methods: Workers.work, SimpleWorkers(...){
language => ... }, finder.byName, inputFile.isEmpty, and the predicate
null-check) so the worker thread simply exits the closure instead of throwing.

---

Nitpick comments:
In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`:
- Line 85: The use of finder.byName(output + suffix).get is brittle; replace the
direct .get with proper Option handling so the optional contract is
preserved—e.g., call finder.byName(output + suffix) and either pattern-match
(case Some(path) => val outputFile = path ...) or use .getOrElse to fail with a
clear error; ensure this flows consistently with the earlier _date/return
control-flow so outputFile is only used when the Option is present (refer to
finder.byName, output, suffix and the val outputFile).

In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala`:
- Around line 101-117: The default match branch currently logs a generic
"required datasets missing" message; update it to report which dataset(s) are
missing by checking the Options returned from finder.file for typesDataset and
mappedTripleDataset and whether the underlying File.exists is false, then
include those dataset names (typesDataset and/or mappedTripleDataset) and their
expected paths in the Console.err message; modify the match block around
(typeDatasetFileOption, mappedTripleDatasetFileOption) in
TypeConsistencyCheck.scala so the case _ branch inspects typeDatasetFileOption
and mappedTripleDatasetFileOption (and their .exists) and emits a clear,
specific error like "<lang.wikiCode>: missing <datasetName> at <path>" before
skipping.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ed760297-542b-4a96-9c99-2bfd4088a591

📥 Commits

Reviewing files that changed from the base of the PR and between 38d6293 and ac97387.

📒 Files selected for processing (6)

core/src/main/scala/org/dbpedia/extraction/util/DateFinder.scala
core/src/main/scala/org/dbpedia/extraction/util/Finder.scala
scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala
scripts/src/main/scala/org/dbpedia/extraction/scripts/QuadReader.scala
scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala
scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 150-164: The changeCount is incremented for every non-literal quad
but should reflect quads actually remapped; move or adjust the increment so it
only increases when a mapping exists (i.e. when ret is nonEmpty) and count the
number of produced replacements (use ret.size or ret.length) rather than
unconditionally adding 1. Locate the block in MapObjectUris.scala where
map.get(quad.value) is assigned to uris, ret is yielded from uris, and
changeCount is updated; update changeCount to add ret.size after determining ret
is nonEmpty (or compute ret first then increment accordingly) so the final
err.println reports true remapped quad count.
- Around line 139-142: The output path lookup uses finder.byName for
outputFileOption which returns None because the output file doesn't exist yet;
change outputFileOption so it doesn't depend on byName existence checks: when
isExternal keep using new RichFile(new File(secondary, input._1 + extension +
input._2)), otherwise construct the output path directly (e.g. use
finder.file(extension + input._2) or build the File for the date directory)
instead of finder.byName(input._1 + extension + input._2, ...); update the
variable outputFileOption construction so inputFileOption.isDefined &&
outputFileOption.isDefined check no longer fails due to missing on-disk output.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1a91abb-27eb-448c-98c7-a28e5dca5519

📥 Commits

Reviewing files that changed from the base of the PR and between ac97387 and f49805a.

📒 Files selected for processing (1)

scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala

coderabbitai · 2026-04-17T18:16:01Z

+        val inputFileOption: Option[FileLike[_]] = if(isExternal) Some(new RichFile(new File(secondary, input._1 + input._2))) else finder.byName(input._1 + input._2, auto = true, required = false).map(x => x: FileLike[_])
+        val outputFileOption: Option[FileLike[_]] = if(isExternal) Some(new RichFile(new File(secondary, input._1 + extension + input._2))) else finder.byName(input._1 + extension + input._2, auto = true, required = false).map(x => x: FileLike[_])
+
+        if (inputFileOption.isDefined && outputFileOption.isDefined && inputFileOption.get.exists) {


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Inspect DateFinder.byName to confirm whether it returns None for files that don't exist. fd -t f 'DateFinder.scala' fd -t f 'DateFinder.scala' --exec cat {}

Repository: dbpedia/extraction-framework

Length of output: 1287

🏁 Script executed:

fd -t f 'Finder.scala' -o -type f -name 'Finder.scala'

Repository: dbpedia/extraction-framework

Length of output: 188

🏁 Script executed:

fd -t f 'Finder.scala' | head -20

Repository: dbpedia/extraction-framework

Length of output: 197

🏁 Script executed:

cat core/src/main/scala/org/dbpedia/extraction/util/Finder.scala

Repository: dbpedia/extraction-framework

Length of output: 4341

🏁 Script executed:

# Check if resolve() implementation or FileLike checks existence rg -A 5 'def resolve' core/src/main/scala/org/dbpedia/extraction/util/ --max-count=10

Repository: dbpedia/extraction-framework

Length of output: 3521

🏁 Script executed:

# Look at the MapObjectUris context around lines 139-142 to understand the full picture sed -n '130,150p' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala

Repository: dbpedia/extraction-framework

Length of output: 1386

🏁 Script executed:

# Check if there's a wikiDate method or alternative path construction mentioned in the review rg 'wikiDate|wikiDir' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala

Repository: dbpedia/extraction-framework

Length of output: 54

🏁 Script executed:

# Look at what finder actually is in MapObjectUris context sed -n '100,145p' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala

Repository: dbpedia/extraction-framework

Length of output: 2630

Fix the output path lookup to not depend on file existence.

The finder.byName(input._1 + extension + input._2, auto = true, required = false) call at line 140 will return None for the output file because DateFinder.byName() internally calls finder.dates(name, required), which filters dates based on whether files matching that name actually exist on disk. Since the output file doesn't exist yet, no matching date is found, dates() returns empty, and byName() returns None. This causes the guard at line 142 to fail, silently skipping the entire language×dataset pair.

The output path should be constructed directly (similar to the isExternal branch using RichFile and new File()) rather than looked up via byName(). For example, use finder.file(extension + input._2) to get the path in the main wiki directory, or construct the path directly for the date directory.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala` around lines 139 - 142, The output path lookup uses finder.byName for outputFileOption which returns None because the output file doesn't exist yet; change outputFileOption so it doesn't depend on byName existence checks: when isExternal keep using new RichFile(new File(secondary, input._1 + extension + input._2)), otherwise construct the output path directly (e.g. use finder.file(extension + input._2) or build the File for the date directory) instead of finder.byName(input._1 + extension + input._2, ...); update the variable outputFileOption construction so inputFileOption.isDefined && outputFileOption.isDefined check no longer fails due to missing on-disk output.

coderabbitai · 2026-04-17T18:16:01Z

+            else {
+              val uris = map.get(quad.value).asScala
+              changeCount = changeCount + 1
+              val ret = for (uri <- uris)
+                yield quad.copy(
+                  value = uri, // change object URI
+                  context = if (quad.context == null) quad.context else quad.context + "&objectMappedFrom=" + quad.value) // add change provenance
+              // none found
+              if(ret.isEmpty)
+                List(quad)
+              else
+                ret
+            }
          }
+          err.println(input._1 + ": changed " + changeCount + " quads.")


⚠️ Potential issue | 🟡 Minor

changeCount over-reports — increments before checking whether a mapping exists.

At line 152 the counter is bumped for every non-literal quad, but the ret.isEmpty branch at lines 158–159 keeps the original quad unchanged when no mapping is found. The log at line 164 ("changed N quads") therefore conflates "non-literal quads processed" with "quads actually remapped". Given the rename from count → changeCount in this PR, the intended semantics appears to be the latter.

🐛 Proposed fix

else { val uris = map.get(quad.value).asScala - changeCount = changeCount + 1 val ret = for (uri <- uris) yield quad.copy( value = uri, // change object URI context = if (quad.context == null) quad.context else quad.context + "&objectMappedFrom=" + quad.value) // add change provenance // none found if(ret.isEmpty) List(quad) - else + else { + changeCount += 1 ret + } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala` around lines 150 - 164, The changeCount is incremented for every non-literal quad but should reflect quads actually remapped; move or adjust the increment so it only increases when a mapping exists (i.e. when ret is nonEmpty) and count the number of produced replacements (use ret.size or ret.length) rather than unconditionally adding 1. Locate the block in MapObjectUris.scala where map.get(quad.value) is assigned to uris, ret is yielded from uris, and changeCount is updated; update changeCount to add ret.size after determining ret is nonEmpty (or compute ret first then increment accordingly) so the final err.println reports true remapped quad count.

…tion

sonarqubecloud · 2026-04-17T18:45:04Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

anshu713-cloud added 3 commits March 10, 2026 23:01

Fix NoSuchElementException in DisambiguationExtractor by using getOrE…

230c39f

…lse for language config

Skip Nexus staging during snapshot deployment to GitHub Packages

f55682b

Implement NIF-based citation extraction with precise character offsets

7879400

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

anshuman9468 added 2 commits April 16, 2026 20:45

feat(ontology): add rdfs:comment to infobox property definitions

38d6293

Added rdfs:comment to the properties extracted from Wikipedia infoboxes in the InfoboxExtractor. This provides descriptive metadata about the original raw property.

Robustify post-processing scripts to handle missing datasets in minid…

ac97387

…ump tests

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

Comment thread scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala Outdated

Fix compilation errors in MapObjectUris.scala

f49805a

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

anshuman9468 added 2 commits April 18, 2026 00:13

Disable remote DataID comparison in minidump tests to avoid RiotExcep…

8194f1d

…tion

Disable remote DataID comparison in all test configs

9fc7dfd

		val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key
		quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)

		private val citationMap = new mutable.HashMap[String, String]()

Conversation

anshuman9468 commented Apr 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Apr 17, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anshuman9468 commented Apr 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 16, 2026 •

edited

Loading