feat(ontology): add rdfs:comment to infobox property definitions#840
feat(ontology): add rdfs:comment to infobox property definitions#840anshuman9468 wants to merge 8 commits intodbpedia:masterfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (4)
📝 WalkthroughWalkthroughAdds citation tracking and emission: Link and LinkExtractor now identify in-page citation anchors; HtmlNifExtractor collects citation URLs and emits citation-specific RDF; NIF config removes Changes
Sequence Diagram(s)sequenceDiagram
participant HTML as HTML Document
participant HtmlExt as HtmlNifExtractor
participant LinkExt as LinkExtractor
participant CitMap as CitationMap
participant RDF as RDF Emitter
HTML->>HtmlExt: parse document
HtmlExt->>HtmlExt: extract li elements id="cite_note-*" and store id→external href in CitMap
HtmlExt->>LinkExt: traverse <a> nodes
LinkExt->>LinkExt: detect href starting "#cite_note-*" → set citation=true, citationId
LinkExt-->>HtmlExt: return Link objects (with citation flag/id)
HtmlExt->>CitMap: lookup citationId → URL (if available)
HtmlExt->>RDF: emit dbpedia:Citation triple for word
HtmlExt->>RDF: emit its:taIdentRef with citation URL (if found)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java`:
- Around line 274-279: The loop in getTableCount() incorrectly references the
field/variable "paragraph" instead of the loop variable "p", which can produce
wrong counts or NPEs; update the body to use p.getHtmlStrings().size() and add a
null-safe check (skip if p is null or if p.getHtmlStrings() is null) so
getTableCount() iterates over this.getParagraphs() using the loop variable "p"
and sums sizes safely.
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`:
- Line 25: The current line initializes replaceString by silently falling back
to the English literal " (disambiguation)"; instead, change initialization in
DisambiguationExtractor (the replaceString assignment referencing
DisambiguationExtractorConfig.disambiguationTitlePartMap and language.wikiCode)
to fail-fast or emit an explicit warning/metric when the map has no entry for
language.wikiCode—e.g., check
DisambiguationExtractorConfig.disambiguationTitlePartMap.contains(language.wikiCode)
and if missing either throw a clear exception (fail-fast) or
log/processLogger.warn and increment a metric before using a fallback; ensure
the message includes language.wikiCode so missing configs are discoverable.
In `@core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala`:
- Around line 169-170: Normalize property.key before including it in the
InfoboxPropertyDefinitions comment: replace newline and tab characters and trim
the string the same way as used elsewhere (e.g., .replace("\n", "
").replace("\t", " ").trim) when building propertyComment in
InfoboxExtractor.scala so that the value passed to new Quad(...,
DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty,
propertyComment, splitNode.sourceIri, rdfLangStrDt) is sanitized and consistent
with the normalization used for statistics extraction.
In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala`:
- Around line 42-43: The extractor's mutable field citationMap in
HtmlNifExtractor is reused across documents causing cross-page contamination;
reset or clear citationMap at the start of each document processing run (e.g.,
in the method that begins processing a page / processDocument / extract or the
function around the code handling each page before collecting cite_note-* links)
so the map is empty for each page; ensure any other per-document state
initialized in lines referenced (358-365) is also reinitialized or cleared to
avoid stale URL lookups.
- Around line 291-293: The citation branch is emitting its:taIdentRef using the
raw citation URL (citationMap.get(link.getCitationId)) without validating it;
update the code in HtmlNifExtractor.scala so that the citation path uses
UriUtils.createURI() (the same validation used in the non-citation link branch)
to parse/validate the citation URL before calling nifLinks (words +=
nifLinks(...)), and if UriUtils.createURI() fails, skip emitting the
its:taIdentRef (and optionally log the invalid citation URL) rather than writing
an invalid IRI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 356bed8d-b522-4f7b-b5be-51fc33ae5f53
📒 Files selected for processing (7)
.github/workflows/snapshot_deploy.ymlcore/src/main/java/org/dbpedia/extraction/nif/Link.javacore/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.javacore/src/main/resources/nifextractionconfig.jsoncore/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scalacore/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scalacore/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala
💤 Files with no reviewable changes (1)
- core/src/main/resources/nifextractionconfig.json
| public int getTableCount() { | ||
| int count = 0; | ||
| for (Paragraph p : this.getParagraphs()) { | ||
| count += paragraph.getHtmlStrings().size(); | ||
| } | ||
| return count; |
There was a problem hiding this comment.
Use the loop variable in getTableCount() (current code can NPE).
Line 277 uses paragraph instead of p. This can produce wrong counts and fail at runtime when paragraph is null.
💡 Proposed fix
public int getTableCount() {
int count = 0;
for (Paragraph p : this.getParagraphs()) {
- count += paragraph.getHtmlStrings().size();
+ count += p.getHtmlStrings().size();
}
return count;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public int getTableCount() { | |
| int count = 0; | |
| for (Paragraph p : this.getParagraphs()) { | |
| count += paragraph.getHtmlStrings().size(); | |
| } | |
| return count; | |
| public int getTableCount() { | |
| int count = 0; | |
| for (Paragraph p : this.getParagraphs()) { | |
| count += p.getHtmlStrings().size(); | |
| } | |
| return count; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/java/org/dbpedia/extraction/nif/LinkExtractor.java` around
lines 274 - 279, The loop in getTableCount() incorrectly references the
field/variable "paragraph" instead of the loop variable "p", which can produce
wrong counts or NPEs; update the body to use p.getHtmlStrings().size() and add a
null-safe check (skip if p is null or if p.getHtmlStrings() is null) so
getTableCount() iterates over this.getParagraphs() using the loop variable "p"
and sums sizes safely.
| private val language = context.language | ||
|
|
||
| private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap(language.wikiCode) | ||
| private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)") |
There was a problem hiding this comment.
Avoid silently defaulting to English when language config is missing.
Using " (disambiguation)" as a universal fallback can silently mis-handle non-English wikis and hide missing configuration. Prefer fail-fast (or at least emit explicit warning/metric) so missing disambiguationTitlePartMap entries are visible.
Suggested fix (fail-fast for missing language mapping)
- private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.getOrElse(language.wikiCode, " (disambiguation)")
+ private val replaceString = DisambiguationExtractorConfig.disambiguationTitlePartMap.get(language.wikiCode).getOrElse {
+ throw new IllegalArgumentException(
+ s"Missing disambiguation title suffix for language '${language.wikiCode}' in DisambiguationExtractorConfig.disambiguationTitlePartMap"
+ )
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@core/src/main/scala/org/dbpedia/extraction/mappings/DisambiguationExtractor.scala`
at line 25, The current line initializes replaceString by silently falling back
to the English literal " (disambiguation)"; instead, change initialization in
DisambiguationExtractor (the replaceString assignment referencing
DisambiguationExtractorConfig.disambiguationTitlePartMap and language.wikiCode)
to fail-fast or emit an explicit warning/metric when the map has no entry for
language.wikiCode—e.g., check
DisambiguationExtractorConfig.disambiguationTitlePartMap.contains(language.wikiCode)
and if missing either throw a clear exception (fail-fast) or
log/processLogger.warn and increment a metric before using a fallback; ensure
the message includes language.wikiCode so missing configs are discoverable.
| val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key | ||
| quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the file
fd -t f "InfoboxExtractor.scala" --exec wc -l {} \;Repository: dbpedia/extraction-framework
Length of output: 152
🏁 Script executed:
# Read the relevant section of the file
sed -n '160,180p' core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scalaRepository: dbpedia/extraction-framework
Length of output: 1250
🏁 Script executed:
# Search for property.key definition and usage to understand data flow
rg "property\.key" core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala -B 3 -A 3Repository: dbpedia/extraction-framework
Length of output: 2924
🏁 Script executed:
# Check for existing normalization or escaping patterns in the file
rg "\.replace|\.trim|normalize|escape" core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala -B 2 -A 2Repository: dbpedia/extraction-framework
Length of output: 1155
Inconsistent normalization of property.key in infobox property definitions.
Line 169 injects raw property.key without normalization, while the same data is normalized elsewhere in the file (around line 215) with .replace("\n", " ").replace("\t", " ").trim for statistics extraction. Apply the same normalization here for consistency and to prevent malformed content in RDF comments.
Proposed hardening diff
- val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key
+ val normalizedKey = property.key.replace("\n", " ").replace("\t", " ").trim
+ val propertyComment = "Raw property extracted from Wikipedia infobox: " + normalizedKey
quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| val propertyComment = "Raw property extracted from Wikipedia infobox: " + property.key | |
| quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt) | |
| val normalizedKey = property.key.replace("\n", " ").replace("\t", " ").trim | |
| val propertyComment = "Raw property extracted from Wikipedia infobox: " + normalizedKey | |
| quads += new Quad(language, DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty, propertyComment, splitNode.sourceIri, rdfLangStrDt) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala`
around lines 169 - 170, Normalize property.key before including it in the
InfoboxPropertyDefinitions comment: replace newline and tab characters and trim
the string the same way as used elsewhere (e.g., .replace("\n", "
").replace("\t", " ").trim) when building propertyComment in
InfoboxExtractor.scala so that the value passed to new Quad(...,
DBpediaDatasets.InfoboxPropertyDefinitions, propertyUri, commentProperty,
propertyComment, splitNode.sourceIri, rdfLangStrDt) is sanitized and consistent
with the normalization used for statistics extraction.
| private val citationMap = new mutable.HashMap[String, String]() | ||
|
|
There was a problem hiding this comment.
Clear citationMap per document to prevent cross-page contamination.
citationMap is shared mutable state on the extractor instance and is only appended to. If the same extractor processes multiple pages, old cite_note-* IDs can resolve to stale URLs.
💡 Proposed fix
protected def getJsoupDoc(html: String): Document = {
val doc = Jsoup.parse(cleanHtml(html))
+ citationMap.clear()
//extract citations
for(note <- doc.select("li[id^=cite_note-]").asScala){Also applies to: 358-365
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around
lines 42 - 43, The extractor's mutable field citationMap in HtmlNifExtractor is
reused across documents causing cross-page contamination; reset or clear
citationMap at the start of each document processing run (e.g., in the method
that begins processing a page / processDocument / extract or the function around
the code handling each page before collecting cite_note-* links) so the map is
empty for each page; ensure any other per-document state initialized in lines
referenced (358-365) is also reinitialized or cleared to avoid stale URL
lookups.
| citationMap.get(link.getCitationId) match { | ||
| case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null) | ||
| case None => |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the specific file and the lines in question
head -n 310 core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala | tail -n 50Repository: dbpedia/extraction-framework
Length of output: 3428
🏁 Script executed:
# Let's get a broader view to understand the context
wc -l core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scalaRepository: dbpedia/extraction-framework
Length of output: 145
🏁 Script executed:
# Search for the non-citation validation path mentioned in the comment
rg -n "nifLinks" core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala | head -20Repository: dbpedia/extraction-framework
Length of output: 1543
🏁 Script executed:
# Check if UriUtils.createURI is used elsewhere in this file
rg -n "UriUtils" core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scalaRepository: dbpedia/extraction-framework
Length of output: 204
Validate citation URLs before emitting its:taIdentRef.
The non-citation link path (line 296) validates URLs using UriUtils.createURI() and handles failures, but the citation path (line 292) uses citation URLs directly without validation. Malformed href attributes from HTML can produce invalid IRIs in output quads.
Proposed fix
- citationMap.get(link.getCitationId) match {
- case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null)
- case None =>
- }
+ citationMap.get(link.getCitationId) match {
+ case Some(url) =>
+ UriUtils.createURI(url) match {
+ case Success(u) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", u.toString, sourceUrl, null)
+ case Failure(_) =>
+ }
+ case None =>
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| citationMap.get(link.getCitationId) match { | |
| case Some(url) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", url, sourceUrl, null) | |
| case None => | |
| citationMap.get(link.getCitationId) match { | |
| case Some(url) => | |
| UriUtils.createURI(url) match { | |
| case Success(u) => words += nifLinks(word, "http://www.w3.org/2005/11/its/rdf#taIdentRef", u.toString, sourceUrl, null) | |
| case Failure(_) => | |
| } | |
| case None => |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala` around
lines 291 - 293, The citation branch is emitting its:taIdentRef using the raw
citation URL (citationMap.get(link.getCitationId)) without validating it; update
the code in HtmlNifExtractor.scala so that the citation path uses
UriUtils.createURI() (the same validation used in the non-citation link branch)
to parse/validate the citation URL before calling nifLinks (words +=
nifLinks(...)), and if UriUtils.createURI() fails, skip emitting the
its:taIdentRef (and optionally log the invalid citation URL) rather than writing
an invalid IRI.
Added rdfs:comment to the properties extracted from Wikipedia infoboxes in the InfoboxExtractor. This provides descriptive metadata about the original raw property.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala (1)
122-135:⚠️ Potential issue | 🟡 MinorHeads-up:
DateFinder._dateis shared mutable state across worker threads.Both
Workers.workpools operate on the same outerfinder(line 123).DateFinder.byName(..., auto = true, ...)lazily sets_dateunder no synchronization, so concurrent workers in this pool (and between the two pools) can race. This predates the PR, but the newrequired = falsepaths make it much more likely the mapping pool never assigns_date, and then the input pool races to auto-resolve it from multiple threads.Consider giving each worker its own
DateFinder, or resolving the date once up-front before kicking off the workers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala` around lines 122 - 135, The DateFinder instance `finder` is shared across concurrent workers (used in `Workers.work` / `SimpleWorkers`), and `DateFinder._date` is set lazily (via `DateFinder.byName(..., auto = true)`), causing races; fix by avoiding shared mutable `finder`: either pre-resolve the date once before spinning up workers (call `DateFinder.byName(..., auto = true)` / ensure `_date` is initialized) or construct a new `DateFinder` per worker (move creation of `new DateFinder(baseDir, language)` into the worker closure so each worker gets its own instance); update references to `finder` in `QuadMapper().readQuads(...)` accordingly and remove reliance on unsynchronized lazy `_date`.scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala (1)
67-119:⚠️ Potential issue | 🔴 CriticalDon't use
returninside theSimpleWorkerslambda (lines 73 and 118).Scala compiles
returnin a closure into athrow scala.runtime.NonLocalReturnControltargeting the enclosing method frame (main). BecauseWorkers.workexecutes the lambda on worker threads viaworker.process()(Workers.scala:234), that frame is no longer on the current thread's stack. The exception either propagates as an uncaught exception and kills the worker thread, or is silently swallowed—either way, the intended "skip this language" behavior fails.Restructure with
if/else(or a pattern match) so the lambda naturally completes instead of returning early.🐛 Proposed fix
val finder = new DateFinder(baseDir, language) - val inputFile = finder.byName(input + suffix, auto = true, required = false) - if (inputFile.isEmpty) { - err.println(language.wikiCode + ": input dataset " + input + " not found, skipping.") - return - } - - // use LinkedHashMap to preserve order - val map = new LinkedHashMap[String, String]() - ... - if (predicate == null) { - err.println(language.wikiCode + ": input dataset " + input + " not found or empty, skipping.") - return - } - err.println(language.wikiCode + ": " + count + " redirects were suppressed ...") + finder.byName(input + suffix, auto = true, required = false) match { + case None => + err.println(language.wikiCode + ": input dataset " + input + " not found, skipping.") + case Some(_) => + // existing body (map building, predicate check, closure, writing) + // replace the second `return` with an `if (predicate != null) { ... }` + // or wrap the tail in the `else` branch. + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala` around lines 67 - 119, The lambda passed to Workers.work (SimpleWorkers) currently uses early return statements to skip a language (the two "return" usages inside the worker lambda), which causes NonLocalReturnControl exceptions when run on worker threads; replace both returns by restructuring the closure so it completes normally (e.g., wrap the worker body in an if/else or pattern match that checks inputFile.isEmpty and predicate==null conditions), ensuring you do not call return from inside the lambda used by Workers.work or QuadMapper callbacks (refer to variables/methods: Workers.work, SimpleWorkers(...){ language => ... }, finder.byName, inputFile.isEmpty, and the predicate null-check) so the worker thread simply exits the closure instead of throwing.
🧹 Nitpick comments (2)
scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala (1)
85-85:.gethere is safe, but brittle — consider consistency with the new optional contract.By the time this line runs,
_dateis guaranteed set (line 70 succeeded), andfinder.filereturnsSome(path)without probing existence, so.getwon't fail today. Still, once you fix thereturncontrol-flow above, you could thread theOptionthrough for consistency rather than leaving a stray.get.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala` at line 85, The use of finder.byName(output + suffix).get is brittle; replace the direct .get with proper Option handling so the optional contract is preserved—e.g., call finder.byName(output + suffix) and either pattern-match (case Some(path) => val outputFile = path ...) or use .getOrElse to fail with a clear error; ensure this flows consistently with the earlier _date/return control-flow so outputFile is only used when the Option is present (refer to finder.byName, output, suffix and the val outputFile).scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala (1)
101-117: Tolerant skip looks correct.
dates(required = false)+ pattern-matching both file options with anexistsguard cleanly avoids the previous.last/.gethard-failure paths.Optional: the
_ =>branch swallows which file is missing; including the names in the log would make triage easier when only one of the two datasets is absent.♻️ Suggested log tweak
- case _ => - Console.err.println(lang.wikiCode + ": required datasets missing, skipping.") + case _ => + val missing = Seq( + typeDatasetFileOption.filter(_.exists).map(_ => "").getOrElse(typesDataset), + mappedTripleDatasetFileOption.filter(_.exists).map(_ => "").getOrElse(mappedTripleDataset) + ).filter(_.nonEmpty).mkString(", ") + Console.err.println(s"${lang.wikiCode}: required datasets missing ($missing), skipping.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala` around lines 101 - 117, The default match branch currently logs a generic "required datasets missing" message; update it to report which dataset(s) are missing by checking the Options returned from finder.file for typesDataset and mappedTripleDataset and whether the underlying File.exists is false, then include those dataset names (typesDataset and/or mappedTripleDataset) and their expected paths in the Console.err message; modify the match block around (typeDatasetFileOption, mappedTripleDatasetFileOption) in TypeConsistencyCheck.scala so the case _ branch inspects typeDatasetFileOption and mappedTripleDatasetFileOption (and their .exists) and emits a clear, specific error like "<lang.wikiCode>: missing <datasetName> at <path>" before skipping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 138-139: Remove the duplicate local variable declaration for
count: there are two identical lines declaring "var count = 0" in MapObjectUris
(duplicate in the same scope), which causes a compilation error; keep a single
"var count = 0" and delete the extra declaration so only one variable named
count is defined in that scope.
---
Outside diff comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 122-135: The DateFinder instance `finder` is shared across
concurrent workers (used in `Workers.work` / `SimpleWorkers`), and
`DateFinder._date` is set lazily (via `DateFinder.byName(..., auto = true)`),
causing races; fix by avoiding shared mutable `finder`: either pre-resolve the
date once before spinning up workers (call `DateFinder.byName(..., auto = true)`
/ ensure `_date` is initialized) or construct a new `DateFinder` per worker
(move creation of `new DateFinder(baseDir, language)` into the worker closure so
each worker gets its own instance); update references to `finder` in
`QuadMapper().readQuads(...)` accordingly and remove reliance on unsynchronized
lazy `_date`.
In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`:
- Around line 67-119: The lambda passed to Workers.work (SimpleWorkers)
currently uses early return statements to skip a language (the two "return"
usages inside the worker lambda), which causes NonLocalReturnControl exceptions
when run on worker threads; replace both returns by restructuring the closure so
it completes normally (e.g., wrap the worker body in an if/else or pattern match
that checks inputFile.isEmpty and predicate==null conditions), ensuring you do
not call return from inside the lambda used by Workers.work or QuadMapper
callbacks (refer to variables/methods: Workers.work, SimpleWorkers(...){
language => ... }, finder.byName, inputFile.isEmpty, and the predicate
null-check) so the worker thread simply exits the closure instead of throwing.
---
Nitpick comments:
In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scala`:
- Line 85: The use of finder.byName(output + suffix).get is brittle; replace the
direct .get with proper Option handling so the optional contract is
preserved—e.g., call finder.byName(output + suffix) and either pattern-match
(case Some(path) => val outputFile = path ...) or use .getOrElse to fail with a
clear error; ensure this flows consistently with the earlier _date/return
control-flow so outputFile is only used when the Option is present (refer to
finder.byName, output, suffix and the val outputFile).
In
`@scripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala`:
- Around line 101-117: The default match branch currently logs a generic
"required datasets missing" message; update it to report which dataset(s) are
missing by checking the Options returned from finder.file for typesDataset and
mappedTripleDataset and whether the underlying File.exists is false, then
include those dataset names (typesDataset and/or mappedTripleDataset) and their
expected paths in the Console.err message; modify the match block around
(typeDatasetFileOption, mappedTripleDatasetFileOption) in
TypeConsistencyCheck.scala so the case _ branch inspects typeDatasetFileOption
and mappedTripleDatasetFileOption (and their .exists) and emits a clear,
specific error like "<lang.wikiCode>: missing <datasetName> at <path>" before
skipping.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ed760297-542b-4a96-9c99-2bfd4088a591
📒 Files selected for processing (6)
core/src/main/scala/org/dbpedia/extraction/util/DateFinder.scalacore/src/main/scala/org/dbpedia/extraction/util/Finder.scalascripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scalascripts/src/main/scala/org/dbpedia/extraction/scripts/QuadReader.scalascripts/src/main/scala/org/dbpedia/extraction/scripts/ResolveTransitiveLinks.scalascripts/src/main/scala/org/dbpedia/extraction/scripts/TypeConsistencyCheck.scala
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`:
- Around line 150-164: The changeCount is incremented for every non-literal quad
but should reflect quads actually remapped; move or adjust the increment so it
only increases when a mapping exists (i.e. when ret is nonEmpty) and count the
number of produced replacements (use ret.size or ret.length) rather than
unconditionally adding 1. Locate the block in MapObjectUris.scala where
map.get(quad.value) is assigned to uris, ret is yielded from uris, and
changeCount is updated; update changeCount to add ret.size after determining ret
is nonEmpty (or compute ret first then increment accordingly) so the final
err.println reports true remapped quad count.
- Around line 139-142: The output path lookup uses finder.byName for
outputFileOption which returns None because the output file doesn't exist yet;
change outputFileOption so it doesn't depend on byName existence checks: when
isExternal keep using new RichFile(new File(secondary, input._1 + extension +
input._2)), otherwise construct the output path directly (e.g. use
finder.file(extension + input._2) or build the File for the date directory)
instead of finder.byName(input._1 + extension + input._2, ...); update the
variable outputFileOption construction so inputFileOption.isDefined &&
outputFileOption.isDefined check no longer fails due to missing on-disk output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c1a91abb-27eb-448c-98c7-a28e5dca5519
📒 Files selected for processing (1)
scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala
| val inputFileOption: Option[FileLike[_]] = if(isExternal) Some(new RichFile(new File(secondary, input._1 + input._2))) else finder.byName(input._1 + input._2, auto = true, required = false).map(x => x: FileLike[_]) | ||
| val outputFileOption: Option[FileLike[_]] = if(isExternal) Some(new RichFile(new File(secondary, input._1 + extension + input._2))) else finder.byName(input._1 + extension + input._2, auto = true, required = false).map(x => x: FileLike[_]) | ||
|
|
||
| if (inputFileOption.isDefined && outputFileOption.isDefined && inputFileOption.get.exists) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect DateFinder.byName to confirm whether it returns None for files that don't exist.
fd -t f 'DateFinder.scala'
fd -t f 'DateFinder.scala' --exec cat {}Repository: dbpedia/extraction-framework
Length of output: 1287
🏁 Script executed:
fd -t f 'Finder.scala' -o -type f -name 'Finder.scala'Repository: dbpedia/extraction-framework
Length of output: 188
🏁 Script executed:
fd -t f 'Finder.scala' | head -20Repository: dbpedia/extraction-framework
Length of output: 197
🏁 Script executed:
cat core/src/main/scala/org/dbpedia/extraction/util/Finder.scalaRepository: dbpedia/extraction-framework
Length of output: 4341
🏁 Script executed:
# Check if resolve() implementation or FileLike checks existence
rg -A 5 'def resolve' core/src/main/scala/org/dbpedia/extraction/util/ --max-count=10Repository: dbpedia/extraction-framework
Length of output: 3521
🏁 Script executed:
# Look at the MapObjectUris context around lines 139-142 to understand the full picture
sed -n '130,150p' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scalaRepository: dbpedia/extraction-framework
Length of output: 1386
🏁 Script executed:
# Check if there's a wikiDate method or alternative path construction mentioned in the review
rg 'wikiDate|wikiDir' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scalaRepository: dbpedia/extraction-framework
Length of output: 54
🏁 Script executed:
# Look at what finder actually is in MapObjectUris context
sed -n '100,145p' scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scalaRepository: dbpedia/extraction-framework
Length of output: 2630
Fix the output path lookup to not depend on file existence.
The finder.byName(input._1 + extension + input._2, auto = true, required = false) call at line 140 will return None for the output file because DateFinder.byName() internally calls finder.dates(name, required), which filters dates based on whether files matching that name actually exist on disk. Since the output file doesn't exist yet, no matching date is found, dates() returns empty, and byName() returns None. This causes the guard at line 142 to fail, silently skipping the entire language×dataset pair.
The output path should be constructed directly (similar to the isExternal branch using RichFile and new File()) rather than looked up via byName(). For example, use finder.file(extension + input._2) to get the path in the main wiki directory, or construct the path directly for the date directory.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`
around lines 139 - 142, The output path lookup uses finder.byName for
outputFileOption which returns None because the output file doesn't exist yet;
change outputFileOption so it doesn't depend on byName existence checks: when
isExternal keep using new RichFile(new File(secondary, input._1 + extension +
input._2)), otherwise construct the output path directly (e.g. use
finder.file(extension + input._2) or build the File for the date directory)
instead of finder.byName(input._1 + extension + input._2, ...); update the
variable outputFileOption construction so inputFileOption.isDefined &&
outputFileOption.isDefined check no longer fails due to missing on-disk output.
| else { | ||
| val uris = map.get(quad.value).asScala | ||
| changeCount = changeCount + 1 | ||
| val ret = for (uri <- uris) | ||
| yield quad.copy( | ||
| value = uri, // change object URI | ||
| context = if (quad.context == null) quad.context else quad.context + "&objectMappedFrom=" + quad.value) // add change provenance | ||
| // none found | ||
| if(ret.isEmpty) | ||
| List(quad) | ||
| else | ||
| ret | ||
| } | ||
| } | ||
| err.println(input._1 + ": changed " + changeCount + " quads.") |
There was a problem hiding this comment.
changeCount over-reports — increments before checking whether a mapping exists.
At line 152 the counter is bumped for every non-literal quad, but the ret.isEmpty branch at lines 158–159 keeps the original quad unchanged when no mapping is found. The log at line 164 ("changed N quads") therefore conflates "non-literal quads processed" with "quads actually remapped". Given the rename from count → changeCount in this PR, the intended semantics appears to be the latter.
🐛 Proposed fix
else {
val uris = map.get(quad.value).asScala
- changeCount = changeCount + 1
val ret = for (uri <- uris)
yield quad.copy(
value = uri, // change object URI
context = if (quad.context == null) quad.context else quad.context + "&objectMappedFrom=" + quad.value) // add change provenance
// none found
if(ret.isEmpty)
List(quad)
- else
+ else {
+ changeCount += 1
ret
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/src/main/scala/org/dbpedia/extraction/scripts/MapObjectUris.scala`
around lines 150 - 164, The changeCount is incremented for every non-literal
quad but should reflect quads actually remapped; move or adjust the increment so
it only increases when a mapping exists (i.e. when ret is nonEmpty) and count
the number of produced replacements (use ret.size or ret.length) rather than
unconditionally adding 1. Locate the block in MapObjectUris.scala where
map.get(quad.value) is assigned to uris, ret is yielded from uris, and
changeCount is updated; update changeCount to add ret.size after determining ret
is nonEmpty (or compute ret first then increment accordingly) so the final
err.println reports true remapped quad count.
|



This PR adds rdfs:comment to the properties extracted from Wikipedia infoboxes in the InfoboxExtractor. This provides descriptive metadata about the original raw property, derived from the Wikipedia infobox key. This change addresses the requirement to include more information in the infobox-property-definitions dataset so it can be viewed as HTML in the store.
Summary by CodeRabbit
New Features
Bug Fixes