-
Notifications
You must be signed in to change notification settings - Fork 74
Change TDMLRunner to use XMLTextInfosetInputter/Outputter as default #1650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
e630050
0ac13c9
47912e4
6ddad8d
1e8ba75
bd27e15
ee366fb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,6 +42,8 @@ import org.apache.daffodil.lib.iapi.URISchemaSource | |
| import org.apache.daffodil.lib.schema.annotation.props.LookupLocation | ||
| import org.apache.daffodil.lib.util.Maybe | ||
| import org.apache.daffodil.lib.util.Misc | ||
| import org.apache.daffodil.runtime1.infoset.InvalidInfosetException | ||
| import org.apache.daffodil.runtime1.infoset.XMLTextInfoset | ||
|
|
||
| import org.apache.commons.io.IOUtils | ||
| import org.xml.sax.XMLReader | ||
|
|
@@ -599,6 +601,14 @@ object XMLUtils { | |
|
|
||
| def removeComments(e: Node): Node = { | ||
| e match { | ||
| case x @ Elem( | ||
| null, | ||
| XMLTextInfoset.stringAsXml, | ||
| Null, | ||
| NamespaceBinding(null, null | "", _), | ||
| _* | ||
| ) => | ||
| x | ||
| case Elem(prefix, label, attribs, scope, child*) => { | ||
| val newChildren = child.filterNot { _.isInstanceOf[Comment] }.map { removeComments(_) } | ||
| Elem(prefix, label, attribs, scope, true, newChildren*) | ||
|
|
@@ -638,40 +648,111 @@ object XMLUtils { | |
| res | ||
| } | ||
|
|
||
| /** | ||
| * normalizes CRLF to LF within text nodes in non-stringAsXML elements | ||
| */ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add something that explain why this is important? Something about how fields in infosets could contain LF, but could be changed to CRLF due to git's autocrlf feature. And since infoset outputters always output LF we need to undo with git might do and normalize those CRLF's to LF. |
||
| private def normalizeCRLFtoLF(ns: Node): Node = { | ||
| if (!ns.isInstanceOf[Elem]) return ns | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of this early return, can we change the default case to |
||
|
|
||
| ns match { | ||
| // NOTE: this is specifically for the stringAsXml feature as we avoid | ||
| // making changes to any of its children requiring that stringAsXml in | ||
| // the infoset match results exactly. | ||
| case e @ Elem( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have this same kindof-complex match/case in a number of places to check for the stringAsElem. Thoughts on doing something like this: private def isStringAsXmlElem(e: Elem) => e match {
case Elem(null, XMLTextInfoset.stringAsXml, ...) => true
case _ => false
}And then these become something like It's also makes it easier to update that one function if we ever change what it stringAsXml elements look like. |
||
| null, | ||
| XMLTextInfoset.stringAsXml, | ||
| Null, | ||
| NamespaceBinding(null, null | "", _), | ||
| _* | ||
| ) => { | ||
| e | ||
| } | ||
| case _ => { | ||
| val e = ns.asInstanceOf[Elem] | ||
| val children = e.child | ||
| val normalized = children | ||
| .map { | ||
| case Text(data) if data.contains("\r") => { | ||
| val replaced = data.replaceAll("\r\n", "\n").replaceAll("\r", "\n") | ||
| Text(replaced) | ||
| } | ||
| case c => c | ||
| } | ||
| .map(normalizeCRLFtoLF) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be a bit cleaner to do something like: case e: Elem ... => e // stringAsXml case
case e: Elem => {
val normalized = e.child.map(normalizeCRLFtoLF)
val res =
if (normalized eq children) e
else e.copy(child = normalized)
}
}
case Text(data) if data.contains("\r") => {
val replaced = ...
Text(replaced)
}
case _ => nThis makes it a bit more clear that it just recurses down non-stringAsXmlElements and the only thing that actually changes is Text nodes that contain CR's. |
||
| val res = | ||
| if (normalized eq children) e | ||
| else e.copy(child = normalized) | ||
| res | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * removes insignificant whitespace from between elements | ||
| */ | ||
|
|
||
| private def removeMixedWhitespace(ns: Node): Node = { | ||
| if (!ns.isInstanceOf[Elem]) return ns | ||
| val e = ns.asInstanceOf[Elem] | ||
| val children = e.child | ||
| val noMixedChildren = | ||
| if (children.exists(_.isInstanceOf[Elem])) { | ||
| children | ||
| .filter { | ||
| case Text(data) if data.matches("""\s*""") => false | ||
| case Text(data) => | ||
| throw new Exception("Element %s contains mixed data: %s".format(e.label, data)) | ||
| case _ => true | ||
| } | ||
| .map(removeMixedWhitespace) | ||
| } else { | ||
| children.filter { | ||
| // | ||
| // So this is a bit strange, but we're dropping nodes that are Empty String. | ||
| // | ||
| // In XML we cannot tell <foo></foo> where there is a Text("") child, from <foo></foo> with Nil children | ||
| // | ||
| case Text("") => false // drop empty strings | ||
| case _ => true | ||
|
|
||
| ns match { | ||
| // NOTE: this is specifically for the stringAsXml feature as we avoid | ||
| // making changes to any of its children except removing any surrounding | ||
| // whitespace, requiring that stringAsXml in the infoset match results exactly. | ||
| case e @ Elem( | ||
| null, | ||
| XMLTextInfoset.stringAsXml, | ||
| Null, | ||
| NamespaceBinding(null, null | "", _), | ||
| _* | ||
| ) => { | ||
| val (elemChildren, nonElemChildren) = e.child.partition { | ||
| _.isInstanceOf[Elem] | ||
| } | ||
| if (elemChildren.length != 1) | ||
| throw new InvalidInfosetException("stringAsXml must contain a single child element.") | ||
| nonElemChildren.foreach { | ||
| case Text(data) if data.matches("""\s*""") => // no-op, empty text siblings are fine | ||
| case x => | ||
| throw new Exception( | ||
| "%s is some kind of mixed content not allowed as a stringAsXml child".format(x) | ||
| ) | ||
| } | ||
| e.asInstanceOf[Elem].copy(child = elemChildren) | ||
| } | ||
| case _ => { | ||
| val e = ns.asInstanceOf[Elem] | ||
| val children = e.child | ||
| val noMixedChildren = | ||
| if (children.exists(_.isInstanceOf[Elem])) { | ||
| children | ||
| .filter { | ||
| case Text(data) if data.matches("""\s*""") => false | ||
| case Text(data) => | ||
| throw new Exception( | ||
| "Element %s contains mixed data: %s".format(e.label, data) | ||
| ) | ||
| case _ => true | ||
| } | ||
| .map(removeMixedWhitespace) | ||
| } else { | ||
| children.filter { | ||
| // | ||
| // So this is a bit strange, but we're dropping nodes that are Empty String. | ||
| // | ||
| // In XML we cannot tell <foo></foo> where there is a Text("") child, from <foo></foo> with Nil children | ||
| // | ||
| case Text("") => false // drop empty strings | ||
| case _ => true | ||
| } | ||
| } | ||
|
|
||
| val res = | ||
| if (noMixedChildren eq children) e | ||
| else e.copy(child = noMixedChildren) | ||
| res | ||
| } | ||
| } | ||
|
|
||
| val res = | ||
| if (noMixedChildren eq children) e | ||
| else e.copy(child = noMixedChildren) | ||
| res | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -700,6 +781,15 @@ object XMLUtils { | |
| ): NodeSeq = { | ||
| val res = n match { | ||
|
|
||
| case e @ Elem( | ||
| null, | ||
| XMLTextInfoset.stringAsXml, | ||
| Null, | ||
| NamespaceBinding(null, null | "", _), | ||
| _* | ||
| ) => | ||
| e | ||
|
|
||
| case e @ Elem(prefix, label, attributes, scope, children*) => { | ||
|
|
||
| val filteredScope = if (ns.length > 0) filterScope(scope, ns) else xml.TopScope | ||
|
|
@@ -808,7 +898,8 @@ object XMLUtils { | |
| val noPCData = convertPCDataToText(noComments) | ||
| val combinedText = coalesceAllAdjacentTextNodes(noPCData) | ||
| val noMixedWS = removeMixedWhitespace(combinedText) | ||
| noMixedWS | ||
| val noCRLFs = normalizeCRLFtoLF(noMixedWS) | ||
| noCRLFs | ||
| } | ||
|
|
||
| class XMLDifferenceException(message: String) extends Exception(message) | ||
|
|
@@ -973,6 +1064,15 @@ Differences were (path, expected, actual): | |
| } else if (checkPrefixes && prefixA != prefixB) { | ||
| // different prefix | ||
| List((zPath + "/" + labelA + "@prefix", prefixA, prefixB)) | ||
| } else if (checkPrefixes && a.scope.getURI(prefixA) != b.scope.getURI(prefixB)) { | ||
| // prefixes doesn't resolve to same namespace | ||
| List( | ||
| ( | ||
| zPath + "/" + labelA + "@prefix-namespace", | ||
| a.scope.getURI(prefixA), | ||
| b.scope.getURI(prefixB) | ||
| ) | ||
| ) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible for these new checks to fail? I thought we validated the expected infoset to make sure it was valid, which I think should check to make sure namespaces resolve?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we validate the expected infoset, it looks like it's still an open ticket https://issues.apache.org/jira/projects/DAFFODIL/issues/DAFFODIL-288 That being said I don't think it can actually fail, since a.getNamespace(prefixA) is usually equal to nsbA.getURI(prefixA). I cannot think of a failing elem example |
||
| } else if (checkNamespaces && mappingsA != mappingsB) { | ||
| // different namespace bindings | ||
| List((zPath + "/" + labelA + "@xmlns", mappingsA, mappingsB)) | ||
|
|
@@ -1055,6 +1155,28 @@ Differences were (path, expected, actual): | |
| computeTextDiff(zPath, tA, tB, maybeType, maybeFloatEpsilon, maybeDoubleEpsilon) | ||
| thisDiff | ||
| } | ||
| case (cA: Comment, cB: Comment) => { | ||
| val thisDiff = computeTextDiff( | ||
| zPath, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be worth adding something to zpath to make it clear this is a comment that is a child of whatever zpath is. Maybe somethin like |
||
| cA.toString, | ||
| cB.toString, | ||
| maybeType, | ||
| maybeFloatEpsilon, | ||
| maybeDoubleEpsilon | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we pass in None for type, and epsilons for comments and PCData? I'm not even really sure how maybeType coul dbe defined here. Maybe if stringAsXml contained something like this: <stringAsXml xmlns="">
<foo xsi:type="xs:int">
<!-- inline comment -->
5
</foo>
</stringAsXml>I think in that case we might e usuing xs:int for type aware comparisons? And we'll try to compare the comment as if it were an int, which might break? I'm not postive, but I think we can avoid this all if we just pass in None for these to disable type awareness.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I tested this and inline comment ends up having its own type who is None. foo is a separate iteration of computeDiff and is what actually carries the type. But for extra precaution, will do! |
||
| ) | ||
| thisDiff | ||
| } | ||
| case (pcA: PCData, pcB: PCData) => { | ||
| val thisDiff = computeTextDiff( | ||
| zPath, | ||
| pcA.toString, | ||
| pcB.toString, | ||
| maybeType, | ||
| maybeFloatEpsilon, | ||
| maybeDoubleEpsilon | ||
| ) | ||
| thisDiff | ||
| } | ||
| case (pA: ProcInstr, pB: ProcInstr) => { | ||
| val ProcInstr(tA1label, tA1content) = pA | ||
| val ProcInstr(tB1label, tB1content) = pB | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you look into any other alternatives to this? It would be nice if we didn't need to have this one special case in a hidden file. It might be hard to remember that this exists if this ever leads to issues.
If we do keep this, instead of doing a directory glob, can we list the exact files that this applies to to make it clear exactly which files have this issue? I think it's only one or two XML files? Is it also possible to add a comment to those XML files that explains that the file intentionally contains CR's for testing purposes, and that git's auto.crlf feature is disabled for this file via the .gitattributes to prevent that from changing?