Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@
# limitations under the License.

# Do not include KEYS in archived source releases
/KEYS export-ignore
/KEYS export-ignore
# ensure stringAsXml file line endings are not normalized in windows
/daffodil-test/src/test/resources/org/apache/daffodil/infoset/stringAsXml/** -text
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look into any other alternatives to this? It would be nice if we didn't need to have this one special case in a hidden file. It might be hard to remember that this exists if this ever leads to issues.

If we do keep this, instead of doing a directory glob, can we list the exact files that this applies to to make it clear exactly which files have this issue? I think it's only one or two XML files? Is it also possible to add a comment to those XML files that explains that the file intentionally contains CR's for testing purposes, and that git's auto.crlf feature is disabled for this file via the .gitattributes to prevent that from changing?

20 changes: 15 additions & 5 deletions daffodil-core/src/main/resources/org/apache/daffodil/xsd/tdml.xsd
Original file line number Diff line number Diff line change
Expand Up @@ -224,11 +224,21 @@
</simpleType>

<simpleType name="validationType">
<restriction base="xs:token">
<enumeration value="on"/>
<enumeration value="limited"/>
<enumeration value="off"/>
</restriction>
<union>
<simpleType>
<restriction base="xs:token">
<enumeration value="on"/>
<enumeration value="limited"/>
<enumeration value="off"/>
</restriction>
</simpleType>

<simpleType>
<restriction base="xs:token">
<pattern value="[A-Za-z0-9_]+"/>
</restriction>
</simpleType>
</union>
</simpleType>

<element name="document" type="tns:documentType"/>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,13 +94,16 @@ object Position {
* behavior of normalizing CRLF to LF, and solitary CR to LF.
* Defaults to true. Should only be changed in special circumstances
* as not normalizing CRLFs is non-standard for XML.
*
* @param removeComments True to remove comments. This is used to keep the XML as close to the original as possible
* @param removeProcInstr True to remove processing instructions. This is used to keep the XML as close to the original as possible
*/
class DaffodilConstructingLoader private[xml] (
uri: URI,
errorHandler: org.xml.sax.ErrorHandler,
addPositionAttributes: Boolean,
normalizeCRLFtoLF: Boolean
normalizeCRLFtoLF: Boolean,
removeComments: Boolean,
removeProcInstr: Boolean
) extends ConstructingParser(
{
// Note: we must open the XML carefully since it might be in some non
Expand All @@ -122,7 +125,14 @@ class DaffodilConstructingLoader private[xml] (
errorHandler: org.xml.sax.ErrorHandler,
addPositionAttributes: Boolean = false
) =
this(uri, errorHandler, addPositionAttributes, normalizeCRLFtoLF = true)
this(
uri,
errorHandler,
addPositionAttributes,
normalizeCRLFtoLF = true,
removeComments = true,
removeProcInstr = true
)

/**
* Ensures that DOCTYPES aka DTDs, if encountered, are rejected.
Expand Down Expand Up @@ -316,19 +326,30 @@ class DaffodilConstructingLoader private[xml] (
}

/**
* Drops comments
* Drops comments if removeComments is true
*
* This is optional controlled by a constructor parameter.
*/
override def comment(pos: Int, s: String): Comment = {
// returning null drops comments
null
if (removeComments) {
// returning null drops comments
null
} else {
super.comment(pos, s)
}
}

/**
* Drops processing instructions
* Drops processing instructions if removeProcInstr is false
*
* This is optional controlled by a constructor parameter.
*/
override def procInstr(pos: Int, target: String, txt: String) = {
// returning null drops processing instructions
null
if (removeProcInstr) { // returning null drops processing instructions
null
} else {
super.procInstr(pos, target, txt)
}
}

private def parseXMLPrologAttributes(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -702,31 +702,20 @@ class DaffodilXMLLoader(val errorHandler: org.xml.sax.ErrorHandler)
* @param optSchemaURI Optional URI for XML schema for the XML source document.
* @param addPositionAttributes True to add dafint:file dafint:line attributes to all elements.
* Defaults to false.
* @return an scala.xml.Node (Element actually) which is the document element of the source.
*/
def load(
source: DaffodilSchemaSource,
optSchemaURI: Option[URI],
addPositionAttributes: Boolean = false
): scala.xml.Node =
load(source, optSchemaURI, addPositionAttributes, normalizeCRLFtoLF = true)

/**
* package private constructor gives access to normalizeCRLFtoLF feature.
*
* @param source The URI for the XML document which may be a XML or DFDL schema, or just XML data.
* @param optSchemaURI Optional URI for XML schema for the XML source document.
* @param addPositionAttributes True to add dafint:file dafint:line attributes to all elements.
* Defaults to false.
* @param normalizeCRLFtoLF True to normalize CRLF and isolated CR to LF. This should usually be true,
* but some special case situations may require preservation of CRLF/CR.
* @param removeComments True to remove comments. This is used to keep the XML as close to the original as possible
* @param removeProcInstr True to remove processing instructions. This is used to keep the XML as close to the original as possible
*
* @return an scala.xml.Node (Element actually) which is the document element of the source.
*/
private[xml] def load(
def load(
source: DaffodilSchemaSource,
optSchemaURI: Option[URI],
addPositionAttributes: Boolean,
normalizeCRLFtoLF: Boolean
addPositionAttributes: Boolean = false,
normalizeCRLFtoLF: Boolean = true,
removeComments: Boolean = true,
removeProcInstr: Boolean = true
): scala.xml.Node = {
//
// First we invoke the validator to explicitly validate the XML against
Expand Down Expand Up @@ -819,7 +808,9 @@ class DaffodilXMLLoader(val errorHandler: org.xml.sax.ErrorHandler)
source.uriForLoading,
errorHandler,
addPositionAttributes,
normalizeCRLFtoLF
normalizeCRLFtoLF,
removeComments,
removeProcInstr
)
val res =
try {
Expand Down
174 changes: 148 additions & 26 deletions daffodil-core/src/main/scala/org/apache/daffodil/lib/xml/XMLUtils.scala
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ import org.apache.daffodil.lib.iapi.URISchemaSource
import org.apache.daffodil.lib.schema.annotation.props.LookupLocation
import org.apache.daffodil.lib.util.Maybe
import org.apache.daffodil.lib.util.Misc
import org.apache.daffodil.runtime1.infoset.InvalidInfosetException
import org.apache.daffodil.runtime1.infoset.XMLTextInfoset

import org.apache.commons.io.IOUtils
import org.xml.sax.XMLReader
Expand Down Expand Up @@ -599,6 +601,14 @@ object XMLUtils {

def removeComments(e: Node): Node = {
e match {
case x @ Elem(
null,
XMLTextInfoset.stringAsXml,
Null,
NamespaceBinding(null, null | "", _),
_*
) =>
x
case Elem(prefix, label, attribs, scope, child*) => {
val newChildren = child.filterNot { _.isInstanceOf[Comment] }.map { removeComments(_) }
Elem(prefix, label, attribs, scope, true, newChildren*)
Expand Down Expand Up @@ -638,40 +648,111 @@ object XMLUtils {
res
}

/**
* normalizes CRLF to LF within text nodes in non-stringAsXML elements
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add something that explain why this is important? Something about how fields in infosets could contain LF, but could be changed to CRLF due to git's autocrlf feature. And since infoset outputters always output LF we need to undo with git might do and normalize those CRLF's to LF.

private def normalizeCRLFtoLF(ns: Node): Node = {
if (!ns.isInstanceOf[Elem]) return ns
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this early return, can we change the default case to case e: Elem =>, and then have the default case be case _ => ns? It's maybe a bit slower since we'll have to do some matching, but it's more scala-y and I imagine it doesn't make that much of a difference in performance.


ns match {
// NOTE: this is specifically for the stringAsXml feature as we avoid
// making changes to any of its children requiring that stringAsXml in
// the infoset match results exactly.
case e @ Elem(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this same kindof-complex match/case in a number of places to check for the stringAsElem. Thoughts on doing something like this:

private def isStringAsXmlElem(e: Elem) => e match {
  case Elem(null, XMLTextInfoset.stringAsXml, ...) => true
  case _ => false
}

And then these become something like

case e: Elem if isStringAsXmlElem(e) => ...

It's also makes it easier to update that one function if we ever change what it stringAsXml elements look like.

null,
XMLTextInfoset.stringAsXml,
Null,
NamespaceBinding(null, null | "", _),
_*
) => {
e
}
case _ => {
val e = ns.asInstanceOf[Elem]
val children = e.child
val normalized = children
.map {
case Text(data) if data.contains("\r") => {
val replaced = data.replaceAll("\r\n", "\n").replaceAll("\r", "\n")
Text(replaced)
}
case c => c
}
.map(normalizeCRLFtoLF)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a bit cleaner to do something like:

case e: Elem ... => e // stringAsXml case
case e: Elem => {
  val normalized = e.child.map(normalizeCRLFtoLF)
  val res =
    if (normalized eq children) e
    else e.copy(child = normalized)
  }
}
case Text(data) if data.contains("\r") => {
  val replaced = ...
  Text(replaced)
}
case _ => n

This makes it a bit more clear that it just recurses down non-stringAsXmlElements and the only thing that actually changes is Text nodes that contain CR's.

val res =
if (normalized eq children) e
else e.copy(child = normalized)
res
}
}
}

/**
* removes insignificant whitespace from between elements
*/

private def removeMixedWhitespace(ns: Node): Node = {
if (!ns.isInstanceOf[Elem]) return ns
val e = ns.asInstanceOf[Elem]
val children = e.child
val noMixedChildren =
if (children.exists(_.isInstanceOf[Elem])) {
children
.filter {
case Text(data) if data.matches("""\s*""") => false
case Text(data) =>
throw new Exception("Element %s contains mixed data: %s".format(e.label, data))
case _ => true
}
.map(removeMixedWhitespace)
} else {
children.filter {
//
// So this is a bit strange, but we're dropping nodes that are Empty String.
//
// In XML we cannot tell <foo></foo> where there is a Text("") child, from <foo></foo> with Nil children
//
case Text("") => false // drop empty strings
case _ => true

ns match {
// NOTE: this is specifically for the stringAsXml feature as we avoid
// making changes to any of its children except removing any surrounding
// whitespace, requiring that stringAsXml in the infoset match results exactly.
case e @ Elem(
null,
XMLTextInfoset.stringAsXml,
Null,
NamespaceBinding(null, null | "", _),
_*
) => {
val (elemChildren, nonElemChildren) = e.child.partition {
_.isInstanceOf[Elem]
}
if (elemChildren.length != 1)
throw new InvalidInfosetException("stringAsXml must contain a single child element.")
nonElemChildren.foreach {
case Text(data) if data.matches("""\s*""") => // no-op, empty text siblings are fine
case x =>
throw new Exception(
"%s is some kind of mixed content not allowed as a stringAsXml child".format(x)
)
}
e.asInstanceOf[Elem].copy(child = elemChildren)
}
case _ => {
val e = ns.asInstanceOf[Elem]
val children = e.child
val noMixedChildren =
if (children.exists(_.isInstanceOf[Elem])) {
children
.filter {
case Text(data) if data.matches("""\s*""") => false
case Text(data) =>
throw new Exception(
"Element %s contains mixed data: %s".format(e.label, data)
)
case _ => true
}
.map(removeMixedWhitespace)
} else {
children.filter {
//
// So this is a bit strange, but we're dropping nodes that are Empty String.
//
// In XML we cannot tell <foo></foo> where there is a Text("") child, from <foo></foo> with Nil children
//
case Text("") => false // drop empty strings
case _ => true
}
}

val res =
if (noMixedChildren eq children) e
else e.copy(child = noMixedChildren)
res
}
}

val res =
if (noMixedChildren eq children) e
else e.copy(child = noMixedChildren)
res
}

/**
Expand Down Expand Up @@ -700,6 +781,15 @@ object XMLUtils {
): NodeSeq = {
val res = n match {

case e @ Elem(
null,
XMLTextInfoset.stringAsXml,
Null,
NamespaceBinding(null, null | "", _),
_*
) =>
e

case e @ Elem(prefix, label, attributes, scope, children*) => {

val filteredScope = if (ns.length > 0) filterScope(scope, ns) else xml.TopScope
Expand Down Expand Up @@ -808,7 +898,8 @@ object XMLUtils {
val noPCData = convertPCDataToText(noComments)
val combinedText = coalesceAllAdjacentTextNodes(noPCData)
val noMixedWS = removeMixedWhitespace(combinedText)
noMixedWS
val noCRLFs = normalizeCRLFtoLF(noMixedWS)
noCRLFs
}

class XMLDifferenceException(message: String) extends Exception(message)
Expand Down Expand Up @@ -973,6 +1064,15 @@ Differences were (path, expected, actual):
} else if (checkPrefixes && prefixA != prefixB) {
// different prefix
List((zPath + "/" + labelA + "@prefix", prefixA, prefixB))
} else if (checkPrefixes && a.scope.getURI(prefixA) != b.scope.getURI(prefixB)) {
// prefixes doesn't resolve to same namespace
List(
(
zPath + "/" + labelA + "@prefix-namespace",
a.scope.getURI(prefixA),
b.scope.getURI(prefixB)
)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for these new checks to fail? I thought we validated the expected infoset to make sure it was valid, which I think should check to make sure namespaces resolve?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we validate the expected infoset, it looks like it's still an open ticket

https://issues.apache.org/jira/projects/DAFFODIL/issues/DAFFODIL-288

That being said I don't think it can actually fail, since a.getNamespace(prefixA) is usually equal to nsbA.getURI(prefixA). I cannot think of a failing elem example

} else if (checkNamespaces && mappingsA != mappingsB) {
// different namespace bindings
List((zPath + "/" + labelA + "@xmlns", mappingsA, mappingsB))
Expand Down Expand Up @@ -1055,6 +1155,28 @@ Differences were (path, expected, actual):
computeTextDiff(zPath, tA, tB, maybeType, maybeFloatEpsilon, maybeDoubleEpsilon)
thisDiff
}
case (cA: Comment, cB: Comment) => {
val thisDiff = computeTextDiff(
zPath,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding something to zpath to make it clear this is a comment that is a child of whatever zpath is. Maybe somethin like zPath + "/@comment". The path step make sit clear it's a child, and @ is what we've been using to indicate something is like an attribute or somethig other than an element. Same idea for pcadata below.

cA.toString,
cB.toString,
maybeType,
maybeFloatEpsilon,
maybeDoubleEpsilon
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass in None for type, and epsilons for comments and PCData? I'm not even really sure how maybeType coul dbe defined here. Maybe if stringAsXml contained something like this:

<stringAsXml xmlns="">
  <foo xsi:type="xs:int">
     <!-- inline comment -->
      5
  </foo>
</stringAsXml>

I think in that case we might e usuing xs:int for type aware comparisons? And we'll try to compare the comment as if it were an int, which might break? I'm not postive, but I think we can avoid this all if we just pass in None for these to disable type awareness.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tested this and inline comment ends up having its own type who is None. foo is a separate iteration of computeDiff and is what actually carries the type. But for extra precaution, will do!

)
thisDiff
}
case (pcA: PCData, pcB: PCData) => {
val thisDiff = computeTextDiff(
zPath,
pcA.toString,
pcB.toString,
maybeType,
maybeFloatEpsilon,
maybeDoubleEpsilon
)
thisDiff
}
case (pA: ProcInstr, pB: ProcInstr) => {
val ProcInstr(tA1label, tA1content) = pA
val ProcInstr(tB1label, tB1content) = pB
Expand Down
Loading
Loading