Skip to content

Providing Input to XMLUnit

Stefan Bodewig edited this page Apr 10, 2023 · 16 revisions

Source and ISource

All core parts of XMLUnit use a single abstraction for "pieces of XML" they are supposed to work on. For Java this is javax.xml.transform.Source and for .NET we've created Org.XmlUnit.ISource which basically adds a wrapper around an XmlReader.

For Java many implementations of said interface are part of the Java class library, for .NET we've added the corresponding

  • ReaderSource - just wraps an existing XmlReader
  • DOMSource - creates a Source from an XmlNode
  • StreamSource - creates a Source from a TextReader, Stream or a string holding an URI
  • LinqSource - creates a Source from an XNode

At the time of this writing there is no XML-Serialization based equivalent of JAXBSource for .NET.

In order to make it easier to create instances of Source or ISource there a builder, that provides a fluent API.

CommentLessSource

CommentLessSource is a decorator of a different source and provides XML that consists of the original source's content with all comments removed.

Use this wrapper if you want XMLUnit to ignore comments.

This is class is used under the covers if you tell DiffBuilder to ignore comments.

WhitespaceStrippedSource

When using XMLUnit.NET of version 2.10.0 or later, you may want to use XmlWhitespaceStrippedSource instead - see below.

WhitespaceStrippedSource is a decorator of a different source that removes all empty text nodes and trims the remaining text nodes.

If you only want to remove all "element content whitespace", i.e. text content between XML elements that is just an artifact of "pretty printing" XML then you should use ElementContentWhitespaceStrippedSource instead.

Examples

Empty text nodes are removed:

<element>
</element>

becomes

<element></element>

Text Nodes are stripped:

<element>
  foo
</element>

becomes

<element>foo</element>

If the XML content has been created in memory rather than been deserialized from an external source it could contain adjacent Text nodes so that

<element>
  foo
  bar
</element>

could become

<element>foobar</element>

or

<element>
foo
bar
</element>

depending on how the document has been structured. In order to get more control the input had to be normalized (using Document.normalize() or XmlDocument.Normalize()) before wrapping it in a WhitespaceStrippedSource - or by using an additional NormalizedSource wrapper.

WhitespaceNormalizedSource

When using XMLUnit.NET of version 2.10.0 or later, you may want to use XmlWhitespaceNormalizedSource instead - see below.

WhitespaceNormalizedSource is a decorator of a different source that replaces all whitespace characters found in Text nodes with Space characters and collapses consecutive whitespace characters into a single Space.

Examples

<element>a

    b
</element>

becomes

<element>a b </element>

NormalizedSource

NormalizedSource performs XML normalization on the wrapped document. This means adjacent text nodes are merged to single nodes and empty Text nodes removed (recursively). For Java when wrapping a Document rather than a Node additional normalizations may be preformed - see XmlNode.Normalize for .NET and Node#normalize as well as Document#normalizeDocument for Java.

When reading documents a parser usually puts the document into normalized form anyway. You will only need to perform XML normalization on DOM trees you have created programmatically.

ElementContentWhitespaceStrippedSource

When using XMLUnit.NET of version 2.10.0 or later, you may want to use XmlElementContentWhitespaceStrippedSource instead - see below.

ElementContentWhitespaceStrippedSource is a decorator of a different source that removes all text nodes solely consisting of whitespace.

The main use of this decorator is to remove all "element content whitespace", i.e. text content between XML elements that is just an artifact of "pretty printing" XML.

This class has been added with XMLUnit 2.6.0.

Examples

Empty text nodes are removed:

<element>
</element>

becomes

<element></element>

Text Nodes are not stripped:

<element>
  foo
</element>

remains

<element>
  foo
</element>

InputBuilder

With the Helper Class Input you can generate Input.Builder to create Source instances.

Source source = Input.fromFile("file:/..../test.xml").build();

or with XSL transformations:

Source source = Input.byTransforming(Input.fromFile("file:/..../test.xml"))
		.withStylesheet(Input.fromFile("file:/..../test.xsl"))
		.build();

In .NET the code Examples are very similar, see API:
Java: http://www.xmlunit.org/api/java/master/org/xmlunit/builder/Input.html
.NET: http://www.xmlunit.org/api/net/master/Org.XmlUnit.Builder/Input.html

Input.from(Object)

A special case is the helper method Input.from(Object). This generic method creates a Builder instance depending of the type of the given Object:

Java type .NET type Description
org.xmlunit.builder.Input.Builder Org.XmlUnit.Builder.Input.IBuilder Builder to create an XML-Source.
javax.xml.transform.Source Org.XmlUnit.ISource XML-Source
org.w3c.dom.Document System.Xml.XmlDocument dom Document
org.w3c.dom.Node System.Xml.XmlNode dom Node
- System.Xml.Linq.XDocument Linq Document
- System.Xml.Linq.XNode Linq Node
byte[] byte[] byte[] which is an XML-Content.
String string String which is an XML-Content.
java.io.File - File which contains XML.
java.net.URL - URL to an XML
java.net.URI System.Uri URI to an XML
java.io.InputStream System.IO.Stream Stream from an XML.
java.nio.channels.ReadableByteChannel System.IO.TextReader ReadableByteChannel or TextReader of an XML
A Jaxb Object - Object which can be transformed to XML by javax.xml.bind.JAXB.marshal(...)

This method simplifies the API of DiffBuilder and CompareMatcher which can accept nearly any Object as input to generate a valid Source.

XXE Prevention

Whenever you parse XML there is the danger of being vulnerable to XML External Entity Processing - XXE for short.

XMLUnit for Java

When passing input to XMLUnit the input is tranformed to a DOM document with the help of a DocumentBuilder most of the time. Prior to XMLUnit for Java 2.6.0 the DocumentBuilder used by default was not configured to prevent XXE as Java's defaults are vulnerable. Starting with XMLUnit 2.6.0 the default DocumentBuilder is configured according to OWASP's XXE Prevention Cheat Sheet.

This means if you want to protect yourself against XXE and you use a version of XMLUnit prior to 2.6.0 you have to explicitly set a DocumentBuilderFactory that is configured properly. Likewise if you rely on DTD loading or expansion of external entities you must provide an explicit DocumentBuilderFactory when using XMLUnit 2.6.0 or later.

If you use the legacy module, XXE prevention is disabled by default. Starting with XMLUnit 2.6.0 the XMLUnit class has a new setEnableXXEPrevention method that can be used to enable it.

XMLUnit.NET

When using .NET 4.5.2 or newer the default settings used by XMLUnit.NET have always been safe according to OWASP's XXE Prevention Cheat Cheet. Prior to XMLUnit.NET 2.6.0 there have been a few places where XmlDocument is used and did not explicitly disable the XmlResolver which means these places have been vulnerable.

If you rely on XmlDocument loading external entities you will need to provide an XmlResolver of your own startting with XMLUnit.NET 2.6.0.

Whitespace in XML and Unicode

The XML specification has a very limited set of characters it considers whitespace while Unicode knows a lot more whitespace characters.

Some of the sources provided by XMLUnit are used to ignore whitespace differences - they use the trim/Trim methods of the String class respectively. For Java trim's idea of whitespace is compatible with the XML definition (it also removes some control characters which would be illegal inside an XML document). For .NET things are different, though, Trim uses Unicode's definition of whitespace and thus may hide differences in non-XML whitespace.

Starting with XMLUnit.NET 2.10.0 new sources XmlWhitespaceStrippedSource, XmlWhitespaceNormalizedSource, and XmlElementContentWhitespaceStrippedSource have been added that only act on whitespace by XML's definition.

This means Java's WhitespaceStrippedSource acts more like .NET's XmlWhitespaceStrippedSource than WhitespaceStrippedSource - and the same is true for the other sources. "Fixing" the original .NET sources would have broken too many existing tests, so new types have been added.

Java 11 introduces a new strip method to the String class that acts like .NET's Trim and could be used to implement Source types that act like .NET's WhitespaceStrippedSource, WhitespaceNormalizedSource, and ElementContentWhitespaceStrippedSource respectively.