• Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint
Share this Page URL

Chapter 37. Toward XML > Understanding the Well-Formedness Constraints

Understanding the Well-Formedness Constraints

Besides the basic properties required in an XML document as listed in the previous "Defining the XML Document as a Whole," section, an XML document must meet certain extra criteria called constraints. The following list describes the well-formedness constraints:

  • Parameter entities in the internal subset can only occur where markup can occur. They can't occur inside markup. This is a completely arbitrary rule and was done to simplify the task of parsing internal DTDs.

  • The name in an element's end-tag must match the name in the start-tag. This is almost trivial. Few of us would expect to be able to close a <cite> tag with a </citation> tag.

  • An attribute name cannot appear more than once in the same start-tag or empty- element tag. Again, this is fairly obvious. What are you supposed to do with a malformed line of code like this?

    <image src="one.gif" src="two.gif" />
  • Attribute values cannot contain direct or indirect references to external entities. This is more subtle and was done to simplify life for XML processors. In an environment in which arbitrary character encodings are possible in external entities, it would be hard to handle them all correctly in an attribute.

  • The replacement text of an entity referred to directly or indirectly in an attribute value (other than "&lt;") must not contain a <. This is for simplicity and error handling. If you allowed an un-escaped < inside an attribute, it would be hard to catch a missing final quote mark. Also, because you have to escape < in running text anyway, treating it differently inside an attribute value would be inconsistent.

  • Characters referred to using character references must be legal characters. In other words, you can't hide characters that would otherwise be illegal by indirection or by defining them as numeric equivalents. So, for example, &#x0000; is not a legal character no matter how you refer to it.

  • In a document without a DTD, a document with only an internal DTD subset containing no parameter entity references or a document with a value of "standalone='yes'" on the XML declaration, the name given in the entity reference must match that in an entity declaration. One exception is that well-formed documents need not declare any of the following entities: &amp;, &lt;, &gt;, &apos;, or &quot;.

    Basically, the declaration of a parameter entity must precede any reference to it, but there are some situations in which a non-validating XML processor stops processing entity declarations. So if the non-validating processor is confident that all declarations have been read and processed, then it can declare a well-formedness error and abort processing if it finds an undeclared entity. On the other hand, if any of the ways in which the non-validating XML processor stops processing entities have occurred, then it's not an error to encounter an undeclared entity. This is a complicated way of saying that non-validating XML processors might or might not catch undeclared entities, depending on the situation.

  • An entity reference must not contain the name of an unparsed entity. In short, you can't plunk binary data into the middle of text without some sort of handling mechanism declared. So, the following code is permitted and the value passed on to the user agent or browser if and only if it represented an external unparsed entity that had already been declared as a notation:

        <image &myimage; />
  • A parsed entity must not contain a recursive reference to itself, either directly or indirectly. Although dictionary makers might like to declare that a hat is a chapeau and a chapeau is a hat as if this means something, XML won't let you get away with it.

  • Parameter-entity references might only appear in the DTD. In other words, you can't carry over processing data into the final document and expect it to mean anything. You might as well expect that you could insert a C statement, say printf("Hello, world"\n);, onto your typewritten page and expect it to be replaced with some value and have the carriage returned for you. On the other hand, although it's a logical error if you expect it to happen, such text wouldn't actually break anything either, any more than printing the above line of C code in your text generates an error. Because % is only a character and doesn't have to be escaped inside your document, it's hard to see how such an "error" would be found out. Although factually interesting, this is a null statement as far as error processing goes.



Not a subscriber?

Start A Free Trial

  • Creative Edge
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint