XML

Hey JSON and YAML users! Remember XML? Well, it’s still around.

What is XML?

eXtensible Markup Language

XML is a meta markup language, meaning it tells you what form the markup takes, not what markup is allowed.

“Specific” markup languages are called applications of XML, perhaps things like:

XML was designed to be:

The official specification is at: http://www.w3.org/TR/REC-xml. There is a nice annotated version of an older spec at http://www.xml.com/axml/testaxml.htm.

An Example XML Document

<people>
  <person social="235432099">
    <name>
      <first>Seán</first>
      <last>Mchunu</last>
    </name>
    <job>Teacher</job>
    <job salaried="no">Clerk</job>
    <birthdate>1975-06-22</birthdate>
    <married spouse="355641111"/>
    <picture src="http://smchunu.name/me.jpg" width="60" height="80"/>
    <birthplace>
      <city>Los Angeles</city>
      <country>us</country>
    </birthplace>
  </person>
</people>

Document Structure

Physical Structure

When viewed as a characater sequence, an XML document has both:

  1. Character Data
  2. Markup: start tags, end tags, empty element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, processing instructions. (More on these later.)

XML documents are always Unicode character sequences. If you have a wimpy text editor, you can always use character entity references, for example:

<greeting>&#1055;&#1088;&#1080;&#x432;&#x435;&#1090;, &#x41c;&#1080;&#x440;!</greeting>

See how you can use hex or decimal for the codepoints?

Logical Structure

The document defines a structured object:

peoplexmltree.gif

This shows elements and attributes. Note the difference between elements and tags.

There are actually 7 kinds of nodes

More on these later.

Well-Formed Documents

A document is well-formed if

  1. It is derived from the start symbol of the official grammar,
  2. It meets all of the well-formedness constraints in the official spec, and
  3. Each of the parsed entities referenced in the document is well formed.

Documents that are not well-formed should be rejected by a processing program.

XML Grammar

Find the XML grammar in the spec, or here.

A few of the grammar rules:

    document  ::=  prolog element Misc*
    Char  ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    S  ::=  (#x20 | #x9 | #xD | #xA)+
    prolog ::=  XMLDecl? Misc* (doctypedecl Misc*)?
    XMLDecl ::=  '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
    VersionInfo ::=  S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
    Eq ::=  S? '=' S?
    VersionNum  ::=  '1.0'
    Misc ::=  Comment | PI | S
    element ::=  EmptyElemTag  | STag content ETag
    STag ::=  '<' Name (S Attribute)* S? '>'
    Attribute ::=  Name Eq AttValue
    ETag ::=  '</' Name S? '>'
    content ::=  CharData? ((element | Reference | CDSect | PI | Comment) CharData?)*
    EmptyElemTag ::=  '<' Name (S Attribute)* S? '/>'
        .
        .
        .

This means an XML document:

  • Starts with an optional XML declaration
  • Then has an optional Document Type Declaration
  • Then has a single Element
  • Spaces, comments, and processing instructions can appear sprinkled throughout (but only in certain places)

Well formedness constraints

Like all non-trivial languages, there are some things you can’t express in a context free grammar....

Some consequences of grammar rules and WFCs

XML Declaration

Example:

<?xml version="1.0" encoding="utf-8" standalone="no"?>

If present, must be the first thing in the document. No whitespace or comments may precede it. Okay well a Unicode byte-order mark can, but that’s different. That way a processor can guess the encoding well enough to get to the encoding declaration from the first few bytes of the file.

If a BOM is present:

00 00 f3 ff   UTF-32BE (1234)
ff fe 00 00   UTF-32LE (4321)
fe ff 00 3c   UTF-16BE
ff fe 3c 00   UTF-16LE
ef bb bf      UTF-8

If there's no BOM:

00 00 00 3c   UTF-32BE
3c 00 00 00   UTF-32LE
00 3c 00 3f   UTF-16BE
3c 00 3f 00   UTF-16LE
3c 3f 78 6d   UTF-8, Latin-1, ASCII, etc.
4c 6f a7 94   EBCDIC

Encoding could be utf-8, utf-16, iso-10646-UCS-2, iso-10646-UCS-4, iso-8859-1, ..., iso-8859-15, iso-2202-jp, Shift_JIS, EUC-JP, ...

Standalone must be "no" (or omitted) whenever the document refers to entities that are externally declared.

Document Type Definitions (DTDs)

A document type definition (DTD) explains precisely which elements and entities may appear where in a document, and what those elements’ contents and attributes are. A DTD for the example document above:

<!ELEMENT people (person*)>
<!ELEMENT person (name, job*, birthdate, married? picture?,
                  birthplace?)>
  <!ATTLIST person
      social ID #REQUIRED>
<!ELEMENT name (first, middle?, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT middle (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT job (#PCDATA)>
  <!ATTLIST job
      salaried (yes|no) "yes">
<!ELEMENT married EMPTY>
  <!ATTLIST married
      spouse IDREF #REQUIRED>
<!ELEMENT picture EMPTY>
  <!ATTLIST picture
      src CDATA #REQUIRED
      width CDATA #IMPLIED
      height CDATA #IMPLIED>
<!ELEMENT birthplace (city, (state|province), country)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT province (#PCDATA)>
<!ELEMENT country (#PCDATA)>

A document states what DTD it is using in its document type declaration. The DTD can be in another file on the same machine:

<?xml version="1.0">
<!DOCTYPE people SYSTEM "mydts/people.dtd">
<people>...</people>

or at some other URI:

<?xml version="1.0">
<!DOCTYPE people SYSTEM "http://someother.place.com/people.dtd">
<people>...</people>

or embedded directly within the XML document itself:

<?xml version="1.0">
<!DOCTYPE people [
  <!ELEMENT ...>
  ...
]>
<people>...</people>

or part external and part internal:

<?xml version="1.0">
<!DOCTYPE people SYSTEM "name.dtd" [
  <!ELEMENT people (person*)>
]>
<people>...</people>

Defining Elements

Element content: #PCDATA, sequences, choice, ?, *, ,, grouping with parentheses, mixed content, EMPTY, ANY.

Attribute types: CDATA, NMTOKEN, NMTOKENS, Enumeration, (in which each value must be a name token), ENTITY, ENTITIES, ID, IDREF, IDREFS, NOTATION.

Attribute defaults:

A document that conforms to its DTD is valid. A document can be well formed but not valid. You can write a validator.

Defining Entities

The DTD can also contain entity declarations. In your document, or elsewhere in the DTD, you can make entity references to these entities.

A general entity is defined in the DTD:

<!ENTITY notice "Copyright &#a9; 2003 Ticketmaster">

and referenced in the document:

<footer>This program is &notice;</footer>

A parameter entity is defined in the DTD:

<!ENTITY % weekdays "Mo|Tu|We|Th|Fr">

and also referenced in the DTD:

<!ATTLIST meeting day (Su|%weekdays;|Sa) "Fr">

More about elements and attributes

Comments

Comments begin with <!-- and end with --> and may not contain -- at all (except at part of the closer).

Processing Instructions

Example

<?php ......... ?>

Note the XML declaration is not a PI.

Namespaces

Often you might want to make a document by mixing content from two or more separate XML applications (e.g. XHTML + MathML + SVG + RDF, say). Element and attribute names may conflict! Namespaces can partition them.

A namespace is really just a URI, but you bind prefixes to it with a namespace declaration. A namespace declaration is NOT an attribute declaration; it only looks like one.

Here is an example, slightly modified from XML in a Nutshell by Harold and Means:

<?xml version="1.0"?>
<htm:html xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink">
  <htm:head><htm:title>Three Namespaces</htm:title></htm:head>
  <htm:body>
  <htm:h1>An ellipse and a rectangle</htm:h1>
  <svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="12cm" height="10cm">
    <svg:ellipse rx="110" ry="130"/>
    <svg:rect x="4cm" y="1cm" width="3cm" height="6cm"/>
  </svg:svg>
  <htm:p xlink:type="simple" xlink:href="ellipses.html">More about ellipses</htm:p>
  <htm:p xlink:type="simple" xlink:href="rectangles.html">More about rectangles</htm:p>
  </htm:body>
</htm:html>

Actually you really want to take advantage of namespace defaulting:

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink">
  <head><title>Three Namespaces</title></head>
  <body>
  <h1>An ellipse and a rectangle</h1>
  <svg xmlns="http://www.w3.org/2000/svg" width="12cm" height="10cm">
    <ellipse rx="110" ry="130"/>
    <rect x="4cm" y="1cm" width="3cm" height="6cm"/>
  </svg>
  <p xlink:type="simple" xlink:href="ellipses.html">More about ellipses</p>
  <p xlink:type="simple" xlink:href="rectangles.html">More about rectangles</p>
  </body>
</html>

Styling

XML documents should contain structure, not presentation. Presentation is specified in a style sheet. Connect a style sheet to an XML document with the xml-stylesheet processing instruction. For example:

simple.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<?xml-stylesheet type="text/css" href="simple.css" ?>
<!DOCTYPE person [
  <!ELEMENT person (name,phone*)>
  <!ATTLIST person id CDATA #REQUIRED>
  <!ELEMENT name (#PCDATA)>
  <!ELEMENT phone (#PCDATA)>
]>
<person id="123456789">
  <name>Alice</name>
  <phone>8005551212</phone>
  <phone>8885551212</phone>
</person>

is connected to this stylesheet:

simple.css
name {
  display: block;
  font-size: 16pt;
  font-weight: bold;
  text-align: center;
  color: white;
  background-color: blue;
}

phone {
  display: block;
  font-size: 12pt;
  text-align: left;
  color: black;
  background-color: pink;
}

and the result looks like this:

alice.gif

The main style languages are CSS and XSL-FO (wut?). You might have to first transform your XML before styling it, use XSLT for that.

Classic XML Applications

Related Technologies