eXtensible Markup Language
XML is a meta markup language, meaning it tells you what form the markup takes, not what markup is allowed.
“Specific” markup languages are called applications of XML, perhaps things like:
XML was designed to be:
The official specification is at: http://www.w3.org/TR/REC-xml. There is a nice annotated version of an older spec at http://www.xml.com/axml/testaxml.htm.
<people>
<person social="235432099">
<name>
<first>Seán</first>
<last>Mchunu</last>
</name>
<job>Teacher</job>
<job salaried="no">Clerk</job>
<birthdate>1975-06-22</birthdate>
<married spouse="355641111"/>
<picture src="http://smchunu.name/me.jpg" width="60" height="80"/>
<birthplace>
<city>Los Angeles</city>
<country>us</country>
</birthplace>
</person>
</people>
When viewed as a characater sequence, an XML document has both:
XML documents are always Unicode character sequences. If you have a wimpy text editor, you can always use character entity references, for example:
<greeting>Привет, Мир!</greeting>
See how you can use hex or decimal for the codepoints?
The document defines a structured object:

This shows elements and attributes. Note the difference between elements and tags.
There are actually 7 kinds of nodes
More on these later.
A document is well-formed if
Documents that are not well-formed should be rejected by a processing program.
Find the XML grammar in the spec, or here.
A few of the grammar rules:
document ::= prolog element Misc*
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
S ::= (#x20 | #x9 | #xD | #xA)+
prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
Eq ::= S? '=' S?
VersionNum ::= '1.0'
Misc ::= Comment | PI | S
element ::= EmptyElemTag | STag content ETag
STag ::= '<' Name (S Attribute)* S? '>'
Attribute ::= Name Eq AttValue
ETag ::= '</' Name S? '>'
content ::= CharData? ((element | Reference | CDSect | PI | Comment) CharData?)*
EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'
.
.
.
This means an XML document:
- Starts with an optional XML declaration
- Then has an optional Document Type Declaration
- Then has a single Element
- Spaces, comments, and processing instructions can appear sprinkled throughout (but only in certain places)
Like all non-trivial languages, there are some things you can’t express in a context free grammar....
< characters are allowed in attributes
<
>
'
"
& are pre-declared for you)
< ==> <
& ==> &
]]> ==> ]]>
Example:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
If present, must be the first thing in the document. No whitespace or comments may precede it. Okay well a Unicode byte-order mark can, but that’s different. That way a processor can guess the encoding well enough to get to the encoding declaration from the first few bytes of the file.
If a BOM is present:
00 00 f3 ff UTF-32BE (1234) ff fe 00 00 UTF-32LE (4321) fe ff 00 3c UTF-16BE ff fe 3c 00 UTF-16LE ef bb bf UTF-8
If there's no BOM:
00 00 00 3c UTF-32BE 3c 00 00 00 UTF-32LE 00 3c 00 3f UTF-16BE 3c 00 3f 00 UTF-16LE 3c 3f 78 6d UTF-8, Latin-1, ASCII, etc. 4c 6f a7 94 EBCDIC
Encoding could be utf-8, utf-16, iso-10646-UCS-2, iso-10646-UCS-4, iso-8859-1, ..., iso-8859-15, iso-2202-jp, Shift_JIS, EUC-JP, ...
Standalone must be "no" (or omitted) whenever the
document refers to entities that are externally declared.
A document type definition (DTD) explains precisely which elements and entities may appear where in a document, and what those elements’ contents and attributes are. A DTD for the example document above:
<!ELEMENT people (person*)>
<!ELEMENT person (name, job*, birthdate, married? picture?,
birthplace?)>
<!ATTLIST person
social ID #REQUIRED>
<!ELEMENT name (first, middle?, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT middle (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT job (#PCDATA)>
<!ATTLIST job
salaried (yes|no) "yes">
<!ELEMENT married EMPTY>
<!ATTLIST married
spouse IDREF #REQUIRED>
<!ELEMENT picture EMPTY>
<!ATTLIST picture
src CDATA #REQUIRED
width CDATA #IMPLIED
height CDATA #IMPLIED>
<!ELEMENT birthplace (city, (state|province), country)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT province (#PCDATA)>
<!ELEMENT country (#PCDATA)>
A document states what DTD it is using in its document type declaration. The DTD can be in another file on the same machine:
<?xml version="1.0"> <!DOCTYPE people SYSTEM "mydts/people.dtd"> <people>...</people>
or at some other URI:
<?xml version="1.0"> <!DOCTYPE people SYSTEM "http://someother.place.com/people.dtd"> <people>...</people>
or embedded directly within the XML document itself:
<?xml version="1.0"> <!DOCTYPE people [ <!ELEMENT ...> ... ]> <people>...</people>
or part external and part internal:
<?xml version="1.0"> <!DOCTYPE people SYSTEM "name.dtd" [ <!ELEMENT people (person*)> ]> <people>...</people>
Element content: #PCDATA, sequences, choice, ?,
*, ,, grouping with parentheses, mixed content,
EMPTY, ANY.
Attribute types: CDATA, NMTOKEN,
NMTOKENS, Enumeration, (in which each value must be a name token),
ENTITY, ENTITIES, ID, IDREF,
IDREFS, NOTATION.
Attribute defaults:
#IMPLIED
#REQUIRED
#FIXED
A document that conforms to its DTD is valid. A document can be well formed but not valid. You can write a validator.
The DTD can also contain entity declarations. In your document, or elsewhere in the DTD, you can make entity references to these entities.
A general entity is defined in the DTD:
<!ENTITY notice "Copyright &#a9; 2003 Ticketmaster">
and referenced in the document:
<footer>This program is ¬ice;</footer>
A parameter entity is defined in the DTD:
<!ENTITY % weekdays "Mo|Tu|We|Th|Fr">
and also referenced in the DTD:
<!ATTLIST meeting day (Su|%weekdays;|Sa) "Fr">
xml:space: has either the value default
or preserve.
xml:lang: identifies the language used in this element.
<!ELEMENT A (#PCDATA | B | C)>
then A has a mixed content model and the content of A contains arbitrary character data with any number of B’s and C’s mixed in, in any order.
<![CDATA[Blah<Blah>Blah]]>
They can’t nest — the first ]]> ends the section!
Comments begin with <!-- and end with -->
and may not contain -- at all (except at part of the closer).
Example
<?php ......... ?>
Note the XML declaration is not a PI.
Often you might want to make a document by mixing content from two or more separate XML applications (e.g. XHTML + MathML + SVG + RDF, say). Element and attribute names may conflict! Namespaces can partition them.
A namespace is really just a URI, but you bind prefixes to it with a namespace declaration. A namespace declaration is NOT an attribute declaration; it only looks like one.
Here is an example, slightly modified from XML in a Nutshell by Harold and Means:
<?xml version="1.0"?>
<htm:html xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink">
<htm:head><htm:title>Three Namespaces</htm:title></htm:head>
<htm:body>
<htm:h1>An ellipse and a rectangle</htm:h1>
<svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="12cm" height="10cm">
<svg:ellipse rx="110" ry="130"/>
<svg:rect x="4cm" y="1cm" width="3cm" height="6cm"/>
</svg:svg>
<htm:p xlink:type="simple" xlink:href="ellipses.html">More about ellipses</htm:p>
<htm:p xlink:type="simple" xlink:href="rectangles.html">More about rectangles</htm:p>
</htm:body>
</htm:html>
Actually you really want to take advantage of namespace defaulting:
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink">
<head><title>Three Namespaces</title></head>
<body>
<h1>An ellipse and a rectangle</h1>
<svg xmlns="http://www.w3.org/2000/svg" width="12cm" height="10cm">
<ellipse rx="110" ry="130"/>
<rect x="4cm" y="1cm" width="3cm" height="6cm"/>
</svg>
<p xlink:type="simple" xlink:href="ellipses.html">More about ellipses</p>
<p xlink:type="simple" xlink:href="rectangles.html">More about rectangles</p>
</body>
</html>
XML documents should contain structure, not presentation.
Presentation is specified in a style sheet. Connect a style sheet
to an XML document with the xml-stylesheet
processing instruction. For example:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<?xml-stylesheet type="text/css" href="simple.css" ?>
<!DOCTYPE person [
<!ELEMENT person (name,phone*)>
<!ATTLIST person id CDATA #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
]>
<person id="123456789">
<name>Alice</name>
<phone>8005551212</phone>
<phone>8885551212</phone>
</person>
is connected to this stylesheet:
name {
display: block;
font-size: 16pt;
font-weight: bold;
text-align: center;
color: white;
background-color: blue;
}
phone {
display: block;
font-size: 12pt;
text-align: left;
color: black;
background-color: pink;
}
and the result looks like this:

The main style languages are CSS and XSL-FO (wut?). You might have to first transform your XML before styling it, use XSLT for that.