eXtensible Markup Language
XML is a meta markup language, meaning it tells you what form the markup takes, not what markup is allowed.
“Specific” markup languages are called applications of XML, perhaps things like:
XML was designed to be:
The official specification is at: http://www.w3.org/TR/REC-xml. There is a nice annotated version of an older spec at http://www.xml.com/axml/testaxml.htm.
<people> <person social="235432099"> <name> <first>Seán</first> <last>Mchunu</last> </name> <job>Teacher</job> <job salaried="no">Clerk</job> <birthdate>1975-06-22</birthdate> <married spouse="355641111"/> <picture src="http://smchunu.name/me.jpg" width="60" height="80"/> <birthplace> <city>Los Angeles</city> <country>us</country> </birthplace> </person> </people>
When viewed as a characater sequence, an XML document has both:
XML documents are always Unicode character sequences. If you have a wimpy text editor, you can always use character entity references, for example:
<greeting>Привет, Мир!</greeting>
See how you can use hex or decimal for the codepoints?
The document defines a structured object:
This shows elements and attributes. Note the difference between elements and tags.
There are actually 7 kinds of nodes
More on these later.
A document is well-formed if
Documents that are not well-formed should be rejected by a processing program.
Find the XML grammar in the spec, or here.
A few of the grammar rules:
document ::= prolog element Misc* Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] S ::= (#x20 | #x9 | #xD | #xA)+ prolog ::= XMLDecl? Misc* (doctypedecl Misc*)? XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>' VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"') Eq ::= S? '=' S? VersionNum ::= '1.0' Misc ::= Comment | PI | S element ::= EmptyElemTag | STag content ETag STag ::= '<' Name (S Attribute)* S? '>' Attribute ::= Name Eq AttValue ETag ::= '</' Name S? '>' content ::= CharData? ((element | Reference | CDSect | PI | Comment) CharData?)* EmptyElemTag ::= '<' Name (S Attribute)* S? '/>' . . .
This means an XML document:
- Starts with an optional XML declaration
- Then has an optional Document Type Declaration
- Then has a single Element
- Spaces, comments, and processing instructions can appear sprinkled throughout (but only in certain places)
Like all non-trivial languages, there are some things you can’t express in a context free grammar....
<
characters are allowed in attributes
<
>
'
"
&
are pre-declared for you)
< ==> < & ==> & ]]> ==> ]]>
Example:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
If present, must be the first thing in the document. No whitespace or comments may precede it. Okay well a Unicode byte-order mark can, but that’s different. That way a processor can guess the encoding well enough to get to the encoding declaration from the first few bytes of the file.
If a BOM is present:
00 00 f3 ff UTF-32BE (1234) ff fe 00 00 UTF-32LE (4321) fe ff 00 3c UTF-16BE ff fe 3c 00 UTF-16LE ef bb bf UTF-8
If there's no BOM:
00 00 00 3c UTF-32BE 3c 00 00 00 UTF-32LE 00 3c 00 3f UTF-16BE 3c 00 3f 00 UTF-16LE 3c 3f 78 6d UTF-8, Latin-1, ASCII, etc. 4c 6f a7 94 EBCDIC
Encoding could be utf-8, utf-16, iso-10646-UCS-2, iso-10646-UCS-4, iso-8859-1, ..., iso-8859-15, iso-2202-jp, Shift_JIS, EUC-JP, ...
Standalone must be "no"
(or omitted) whenever the
document refers to entities that are externally declared.
A document type definition (DTD) explains precisely which elements and entities may appear where in a document, and what those elements’ contents and attributes are. A DTD for the example document above:
<!ELEMENT people (person*)> <!ELEMENT person (name, job*, birthdate, married? picture?, birthplace?)> <!ATTLIST person social ID #REQUIRED> <!ELEMENT name (first, middle?, last)> <!ELEMENT first (#PCDATA)> <!ELEMENT middle (#PCDATA)> <!ELEMENT last (#PCDATA)> <!ELEMENT job (#PCDATA)> <!ATTLIST job salaried (yes|no) "yes"> <!ELEMENT married EMPTY> <!ATTLIST married spouse IDREF #REQUIRED> <!ELEMENT picture EMPTY> <!ATTLIST picture src CDATA #REQUIRED width CDATA #IMPLIED height CDATA #IMPLIED> <!ELEMENT birthplace (city, (state|province), country)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT province (#PCDATA)> <!ELEMENT country (#PCDATA)>
A document states what DTD it is using in its document type declaration. The DTD can be in another file on the same machine:
<?xml version="1.0"> <!DOCTYPE people SYSTEM "mydts/people.dtd"> <people>...</people>
or at some other URI:
<?xml version="1.0"> <!DOCTYPE people SYSTEM "http://someother.place.com/people.dtd"> <people>...</people>
or embedded directly within the XML document itself:
<?xml version="1.0"> <!DOCTYPE people [ <!ELEMENT ...> ... ]> <people>...</people>
or part external and part internal:
<?xml version="1.0"> <!DOCTYPE people SYSTEM "name.dtd" [ <!ELEMENT people (person*)> ]> <people>...</people>
Element content: #PCDATA
, sequences, choice, ?
,
*
, ,
, grouping with parentheses, mixed content,
EMPTY
, ANY
.
Attribute types: CDATA
, NMTOKEN
,
NMTOKENS
, Enumeration
, (in which each value must be a name token),
ENTITY
, ENTITIES
, ID
, IDREF
,
IDREFS
, NOTATION
.
Attribute defaults:
#IMPLIED
#REQUIRED
#FIXED
A document that conforms to its DTD is valid. A document can be well formed but not valid. You can write a validator.
The DTD can also contain entity declarations. In your document, or elsewhere in the DTD, you can make entity references to these entities.
A general entity is defined in the DTD:
<!ENTITY notice "Copyright &#a9; 2003 Ticketmaster">
and referenced in the document:
<footer>This program is ¬ice;</footer>
A parameter entity is defined in the DTD:
<!ENTITY % weekdays "Mo|Tu|We|Th|Fr">
and also referenced in the DTD:
<!ATTLIST meeting day (Su|%weekdays;|Sa) "Fr">
xml:space
: has either the value default
or preserve
.
xml:lang
: identifies the language used in this element.
<!ELEMENT A (#PCDATA | B | C)>
then A has a mixed content model and the content of A contains arbitrary character data with any number of B’s and C’s mixed in, in any order.
<![CDATA[Blah<Blah>Blah]]>
They can’t nest — the first ]]>
ends the section!
Comments begin with <!--
and end with -->
and may not contain --
at all (except at part of the closer).
Example
<?php ......... ?>
Note the XML declaration is not a PI.
Often you might want to make a document by mixing content from two or more separate XML applications (e.g. XHTML + MathML + SVG + RDF, say). Element and attribute names may conflict! Namespaces can partition them.
A namespace is really just a URI, but you bind prefixes to it with a namespace declaration. A namespace declaration is NOT an attribute declaration; it only looks like one.
Here is an example, slightly modified from XML in a Nutshell by Harold and Means:
<?xml version="1.0"?> <htm:html xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink"> <htm:head><htm:title>Three Namespaces</htm:title></htm:head> <htm:body> <htm:h1>An ellipse and a rectangle</htm:h1> <svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="12cm" height="10cm"> <svg:ellipse rx="110" ry="130"/> <svg:rect x="4cm" y="1cm" width="3cm" height="6cm"/> </svg:svg> <htm:p xlink:type="simple" xlink:href="ellipses.html">More about ellipses</htm:p> <htm:p xlink:type="simple" xlink:href="rectangles.html">More about rectangles</htm:p> </htm:body> </htm:html>
Actually you really want to take advantage of namespace defaulting:
<?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xlink="http://www.w3.org/1999/xlink"> <head><title>Three Namespaces</title></head> <body> <h1>An ellipse and a rectangle</h1> <svg xmlns="http://www.w3.org/2000/svg" width="12cm" height="10cm"> <ellipse rx="110" ry="130"/> <rect x="4cm" y="1cm" width="3cm" height="6cm"/> </svg> <p xlink:type="simple" xlink:href="ellipses.html">More about ellipses</p> <p xlink:type="simple" xlink:href="rectangles.html">More about rectangles</p> </body> </html>
XML documents should contain structure, not presentation.
Presentation is specified in a style sheet. Connect a style sheet
to an XML document with the xml-stylesheet
processing instruction. For example:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <?xml-stylesheet type="text/css" href="simple.css" ?> <!DOCTYPE person [ <!ELEMENT person (name,phone*)> <!ATTLIST person id CDATA #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT phone (#PCDATA)> ]> <person id="123456789"> <name>Alice</name> <phone>8005551212</phone> <phone>8885551212</phone> </person>
is connected to this stylesheet:
name { display: block; font-size: 16pt; font-weight: bold; text-align: center; color: white; background-color: blue; } phone { display: block; font-size: 12pt; text-align: left; color: black; background-color: pink; }
and the result looks like this:
The main style languages are CSS and XSL-FO (wut?). You might have to first transform your XML before styling it, use XSLT for that.