Introduction to HTML

Most people can hack together a little bit of HTML, but there’s actually a lot to this technology.

What Is It?

HTML is a markup language used in:

HTML documents are rendered by both visual and aural user agents. Make sure to author documents that work properly on both kinds of agents. For example, don’t author documents for which color has semantic meaning. And make sure to mark up abbreviations, language changes, and stress so that aural agents can read properly.

History

Read Dave Ragget’s A history of HTML covering the period 1989-1998.

HTML has undergone a few revisions since its inception. The interesting versions are:

Version Date Notes Docs
HTML 1.0 1991 Early version, used before most people took note.
HTML 2.0 1994 First version to get an official spec. W3CRFC
HTML 3.0 Very ambitious. Never actually implemented.
HTML 3.2 1997 A scaling back of 3.0. W3C
HTML 4 1998 (4.0)
1999 (4.01)
More multimedia support, better accessibility, better internationalization. W3C
HTML 5 2008-present
(Still evolving)
The modern version. Living Standard

There was once a time when the web consisted of most documents written in HTML4 and below, but that time has long passed, and there is really no reason to learn the old versions anymore.

HTML5 is current. You should not author any new documents with earlier versions. You should be aware that they exist, though.

Official Specification

The official specification is known as the HTML Living Standard.

It is maintained by The WHATWG (Web Hypertext Application Technology Working Group).

Language Basics

Since you’ve already written your own simple web apps, let’s take a look at HTML form a computer science (language) lens, beginning with the question “what is the basic structure of an HTML document?”

Surface Syntax

The surface syntax of an HTML document, like that of any computer-friendly language, is a sequence of characters collected into tokens and assembled into phrases according to some rules. Here is an example document:

<!doctype html>
<html>
  <head>
    <meta charset="utf-8">
    <title>Hello</title>
  </head>
  <body>
    <h1>Welcome</h1>
    <p>Hello, world, this is my <a href="fido.html">dog</a>.
      <img src="fido.jpg" alt="fido">
      <!-- That wasn't too bad -->
    </p>
  </body>
</html>

The first line is a doctype that identifies the document as HTML5. There are other doctypes that tell the browser to use a different version of HTML. If you omit the doctype, the browser will resort to quirks mode — a rendering mode that tries to make old web pages written for hacked and buggy browsers back in the day render sort of the way they were intended.

Never use quirks mode.

It’s gross, but a necessary evil, for now.

Always use <!doctype html>.

After the doctype comes the elements, defining a tree with root element html. Elements are written with start tags and end tags. Sometimes the end tag can be omitted and sometimes the start tag can be omitted, too! Elements may have attributes. The content of an element may include other elements, as well as comments, text, and a few other things.

There are multiple definitions of the surface syntax

It turns out that there are different syntaxes in which to write HTML documents!.

The two official syntaxes are the the HTML syntax and the the XML syntax.

Internal Representation

The internal representation of an HTML document is called the DOM, which is short for “document object model.” The DOM is a tree data structure made up of the following types of nodes: element, attribute, text, cdata, entity reference, entity, processing instruction, comment, document, document type, document fragment, and notation. The HTML document above is represented with the following DOM (here we use the character to represent newlines and the to represent spaces):

document
    doctype html
    element html
        text ⏎‿‿
        element head
            text ⏎‿‿‿‿
            element meta (charset=utf-8)
            text ⏎‿‿‿‿
            element title
                text Hello
            text ⏎‿‿
        text ⏎‿‿
        element body
            text ⏎‿‿‿‿
            element h1
                text Welcome
            text ⏎‿‿‿‿
            element p
                text Hello,‿world,‿this‿is‿my‿
                element a (href=fido.html)
                    text dog
                text .⏎‿‿‿‿‿‿
                element img (src=fido.jpg alt=fido)
                text ⏎‿‿‿‿‿‿
                comment ‿That‿wasn't‿too‿bad‿
                text ⏎‿‿‿‿
            text ⏎‿‿
        text

It’s fine to draw DOM trees without the inter-element whitespace (the empty text nodes and the text nodes containing only whitespace that is not part of elements that can contain significant text). So the following picture would be more common:

dom

The in-memory representation (DOM HTML) is what ultimately matters.

Learning HTML

Here are big picture topics to keep in mind when learning HTML:

The Elements of HTML5

Quick reference time. Let’s face it. The elements are the central thing. So why not see the big list?

List of Elements

Here are the elements, grouped by category. I’ve tried to be up-to-date, but might have missed some. You can always find the complete list by going to the Elements section of the HTML Living Standard.

There are actually some cooler lists you might want to browse. One is the Periodic Table of the Elements. HTML5 Doctor’s Element Index is really useful too. And of course, there is MDN’s HTML Elements Reference.

ROOT
htmlThe document root element
METADATA
headThe metadata container
titleThe document’s title
baseThe URL to use for resolving relative URLs
linkA link from this document to an other resource
metaMetadata not specifiable via title, base, link, style, or script
styleEmbedded styling information
SECTIONING
bodyMain document content
articleSelf-contained composition in a document, independently distributable (syndicatable)
sectionThematic grouping of content, typically with a heading, such as chapters in a book, or a web page’s introduction, news, and contact info sections.
navA (major) section that contains navigation links
asideTangentially related content, such as would appear in a sidebar
h1A “Level 1” section heading
h2A “Level 2” section heading
h3A “Level 3” section heading
h4A “Level 4” section heading
h5A “Level 5” section heading
h6A “Level 6” section heading
hgroupA section heading that can contain multiple levels (e.g. headings and subheadings)
headerA group of introductory or navigational aids for a document or section, such as a wrapper for a table of contents, logo, company name, and search form
footerA footer for its document or section, perhaps containing copyright info, author info, license agreements, etc.
addressContact info for nearest article or body ancestor
GROUPING
pParagraph
hrParagraph-level thematic break, such as a scene change in a story, or a transition to another topic within a section.
prePreformatted text
blockquoteA section quoted from another source
olOrdered list
ulUnordered list
menuA semantic alternative to ul to express an unordered list of commands (a "toolbar").
liList item
dlDescription list (name/value groups), such as for questions/answers and terms/definitions
dtA name part of a name/value group in a description list
ddA value part of a name/value group in a description list
figureContent (such as illustrations, diagrams, photos, code listings) optionally with a caption, that is self-contained and is typically referenced as a single unit from the main flow of the document
figcaptionA caption or legend for the rest of the contents of the figcaption element’s parent figure element, if any
mainThe dominant contents of the document
searchContainer for a set of form controls or other content related to performing a search or filtering operation
divGeneric wrapper for a group of consecutive elements (should only be used as a last resort, when no existing element is suitable)
TEXT-LEVEL SEMANTICS
aA hyperlink, or a placeholder for a hyperlink
emStress emphasis
strongImportance
smallSide comments (e.g., fine print)
sNo longer accurate or relevant
citeTitle of a work, such as a book, paper, essay, poem, score, song, script, film, game, painting, play, musical, exhibition, or similar
qContent quoted from another source
dfnThe defining instance of a term
abbrAbbreviation or acronym
rubyText spans containing ruby markup
rtThe ruby text component of a ruby annotation
rpParentheses around a ruby text component of a ruby annotation, to be shown by user agents that don’t support ruby annotations
dataContent tagged with a machine-readable format (in the value attribute)
timea date, time, or datetime (human readable in content, machine readable in datetime attribute)
codeA fragment of computer code
varA variable
samp(Sample) output from a computer program or system
kbdUser input, typically keyboard input, but could be voice or other kind of input
subSubscript
supSuperscript
iText in an alternate voice or mood, e.g., foreign words, technical terms, terms from a taxonomy, ship names, stage directions in a script, thoughts, hand-written notes in a document, voice-overs. (Do not use if some other element such as em, strong, dfn, var, cite, q applies.)
bText to which attention is being drawn for utilitarian purposes without conveying any extra importance and with no implication of an alternate voice or mood, such as key words in a document abstract, product names in a review, actionable words in interactive text-driven software, or an article lede
utext with an unarticulated, though explicitly rendered, non-textual annotation, such as labeling the text as being a proper name in Chinese text (a Chinese proper name mark), or labeling the text as being misspelt
markmarked or highlighted for reference purposes, due to its relevance in another context
bditext to be isolated from its surroundings for the purposes of bidirectional text formatting
bdoBidirectional override
spanGeneric phrase-level wrapper
brLine break (Only used when the break is part of the content, as in a poem or address)
wbrLine break opportunity (usually inside of a very long word or source code line)
EDITS
insAn addition to the document
delA removal from the document
EMBEDDING
pictureContainer that provides multiple sources to its contained img element
sourceAlternative media resource for a media element
imgAn image
iframeA nested browsing context
embedan integration point for an external (typically non-HTML) application or interactive content
objectan external resource, which, depending on the type of the resource, will either be treated as an image, as a nested browsing context, or as an external resource to be processed by a plugin.
paramA parameter for plugins invoked by object elements
videoA video
audioA sound or audio stream
trackExplicit external timed text track for media elements
mapAn image map
areaA hyperlink with some text and a corresponding area on an image map, or a dead area on an image map.
TABULAR
tableA table
captionThe title of a table
colgroupA group of table columns
colA column in a colgroup
tbodyA block of rows making up the main content of a table
theadA block of rows making up the column labels (headers) of a table
tfootA block of rows making up the column summaries (footers) of a table
trTable row
tdTable cell
thHeader cell in a table
FORMS
forma collection of form-associated elements, some of which can represent editable values that can be submitted to a server for processing
labela caption in a user interface, generally for a specific form control
inputa typed data field, usually with a form control. The types include hidden, text, search, tel, url, email, password, datetime, date, month, week, time, datetime-local, number, range, color, checkbox, radio, file, submit, image, reset, button
buttonA button
selectA control for selecting among a set of options
datalistA set of option elements that represent predefined options for other controls
optgroupA group of option elements with a common label
optionAn option in a select element or as part of a list of suggestions in a datalist element
textareaA multiline plain text edit control
outputThe result of a calculation
progressThe completion progress of a task
meterA scalar measurement within a known range, or a fractional value; for example disk usage, the relevance of a query result, or the fraction of a voting population to have selected a particular candidate
fieldsetA group of form controls optionally grouped under a common name
legendcaption for a fieldset
INTERACTIVE
detailsA disclosure widget from which the user can obtain additional information or controls
summaryA summary, caption, or legend for the rest of the contents of the summary element’s parent details element, if any
dialogPart of an application that a user interacts with (e.g., dialog box, inspector, window)
SCRIPTING
scriptA script, either embedded or external
noscriptContent activated only if scripting is disabled
templateFragments of HTML that can be cloned and inserted in the document by script.
slotA slot in a shadow tree
canvasA resolution-dependent bitmap canvas, which can be used for rendering graphs, game graphics, or other visual images on the fly

Element Relationships

You can’t just put any element inside any other. The rules are sometimes very complicated; see the HTML spec for details. The following image will give you an idea of some of the relationships, but not all:

htmlelements.png

Attributes

Which attributes are allowed for which elements? Some, called the global attributes, are allowed on all elements. Some are allowed only on specific elements.

Element Allowed Attributes
(ALL) accesskey class contenteditable contextmenu dir draggable dropzone hidden id lang spellcheck style tabindex title onabort onblur oncanplay oncanplaythrough onchange onclick oncontextmenu oncuechange ondblclick ondrag ondragend ondragenter ondragleave ondragover ondragstart ondrop ondurationchange onemptied onended onerror onfocus oninput oninvalid onkeydown onkeypress onkeyup onload onloadeddata onloadedmetadata onloadstart onmousedown onmousemove onmouseout onmouseover onmouseup onmousewheel onpause onplay onplaying onprogress onratechange onreset onscroll onseeked onseeking onselect onshow onstalled onsubmit onsuspend ontimeupdate onvolumechange onwaiting
html manifest
basehref target
linkrel href media hreflang type sizes title
metaname http-equiv content charset
stylemedia type scoped title
scriptsrc async defer type charset
bodyonafterprint onbeforeprint onbeforeunload onblur onerror onfocus onhashchange onload onmessage onoffline ononline onpagehide onpageshow onpopstate onredo onresize onscroll onstorage onundo onunload
blockquotecite
olreverse start type
livalue
ahref target download ping rel media hreflang type
qcite
datavalue
timedatetime
inscite datetime
delcite datetime
imgalt src srcset crossorigin usemap ismap width height
iframesrc srcdoc name sandbox seamless width height
embedsrc type width height
objectdata type typemustmatch name usemap form width height
paramname value
videosrc crossorigin poster preload autoplay mediagroup loop muted controls width height
audiosrc crossorigin preload autoplay mediagroup loop muted controls
sourcesrc type media
trackkind src srclang label default
canvaswidth height
mapname
areaalt coords shape href target download ping rel media hreflang type
colgroupspan
colspan
tdcolspan rowspan headers
thcolspan rowspan headers scope
formaccept-charset action autocomplete enctype method name novalidate target
fieldsetdisabled form name
labelform for
inputaccept alt autocomplete autofocus checked dirname disabled form formaction formenctype formmethod formnovalidate formtarget height inputmode list max maxlength min multiple name pattern placeholder readonly required size src step type value width
buttonautofocus disabled form formaction formenctype formmethod formnovalidate formtarget name type value
selectautofocus disabled form multiple name required size
optgroupdisabled label
optiondisabled label selected value
textareaautocomplete autofocus cols dirname disabled form inputmode maxlength name placeholder readonly required rows wrap
keygenautofocus challenge disabled form keytype name
outputfor form name
progressvalue max
metervalue min max low high optimum
detailsopen
commandtype label icon disabled checked radiogroup command
menutype label
dialogopen

Obsolete Features

As HTML has evolved, elements have come and gone. Yes, some have gone. You might still see some of these in browsers. If you do, well, yikes. Just make sure YOU don’t ever use them. If you are copying and pasting code from someone else, be on the lookout for these obsolete features and replace them.

The elements that have been removed include acronym, applet, bgsound, dir, frame, frameset, noframes, isindex, keygen, listing, menuitem, nextid, noembed, plaintext, rb, rtc, strike, xmp, basefont, big, blinnk, center, font, marquee, multicol, nobr, spacer, tt.

There are quite a few attributes that should no longer be used, too.

Here is the complete list of obsolete features.

Recall Practice

Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.

  1. HTML isn’t reall a programming language; instead, it is a ____________ language.
    Markup.
  2. What is the purpose of HTML?
    Online document publishing, creating form-based applications, creating full-fledged interactive applications.
  3. What are the two official syntaxes for writing HTML documents?
    HTML syntax and XML syntax.

Summary

We’ve covered:

  • What HTML is
  • What it is used for
  • Structure of and HTML document
  • Elements and attributes
  • The list of elements and attributes
  • Obsolete features