URIs

On the web, everyone needs to know your name. Or acutally your address.

Unit Goals

To understand URIs, deeply, because everyone uses them but there are a lot of gotchas and things not a lot of people know about them.

What is a URI?

A URI is a Uniform Resource Identifier and is used to describe a resource. They were used as far back as 1990.

They were first officially documented in RFC 1630, from 1994, written by Tim Berners-Lee himself, describing the ideas behind URIs and their early syntax.

RFC 3986 is the important one, the one with the generic syntax for URIs.

Individual RFCs describe particular schemes, such as RFC 4248 for telnet, RFC 4266 for gopher, RFC 8089 for file.

Exercise: Search for a list of schemes, and record the RFCs for each.

Understand the term URI:

Example: These are the example URIs from RFC 3986:
  • ftp://ftp.is.co.za/rfc/rfc1808.txt
  • http://www.ietf.org/rfc/rfc2396.txt
  • ldap://[2001:db8::7]/c=GB?objectClass?one
  • mailto:John.Doe@example.com
  • news:comp.infosystems.www.servers.unix
  • tel:+1-816-555-1212
  • telnet://192.0.2.16:80/
  • urn:oasis:names:specification:docbook:dtd:xml:4.1.2
URIs, URLs, and URNs.

URIs can be used to name things, to locate things, or boths. Sometimes the term URL (for uniform resource locator) is used if the URI describes how to locate the resource, and the term URN (for uniform resource name) is used for URIs with the urn scheme. It is probably best to avoid those restrictive terms.

Remember, just because something has a URI does not mean you can find it or even access it!

Components

A URI has a hiearchical structure with the most general components on the left to the most specific on the right. There are between two to four components:

  scheme  :  path  ?  query  #  fragment

The ? query and # fragment part are both optional. If they both appear, the query has to come first.

where

Some components may be optional in certain contexts.

Important to Know:

URI Syntax

This is just a copy of Appendix A of RFC 3986. It is the ABNF of the generic syntax for URIs. By generic syntax it is meant that it is a superset of all valid URIs. Each individual URI scheme will have its own specific grammar and restrictions on this generic syntax.

  URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

  hier-part     = "//" authority path-abempty
                / path-absolute
                / path-rootless
                / path-empty

  URI-reference = URI / relative-ref

  absolute-URI  = scheme ":" hier-part [ "?" query ]

  relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

  relative-part = "//" authority path-abempty
                / path-absolute
                / path-noscheme
                / path-empty

  scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

  authority     = [ userinfo "@" ] host [ ":" port ]
  userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
  host          = IP-literal / IPv4address / reg-name
  port          = *DIGIT

  IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"

  IPvFuture     = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

  IPv6address   =                            6( h16 ":" ) ls32
                /                       "::" 5( h16 ":" ) ls32
                / [               h16 ] "::" 4( h16 ":" ) ls32
                / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                / [ *4( h16 ":" ) h16 ] "::"              ls32
                / [ *5( h16 ":" ) h16 ] "::"              h16
                / [ *6( h16 ":" ) h16 ] "::"

  h16           = 1*4HEXDIG
  ls32          = ( h16 ":" h16 ) / IPv4address
  IPv4address   = dec-octet "." dec-octet "." dec-octet "." dec-octet
  dec-octet     = DIGIT                 ; 0-9
                / %x31-39 DIGIT         ; 10-99
                / "1" 2DIGIT            ; 100-199
                / "2" %x30-34 DIGIT     ; 200-249
                / "25" %x30-35          ; 250-255

  reg-name      = *( unreserved / pct-encoded / sub-delims )

  path          = path-abempty    ; begins with "/" or is empty
                / path-absolute   ; begins with "/" but not "//"
                / path-noscheme   ; begins with a non-colon segment
                / path-rootless   ; begins with a segment
                / path-empty      ; zero characters

  path-abempty  = *( "/" segment )
  path-absolute = "/" [ segment-nz *( "/" segment ) ]
  path-noscheme = segment-nz-nc *( "/" segment )
  path-rootless = segment-nz *( "/" segment )
  path-empty    = 0

  segment       = *pchar
  segment-nz    = 1*pchar
  segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                ; non-zero-length segment without any colon ":"

  pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

  query         = *( pchar / "/" / "?" )

  fragment      = *( pchar / "/" / "?" )

  pct-encoded   = "%" HEXDIG HEXDIG

  unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
  reserved      = gen-delims / sub-delims
  gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
  sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                / "*" / "+" / "," / ";" / "="
The predefined categories in ABNF are:
SP      =  %x20               ; space
HTAB    =  %x09               ; horizontal tab
WSP     =  SP / HTAB          ; white space
CR      =  %x0D               ; carriage return
LF      =  %x0A               ; linefeed
CRLF    =  CR LF              ; Internet standard newline
LWSP    =  *(WSP / CRLF WSP)  ; linear white space (past newline)
BIT     =  "0" / "1"
ALPHA   =  %x41-5A / %x61-7A  ; A-Z / a-z
DIGIT   =  %x30-39            ; 0-9
CTL     =  %x00-1F / %x7F     ; controls
DQUOTE  =  %x22               ; " (Double Quote)
HEXDIG  =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
CHAR    =  %x01-7F            ; any 7-bit US-ASCII character, excluding NUL
VCHAR   =  %x21-7E            ; visible (printing) characters
OCTET   =  %x00-FF            ; 8 bits of data
Exercise: Work through the syntax generating a large number of both simple and complex URIs.

Programming with URIs

Your favorite programming language should have safe, robust, libraries for both:

Make sure to use these library functions and know how to use them well. There are many nasty security implications of trying to do this stuff yourself.

TODO JavaScript Examples

TODO Python Examples

Summary

We’ve covered:

  • What a URI is for
  • Where URI got its name from
  • The components of a URI: scheme, path, query, and fragment
  • The official syntax of a URI
  • Programming concerns when dealing with URIs