URIs

On the web, everyone needs to know your name. Or actually your address.

Really?

Why is there a whole page of notes on URIs?

Because everyone uses them, but there are a lot of gotchas surrounding them, and things not a lot of people know about.

Also, it’s a good computer science case study, as are all encoding schemes that use characters; escaping is often required and there are always security implications.

What is a URI?

A URI is a Uniform Resource Identifier and is used to describe a resource. They were used as far back as 1990.

They were first officially documented in RFC 1630, from 1994, written by Tim Berners-Lee himself, describing the ideas behind URIs and their early syntax.

RFC 3986 is the important one, the one with the generic syntax for URIs.

Individual RFCs describe particular schemes, such as RFC 4248 for telnet, RFC 4266 for gopher, RFC 8089 for file.

Exercise: Search for a list of schemes, and record the RFCs for each.

Understand the term URI:

Uniform means no matter what kind of resource is being identified, or how that resource is located and accessed, the overall format of the identifier is the same
Resource means anything identified by the URI (haha yes it’s a circular definition). The RFC states: “Familiar examples include an electronic document, an image, a source of information with a consistent purpose (e.g., “today's weather report for Los Angeles”), a service (e.g., an HTTP-to-SMS gateway), and a collection of other resources. A resource is not necessarily accessible via the Internet; e.g., human beings, corporations, and bound books in a library can also be resources. Likewise, abstract concepts can be resources, such as the operators and operands of a mathematical equation, the types of a relationship (e.g., “parent” or “employee”), or numeric values (e.g., zero, one, and infinity).”
Identifier means that the URI “embodies the information required to distinguish what is being identified from all other things within its scope of identification.”

These are the example URIs from RFC 3986:

ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2

URIs, URLs, and URNs.

URIs can be used to name things, to locate things, or boths. Sometimes the term URL (for uniform resource locator) is used if the URI describes how to locate the resource, and the term URN (for uniform resource name) is used for URIs with the urn scheme. It is probably best to avoid those restrictive terms.

Remember, just because something has a URI does not mean you can find it or even access it!

Components

A URI has a hierarchical structure with the most general components on the left to the most specific on the right. There are between two to four components:

scheme : authority path ? query # fragment

where:

scheme is something like file, ftp, http, https, gopher, news, telnet, wais, or mailto. Some schemes are well-known and officially registered at IANA.
host is an IP address or DNS name
port is optional unless you want the default port
pathname is relative to the "document root"

 foo://example.com:8042/over/there?name=ferret#nose
 \_/   \______________/\_________/ \_________/ \__/
  |           |            |            |        |
scheme     authority      path        query   fragment
  |   _____________________|__
 / \ /                        \
 urn:example:animal:ferret:nose

The ? query and # fragment part are both optional. If they both appear, the query has to come first.

Other components may be optional in certain contexts.

A URI is a compact string of characters for identifying a physical or abstract resource. Did you get that first part? A URI is a sequence of CHARACTERS, NOT OCTETS. The octet stream representing the URI is dependent on some character encoding.

URI characters come from a very restricted character set. After all we use them for the world wide web, so everyone should be able to make sense of them. Two classes of characters are important:

Character Class	Values	Notes
Reserved Characters	`: / ? # [ ] @ ! $ & ' ( ) * + , ; =`	Used as delimiters and must be escaped if not. Encoded and nonencoded instances are not the same.
Unreserved Characters	`A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 - _ . ~`	Should not normally be escaped. If you do, it doesn’t make any difference (unless you encode twice!)

Characters in a URI are used either as (1) delimiters or (2) strings within delimited subsequences.

Remember to escape all characters which are not unreserved when they appear between delimiters. Some of the ones worth memorizing are:

   %20
"  %22
#  %23
$  %24
%  %25
&  %26
+  %2B
,  %2C
/  %2F
:  %3A
;  %3B
<  %3C
=  %3D
>  %3E
?  %3F
@  %40
[  %5B
\  %5C
]  %5D
^  %5E
`  %60
{  %7B
|  %7C
}  %7D

Interestingly, a + character can also be used to represent a space (in case you don't like %20).

To know which characters are allowed in which components of a URI, see the Syntax section below.

URI Syntax

Here is a copy of Appendix A of RFC 3986. It is the ABNF of the generic syntax for URIs. By generic syntax it is meant that it is a superset of all valid URIs. Each individual URI scheme will have its own specific grammar and restrictions on this generic syntax.

URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

hier-part     = "//" authority path-abempty
              / path-absolute
              / path-rootless
              / path-empty

URI-reference = URI / relative-ref

absolute-URI  = scheme ":" hier-part [ "?" query ]

relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

relative-part = "//" authority path-abempty
              / path-absolute
              / path-noscheme
              / path-empty

scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

authority     = [ userinfo "@" ] host [ ":" port ]
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
host          = IP-literal / IPv4address / reg-name
port          = *DIGIT

IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"

IPvFuture     = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

IPv6address   =                            6( h16 ":" ) ls32
              /                       "::" 5( h16 ":" ) ls32
              / [               h16 ] "::" 4( h16 ":" ) ls32
              / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
              / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
              / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
              / [ *4( h16 ":" ) h16 ] "::"              ls32
              / [ *5( h16 ":" ) h16 ] "::"              h16
              / [ *6( h16 ":" ) h16 ] "::"

h16           = 1*4HEXDIG
ls32          = ( h16 ":" h16 ) / IPv4address
IPv4address   = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet     = DIGIT                 ; 0-9
              / %x31-39 DIGIT         ; 10-99
              / "1" 2DIGIT            ; 100-199
              / "2" %x30-34 DIGIT     ; 200-249
              / "25" %x30-35          ; 250-255

reg-name      = *( unreserved / pct-encoded / sub-delims )

path          = path-abempty    ; begins with "/" or is empty
              / path-absolute   ; begins with "/" but not "//"
              / path-noscheme   ; begins with a non-colon segment
              / path-rootless   ; begins with a segment
              / path-empty      ; zero characters

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty    = 0

segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

query         = *( pchar / "/" / "?" )

fragment      = *( pchar / "/" / "?" )

pct-encoded   = "%" HEXDIG HEXDIG

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved      = gen-delims / sub-delims
gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

If you are new to ABNF, it’s good to know its predefined categories:

SP      =  %x20               ; space
HTAB    =  %x09               ; horizontal tab
WSP     =  SP / HTAB          ; white space
CR      =  %x0D               ; carriage return
LF      =  %x0A               ; linefeed
CRLF    =  CR LF              ; Internet standard newline
LWSP    =  *(WSP / CRLF WSP)  ; linear white space (past newline)
BIT     =  "0" / "1"
ALPHA   =  %x41-5A / %x61-7A  ; A-Z / a-z
DIGIT   =  %x30-39            ; 0-9
CTL     =  %x00-1F / %x7F     ; controls
DQUOTE  =  %x22               ; " (Double Quote)
HEXDIG  =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
CHAR    =  %x01-7F            ; any 7-bit US-ASCII character, excluding NUL
VCHAR   =  %x21-7E            ; visible (printing) characters
OCTET   =  %x00-FF            ; 8 bits of data

Exercise: Work through the syntax generating a large number of both simple and complex URIs.

Programming with URIs

Your favorite programming language should have safe, robust, libraries for both:

Constructing URIs from components (schemes, path components, query parameters, and fragments)
Parsing strings representing URIs into their components

Make sure to use these library functions and know how to use them well. There are many nasty security implications of trying to do this stuff yourself.

Exercise: For at least Java, JavaScript, and Python, construct examples of how to create and parse URIs using the standard libraries. Make sure to include examples with schemes, user information, authorities, paths, query parameters, and fragments.

Exercise: Make a complete table of differences between JavaScript’s encodeURI and encodeURIComponent. What exactly are the differences in the characters they escape and do not escape? Why do they differ? When should you use one over the other?

Summary

We’ve covered:

What a URI is for
Where URI got its name from
The components of a URI: scheme, path, query, and fragment
The official syntax of a URI
Programming concerns when dealing with URIs