A URI is a Uniform Resource Identifier and is used to describe a resource. They were used as far back as 1990.
They were first officially documented in RFC 1630, from 1994, written by Tim Berners-Lee himself, describing the ideas behind URIs and their early syntax.
RFC 3986 is the important one, the one with the generic syntax for URIs.
Individual RFCs describe particular schemes, such as RFC 4248 for telnet, RFC 4266 for gopher, RFC 8089 for file.
Understand the term URI:
ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
URIs, URLs, and URNs.
URIs can be used to name things, to locate things, or boths. Sometimes the term URL (for uniform resource locator) is used if the URI describes how to locate the resource, and the term URN (for uniform resource name) is used for URIs with theurn
scheme. It is probably best to avoid those restrictive terms.
Remember, just because something has a URI does not mean you can find it or even access it!
A URI has a hiearchical structure with the most general components on the left to the most specific on the right. There are between two to four components:
scheme : path ? query # fragment
The ? query
and # fragment
part are both optional. If they both appear, the query has to come first.
where
Some components may be optional in certain contexts.
Important to Know:
Character Class | Values | Notes |
---|---|---|
Reserved Characters | ; / ? : @ & = + $ , | Used as delimiters and must be escaped if not |
Unreserved Characters | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z – _ . ! ~ * ' ( ) | |
DISALLOWED Characters | #x0 through #x1f #x7f #x20 < > # % " |
|
Unwise Characters | { } | \ ^ [ ] ` |
%20 " %22 # %23 $ %24 % %25 & %26 + %2B , %2C / %2F : %3A ; %3B < %3C = %3D > %3E ? %3F @ %40 [ %5B \ %5C ] %5D ^ %5E ` %60 { %7B | %7C } %7D
This is just a copy of Appendix A of RFC 3986. It is the ABNF of the generic syntax for URIs. By generic syntax it is meant that it is a superset of all valid URIs. Each individual URI scheme will have its own specific grammar and restrictions on this generic syntax.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty URI-reference = URI / relative-ref absolute-URI = scheme ":" hier-part [ "?" query ] relative-ref = relative-part [ "?" query ] [ "#" fragment ] relative-part = "//" authority path-abempty / path-absolute / path-noscheme / path-empty scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) authority = [ userinfo "@" ] host [ ":" port ] userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) host = IP-literal / IPv4address / reg-name port = *DIGIT IP-literal = "[" ( IPv6address / IPvFuture ) "]" IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) IPv6address = 6( h16 ":" ) ls32 / "::" 5( h16 ":" ) ls32 / [ h16 ] "::" 4( h16 ":" ) ls32 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 / [ *4( h16 ":" ) h16 ] "::" ls32 / [ *5( h16 ":" ) h16 ] "::" h16 / [ *6( h16 ":" ) h16 ] "::" h16 = 1*4HEXDIG ls32 = ( h16 ":" h16 ) / IPv4address IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet dec-octet = DIGIT ; 0-9 / %x31-39 DIGIT ; 10-99 / "1" 2DIGIT ; 100-199 / "2" %x30-34 DIGIT ; 200-249 / "25" %x30-35 ; 250-255 reg-name = *( unreserved / pct-encoded / sub-delims ) path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0segment = *pchar segment-nz = 1*pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":" pchar = unreserved / pct-encoded / sub-delims / ":" / "@" query = *( pchar / "/" / "?" ) fragment = *( pchar / "/" / "?" ) pct-encoded = "%" HEXDIG HEXDIG unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
The predefined categories in ABNF are:SP = %x20 ; space HTAB = %x09 ; horizontal tab WSP = SP / HTAB ; white space CR = %x0D ; carriage return LF = %x0A ; linefeed CRLF = CR LF ; Internet standard newline LWSP = *(WSP / CRLF WSP) ; linear white space (past newline) BIT = "0" / "1" ALPHA = %x41-5A / %x61-7A ; A-Z / a-z DIGIT = %x30-39 ; 0-9 CTL = %x00-1F / %x7F ; controls DQUOTE = %x22 ; " (Double Quote) HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F" CHAR = %x01-7F ; any 7-bit US-ASCII character, excluding NUL VCHAR = %x21-7E ; visible (printing) characters OCTET = %x00-FF ; 8 bits of data
Your favorite programming language should have safe, robust, libraries for both:
Make sure to use these library functions and know how to use them well. There are many nasty security implications of trying to do this stuff yourself.
TODO JavaScript Examples
TODO Python Examples
We’ve covered: