Search

What are computers good for? Many things. But one of the biggest things they help with is getting information. How do we get information? We search.

Technical Landscape

Let’s frame the world of search by (1) what kinds of searches we do, (2) what kind of data types we can use to collect information for searching, and (3) what kind of actual data structures can enable efficient search.

For serious searching, we have:

Actions

Does something exist?
What is its position?
Fetch object by id
Find text 💪

Abstractions

Sets
Dictionaries
Search Indexes

Structures

Sorted Arrays
Search Trees
Prefix Trees
Hashtables

For “less serious” in-memory search for very small data sets, we can find the index of an array or list element in $\Theta(n)$ time in an unordered sequence. But you already know how to do that. It isn’t super interesting.

We are interested in efficient searches.

Binary Search

If an array or list is sorted, we can find the index of an array or list element in $\Theta(\log n)$ time using binary search.

Exercise: Write a binary search of a sorted array or list in the language of your choice.

Exercise: Write a small Java app that illustrates the use of the built-in binary search method of Java.

Sets

A set is an unordered collection of (unique) items. Examples:

Here’s a bunch of operations, including mutators and non-mutators:

s.add(e): Add element e to the set s
s.remove(e): Remove element e from the set s
s.contains(e): Return whether e is a member of s
s.size(): Return the number of elements in s
s.isEmpty(): Return whether there are no elements in s
s.addAll(t): Add all the elements of set t to set s
s.removeAll(t): Remove all the elements of t from s
s.retainAll(t): Remove all the elements from s that are not in t
s.union(t): Return a new set which contains all the elements of set s and all the elements of set t, and no others
s.intersect(t): Return a new set which contains all and only those elements in both s and t
s.minus(t): Return a new set which contains all and only those elements in s but not in t
s.symmetricMinus(t): Return a new set which contains all and only those elements in either set s or set t but not both

Many programming languages have a built-in set type. Check out the documentation:

Dictionaries

A dictionary (also called a map, associative array, hash, or lookup table) is a collection of key-value pairs where every key is unique. We don’t care as much about the pairs as we do about looking up the value associated with a given key. Example:


www.nato.int	152.152.96.1
cr.yp.to	131.193.178.175
losangeles.citysearch.com	209.104.35.15
www.cl.cam.ac.uk	128.232.0.20
mit.edu	18.7.22.69
www.altan.ie	212.108.64.74
securitysearch.net	64.94.136.5
linuxassembly.org	64.202.189.170
regular-expressions.info	66.39.67.31
jpl.nasa.gov	137.78.160.180
groups.google.com	216.239.57.147
script.aculo.us	86.59.5.67

Say: www.nato.int maps to 152.152.96.1, or www.nato.int is bound to 152.152.96.1.

Fun Fact
A set is just a kind of dictionary in which the value part is ignored.

Here are some sample operations, both mutators and non-mutators:

m.put(k,v): Associate k with v (will either add a new entry or replace the existing entry for k)
m.putIfAbsent(k,v): If there is no entry with key k, add a new entry associating k with v
m.replace(k,v): If there an entry with key k, replace its value with v
m.merge(k,v,f): If there an entry with key k and value oldv, replace its value with f(oldv, v)
m.get(k): Return the value associated with key k (failure can handled in many different ways)
m.getOrDefault(k): Return the value associated with key k, or return d if key k is not present
m.containsKey(k): Return whether there’s a key-value pair in map m with key k
m.containsValue(v): Return whether there’s a key-value pair in map m with value v
m.putAll(otherMap): Add all the key-value pairs of otherMap to m
m.keySet(): Return the set of keys of m
m.values(): Return the values of m in some collection (not a set, since they don’t have to be unique)
m.entrySet(): Return the set of key-value pairs in m
m.remove(k): Remove the key-value pair with key k from map m
m.clear(): Remove all the key-value pairs from m
m.size(): Return the number of key-value pairs in m
m.isEmpty(): Return whether there are zero pairs in map m

Many programming languages have a built-in dictionary/map type. Check out the documentation:

Search Indexes

A search index is a fancy object that acts a lot like a dictionary, but rather than giving it a key and having the dictionary look up the associated value, you give a search index a search term and it produces a collection of matching documents.

The search term can be a specific string or it can be a pattern that can be matched by many strings.

The documents might be entities that contain the search term directly, or be documents related to the search term.

I don’t have any detailed notes on this topic, but if interested, start your journey at Wikipedia.

Representations

There are two primary approaches to representing sets, dictionaries, and search indexes. You can either:

Navigate through a collection of items to search for (only works for items that can be ordered), or
Perform a (“hash”) computation on the key (or search term) that produces the index of a bucket where the thing you are looking for will be easily found.

For example, in Java we have:

Interface	Navigable Representation	Hashed Representation
Set	TreeSet	HashSet
Map	TreeMap	HashMap

A Taxonomy

These two approaches give a pretty good rough breakdown, but alas, there is always some overlap, and some crazy hybrids, and some seemingly creative and off-the-wall mechanisms, so the actual taxonomy of known representations is rather huge; here is but a small slice:

Association Lists / Associative Arrays
- Unsorted association lists
- Sorted association lists
- Judy Arrays
Search Trees
- Traditional Search Trees
  - (a,b) Trees
  - B Trees and B+ Trees
  - Binary Search Trees
    - Unbalanced Binary Search Trees
    - Self-Balancing Binary Search Trees
      - AVL Trees
      - Scapegoat Trees
      - Red-Black Trees
      - AA Trees
      - Splay Trees
      - Cartesian Trees
- Prefix-Based Trees
  - Tries
  - Radix (Patricia) Trees
Hashed Structures
- Hash Tables
Other
- Skip Lists
- Bitsets
- Bloom Filters

Evaluation

No one representation is best for all situations. You need to take into account:

Will the collection be loaded once and only read and never written?
Will insertions and deletions be frequent or uncommon, compared to lookups?
Will the collection contain only a very small number of elements or can it be huge?
Are there any restrictions on the keys? For example are keys only strings? Or only integers?
Will the keys come from an type that can be ordered?
Is the collection subject to algorithmic complexity attacks?

It is worth noting:

If your keys are integers in the range 0..N, use a PLAIN OLD ARRAY

As you go over each data structure, make a little fact sheet such as the following:

Unsorted Associative Array or List
Basic idea: Key-value pairs are stored in a simple unsorted array or linked list Insertion: Θ(n) if you check for duplicate keys, Θ(1) otherwise Deletion: Θ(n) since you have to find the item first Lookup: Θ(n) since the whole list must be scanned Key type restrictions: None at all! Good for: Small collections only

Sorted Associative Array
Basic idea: Key-value pairs are stored in an array sorted by key Insertion: Θ(n) in general due to sifting; Θ(1) if adding at the end Deletion: Θ(n) in general due to sifting; Θ(1) if from the end Lookup: Θ(log n) using binary search Key type restrictions: Must be ordered Good for: Memory-constrained environments; read-only collections or those filled only in ascending order

Sorted Associative Array

Basic idea: Key-value pairs are stored in an array sorted by key
Insertion: Θ(n) in general due to sifting; Θ(1) if adding at the end
Deletion: Θ(n) in general due to sifting; Θ(1) if from the end
Lookup: Θ(log n) using binary search
Key type restrictions: Must be ordered
Good for: Memory-constrained environments; read-only collections or those filled only in ascending order

Exercise: Complete fact sheets for as many of the other implementations as you can.

Summary

We’ve covered:

How to think about search
Binary Search
What sets are
Operations on sets
What dictionaries are
Operations on dictionaries
How sets and dictionaries are alike
Representations of sets and dictionaries, without details
Ideas of when to use different representations