Search
What are computers good for? Many things. But one of the biggest things they help with is getting information. How do we get information? We search.
Technical Landscape
Let’s frame the world of search by (1) what kinds of searches we do, (2) what kind of data types we can use to collect information for searching, and (3) what kind of actual data structures can enable efficient search.
For serious searching, we have:
Actions
Does something exist?
What is its position?
Fetch object by id
Find text 💪
Abstractions
Sets
Dictionaries
Search Indexes
Structures
Sorted Arrays
Search Trees
Prefix Trees
Hashtables
For “less serious” in-memory search for very small data sets, we can find the index of an array or list element in $\Theta(n)$ time in an unordered sequence. But you already know how to do that. It isn’t super interesting.
We are interested in efficient searches.
Binary Search
If an array or list is sorted, we can find the index of an array or list element in $\Theta(\log n)$ time using
binary search.
Exercise: Write a binary search of a sorted array or list in the language of your choice.
Exercise: Write a small Java app that illustrates the use of the built-in binary search method of Java.
Sets
A set is an unordered collection of (unique) items. Examples:
Here’s a bunch of operations, including mutators and non-mutators:
- s.add(e)
- Add element e to the set s
- s.remove(e)
- Remove element e from the set s
- s.contains(e)
- Return whether e is a member of s
- s.size()
- Return the number of elements in s
- s.isEmpty()
- Return whether there are no elements in s
- s.addAll(t)
- Add all the elements of set t to set s
- s.removeAll(t)
- Remove all the elements of t from s
- s.retainAll(t)
- Remove all the elements from s that are not in t
- s.union(t)
- Return a new set which contains all the elements of
set s and all the elements of set t, and no others
- s.intersect(t)
- Return a new set which contains all and only those
elements in both s and t
- s.minus(t)
- Return a new set which contains all and only those
elements in s but not in t
- s.symmetricMinus(t)
- Return a new set which contains all and only those
elements in either set s or set t but not both
Many programming languages have a built-in set type. Check out the documentation:
Dictionaries
A dictionary (also called a map, associative array, hash, or lookup table) is a collection of key-value pairs where every key is unique. We don’t care as much about the pairs as we do about looking up the value associated with a given key.
Example:
| |
www.nato.int | 152.152.96.1 |
cr.yp.to | 131.193.178.175 |
losangeles.citysearch.com | 209.104.35.15 |
www.cl.cam.ac.uk | 128.232.0.20 |
mit.edu | 18.7.22.69 |
www.altan.ie | 212.108.64.74 |
securitysearch.net | 64.94.136.5 |
linuxassembly.org | 64.202.189.170 |
regular-expressions.info | 66.39.67.31 |
jpl.nasa.gov | 137.78.160.180 |
groups.google.com | 216.239.57.147 |
script.aculo.us | 86.59.5.67 |
Say: www.nato.int maps to 152.152.96.1, or www.nato.int is bound to 152.152.96.1.
Fun FactA set is just a kind of dictionary in which the value part is ignored.
Here are some sample operations, both mutators and non-mutators:
- m.put(k,v)
- Associate k with v (will either add a new entry or replace the existing entry for k)
- m.putIfAbsent(k,v)
- If there is no entry with key k, add a new entry associating k with v
- m.replace(k,v)
- If there an entry with key k, replace its value with v
- m.merge(k,v,f)
- If there an entry with key k and value oldv, replace its value with f(oldv, v)
- m.get(k)
- Return the value associated with key k (failure can handled in many different ways)
- m.getOrDefault(k)
- Return the value associated with key k, or return d if key k is not present
- m.containsKey(k)
- Return whether there’s a key-value pair in map m with key k
- m.containsValue(v)
- Return whether there’s a key-value pair in map m with value v
- m.putAll(otherMap)
- Add all the key-value pairs of otherMap to m
- m.keySet()
- Return the set of keys of m
- m.values()
- Return the values of m in some collection (not a set, since they
don’t have to be unique)
- m.entrySet()
- Return the set of key-value pairs in m
- m.remove(k)
- Remove the key-value pair with key k from map m
- m.clear()
- Remove all the key-value pairs from m
- m.size()
- Return the number of key-value pairs in m
- m.isEmpty()
- Return whether there are zero pairs in map m
Many programming languages have a built-in dictionary/map type. Check out the documentation:
Search Indexes
A search index is a fancy object that acts a lot like a dictionary, but rather than giving it a key and having the dictionary look up the associated value, you give a search index a search term and it produces a collection of matching documents.
The search term can be a specific string or it can be a pattern that can be matched by many strings.
The documents might be entities that contain the search term directly, or be documents related to the search term.
I don’t have any detailed notes on this topic, but if interested, start your journey at Wikipedia.
Representations
There are two primary approaches to representing sets, dictionaries, and search indexes. You can either:
- Navigate through a collection of items to search for (only works for items that can be ordered), or
- Perform a (“hash”) computation on the key (or search term) that produces the index of a bucket where the thing you are looking for will be easily found.
For example, in Java we have:
Interface | Navigable Representation | Hashed Representation |
Set | TreeSet | HashSet |
Map | TreeMap | HashMap |
A Taxonomy
These two approaches give a pretty good rough breakdown, but alas, there is always some overlap, and some crazy hybrids, and some seemingly creative and off-the-wall mechanisms, so the actual taxonomy of known representations is rather huge; here is but a small slice:
- Association Lists / Associative Arrays
- Unsorted association lists
- Sorted association lists
- Judy Arrays
- Search Trees
- Traditional Search Trees
- (a,b) Trees
- B Trees and B+ Trees
- Binary Search Trees
- Unbalanced Binary Search Trees
- Self-Balancing Binary Search Trees
- AVL Trees
- Scapegoat Trees
- Red-Black Trees
- AA Trees
- Splay Trees
- Cartesian Trees
- Prefix-Based Trees
- Tries
- Radix (Patricia) Trees
- Hashed Structures
- Other
- Skip Lists
- Bitsets
- Bloom Filters
Evaluation
No one representation is best for all situations. You need to take into account:
- Will the collection be loaded once and only read and never written?
- Will insertions and deletions be frequent or uncommon, compared to lookups?
- Will the collection contain only a very small number of elements or can it be huge?
- Are there any restrictions on the keys? For example are keys only strings? Or only integers?
- Will the keys come from an type that can be ordered?
- Is the collection subject to algorithmic complexity
attacks?
It is worth noting:
If your keys are integers in the range 0..N, use a PLAIN OLD ARRAY
As you go over each data structure, make a little fact sheet such as the following:
Unsorted Associative Array or List |
Basic idea: Key-value pairs are stored in a simple unsorted array or linked list
Insertion: Θ(n) if you check for duplicate keys, Θ(1) otherwise
Deletion: Θ(n) since you have to find the item first
Lookup: Θ(n) since the whole list must be scanned
Key type restrictions: None at all!
Good for: Small collections only |
Sorted Associative Array |
Basic idea: Key-value pairs are stored in an array sorted by key
Insertion: Θ(n) in general due to sifting; Θ(1) if adding at the end
Deletion: Θ(n) in general due to sifting; Θ(1) if from the end
Lookup: Θ(log n) using binary search
Key type restrictions: Must be ordered
Good for: Memory-constrained environments; read-only collections
or those filled only in ascending order |
Exercise: Complete fact sheets for as many of the other implementations as you can.
Summary
We’ve covered:
- How to think about search
- Binary Search
- What sets are
- Operations on sets
- What dictionaries are
- Operations on dictionaries
- How sets and dictionaries are alike
- Representations of sets and dictionaries, without details
- Ideas of when to use different representations