Search

What are computers good for? Many things. But one of the biggest things they help with is getting information. How do we get information? We search.

Technical Landscape

Let’s frame the world of search by (1) what kinds of searches we do, (2) what kind of data types we can use to collect information for searching, and (3) what kind of actual data structures can enable efficient search.

For serious searching, we have:

Actions

Does something exist?
What is its position?
Fetch object by id
Find text 💪

Abstractions

Sets
Dictionaries
Search Indexes

Structures

Sorted Arrays
Search Trees
Prefix Trees
Hashtables

For “less serious” in-memory search for very small data sets, we can find the index of an array or list element in $\Theta(n)$ time in an unordered sequence. But you already know how to do that. It isn’t super interesting.

We are interested in efficient searches.

Binary Search

If an array or list is sorted, we can find the index of an array or list element in $\Theta(\log n)$ time using binary search.
Exercise: Write a binary search of a sorted array or list in the language of your choice.
Exercise: Write a small Java app that illustrates the use of the built-in binary search method of Java.

Sets

A set is an unordered collection of (unique) items. Examples:

examplesets.gif

Here’s a bunch of operations, including mutators and non-mutators:

s.add(e)
Add element e to the set s
s.remove(e)
Remove element e from the set s
s.contains(e)
Return whether e is a member of s
s.size()
Return the number of elements in s
s.isEmpty()
Return whether there are no elements in s
s.addAll(t)
Add all the elements of set t to set s
s.removeAll(t)
Remove all the elements of t from s
s.retainAll(t)
Remove all the elements from s that are not in t
s.union(t)
Return a new set which contains all the elements of set s and all the elements of set t, and no others
s.intersect(t)
Return a new set which contains all and only those elements in both s and t
s.minus(t)
Return a new set which contains all and only those elements in s but not in t
s.symmetricMinus(t)
Return a new set which contains all and only those elements in either set s or set t but not both

Many programming languages have a built-in set type. Check out the documentation:

Dictionaries

A dictionary (also called a map, associative array, hash, or lookup table) is a collection of key-value pairs where every key is unique. We don’t care as much about the pairs as we do about looking up the value associated with a given key. Example:

www.nato.int152.152.96.1
cr.yp.to131.193.178.175
losangeles.citysearch.com209.104.35.15
www.cl.cam.ac.uk128.232.0.20
mit.edu18.7.22.69
www.altan.ie212.108.64.74
securitysearch.net64.94.136.5
linuxassembly.org64.202.189.170
regular-expressions.info66.39.67.31
jpl.nasa.gov137.78.160.180
groups.google.com216.239.57.147
script.aculo.us86.59.5.67

Say: www.nato.int maps to 152.152.96.1, or www.nato.int is bound to 152.152.96.1.

Fun Fact

A set is just a kind of dictionary in which the value part is ignored.

Here are some sample operations, both mutators and non-mutators:

m.put(k,v)
Associate k with v (will either add a new entry or replace the existing entry for k)
m.putIfAbsent(k,v)
If there is no entry with key k, add a new entry associating k with v
m.replace(k,v)
If there an entry with key k, replace its value with v
m.merge(k,v,f)
If there an entry with key k and value oldv, replace its value with f(oldv, v)
m.get(k)
Return the value associated with key k (failure can handled in many different ways)
m.getOrDefault(k)
Return the value associated with key k, or return d if key k is not present
m.containsKey(k)
Return whether there’s a key-value pair in map m with key k
m.containsValue(v)
Return whether there’s a key-value pair in map m with value v
m.putAll(otherMap)
Add all the key-value pairs of otherMap to m
m.keySet()
Return the set of keys of m
m.values()
Return the values of m in some collection (not a set, since they don’t have to be unique)
m.entrySet()
Return the set of key-value pairs in m
m.remove(k)
Remove the key-value pair with key k from map m
m.clear()
Remove all the key-value pairs from m
m.size()
Return the number of key-value pairs in m
m.isEmpty()
Return whether there are zero pairs in map m

Many programming languages have a built-in dictionary/map type. Check out the documentation:

Search Indexes

A search index is a fancy object that acts a lot like a dictionary, but rather than giving it a key and having the dictionary look up the associated value, you give a search index a search term and it produces a collection of matching documents.

The search term can be a specific string or it can be a pattern that can be matched by many strings.

The documents might be entities that contain the search term directly, or be documents related to the search term.

I don’t have any detailed notes on this topic, but if interested, start your journey at Wikipedia.

Representations

There are two primary approaches to representing sets, dictionaries, and search indexes. You can either:

For example, in Java we have:

InterfaceNavigable
Representation
Hashed
Representation
SetTreeSetHashSet
MapTreeMapHashMap

A Taxonomy

These two approaches give a pretty good rough breakdown, but alas, there is always some overlap, and some crazy hybrids, and some seemingly creative and off-the-wall mechanisms, so the actual taxonomy of known representations is rather huge; here is but a small slice:

Evaluation

No one representation is best for all situations. You need to take into account:

It is worth noting:

If your keys are integers in the range 0..N, use a PLAIN OLD ARRAY

As you go over each data structure, make a little fact sheet such as the following:

Unsorted Associative Array or List
Basic idea: Key-value pairs are stored in a simple unsorted array or linked list
Insertion: Θ(n) if you check for duplicate keys, Θ(1) otherwise
Deletion: Θ(n) since you have to find the item first
Lookup: Θ(n) since the whole list must be scanned
Key type restrictions: None at all!
Good for: Small collections only
Sorted Associative Array
Basic idea: Key-value pairs are stored in an array sorted by key
Insertion: Θ(n) in general due to sifting; Θ(1) if adding at the end
Deletion: Θ(n) in general due to sifting; Θ(1) if from the end
Lookup: Θ(log n) using binary search
Key type restrictions: Must be ordered
Good for: Memory-constrained environments; read-only collections or those filled only in ascending order
Exercise: Complete fact sheets for as many of the other implementations as you can.

Summary

We’ve covered:

  • How to think about search
  • Binary Search
  • What sets are
  • Operations on sets
  • What dictionaries are
  • Operations on dictionaries
  • How sets and dictionaries are alike
  • Representations of sets and dictionaries, without details
  • Ideas of when to use different representations