Words

Words are cool.

What are Words?

Exactly what counts as a word is kind of hard to pin down, and varies between languages. It’s typically understood to be a unit of meaning that can stand alone in an utterance. There’s a lot of variation among languages, especially in the orthographic representations—think of agglutinative languages like Finnish, Hungarian, Turkish, Japanese, Swahili, Korean, Quecha, and Tamil.

Tokenization in LLMs

This is exactly why tokenization is hard and why modified Byte-Pair Encoding (BPE) is a compromise, not a solution. When an LLM tokenizes your prompt, it is not operating on “words” in any linguistically meaningful sense.

Basic Terminology

Here are some linguistic terms useful in the study of spoken words, just to get started:

Lexical and Grammatical Words

Words are often divided into lexical and grammatical words:

Lexical WordsGrammatical Words
Carry meaningFunction words, glue of syntax
Can stand aloneCannot stand alone
Open class, meaning new members can be added freely—to google, selfie, ghostingClosed class, new words pretty much never added, and they dominate in frequency analysis
Example categories in English: Nouns, verbs, adjectivesExample categories in English: Prepositions, conjunctions, pronouns, determiners
Gazillions of theseVery few of these

To be honest, the boundary between these categories is not always clear-cut, for example, auxiliary verbs in English feel like they can be both.

Exercise: Make a list of 100 new words added to the Oxford English Dictionary in the last 5 years. How many do you use regularly or just know without reading the definitions? Are any from a closed class?
Exercise: How would you characterize word-like units such as (English) oops, shhh, hmm, uh, um, haha, tsk-tsk, yikes, oops? Are they words? If so, are they lexical or grammatical? Do they have meaning?
Exercise: Are (instinctive) screams, grunts, yawns, laughs, and other such vocalizations words? Do they have meaning?

Lexical words can range from the concrete (e.g., apple, fly, blue) to the abstract (e.g., freedom, oppression, justice).

Some words are iconic, meaning their form resembles their meaning (e.g., buzz, bang, hiss, plop), and others are arbitrary (e.g., dog, cat). Perhaps the separation here is fuzzy, too

Exercise: Research the kiki vs. boba experiment.
Exercise: Is the difference between bee and hippopotamus purely arbitrary? How about tiny vs. enormous? Make a list of words for small things and record the vowels they tend to use. Do the same for words for large things. Do you see a pattern? What does this suggest about the iconicity/arbitrariness distinction?

Lexicons

The lexicon of a language is its inventory of words (and their meanings).

The average adult English speaker knows ~50,000–100,000 words (their passive vocabulary), but frequently uses ~20,000–30,000. An unlimited number can be made in theory which can be figured out.

Exercise: Your laptop likely has a file with a list of words. On a Mac, it’s at /usr/share/dict/words. How many words are in that file? (Hint: use wc -l in the terminal.) Search the file (hint: use grep) for words you know but suspect might not be in the file.

Children acquire words at a remarkable pace, about 10 new words per day between ages 2–8, many of which are inferred from a single exposure in context.

Contrast with LLMs: trained on trillions of tokens but don’t seem to generalize as easy.

Exercise: Try this prompt for a chatbot: Do LLMs still struggle with novel word generalization despite being trained on trillions of tokens? What is an example of such difficulty if so?

Morphemes

A word is made up of one or more morphemes, the smallest units of meaning.

They can be lexical (anchored to a concept) or grammatical (inflection, affix), and can be either free or bound.

English examples:

Exercise: Open the OpenAI tokenizer and paste in a compound German word like Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz or a complex Turkish verb form. Are the segmentations morphemes?

Phonemes

Phonemes are the smallest units of sound in a language that can distinguish meaning. Some spoken languages have only a few dozen phonemes, while others have over a hundred. The IPA (International Phonetic Alphabet) provides a standardized way to represent these sounds.

CLASSWORK
Let’s visit the the interactive IPA chart and practice pronouncing some of the phonemes.
Exercise: Note the large number of terms (bilabial, alveolar, velar, glottal, uvular, glottal, plosive, affricate, etc.) Look up the ones that you may be interested in.

A lot of languages are written with the Latin alphabet, even when the sounds don’t match the letters perfectly. Watch the following video to see if the Latin alphabet is a good fit for Tlingit.

Exercise: How badly does the Latin alphabet fit English? How well does it fit Spanish? Why the difference? What other alphabets are used for English?
Exercise: Research the click sounds of Zulu. How are they represented in the Latin alphabet? How are they represented in the IPA?

Word Meanings

How do words get their meanings? It is a huge question. There are some theories, like Rosch’s Prototype Theory. The idea is that word categories aren’t defined by certain conditions, but rather organized around prototypes (or best examples), such as a robin for the category "bird" (and not a penguin). Other members of the category are included based on their similarity to the prototype, leading to graded membership rather than a strict boundary.

The important question is whether a hot dog, or even a pop tart, is a sandwich.

sandwich.jpg

Prototype Theory in LLMs

This theory is compelling an used in machine learning and LLMs: classification at category boundaries is hard, embeddings do capture graded similarity!

You might also check out Putnam’s Twin Earth Thought Experiment, which gives evidence for meaning not being entirely in one’s head (intensionally) but also (extensionally) within one’s environmental context.

Another theory is the Causal Theory of Reference, which suggests that words get their meaning anchored historically, through chains of use.

Exercise: Do LLMs have a chain of use? Do they have an initial anchoring event? What does this mean for their "understanding" of word meanings?

Lexical Relations

Words are definitely related to each other! Learn these terms about these relationships to impress your friends:

Synonymy
Words with very similar meanings, close to being interchangeable but not necessarily. Examples:
  • big / large
  • happy / joyful
  • begin / start / initiate
  • watch / observe
  • smart / intelligent
Antonymy
Words that have opposite meanings. Gradable antonyms are at opposite ends of a spectrum, such as:
  • hot / cold
  • big / small
Complementary antonyms are absolute binary opposites, such as:
  • on / off
  • alive / dead
  • pass / fail
Converse antonyms are relational opposites, such as:
  • buy / sell
  • parent / child
  • teacher / student
Hyponymy
Words with a subcategory relationship. Examples:
  • dog is a hyponym of animal (the hypernym)
  • rose / flower
  • car / vehicle
  • poodle / dog / canine / animal
Polysemy
When a word has many related meanings (can be metaphorical or metonymic or systematic). Think: one lexeme with multiple senses. Examples:
  • foot (body part / mountain base)
  • head (body part / leader / top of something)
  • run (move fast on foot / operate a business)
  • get (obtain / understand / become)
  • paper (material / academic article / publication / newspaper company)
Homonymy
When words coincidentally look or sound the same, despite having completely different origins and distinct, unrelated meanings. Think: distinct lexemes sharing a surface form. Examples:
  • bat (flying mammal / baseball equipment)
  • bark (tree skin / dog sound)
  • lie (recline / falsehood)
  • right (correct / direction)
  • bank (financial institution / river bank)
  • stalk (plant stem / follow stealthily)
Word Vectors Encode These Relations

Embedding spaces capture many lexical relations geometrically. The famous example: king − man + woman ≈ queen. Synonymy, hyponymy, and analogy all emerge from distributional statistics — no one programmed them in.
Exercise: Run a live word embedding query using the TensorFlow Embedding Projector. Demonstrate synonymy, hyponymy, and vector analogy (king − man + woman ≈ queen). What does it mean that geometric relationships encode semantic ones?

TODO

TODO

Recall Practice

Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.

  1. What is the difference between open-class and closed-class words? Give examples of each.
    Open-class words (nouns, verbs, adjectives) accept new members freely — "selfie," "to google." Closed-class words (prepositions, conjunctions, determiners) form a small, stable set and are the structural glue of syntax.
  2. What is fast mapping?
    Fast mapping is the ability of children to infer a word’s meaning from a single exposure in context. Children acquire roughly 10 new words per day between ages 2 and 8 this way.
  3. What is prototype theory (Rosch), and what does it predict about category membership?
    Categories are organized around best examples (prototypes) rather than necessary and sufficient conditions. Membership is graded — a robin is a better bird than a penguin. Category boundaries are fuzzy.
  4. Distinguish polysemy from homonymy, and give an example of each.
    Polysemy: one word with multiple related meanings (bank: riverbank / financial institution). Homonymy: one word with unrelated meanings (bat: animal / baseball bat).

Summary

We’ve covered:

  • TODO
  • TODO