Words

Words are cool.

What are Words?

Exactly what counts as a word is kind of hard to pin down, and varies between languages. It’s typically understood to be a unit of meaning that can stand alone in an utterance. There’s a lot of variation among languages, especially in the orthographic representations—think of agglutinative languages like Finnish, Hungarian, Turkish, Japanese, Swahili, Korean, Quecha, and Tamil.

Tokenization in LLMs
This is exactly why tokenization is hard and why modified Byte-Pair Encoding (BPE) is a compromise, not a solution. When an LLM tokenizes your prompt, it is not operating on “words” in any linguistically meaningful sense.

Basic Terminology

Here are some linguistic terms useful in the study of spoken words, just to get started:

Spoken Word: a unit of meaning that can stand alone in an utterance, made up of one or more phonemes
Phoneme: a single speech sound
Vowel: a phoneme produced without significant constriction of the vocal tract
Consonant: a phoneme produced with some constriction of the vocal tract
Syllable: a larger sound unit
Phonology: the study of the functional patterns of speech sounds, specifically how phonemes create meaning
Morpheme: a part of a word that contribute meaning
Morphology: How words (tokens) are made up of smaller pieces
Syntax: How words are structured to form phrases and sentences
Semantics: What sentences mean
Pragmatics: How language is used to communicate ideas and concepts

Lexical and Grammatical Words

Words are often divided into lexical and grammatical words:

Lexical Words	Grammatical Words
Carry meaning	Function words, glue of syntax
Can stand alone	Cannot stand alone
Open class, meaning new members can be added freely—to google, selfie, ghosting	Closed class, new words pretty much never added, and they dominate in frequency analysis
Example categories in English: Nouns, verbs, adjectives	Example categories in English: Prepositions, conjunctions, pronouns, determiners
Gazillions of these	Very few of these

To be honest, the boundary between these categories is not always clear-cut, for example, auxiliary verbs in English feel like they can be both.

Exercise: Make a list of 100 new words added to the Oxford English Dictionary in the last 5 years. How many do you use regularly or just know without reading the definitions? Are any from a closed class?

Exercise: How would you characterize word-like units such as (English) oops, shhh, hmm, uh, um, haha, tsk-tsk, yikes, oops? Are they words? If so, are they lexical or grammatical? Do they have meaning?

Exercise: Are (instinctive) screams, grunts, yawns, laughs, and other such vocalizations words? Do they have meaning?

Lexical words can range from the concrete (e.g., apple, fly, blue) to the abstract (e.g., freedom, oppression, justice).

Some words are iconic, meaning their form resembles their meaning (e.g., buzz, bang, hiss, plop), and others are arbitrary (e.g., dog, cat). Perhaps the separation here is fuzzy, too

Exercise: Research the kiki vs. boba experiment.

Exercise: Is the difference between bee and hippopotamus purely arbitrary? How about tiny vs. enormous? Make a list of words for small things and record the vowels they tend to use. Do the same for words for large things. Do you see a pattern? What does this suggest about the iconicity/arbitrariness distinction?

Lexicons

The lexicon of a language is its inventory of words (and their meanings).

The average adult English speaker knows ~50,000–100,000 words (their passive vocabulary), but frequently uses ~20,000–30,000. An unlimited number can be made in theory which can be figured out.

Exercise: Your laptop likely has a file with a list of words. On a Mac, it’s at /usr/share/dict/words. How many words are in that file? (Hint: use wc -l in the terminal.) Search the file (hint: use grep) for words you know but suspect might not be in the file.

Children acquire words at a remarkable pace, about 10 new words per day between ages 2–8, many of which are inferred from a single exposure in context.

Contrast with LLMs: trained on trillions of tokens but don’t seem to generalize as easy.

Exercise: Try this prompt for a chatbot: Do LLMs still struggle with novel word generalization despite being trained on trillions of tokens? What is an example of such difficulty if so?

Morphemes

A word is made up of one or more morphemes, the smallest units of meaning.

They can be lexical (anchored to a concept) or grammatical (inflection, affix), and can be either free or bound.

English examples:

Free lexical morpheme (root), e.g. dog, happy, walk
Bound lexical morpheme, e.g. bio (as in biology), tele (as in telephone)
Free grammatical morpheme, e.g. and, but, the, as, of
Bound grammatical morpheme, e.g. -s (plural), -ed (past tense), -ing (progressive), -ment (noun-forming)

Exercise: Open the OpenAI tokenizer and paste in a compound German word like Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz or a complex Turkish verb form. Are the segmentations morphemes?

Phonemes

Phonemes are the smallest units of sound in a language that can distinguish meaning. Some spoken languages have only a few dozen phonemes, while others have over a hundred. The IPA (International Phonetic Alphabet) provides a standardized way to represent these sounds.

CLASSWORK

Let’s visit the the interactive IPA chart and practice pronouncing some of the phonemes.

Exercise: Note the large number of terms (bilabial, alveolar, velar, glottal, uvular, glottal, plosive, affricate, etc.) Look up the ones that you may be interested in.

A lot of languages are written with the Latin alphabet, even when the sounds don’t match the letters perfectly. Watch the following video to see if the Latin alphabet is a good fit for Tlingit.

Exercise: How badly does the Latin alphabet fit English? How well does it fit Spanish? Why the difference? What other alphabets are used for English?

Exercise: Research the click sounds of Zulu. How are they represented in the Latin alphabet? How are they represented in the IPA?

Word Meanings

How do words get their meanings? It is a huge question. There are some theories, like Rosch’s Prototype Theory. The idea is that word categories aren’t defined by certain conditions, but rather organized around prototypes (or best examples), such as a robin for the category "bird" (and not a penguin). Other members of the category are included based on their similarity to the prototype, leading to graded membership rather than a strict boundary.

The important question is whether a hot dog, or even a pop tart, is a sandwich.

Prototype Theory in LLMs
This theory is compelling an used in machine learning and LLMs: classification at category boundaries is hard, embeddings do capture graded similarity!

You might also check out Putnam’s Twin Earth Thought Experiment, which gives evidence for meaning not being entirely in one’s head (intensionally) but also (extensionally) within one’s environmental context.

Another theory is the Causal Theory of Reference, which suggests that words get their meaning anchored historically, through chains of use.

Exercise: Do LLMs have a chain of use? Do they have an initial anchoring event? What does this mean for their "understanding" of word meanings?

Lexical Relations

Words are definitely related to each other! Learn these terms about these relationships to impress your friends:

Synonymy

Words with very similar meanings, close to being interchangeable but not necessarily. Examples:

big / large
happy / joyful
begin / start / initiate
watch / observe
smart / intelligent

Antonymy

Words that have opposite meanings. Gradable antonyms are at opposite ends of a spectrum, such as:

hot / cold
big / small

Complementary antonyms are absolute binary opposites, such as:

on / off
alive / dead
pass / fail

Converse antonyms are relational opposites, such as:

buy / sell
parent / child
teacher / student

Hyponymy

Words with a subcategory relationship. Examples:

dog is a hyponym of animal (the hypernym)
rose / flower
car / vehicle
poodle / dog / canine / animal

Polysemy

When a word has many related meanings (can be metaphorical or metonymic or systematic). Think: one lexeme with multiple senses. Examples:

foot (body part / mountain base)
head (body part / leader / top of something)
run (move fast on foot / operate a business)
get (obtain / understand / become)
paper (material / academic article / publication / newspaper company)

Homonymy

When words coincidentally look or sound the same, despite having completely different origins and distinct, unrelated meanings. Think: distinct lexemes sharing a surface form. Examples:

bat (flying mammal / baseball equipment)
bark (tree skin / dog sound)
lie (recline / falsehood)
right (correct / direction)
bank (financial institution / river bank)
stalk (plant stem / follow stealthily)

Word Vectors Encode These Relations

Embedding spaces capture many lexical relations geometrically. The famous example: king − man + woman ≈ queen. Synonymy, hyponymy, and analogy all emerge from distributional statistics — no one programmed them in.

Exercise: Run a live word embedding query using the TensorFlow Embedding Projector. Demonstrate synonymy, hyponymy, and vector analogy (king − man + woman ≈ queen). What does it mean that geometric relationships encode semantic ones?

TODO

Recall Practice

Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.

What is the difference between open-class and closed-class words? Give examples of each.
Open-class words (nouns, verbs, adjectives) accept new members freely — "selfie," "to google." Closed-class words (prepositions, conjunctions, determiners) form a small, stable set and are the structural glue of syntax.
What is fast mapping?
Fast mapping is the ability of children to infer a word’s meaning from a single exposure in context. Children acquire roughly 10 new words per day between ages 2 and 8 this way.
What is prototype theory (Rosch), and what does it predict about category membership?
Categories are organized around best examples (prototypes) rather than necessary and sufficient conditions. Membership is graded — a robin is a better bird than a penguin. Category boundaries are fuzzy.
Distinguish polysemy from homonymy, and give an example of each.
Polysemy: one word with multiple related meanings (bank: riverbank / financial institution). Homonymy: one word with unrelated meanings (bat: animal / baseball bat).

Summary

We’ve covered:

TODO
TODO