We’ve seen that human languages have evolved over millions of years, picking up symbols, words, grammar, pragmatics, social cognition, and much more along the way. Now, let’s explore what happens when we apply machine learning—or more accurately, a statistical learning engine—at billions of pages of human text to see what such as system can learn and do.
We’re only going to scratch the surface of language modeling, but you’ll be able to at least pick up some vocabulary with which to enter your ML or NLP courses, and get a sense of what an LLM actually does.
It’s always a good idea to start with a 3Blue1Brown video if one exists for the topic of interest!
You can also jump start your learning by reading the Large Language Model article on Wikipedia or take the LLM Course on Hugging Face. Getting involved with the HuggingFace Community is worthwhile too.
Roughly speaking, language modeling is assigning a probability distribution over sequences of tokens.
Formally, $P(w_1, w_2, w_3, … w_n)$ is probability of a sequence and $P(w_n | w_1, w_2, … w_{n-1})$ helps us predict the next token given all previous tokens. Everything you can do with LLMs—summarization, translation, programming, and reasoning—emerges from doing this well at scale.
Not surprisingly, prediction power can come from throwing machine learning at the problem. Hopefully, we can do this well enough to learn:
Excellent prediction is either language understanding or an approximation of understanding, depending on your perspective.
A language model that is really big is called a large language model (LLM). One that isn’t very big at all, and fits on a single device, is called a smol language model.
LLMs can do cool things, like:
LLMs collapse most of these into a single framework: generation conditioned on a prompt.
Just an introduction todayWe won’t go into the details of speech (acoustic models, ASR) or multimodal models (vision-language) today.
Here is a beautiful idea. Represent words (actually tokens) as vectors in a high-dimensional space, where similar words are close together. After all, neural networks operate on vectors of real numbers. If words are vectors, neural networks can do language tasks.
The very naive approach to this is One-Hot Encoding. It’s simple but not really helpful. Here’s how it works.
Assume a vocabulary of size V = 50,000 words. In one-hot encoding, each word is a vector of length 50,000: all zeros except a single 1.
This only encodes identity, not relationships, e.g. dog and puppy are orthogonal — as far apart as dog and democracy. The dimensionality is enormous, but the representation is sparse.
It’s little more than a useful toy to prime understanding; it’s not used in modern LLMs directly.
You shall know a word by the company it keeps. — J.R. Firth (1957)
The idea here is that words that appear in similar contexts tend to have similar meanings. It’s empirically true (words are defined in terms of other words, and their meanings often vary based on context). It’s also extraordinarily powerful and the philosophical foundation of every word embedding ever built.
2013 was a big year, with the introduction of Word2Vec.
Let’s oversimplify but capture the intuition here.
The idea is to train a shallow neural network so that each word is a dense vector of (around) 300 real numbers that encode semantic and syntactic relationships.
Training was pretty cool. There employed two interesting ideas:
A good embedding space will give us several useful properties:
This is what meaning in a vector space looks like!
Cosine SimilarityLet’s get technical. The cosine between two vectors v and w is defined as the dot product of v and w divided by the product of their magnitudes. This gives us a measure of similarity that ranges from -1 (opposite) to 1 (identical), with 0 meaning orthogonal (unrelated).
$$ $$cosine\_similarity(v, w) = \frac{v \cdot w}{\|v\| \|w\|} $$
Word2Vec gives one vector per word — always
"bank" (river) = "bank" (financial) — same vector
Language is massively context-dependent
We need representations that shift based on surrounding context
This will be solved by the transformer
Don’t tokenize at word boundaries — tokenize at subword boundaries
[Demo moment: show tiktoken on unbelievably or a Turkish verb]
Algorithm: start with characters; iteratively merge the most frequent pair
unbelievably → [un, believ, ably] (approximately)
Why: handles unknown words gracefully, captures morphology, manageable vocabulary size (~50K tokens)
Trade-off: $\textsf{token} \neq \textsf{word}$ — keep this in mind when thinking about what LLMs process
TODO: Tokenizer live demo (tiktoken) — show BPE on compound words, rare words, code
TODO: Embedding visualization — nearest neighbors for "king," "bank" (show polysemy problem)
Takes a vector of inputs $x$
Computes a weighted sum: $z = w \cdot x + b$
Applies a nonlinearity: output = $f(z)$ where $f$ is (e.g.) ReLU = $\max(0, z)$
One neuron = one learned feature detector
Without it, stacking layers = one big linear transformation (provable)
Nonlinearity lets networks approximate arbitrary functions (Universal Approximation Theorem — don’t need the proof, just the fact)
ReLU is popular because it’s fast, doesn’t vanish for large inputs, and works well in practice
Stack N neurons operating in parallel on the same input
Output: a vector of length N
Each neuron learns a different feature
Stack L layers: input → layer 1 → layer 2 → … → layer L → output
Each layer transforms the representation
Early layers: simple patterns; deep layers: abstract features
This is empirically observed in vision networks; believed (with more uncertainty) in language networks
Define a loss function: how wrong is the current prediction?
For language modeling: cross-entropy loss between predicted token distribution and actual next token
Backpropagation: compute gradient of loss w.r.t. every parameter via chain rule
Gradient descent: nudge every parameter in the direction that reduces loss
Repeat for billions of examples
Reinforcement learning from human feedback (RLHF) — mentioned briefly, not detailed
Constitutional AI — mentioned briefly, not detailed
Learning rate: step size — too big = diverge, too small = never converge
Batch size: how many examples per gradient step
Epochs: passes through the dataset (LLMs typically train for ~1 epoch on huge datasets)
The network has no symbolic rules
It has billions of numbers (weights) that get adjusted until predictions are good
The "knowledge" is distributed across all the weights — not localized
This is deeply unlike classical AI / expert systems — and unlike how we naively think of memory
A "large" language model: 7B to 700B+ parameters
Each parameter is one floating point number
GPT-2 (2019): 1.5B parameters — could fit on a laptop
GPT-4 (estimated): ~1T parameters — requires thousands of GPUs
Why does scale matter? We’ll revisit this in the section on What LLMs can and cannot do
Before transformers: RNNs (Recurrent Neural Networks) processed tokens one at a time, left to right
Information about token 1 must "travel" through every subsequent hidden state to reach token 500
Result: vanishing gradients — early context gets diluted or lost
Also: sequential processing = no parallelism = slow training
Proposed replacing recurrence entirely with attention mechanisms
The transformer processes all tokens simultaneously
Every token can look at every other token directly
Maybe they should have said transformers is all you need
For each token, compute three vectors:
Attention score between token i and token j:
score(i,j) = (Qᵢ · Kⱼ) / √dₖ
Softmax over all j → attention weights (sum to 1)
Output for token i = weighted sum of all Vⱼ
In plain English: each token figures out how much to "attend to" every other token, then mixes their values proportionally
Subject-verb agreement: "The dogs that chased the cat are…" — "are" attends to "dogs" not "cat"
Coreference: "The trophy didn’t fit in the suitcase because it was too big" — "it" attends to "trophy"
Syntactic structure, semantic roles, discourse relations
These were hand-engineered features in classical NLP — here they emerge from gradient descent
Run h parallel attention operations ("heads") with different Q/K/V matrices
Each head can learn a different type of relationship
Concatenate and project the results
Typical model: 12–96 heads per layer
Input
↓
Layer Norm
↓Multi-Head Self-Attention ← residual connection back to input
↓Layer Norm
↓Feed-Forward Network (two linear layers + ReLU) ← residual connection
↓Output (same shape as input)
Stack 12 to 96+ of these blocks. That’s the model.
Positional Encoding
Self-attention is position-agnostic — "cat bit dog" and "dog bit cat" look the same without positional info
Solution: add a positional signal to each token embedding before the first layer
Original paper: sinusoidal functions (elegant, fixed)
Modern models: learned positional embeddings, or RoPE (Rotary Position Embedding)
The maximum number of tokens the model can attend to at once
Early GPT: 512 tokens (~400 words)
Modern models: 128K–1M tokens
Attention cost scales quadratically with context length: O(n²) — the main computational bottleneck
Active research area: sparse attention, linear attention approximations
Original transformer: encoder + decoder (for translation)
Language modeling uses decoder-only: predict next token, left to right
Causal masking: when computing attention for token i, mask out tokens i+1, i+2… (can’t look at the future)
This is what makes autoregressive generation work: generate one token, append it, generate the next
TODO: Attention visualization — BertViz or similar, show what "it" attends to in ambiguous sentences
Common Crawl: a scrape of much of the web — petabytes of text
Books, Wikipedia, code repositories, scientific papers, forums
Preprocessing: deduplication, quality filtering, language identification, toxicity filtering
The data mixture matters enormously — and is often not fully disclosed
This is the "company it keeps" principle from distributional semantics, applied at civilizational scale
Autoregressive language modeling: predict next token
For each training example: run the sequence through the model, compare predicted distribution to actual token, compute cross-entropy loss, backpropagate
Do this for ~10²² tokens (roughly)
Training a frontier model: thousands of GPUs for months
Estimated cost: $10M–$100M+ for a single training run
Why GPUs? Matrix multiplication is what they’re designed for — and transformers are basically stacked matrix multiplications
This is a genuine barrier to entry and a geopolitical resource
Small models: predict tokens, nothing more impressive
Scale up: few-shot learning appears. Then chain-of-thought. Then code generation. Then...
Emergent abilities: capabilities that appear abruptly at scale thresholds, not predicted by smooth extrapolation
Hotly debated: are these truly emergent, or artifacts of how we measure?
The honest answer: we don’t fully understand why scale works as well as it does
Pretrained model: knows language, not how to be helpful or safe
Supervised fine-tuning (SFT): train on curated examples of good assistant behavior
Reinforcement Learning from Human Feedback (RLHF): humans rank model outputs → train a reward model → optimize policy against that reward
This is how GPT-3 → InstructGPT → ChatGPT
The details are a whole lecture on their own
We’ve seen that LLMs are characterized by three conceptual handles:
So how does this translate into what LLMs can and cannot do?
We can all agree that LLMs can do a lot of things well:
They’re not perfect. Let’s be aware of limitations.
Models often generate confident, fluent, false claims.
Why? The objective is to predict plausible text, not true text. Also, if the training data contains errors, the model can’t detect them.
TODO: Live hallucination demo — ask a model a plausible-but-false question about an obscure topic; show the confident wrong answer
Small prompt changes → large output changes
"Solve this step by step" dramatically outperforms "Solve this" on math — same model
The model is sensitive to surface form in ways humans aren’t
Context window is it. No learning between conversations (without fine-tuning)
Every conversation starts fresh
Struggle with novel compositional combinations ("the blue sphere to the left of the red cube that is behind the green cylinder")
May be learning shortcuts rather than compositional rules
Surprisingly bad at multi-step arithmetic unless given explicit scratchpad
"Chain of thought" prompting helps significantly — why?
Probably because it externalizes intermediate steps, letting the model condition on them
TODO: Chain-of-thought demo — same math problem with and without "let’s think step by step"
In natural language, words get meaning from perception, action, joint attention, causal anchoring, and more. LLMs have none of this: they simply have distributional co-occurrence at massive scale. But it’s remarkable how well this seems to work.
But is it understanding? That’s a debate (and depends on how you define understanding).
LLMs are "stochastic parrots" — form without meaning
They remix training data without comprehension
The critique is useful; the parrot metaphor is probably too strong
Parrots don’t write novel proofs, debug unfamiliar code, or explain jokes
The truth is probably somewhere uncomfortable in between
Here’s a short video by Parth
A brief timeline of key developments in language models:
As of mid-2026, here are some popular models:
| Family | Producer | Type | Open-weight | Tiers / Variants | Notes |
|---|---|---|---|---|---|
| Claude | Anthropic | General LLM | No | Haiku, Sonnet, Opus, Mythos | Mythos is a new tier above Opus |
| Gemini | Google DeepMind | General LLM | No | Flash Lite, Flash, Pro, Ultra | Deep Think is a mode of Pro/Ultra, not a separate model |
| Gemma | Google DeepMind | Small/edge LLM | Yes | 2B, 9B, 27B | Open-weight counterpart to GeminiGPT |
| GPT | OpenAI | General LLM | No | Mini, Nano, standard, Pro | GPT-5.x current generation |
| Phi | Microsoft | Small/edge LLM | Yes | Phi-4, Phi-4-mini, Phi-4-multimodal | Optimized for on-device / edge |
| MAI | Microsoft | Specialized | No | Transcribe-1, Voice-1, Image-2 | No general-purpose frontier model yet |
| Llama | Meta | General LLM | Yes | Scout, Maverick | Behemoth announced but unreleased |
| Muse | Meta | General LLM | No | Spark | Proprietary; from Meta Superintelligence Labs |
| Nemotron | NVIDIA | General LLM | Partial | Nano, Super, Ultra | |
| DeepSeek | DeepSeek | General + Reasoning | Yes | V3, V4, R1 | V/R series are different model types, not just tiers |
| Qwen | Alibaba Cloud | General LLM | Yes | Qwen3, Qwen3-Coder, Qwen3-VL | Rapid versioning; many specialist variants |
| GLM | Z.ai | General LLM | Partial | GLM-5, GLM-5.1 | |
| Kimi | Moonshot AI | General LLM | No | Known for very long context windows | |
| Mistral | Mistral AI | General LLM | Partial | Ministral, Mistral, Devstral, Magistral | Mix of open and closed models |
| Grok | xAI | General LLM | Partial | ||
| Sonar | Perplexity AI | Search-optimized | No | Sonar, Sonar Pro, Sonar Reasoning | Fine-tune of LLaMA 3.3; not trained from scratch |
| Apple Foundation Models | Apple | On-device + server | No | On-device (~3B), Server (MoE) | No public branding; powers Apple Intelligence |
And popular products:
| Product | Vendor | Category | Primary model(s) | Includes third-party models? | Notes |
|---|---|---|---|---|---|
| Claude.ai | Anthropic | Chatbot / assistant | Claude | No | |
| Claude API | Anthropic | API | Claude | No | |
| Claude Code | Anthropic | Coding assistant | Claude | No | |
| ChatGPT | OpenAI | Chatbot / assistant | GPT | No | |
| OpenAI API | OpenAI | API | GPT | No | |
| Gemini app | Chatbot / assistant | Gemini | No | ||
| Google AI Studio | API / Development | Gemini, Gemma | No | ||
| Microsoft Copilot | Microsoft | Productivity assistant | GPT (primary), MAI, Phi | Yes — Claude, Gemini | Routes between models by task |
| GitHub Copilot | Microsoft | Coding assistant | GPT | Yes — Claude and others | Developer IDE integration |
| Perplexity | Perplexity AI | Search-answer engine | Sonar | Yes — GPT, Claude, Gemini, Grok | Core differentiator is search grounding + routing |
| Meta AI | Meta | Chatbot / assistant | Llama, Muse | No | Integrated into WhatsApp, Instagram, etc. |
| Grok app | xAI | Chatbot / assistant | Grok | No | |
| Apple Intelligence | Apple | On-device assistant | Apple Foundation Models | Yes — ChatGPT, Gemini | Privacy-first; runs on-device by default |
| Kimi app | Moonshot AI | Chatbot / assistant | Kimi | No |
See also Wikipedia’s List of large language models.
Systems built around LLMs are surprisingly effective! LLMs learned the regularities language. They apparently reconstructed a lot of the underlying structure—grammar, semantics, pragmatics—from prediction alone. Clearly they are missing human elements such as grounding, embodiment, social cognition, and continuous experience.
A fundamental philosophical question: how much of that absence matters for the tasks we care about?
The interesting question isn’t "are LLMs intelligent" — that’s mostly a definitional debate
The interesting question is: what cognitive and linguistic capacities are necessary for what purposes, and which of those does this architecture actually instantiate?
There are more questions:
A great question! After all, the architecture is kind of fixed, and we want to know the next big thing. Some people are already looking ahead, since we kind of know what the limitations are.
These notes were not intended to be anything more than a very light introduction. We got the mental model:
Raw text (tokens) $\longrightarrow$ Embedding (tokens to vectors) $\longrightarrow$ Transformer blocks (contextual representations) $\longrightarrow$ Output head (distribution over next token) $\longrightarrow$ Sample $\longrightarrow$ token $\longrightarrow$ append $\longrightarrow$ repeat
But we were light on details. You’ll no doubt move on to more advanced studies such as:
If you’re impatient: Andrej Karpathy’s "Neural Networks: Zero to Hero" series on YouTube builds GPT from scratch in Python.
As you prepare for further study, consider making flashcards for:
And of course, read and watch:
Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.
Coming soon
We’ve covered: