Introduction to Large Language Models

In addition to looking at language mathematically and evolutionarily, we can also look at it as a statistical machine learning problem.

Getting Started

We’ve seen that human languages have evolved over millions of years, picking up symbols, words, grammar, pragmatics, social cognition, and much more along the way. Now, let’s explore what happens when we apply machine learning—or more accurately, a statistical learning engine—at billions of pages of human text to see what such as system can learn and do.

We’re only going to scratch the surface of language modeling, but you’ll be able to at least pick up some vocabulary with which to enter your ML or NLP courses, and get a sense of what an LLM actually does.

It’s always a good idea to start with a 3Blue1Brown video if one exists for the topic of interest!

You can also jump start your learning by reading the Large Language Model article on Wikipedia or take the LLM Course on Hugging Face. Getting involved with the HuggingFace Community is worthwhile too.

Language Modeling

Roughly speaking, language modeling is assigning a probability distribution over sequences of tokens.

Formally, $P(w_1, w_2, w_3, … w_n)$ is probability of a sequence and $P(w_n | w_1, w_2, … w_{n-1})$ helps us predict the next token given all previous tokens. Everything you can do with LLMs—summarization, translation, programming, and reasoning—emerges from doing this well at scale.

Not surprisingly, prediction power can come from throwing machine learning at the problem. Hopefully, we can do this well enough to learn:

Grammar (so that ungrammatical continuations are unlikely)
Facts (so that wrong facts are unlikely given training data)
Pragmatics (so that incoherent continuations are unlikely)
Style, register, discourse structure, and similar aspects of communication

Excellent prediction is either language understanding or an approximation of understanding, depending on your perspective.

A language model that is really big is called a large language model (LLM). One that isn’t very big at all, and fits on a single device, is called a smol language model.

LLMs can do cool things, like:

Sentiment analysis, spam detection
Named entity recognition, part-of-speech tagging
Translation, summarization
Text and dialog generation
Programming
Reasoning
Creating images and video

LLMs collapse most of these into a single framework: generation conditioned on a prompt.

Just an introduction today
We won’t go into the details of speech (acoustic models, ASR) or multimodal models (vision-language) today.

Embeddings

Here is a beautiful idea. Represent words (actually tokens) as vectors in a high-dimensional space, where similar words are close together. After all, neural networks operate on vectors of real numbers. If words are vectors, neural networks can do language tasks.

The Naive Approach

The very naive approach to this is One-Hot Encoding. It’s simple but not really helpful. Here’s how it works.

Assume a vocabulary of size V = 50,000 words. In one-hot encoding, each word is a vector of length 50,000: all zeros except a single 1.

This only encodes identity, not relationships, e.g. dog and puppy are orthogonal — as far apart as dog and democracy. The dimensionality is enormous, but the representation is sparse.

It’s little more than a useful toy to prime understanding; it’s not used in modern LLMs directly.

The Distributional Hypothesis

You shall know a word by the company it keeps. — J.R. Firth (1957)

The idea here is that words that appear in similar contexts tend to have similar meanings. It’s empirically true (words are defined in terms of other words, and their meanings often vary based on context). It’s also extraordinarily powerful and the philosophical foundation of every word embedding ever built.

Exercise: Relate this idea to the principle of compositionality from the philosophy of language, and to the idea of "meaning as use" from Wittgenstein. How are these ideas similar? How are they different?

Word2Vec

2013 was a big year, with the introduction of Word2Vec.

Let’s oversimplify but capture the intuition here.

The idea is to train a shallow neural network so that each word is a dense vector of (around) 300 real numbers that encode semantic and syntactic relationships.

Training was pretty cool. There employed two interesting ideas:

Skip-grams: given a word, predict surrounding context words
CBOW: given context words, predict the center word

Embedding Space Properties

A good embedding space will give us several useful properties:

Similarity: similar words have similar vectors, e.g., $cos(v_{dog}, v_{puppy}) > cos(v_{dog}, v_{democracy})$
Analogy: The famous example is $v_{king} - v_{man} + v_{woman} \approx v_{queen}$
Clustering: animals cluster together, countries cluster together, verbs behave differently from nouns

This is what meaning in a vector space looks like!

Cosine Similarity
Let’s get technical. The cosine between two vectors v and w is defined as the dot product of v and w divided by the product of their magnitudes. This gives us a measure of similarity that ranges from -1 (opposite) to 1 (identical), with 0 meaning orthogonal (unrelated).
$$ $$cosine\_similarity(v, w) = \frac{v \cdot w}{\|v\| \|w\|} $$

Early Approaches Were Context-Free

Word2Vec gives one vector per word — always

"bank" (river) = "bank" (financial) — same vector

Language is massively context-dependent

We need representations that shift based on surrounding context

This will be solved by the transformer

Subword Tokenization (Byte-Pair Encoding, BPE)

Don’t tokenize at word boundaries — tokenize at subword boundaries

[Demo moment: show tiktoken on unbelievably or a Turkish verb]

Algorithm: start with characters; iteratively merge the most frequent pair

unbelievably → [un, believ, ably] (approximately)

Why: handles unknown words gracefully, captures morphology, manageable vocabulary size (~50K tokens)

Trade-off: $\textsf{token} \neq \textsf{word}$ — keep this in mind when thinking about what LLMs process

CLASSWORK

TODO: Tokenizer live demo (tiktoken) — show BPE on compound words, rare words, code

CLASSWORK

TODO: Embedding visualization — nearest neighbors for "king," "bank" (show polysemy problem)

The Very Basics of Neural Networks

What a Neuron Does

Takes a vector of inputs $x$

Computes a weighted sum: $z = w \cdot x + b$

Applies a nonlinearity: output = $f(z)$ where $f$ is (e.g.) ReLU = $\max(0, z)$

One neuron = one learned feature detector

Why Nonlinearity?

Without it, stacking layers = one big linear transformation (provable)

Nonlinearity lets networks approximate arbitrary functions (Universal Approximation Theorem — don’t need the proof, just the fact)

ReLU is popular because it’s fast, doesn’t vanish for large inputs, and works well in practice

A Layer

Stack N neurons operating in parallel on the same input

Output: a vector of length N

Each neuron learns a different feature

A Deep Network

Stack L layers: input → layer 1 → layer 2 → … → layer L → output

Each layer transforms the representation

Early layers: simple patterns; deep layers: abstract features

This is empirically observed in vision networks; believed (with more uncertainty) in language networks

Training: Gradient Descent

Define a loss function: how wrong is the current prediction?

For language modeling: cross-entropy loss between predicted token distribution and actual next token

Backpropagation: compute gradient of loss w.r.t. every parameter via chain rule

Gradient descent: nudge every parameter in the direction that reduces loss

Repeat for billions of examples

Reinforcement learning from human feedback (RLHF) — mentioned briefly, not detailed

Constitutional AI — mentioned briefly, not detailed

Key Parameters

Learning rate: step size — too big = diverge, too small = never converge

Batch size: how many examples per gradient step

Epochs: passes through the dataset (LLMs typically train for ~1 epoch on huge datasets)

What "Learning" Means Here

The network has no symbolic rules

It has billions of numbers (weights) that get adjusted until predictions are good

The "knowledge" is distributed across all the weights — not localized

This is deeply unlike classical AI / expert systems — and unlike how we naively think of memory

Scale

A "large" language model: 7B to 700B+ parameters

Each parameter is one floating point number

GPT-2 (2019): 1.5B parameters — could fit on a laptop

GPT-4 (estimated): ~1T parameters — requires thousands of GPUs

Why does scale matter? We’ll revisit this in the section on What LLMs can and cannot do

The Transformer Architecture

The Problem with Sequential Models

Before transformers: RNNs (Recurrent Neural Networks) processed tokens one at a time, left to right

Information about token 1 must "travel" through every subsequent hidden state to reach token 500

Result: vanishing gradients — early context gets diluted or lost

Also: sequential processing = no parallelism = slow training

"Attention Is All You Need" (Vaswani et al., 2017)

Proposed replacing recurrence entirely with attention mechanisms

The transformer processes all tokens simultaneously

Every token can look at every other token directly

Maybe they should have said transformers is all you need

The Self-Attention Mechanism — The Core Idea

For each token, compute three vectors:

Query (Q): what am I looking for?
Key (K): what do I contain?
Value (V): what do I actually contribute?

Attention score between token i and token j:

score(i,j) = (Qᵢ · Kⱼ) / √dₖ

Softmax over all j → attention weights (sum to 1)

Output for token i = weighted sum of all Vⱼ

In plain English: each token figures out how much to "attend to" every other token, then mixes their values proportionally

What Attention Learns

Subject-verb agreement: "The dogs that chased the cat are…" — "are" attends to "dogs" not "cat"

Coreference: "The trophy didn’t fit in the suitcase because it was too big" — "it" attends to "trophy"

Syntactic structure, semantic roles, discourse relations

These were hand-engineered features in classical NLP — here they emerge from gradient descent

Multi-Head Attention

Run h parallel attention operations ("heads") with different Q/K/V matrices

Each head can learn a different type of relationship

Concatenate and project the results

Typical model: 12–96 heads per layer

The Full Transformer Block

Input

↓

Layer Norm

↓

Multi-Head Self-Attention ← residual connection back to input

↓

Layer Norm

↓

Feed-Forward Network (two linear layers + ReLU) ← residual connection

↓

Output (same shape as input)

Stack 12 to 96+ of these blocks. That’s the model.

Positional Encoding

Self-attention is position-agnostic — "cat bit dog" and "dog bit cat" look the same without positional info

Solution: add a positional signal to each token embedding before the first layer

Original paper: sinusoidal functions (elegant, fixed)

Modern models: learned positional embeddings, or RoPE (Rotary Position Embedding)

Context Window

The maximum number of tokens the model can attend to at once

Early GPT: 512 tokens (~400 words)

Modern models: 128K–1M tokens

Attention cost scales quadratically with context length: O(n²) — the main computational bottleneck

Active research area: sparse attention, linear attention approximations

The Decoder-Only Architecture (GPT-style)

Original transformer: encoder + decoder (for translation)

Language modeling uses decoder-only: predict next token, left to right

Causal masking: when computing attention for token i, mask out tokens i+1, i+2… (can’t look at the future)

This is what makes autoregressive generation work: generate one token, append it, generate the next

CLASSWORK

TODO: Attention visualization — BertViz or similar, show what "it" attends to in ambiguous sentences

Training at Scale

Data

Common Crawl: a scrape of much of the web — petabytes of text

Books, Wikipedia, code repositories, scientific papers, forums

Preprocessing: deduplication, quality filtering, language identification, toxicity filtering

The data mixture matters enormously — and is often not fully disclosed

This is the "company it keeps" principle from distributional semantics, applied at civilizational scale

The Objective, Revisited

Autoregressive language modeling: predict next token

For each training example: run the sequence through the model, compare predicted distribution to actual token, compute cross-entropy loss, backpropagate

Do this for ~10²² tokens (roughly)

Compute Requirements

Training a frontier model: thousands of GPUs for months

Estimated cost: $10M–$100M+ for a single training run

Why GPUs? Matrix multiplication is what they’re designed for — and transformers are basically stacked matrix multiplications

This is a genuine barrier to entry and a geopolitical resource

Emergence

Small models: predict tokens, nothing more impressive

Scale up: few-shot learning appears. Then chain-of-thought. Then code generation. Then...

Emergent abilities: capabilities that appear abruptly at scale thresholds, not predicted by smooth extrapolation

Hotly debated: are these truly emergent, or artifacts of how we measure?

The honest answer: we don’t fully understand why scale works as well as it does

Fine-Tuning and RLHF

Pretrained model: knows language, not how to be helpful or safe

Supervised fine-tuning (SFT): train on curated examples of good assistant behavior

Reinforcement Learning from Human Feedback (RLHF): humans rank model outputs → train a reward model → optimize policy against that reward

This is how GPT-3 → InstructGPT → ChatGPT

The details are a whole lecture on their own

What LLMs Can and Cannot Do

We’ve seen that LLMs are characterized by three conceptual handles:

Compression: an LLM is a lossy compression of its training data — it stores regularities, not instances. When it generates, it decompresses.
Interpolation: LLMs are good at interpolating within the distribution of their training data; they struggle at extrapolating outside it. The question "what’s the training distribution?" matters enormously.
Emergence from scale: capabilities we didn’t design appear when the model is big enough and trained on enough data. This is exciting and unsettling in equal measure.

So how does this translate into what LLMs can and cannot do?

What They’re Genuinely Good At

We can all agree that LLMs can do a lot of things well:

Fluent, coherent, stylistically appropriate text generation
Summarization, translation, reformatting
Code generation and explanation (code is another formal language)
In-context learning: adapt to new tasks given a few examples in the prompt — no weight updates
Recall of facts present in training data
Analogical reasoning, paraphrase, stylistic imitation

Shortcomings

They’re not perfect. Let’s be aware of limitations.

Hallucination

Models often generate confident, fluent, false claims.

Why? The objective is to predict plausible text, not true text. Also, if the training data contains errors, the model can’t detect them.

CLASSWORK

TODO: Live hallucination demo — ask a model a plausible-but-false question about an obscure topic; show the confident wrong answer

Brittleness

Small prompt changes → large output changes

"Solve this step by step" dramatically outperforms "Solve this" on math — same model

The model is sensitive to surface form in ways humans aren’t

No Persistent Memory

Context window is it. No learning between conversations (without fine-tuning)

Every conversation starts fresh

Systematic Generalization

Struggle with novel compositional combinations ("the blue sphere to the left of the red cube that is behind the green cylinder")

May be learning shortcuts rather than compositional rules

Arithmetic and Formal Reasoning

Surprisingly bad at multi-step arithmetic unless given explicit scratchpad

"Chain of thought" prompting helps significantly — why?

Probably because it externalizes intermediate steps, letting the model condition on them

CLASSWORK

TODO: Chain-of-thought demo — same math problem with and without "let’s think step by step"

The Grounding Problem Again

In natural language, words get meaning from perception, action, joint attention, causal anchoring, and more. LLMs have none of this: they simply have distributional co-occurrence at massive scale. But it’s remarkable how well this seems to work.

But is it understanding? That’s a debate (and depends on how you define understanding).

Exercise: Read about Searle’s Chinese Room: a system that produces appropriate outputs just by following rules. Is this what LLMs do? Does is matter? Do you think human understanding is also a form of sophisticated pattern matching, just implemented in neurons?

The Stochastic Parrot Critique

LLMs are "stochastic parrots" — form without meaning

They remix training data without comprehension

The critique is useful; the parrot metaphor is probably too strong

Parrots don’t write novel proofs, debug unfamiliar code, or explain jokes

The truth is probably somewhere uncomfortable in between

Simulating Emotions

Here’s a short video by Parth

LLMs in the World

Where We’ve Been

A brief timeline of key developments in language models:

2013: Word2Vec — distributed representations take off
2017: "Attention Is All You Need" — transformer architecture
2018: BERT (bidirectional encoder, masked language modeling) — fine-tuning paradigm
2019: GPT-2 — decoder-only, generation, early signs of scale
2020: GPT-3 — 175B parameters, in-context learning emerges
2022: InstructGPT / ChatGPT — RLHF, the assistant paradigm
2023–2024: GPT-4, Claude, Gemini, Llama — competition, open weights, multimodal
2025-2026: Reasoning models, long context, agentic systems — the current frontier

Today’s Models and Products

As of mid-2026, here are some popular models:

Family	Producer	Type	Open-weight	Tiers / Variants	Notes
Claude	Anthropic	General LLM	No	Haiku, Sonnet, Opus, Mythos	Mythos is a new tier above Opus
Gemini	Google DeepMind	General LLM	No	Flash Lite, Flash, Pro, Ultra	Deep Think is a mode of Pro/Ultra, not a separate model
Gemma	Google DeepMind	Small/edge LLM	Yes	2B, 9B, 27B	Open-weight counterpart to GeminiGPT
GPT	OpenAI	General LLM	No	Mini, Nano, standard, Pro	GPT-5.x current generation
Phi	Microsoft	Small/edge LLM	Yes	Phi-4, Phi-4-mini, Phi-4-multimodal	Optimized for on-device / edge
MAI	Microsoft	Specialized	No	Transcribe-1, Voice-1, Image-2	No general-purpose frontier model yet
Llama	Meta	General LLM	Yes	Scout, Maverick	Behemoth announced but unreleased
Muse	Meta	General LLM	No	Spark	Proprietary; from Meta Superintelligence Labs
Nemotron	NVIDIA	General LLM	Partial	Nano, Super, Ultra
DeepSeek	DeepSeek	General + Reasoning	Yes	V3, V4, R1	V/R series are different model types, not just tiers
Qwen	Alibaba Cloud	General LLM	Yes	Qwen3, Qwen3-Coder, Qwen3-VL	Rapid versioning; many specialist variants
GLM	Z.ai	General LLM	Partial	GLM-5, GLM-5.1
Kimi	Moonshot AI	General LLM	No		Known for very long context windows
Mistral	Mistral AI	General LLM	Partial	Ministral, Mistral, Devstral, Magistral	Mix of open and closed models
Grok	xAI	General LLM	Partial
Sonar	Perplexity AI	Search-optimized	No	Sonar, Sonar Pro, Sonar Reasoning	Fine-tune of LLaMA 3.3; not trained from scratch
Apple Foundation Models	Apple	On-device + server	No	On-device (~3B), Server (MoE)	No public branding; powers Apple Intelligence

And popular products:

Product	Vendor	Category	Primary model(s)	Includes third-party models?	Notes
Claude.ai	Anthropic	Chatbot / assistant	Claude	No
Claude API	Anthropic	API	Claude	No
Claude Code	Anthropic	Coding assistant	Claude	No
ChatGPT	OpenAI	Chatbot / assistant	GPT	No
OpenAI API	OpenAI	API	GPT	No
Gemini app	Google	Chatbot / assistant	Gemini	No
Google AI Studio	Google	API / Development	Gemini, Gemma	No
Microsoft Copilot	Microsoft	Productivity assistant	GPT (primary), MAI, Phi	Yes — Claude, Gemini	Routes between models by task
GitHub Copilot	Microsoft	Coding assistant	GPT	Yes — Claude and others	Developer IDE integration
Perplexity	Perplexity AI	Search-answer engine	Sonar	Yes — GPT, Claude, Gemini, Grok	Core differentiator is search grounding + routing
Meta AI	Meta	Chatbot / assistant	Llama, Muse	No	Integrated into WhatsApp, Instagram, etc.
Grok app	xAI	Chatbot / assistant	Grok	No
Apple Intelligence	Apple	On-device assistant	Apple Foundation Models	Yes — ChatGPT, Gemini	Privacy-first; runs on-device by default
Kimi app	Moonshot AI	Chatbot / assistant	Kimi	No

See also Wikipedia’s List of large language models.

Effectiveness

Systems built around LLMs are surprisingly effective! LLMs learned the regularities language. They apparently reconstructed a lot of the underlying structure—grammar, semantics, pragmatics—from prediction alone. Clearly they are missing human elements such as grounding, embodiment, social cognition, and continuous experience.

A fundamental philosophical question: how much of that absence matters for the tasks we care about?

The interesting question isn’t "are LLMs intelligent" — that’s mostly a definitional debate

The interesting question is: what cognitive and linguistic capacities are necessary for what purposes, and which of those does this architecture actually instantiate?

Open Problems

There are more questions:

Why does scale work? — We lack a theory
What’s actually inside? — Interpretability/mechanistic understanding is embryonic
How to make them reliably factual? — Retrieval augmentation helps, doesn’t solve
How to make them reason systematically? — Active research
The data wall: web-scale data largely exhausted; synthetic data, self-play emerging
Alignment: how do you specify what you actually want, and make the model do it?

What Comes After LLMs?

A great question! After all, the architecture is kind of fixed, and we want to know the next big thing. Some people are already looking ahead, since we kind of know what the limitations are.

Next Steps

These notes were not intended to be anything more than a very light introduction. We got the mental model:

Raw text (tokens) $\longrightarrow$ Embedding (tokens to vectors) $\longrightarrow$ Transformer blocks (contextual representations) $\longrightarrow$ Output head (distribution over next token) $\longrightarrow$ Sample $\longrightarrow$ token $\longrightarrow$ append $\longrightarrow$ repeat

But we were light on details. You’ll no doubt move on to more advanced studies such as:

A machine learning course: gradient descent, backprop, regularization, the full framework
An NLP course: sequence models, parsing, information extraction, evaluation
A deep learning course: CNNs, RNNs, attention in full detail
AI safety / alignment studies: the question of what we’re building toward

If you’re impatient: Andrej Karpathy’s "Neural Networks: Zero to Hero" series on YouTube builds GPT from scratch in Python.

Terms to Know

As you prepare for further study, consider making flashcards for:

Chain of Thought
Multi-agent Orchestration
Model Training
Infrastructure
Pretraining
Fine-tuning
RLHF
Constitutional AI
Multimodal Perception
Reasoning
Contemplating
Hallucination
Behavioral Alignment
Guardrails

Recall Practice

Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.

Coming soon

Summary

We’ve covered:

Language Modeling
Embeddings
The Very Basics of Neural Networks
The Transformer Architecture
Training at Scale
What LLMs Can and Cannot Do
LLMs in the World

Introduction to Large Language Models

Getting Started

Language Modeling

Embeddings

The Naive Approach

The Distributional Hypothesis

Word2Vec

Embedding Space Properties

Early Approaches Were Context-Free

Subword Tokenization (Byte-Pair Encoding, BPE)

The Very Basics of Neural Networks

What a Neuron Does

Why Nonlinearity?

A Layer

A Deep Network

Training: Gradient Descent

Key Parameters

What "Learning" Means Here

Scale

The Transformer Architecture

The Problem with Sequential Models

"Attention Is All You Need" (Vaswani et al., 2017)

The Self-Attention Mechanism — The Core Idea

What Attention Learns

Multi-Head Attention

The Full Transformer Block

Context Window

The Decoder-Only Architecture (GPT-style)

Training at Scale

Data

The Objective, Revisited

Compute Requirements

Emergence

Fine-Tuning and RLHF

What LLMs Can and Cannot Do

What They’re Genuinely Good At

Shortcomings

Hallucination

Brittleness

No Persistent Memory

Systematic Generalization

Arithmetic and Formal Reasoning

The Grounding Problem Again

The Stochastic Parrot Critique

Simulating Emotions

LLMs in the World

Where We’ve Been

Today’s Models and Products

Effectiveness

Open Problems

What Comes After LLMs?

Next Steps

Terms to Know

Further Reading

Recall Practice

Summary