Introduction to Large Language Models

In addition to looking at language mathematically and evolutionarily, we can also look at it as a statistical machine learning problem.

Getting Started

We’ve seen that human languages have evolved over millions of years, picking up symbols, words, grammar, pragmatics, social cognition, and much more along the way. Now, let’s explore what happens when we apply machine learning—or more accurately, a statistical learning engine—at billions of pages of human text to see what such as system can learn and do.

We’re only going to scratch the surface of language modeling, but you’ll be able to at least pick up some vocabulary with which to enter your ML or NLP courses, and get a sense of what an LLM actually does.

It’s always a good idea to start with a 3Blue1Brown video if one exists for the topic of interest!

You can also jump start your learning by reading the Large Language Model article on Wikipedia or take the LLM Course on Hugging Face. Getting involved with the HuggingFace Community is worthwhile too.

Language Modeling

Roughly speaking, language modeling is assigning a probability distribution over sequences of tokens.

Formally, $P(w_1, w_2, w_3, … w_n)$ is probability of a sequence and $P(w_n | w_1, w_2, … w_{n-1})$ helps us predict the next token given all previous tokens. Everything you can do with LLMs—summarization, translation, programming, and reasoning—emerges from doing this well at scale.

Not surprisingly, prediction power can come from throwing machine learning at the problem. Hopefully, we can do this well enough to learn:

Excellent prediction is either language understanding or an approximation of understanding, depending on your perspective.

A language model that is really big is called a large language model (LLM). One that isn’t very big at all, and fits on a single device, is called a smol language model.

LLMs can do cool things, like:

LLMs collapse most of these into a single framework: generation conditioned on a prompt.

Just an introduction today

We won’t go into the details of speech (acoustic models, ASR) or multimodal models (vision-language) today.

Embeddings

Here is a beautiful idea. Represent words (actually tokens) as vectors in a high-dimensional space, where similar words are close together. After all, neural networks operate on vectors of real numbers. If words are vectors, neural networks can do language tasks.

The Naive Approach

The very naive approach to this is One-Hot Encoding. It’s simple but not really helpful. Here’s how it works.

Assume a vocabulary of size V = 50,000 words. In one-hot encoding, each word is a vector of length 50,000: all zeros except a single 1.

This only encodes identity, not relationships, e.g. dog and puppy are orthogonal — as far apart as dog and democracy. The dimensionality is enormous, but the representation is sparse.

It’s little more than a useful toy to prime understanding; it’s not used in modern LLMs directly.

The Distributional Hypothesis

You shall know a word by the company it keeps. — J.R. Firth (1957)

The idea here is that words that appear in similar contexts tend to have similar meanings. It’s empirically true (words are defined in terms of other words, and their meanings often vary based on context). It’s also extraordinarily powerful and the philosophical foundation of every word embedding ever built.

Exercise: Relate this idea to the principle of compositionality from the philosophy of language, and to the idea of "meaning as use" from Wittgenstein. How are these ideas similar? How are they different?

Word2Vec

2013 was a big year, with the introduction of Word2Vec.

Let’s oversimplify but capture the intuition here.

The idea is to train a shallow neural network so that each word is a dense vector of (around) 300 real numbers that encode semantic and syntactic relationships.

Training was pretty cool. There employed two interesting ideas:

Embedding Space Properties

A good embedding space will give us several useful properties:

This is what meaning in a vector space looks like!

Cosine Similarity

Let’s get technical. The cosine between two vectors v and w is defined as the dot product of v and w divided by the product of their magnitudes. This gives us a measure of similarity that ranges from -1 (opposite) to 1 (identical), with 0 meaning orthogonal (unrelated).

$$ $$cosine\_similarity(v, w) = \frac{v \cdot w}{\|v\| \|w\|} $$

Early Approaches Were Context-Free

Word2Vec gives one vector per word — always

"bank" (river) = "bank" (financial) — same vector

Language is massively context-dependent

We need representations that shift based on surrounding context

This will be solved by the transformer

Subword Tokenization (Byte-Pair Encoding, BPE)

Don’t tokenize at word boundaries — tokenize at subword boundaries

[Demo moment: show tiktoken on unbelievably or a Turkish verb]

Algorithm: start with characters; iteratively merge the most frequent pair

unbelievably → [un, believ, ably] (approximately)

Why: handles unknown words gracefully, captures morphology, manageable vocabulary size (~50K tokens)

Trade-off: $\textsf{token} \neq \textsf{word}$ — keep this in mind when thinking about what LLMs process

CLASSWORK
TODO: Tokenizer live demo (tiktoken) — show BPE on compound words, rare words, code
CLASSWORK
TODO: Embedding visualization — nearest neighbors for "king," "bank" (show polysemy problem)

The Very Basics of Neural Networks

What a Neuron Does

Takes a vector of inputs $x$

Computes a weighted sum: $z = w \cdot x + b$

Applies a nonlinearity: output = $f(z)$ where $f$ is (e.g.) ReLU = $\max(0, z)$

One neuron = one learned feature detector

Why Nonlinearity?

Without it, stacking layers = one big linear transformation (provable)

Nonlinearity lets networks approximate arbitrary functions (Universal Approximation Theorem — don’t need the proof, just the fact)

ReLU is popular because it’s fast, doesn’t vanish for large inputs, and works well in practice

A Layer

Stack N neurons operating in parallel on the same input

Output: a vector of length N

Each neuron learns a different feature

A Deep Network

Stack L layers: input → layer 1 → layer 2 → … → layer L → output

Each layer transforms the representation

Early layers: simple patterns; deep layers: abstract features

This is empirically observed in vision networks; believed (with more uncertainty) in language networks

Training: Gradient Descent

Define a loss function: how wrong is the current prediction?

For language modeling: cross-entropy loss between predicted token distribution and actual next token

Backpropagation: compute gradient of loss w.r.t. every parameter via chain rule

Gradient descent: nudge every parameter in the direction that reduces loss

Repeat for billions of examples

Reinforcement learning from human feedback (RLHF) — mentioned briefly, not detailed

Constitutional AI — mentioned briefly, not detailed

Key Parameters

Learning rate: step size — too big = diverge, too small = never converge

Batch size: how many examples per gradient step

Epochs: passes through the dataset (LLMs typically train for ~1 epoch on huge datasets)

What "Learning" Means Here

The network has no symbolic rules

It has billions of numbers (weights) that get adjusted until predictions are good

The "knowledge" is distributed across all the weights — not localized

This is deeply unlike classical AI / expert systems — and unlike how we naively think of memory

Scale

A "large" language model: 7B to 700B+ parameters

Each parameter is one floating point number

GPT-2 (2019): 1.5B parameters — could fit on a laptop

GPT-4 (estimated): ~1T parameters — requires thousands of GPUs

Why does scale matter? We’ll revisit this in the section on What LLMs can and cannot do

The Transformer Architecture

The Problem with Sequential Models

Before transformers: RNNs (Recurrent Neural Networks) processed tokens one at a time, left to right

Information about token 1 must "travel" through every subsequent hidden state to reach token 500

Result: vanishing gradients — early context gets diluted or lost

Also: sequential processing = no parallelism = slow training

"Attention Is All You Need" (Vaswani et al., 2017)

Proposed replacing recurrence entirely with attention mechanisms

The transformer processes all tokens simultaneously

Every token can look at every other token directly

Maybe they should have said transformers is all you need

The Self-Attention Mechanism — The Core Idea

For each token, compute three vectors:

Attention score between token i and token j:

score(i,j) = (Qᵢ · Kⱼ) / √dₖ

Softmax over all j → attention weights (sum to 1)

Output for token i = weighted sum of all Vⱼ

In plain English: each token figures out how much to "attend to" every other token, then mixes their values proportionally

What Attention Learns

Subject-verb agreement: "The dogs that chased the cat are…" — "are" attends to "dogs" not "cat"

Coreference: "The trophy didn’t fit in the suitcase because it was too big" — "it" attends to "trophy"

Syntactic structure, semantic roles, discourse relations

These were hand-engineered features in classical NLP — here they emerge from gradient descent

Multi-Head Attention

Run h parallel attention operations ("heads") with different Q/K/V matrices

Each head can learn a different type of relationship

Concatenate and project the results

Typical model: 12–96 heads per layer

The Full Transformer Block

Input

Layer Norm

Multi-Head Self-Attention ← residual connection back to input

Layer Norm

Feed-Forward Network (two linear layers + ReLU) ← residual connection

Output (same shape as input)

Stack 12 to 96+ of these blocks. That’s the model.

Positional Encoding

Self-attention is position-agnostic — "cat bit dog" and "dog bit cat" look the same without positional info

Solution: add a positional signal to each token embedding before the first layer

Original paper: sinusoidal functions (elegant, fixed)

Modern models: learned positional embeddings, or RoPE (Rotary Position Embedding)

Context Window

The maximum number of tokens the model can attend to at once

Early GPT: 512 tokens (~400 words)

Modern models: 128K–1M tokens

Attention cost scales quadratically with context length: O(n²) — the main computational bottleneck

Active research area: sparse attention, linear attention approximations

The Decoder-Only Architecture (GPT-style)

Original transformer: encoder + decoder (for translation)

Language modeling uses decoder-only: predict next token, left to right

Causal masking: when computing attention for token i, mask out tokens i+1, i+2… (can’t look at the future)

This is what makes autoregressive generation work: generate one token, append it, generate the next

CLASSWORK
TODO: Attention visualization — BertViz or similar, show what "it" attends to in ambiguous sentences

Training at Scale

Data

Common Crawl: a scrape of much of the web — petabytes of text

Books, Wikipedia, code repositories, scientific papers, forums

Preprocessing: deduplication, quality filtering, language identification, toxicity filtering

The data mixture matters enormously — and is often not fully disclosed

This is the "company it keeps" principle from distributional semantics, applied at civilizational scale

The Objective, Revisited

Autoregressive language modeling: predict next token

For each training example: run the sequence through the model, compare predicted distribution to actual token, compute cross-entropy loss, backpropagate

Do this for ~10²² tokens (roughly)

Compute Requirements

Training a frontier model: thousands of GPUs for months

Estimated cost: $10M–$100M+ for a single training run

Why GPUs? Matrix multiplication is what they’re designed for — and transformers are basically stacked matrix multiplications

This is a genuine barrier to entry and a geopolitical resource

Emergence

Small models: predict tokens, nothing more impressive

Scale up: few-shot learning appears. Then chain-of-thought. Then code generation. Then...

Emergent abilities: capabilities that appear abruptly at scale thresholds, not predicted by smooth extrapolation

Hotly debated: are these truly emergent, or artifacts of how we measure?

The honest answer: we don’t fully understand why scale works as well as it does

Fine-Tuning and RLHF

Pretrained model: knows language, not how to be helpful or safe

Supervised fine-tuning (SFT): train on curated examples of good assistant behavior

Reinforcement Learning from Human Feedback (RLHF): humans rank model outputs → train a reward model → optimize policy against that reward

This is how GPT-3 → InstructGPT → ChatGPT

The details are a whole lecture on their own

What LLMs Can and Cannot Do

We’ve seen that LLMs are characterized by three conceptual handles:

So how does this translate into what LLMs can and cannot do?

What They’re Genuinely Good At

We can all agree that LLMs can do a lot of things well:

Shortcomings

They’re not perfect. Let’s be aware of limitations.

Hallucination

Models often generate confident, fluent, false claims.

Why? The objective is to predict plausible text, not true text. Also, if the training data contains errors, the model can’t detect them.

CLASSWORK
TODO: Live hallucination demo — ask a model a plausible-but-false question about an obscure topic; show the confident wrong answer

Brittleness

Small prompt changes → large output changes

"Solve this step by step" dramatically outperforms "Solve this" on math — same model

The model is sensitive to surface form in ways humans aren’t

No Persistent Memory

Context window is it. No learning between conversations (without fine-tuning)

Every conversation starts fresh

Systematic Generalization

Struggle with novel compositional combinations ("the blue sphere to the left of the red cube that is behind the green cylinder")

May be learning shortcuts rather than compositional rules

Arithmetic and Formal Reasoning

Surprisingly bad at multi-step arithmetic unless given explicit scratchpad

"Chain of thought" prompting helps significantly — why?

Probably because it externalizes intermediate steps, letting the model condition on them

CLASSWORK
TODO: Chain-of-thought demo — same math problem with and without "let’s think step by step"

The Grounding Problem Again

In natural language, words get meaning from perception, action, joint attention, causal anchoring, and more. LLMs have none of this: they simply have distributional co-occurrence at massive scale. But it’s remarkable how well this seems to work.

But is it understanding? That’s a debate (and depends on how you define understanding).

Exercise: Read about Searle’s Chinese Room: a system that produces appropriate outputs just by following rules. Is this what LLMs do? Does is matter? Do you think human understanding is also a form of sophisticated pattern matching, just implemented in neurons?

The Stochastic Parrot Critique

LLMs are "stochastic parrots" — form without meaning

They remix training data without comprehension

The critique is useful; the parrot metaphor is probably too strong

Parrots don’t write novel proofs, debug unfamiliar code, or explain jokes

The truth is probably somewhere uncomfortable in between

Simulating Emotions

Here’s a short video by Parth

LLMs in the World

Where We’ve Been

A brief timeline of key developments in language models:

Today’s Models and Products

As of mid-2026, here are some popular models:

FamilyProducerTypeOpen-weightTiers / VariantsNotes
ClaudeAnthropicGeneral LLMNoHaiku, Sonnet, Opus, MythosMythos is a new tier above Opus
GeminiGoogle DeepMindGeneral LLMNoFlash Lite, Flash, Pro, UltraDeep Think is a mode of Pro/Ultra, not a separate model
GemmaGoogle DeepMindSmall/edge LLMYes2B, 9B, 27BOpen-weight counterpart to GeminiGPT
GPTOpenAIGeneral LLMNoMini, Nano, standard, ProGPT-5.x current generation
PhiMicrosoftSmall/edge LLMYesPhi-4, Phi-4-mini, Phi-4-multimodalOptimized for on-device / edge
MAIMicrosoftSpecializedNoTranscribe-1, Voice-1, Image-2No general-purpose frontier model yet
LlamaMetaGeneral LLMYesScout, MaverickBehemoth announced but unreleased
MuseMetaGeneral LLMNoSparkProprietary; from Meta Superintelligence Labs
NemotronNVIDIAGeneral LLMPartialNano, Super, Ultra
DeepSeekDeepSeekGeneral + ReasoningYesV3, V4, R1V/R series are different model types, not just tiers
QwenAlibaba CloudGeneral LLMYesQwen3, Qwen3-Coder, Qwen3-VLRapid versioning; many specialist variants
GLMZ.aiGeneral LLMPartialGLM-5, GLM-5.1
KimiMoonshot AIGeneral LLMNoKnown for very long context windows
MistralMistral AIGeneral LLMPartialMinistral, Mistral, Devstral, MagistralMix of open and closed models
GrokxAIGeneral LLMPartial
SonarPerplexity AISearch-optimizedNoSonar, Sonar Pro, Sonar ReasoningFine-tune of LLaMA 3.3; not trained from scratch
Apple Foundation ModelsAppleOn-device + serverNoOn-device (~3B), Server (MoE)No public branding; powers Apple Intelligence

And popular products:

ProductVendorCategoryPrimary model(s)Includes third-party models?Notes
Claude.aiAnthropicChatbot / assistantClaudeNo
Claude APIAnthropicAPIClaudeNo
Claude CodeAnthropicCoding assistantClaudeNo
ChatGPTOpenAIChatbot / assistantGPTNo
OpenAI APIOpenAIAPIGPTNo
Gemini appGoogleChatbot / assistantGeminiNo
Google AI StudioGoogleAPI / DevelopmentGemini, GemmaNo
Microsoft CopilotMicrosoftProductivity assistantGPT (primary), MAI, PhiYes — Claude, GeminiRoutes between models by task
GitHub CopilotMicrosoftCoding assistantGPTYes — Claude and othersDeveloper IDE integration
PerplexityPerplexity AISearch-answer engineSonarYes — GPT, Claude, Gemini, GrokCore differentiator is search grounding + routing
Meta AIMetaChatbot / assistantLlama, MuseNoIntegrated into WhatsApp, Instagram, etc.
Grok appxAIChatbot / assistantGrokNo
Apple IntelligenceAppleOn-device assistantApple Foundation ModelsYes — ChatGPT, GeminiPrivacy-first; runs on-device by default
Kimi appMoonshot AIChatbot / assistantKimiNo

See also Wikipedia’s List of large language models.

Effectiveness

Systems built around LLMs are surprisingly effective! LLMs learned the regularities language. They apparently reconstructed a lot of the underlying structure—grammar, semantics, pragmatics—from prediction alone. Clearly they are missing human elements such as grounding, embodiment, social cognition, and continuous experience.

A fundamental philosophical question: how much of that absence matters for the tasks we care about?

The interesting question isn’t "are LLMs intelligent" — that’s mostly a definitional debate

The interesting question is: what cognitive and linguistic capacities are necessary for what purposes, and which of those does this architecture actually instantiate?

Open Problems

There are more questions:

What Comes After LLMs?

A great question! After all, the architecture is kind of fixed, and we want to know the next big thing. Some people are already looking ahead, since we kind of know what the limitations are.

Next Steps

These notes were not intended to be anything more than a very light introduction. We got the mental model:

Raw text (tokens) $\longrightarrow$ Embedding (tokens to vectors) $\longrightarrow$ Transformer blocks (contextual representations) $\longrightarrow$ Output head (distribution over next token) $\longrightarrow$ Sample $\longrightarrow$ token $\longrightarrow$ append $\longrightarrow$ repeat

But we were light on details. You’ll no doubt move on to more advanced studies such as:

If you’re impatient: Andrej Karpathy’s "Neural Networks: Zero to Hero" series on YouTube builds GPT from scratch in Python.

Terms to Know

As you prepare for further study, consider making flashcards for:

Further Reading

And of course, read and watch:

Recall Practice

Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.

Coming soon

Summary

We’ve covered:

  1. Language Modeling
  2. Embeddings
  3. The Very Basics of Neural Networks
  4. The Transformer Architecture
  5. Training at Scale
  6. What LLMs Can and Cannot Do
  7. LLMs in the World