Lecture 24

Prompts and In-Context Learning

1. Prompting as the Interface to LLMs

Prompting is the main way we “program” modern large language models (LLMs).
Instead of changing the model weights, we:

Fix the model parameters, and
Change the input text (the prompt) to define the task and desired behavior.

1.1 What is a Prompt?

A prompt is any text we feed into the model before it generates an output.
It can include:

Instructions (e.g., “Summarize the following article in one paragraph.”)
Examples of the task (e.g., input–output pairs)
Constraints on style or format (e.g., “Answer in JSON.”)

The model then continues the text by sampling or choosing the most likely next tokens.

1.2 Prompting vs. Traditional Supervised Learning

Traditional supervised learning:

Learns a mapping from input (x) to label (y) via gradient descent.
Requires a labeled dataset and task-specific training.

Prompting with a frozen LLM:

Keeps the model weights fixed.
Encodes the task in the prompt structure.
Uses the model as a general conditional distribution
[p_\theta(\text{output} \mid \text{prompt}).]

The key idea: a single pre-trained model can perform many different tasks just by changing the prompt.

2. Zero-Shot and Few-Shot Learning

2.1 Zero-Shot Learning

Zero-shot learning: the model is asked to perform a task without seeing any examples of that task in the prompt.

The prompt typically includes:

A short natural-language description of the task, and
The specific instance to solve.

Example (question answering):

Answer the question using a short phrase.

Question: Where was Tom Brady born?
Answer:

The model uses its internal knowledge and language understanding to complete the answer, even though we never fine-tuned it specifically for this QA dataset.

A well-known illustration of zero-shot behavior comes from summarization with the prompt “TL;DR:”. Even without any fine-tuning or task-specific examples, GPT-2 can produce a reasonable one-sentence summary simply by appending “TL;DR:” to the end of an article. During pre-training, the model encountered this construction frequently on the web, where “TL;DR:” conventionally introduces a brief summary. As a result, a single well-chosen prompt is enough to activate this implicit knowledge and elicit summary-like behavior in a purely zero-shot setting.

2.2 Few-Shot Learning

Few-shot learning: we prepend a few input–output examples of the task directly into the prompt.

Example (sentiment classification):

Review: "This movie was amazing!"
Sentiment: positive

Review: "The plot was boring and predictable."
Sentiment: negative

Review: "Performances were mixed but I liked the atmosphere."
Sentiment:

The examples in the prompt implicitly define:

The task (sentiment analysis),
The label space (e.g., positive/negative),
The output format.

The model then “continues the pattern” for the new review.

2.3 Prompts as Task Descriptions

From the model’s perspective, all of these are just token sequences.
However, the pattern of tokens in the prompt:

Encodes how to interpret the input, and
Suggests how to format the output.

As models scale, they increasingly succeed at zero- and few-shot tasks without any extra training, which is one of the surprising emergent abilities of LLMs.

**Figure 1.** GPT-2 performance on LAMBADA, CBT, and WikiText2 as model size increases (Radford et al., 2019).

2.4 Prompt design: good patterns and failure modes

In practice, prompt design matters a lot. Small changes in phrasing can lead to surprisingly large differences in performance.

Some helpful patterns:

Be explicit about the task and output format
(“Classify the sentiment as positive or negative.” rather than “What do you think?”).
Provide clear separators between examples
(e.g., using Input: ... / Output: ... or Q: / A:).
Use consistent labeling and formatting across all examples in a few-shot prompt.

Common failure modes:

Under-specified prompts
The model guesses the task or label space and may produce inconsistent outputs.
Leading or biased phrasing
Prompts that contain strong opinions (e.g., “Obviously this is bad, right?”) can push the model toward sycophantic or biased responses.
Prompt length vs. context budget
Adding too many examples can exceed the context window or push important information (like the final question) too far from the end of the prompt.

Prompting is therefore both a powerful interface and a source of instability: good prompts can elicit impressive behavior, but brittle prompts can hide the model’s true capabilities or amplify biases.

3. In-Context Learning (ICL)

3.1 Definition

In-context learning is the phenomenon where a language model appears to “learn” a new mapping from examples provided in the context window, without explicitly updating its parameters.

We show the model a few examples ((xi, y_i)) in the prompt.
Then we give it a new (x{\text{test}}) and ask it to generate (y_{\text{test}}).

All adaptation happens implicitly through the forward pass on the prompt.

This is sometimes described as “learning via inference rather than via gradient updates.”

3.2 Example: Translation via ICL

Translate English to French:

sea otter        => loutre de mer
peppermint       => menthe poivrée
plush giraffe    => girafe peluche
cheese           =>

The model:

Recognizes the pattern (English phrase, arrow, French phrase).
Produces an appropriate French translation for “cheese”.

No weights were updated; the examples were only in the prompt.

3.3 Relationship to Few-Shot Learning

“Few-shot prompting” and “in-context learning” are closely related:

Few-shot prompting = how we construct prompts with examples.
In-context learning = the behavior where the model uses those examples to solve new inputs.

Empirically:

Performance often improves as we add more relevant in-context examples (up to the context window limit).
For many tasks, a large LLM with good few-shot prompts can rival or beat smaller, task-specific fine-tuned models.

**Figure 2.** Few-shot in-context learning on SuperGLUE: GPT-3 performance as the number of examples in context increases (Brown et al., 2020).

4. Chain-of-Thought (CoT) Prompting

4.1 Motivation

Standard prompts often ask the model to directly output an answer:

Q: 23 − 7 = ?
A:

This can work, but for more complex reasoning (multi-step arithmetic, logic puzzles, word problems), models may:

Rely on shallow heuristics.
Make mistakes that a detailed derivation would avoid.

Chain-of-thought (CoT) prompting encourages the model to output the reasoning steps, not just the final answer.

4.2 Example of CoT

Instead of:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he now have?
A:

We prompt the model to reason step-by-step:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he now have?

A: Let's think step by step.
Roger has 5 tennis balls.
He buys 2 cans, each with 3 balls, so that is 2 * 3 = 6 balls.
In total, he has 5 + 6 = 11 balls.
The answer is 11.

When given more problems with this style of answer, the model tends to:

Generate intermediate reasoning steps.
Reach correct final answers more often.

4.3 Why CoT Helps

Chain-of-thought prompting:

Encourages the model to expand the computation into multiple steps.
Reduces the chance that it jumps to a wrong shortcut.
Makes the model’s reasoning inspectable to the user (you can check intermediate steps).

However:

CoT may also increase verbosity.
Correct-looking reasoning can still be wrong or “hallucinated,” so it must be interpreted carefully.

5. Reasoning Models

The lecture also touches on reasoning models, which are architectures and training methods explicitly designed to improve reasoning ability.

5.1 Limitations of Vanilla LLMs

Base LLMs:

Are trained with a simple next-token prediction objective.
Have no explicit semantics, task goals, or reward signal.

They can still show surprising reasoning behaviors, but:

They may be brittle.
They can hallucinate plausible-but-wrong chains of thought.

This motivates specialized designs for structured reasoning.

5.2 Ideas Behind Reasoning Models

Reasoning models often introduce one or more of:

Decomposition
Breaking a complex problem into smaller steps or subproblems
(e.g., “tree of thoughts”, multi-step planning).
Tool use
Allowing the model to call external tools (calculators, search engines, code execution) during reasoning.
Verification
Checking intermediate steps or final answers using verifiable signals
(tests, constraints, or external systems).
Deliberate thinking time
Allocating more computation to “harder” questions
(e.g., sampling multiple reasoning paths, then choosing the best).

The general theme: augment LLMs so that reasoning paths become more reliable, interpretable, and verifiable compared to naive next-token prediction.

**Figure 3.** Illustration of an explicit reasoning language model (RLM) with separate components for data, models, and operators.

6. Soft Prompting

6.1 From Hard Prompts to Soft Prompts

So far, we considered hard prompts:

Human-written text in natural language.
Interpretable, but sometimes brittle or suboptimal.

Soft prompting replaces or augments these with learned prompt vectors:

Instead of writing a textual prefix like “You are a helpful assistant.”, we insert a sequence of trainable embedding vectors before the input tokens.
These vectors are optimized via gradient descent on a downstream task, while keeping the main model parameters frozen.

6.2 How Soft Prompting Works (High Level)

Conceptually:

Take the pre-trained model and freeze its weights.
Introduce a new set of parameters: a sequence of (k) prompt embeddings
({p_1, \dots, p_k}).
For each training example, feed the model the concatenation:
([p_1, \dots, p_k, x_1, x_2, \dots, x_T])
where (x_i) are the embeddings of the input tokens.
Train only the prompt embeddings to minimize a task loss
(e.g., classification or generation objective).

Result:

The learned prompt acts like a continuous control knob that steers the model’s behavior for that task.
We get a small, task-specific parameter set (the prompt) instead of full fine-tuning.

6.3 Benefits of Soft Prompting

Soft prompting is an example of parameter-efficient adaptation:

Storage-efficient:
We can keep one base model plus many small soft prompts for different tasks or users.
Computation-efficient:
Training is lighter since we only update a small number of parameters.
Modular:
Task behavior is encapsulated in the prompt; swapping prompts can quickly change model behavior.

Trade-offs:

Soft prompts are not human-readable (they are dense vectors).
They rely on a strong pre-trained backbone; if the base model lacks relevant knowledge, soft prompting alone may not be enough.

**Figure 4.** Soft prompting: learned prompt embeddings prepended as virtual tokens and optimized while keeping the decoder-only Transformer frozen.

7. Key Takeaways

Prompting is the main interface to modern LLMs: we steer behavior by crafting inputs rather than changing weights.
Zero-shot and few-shot learning show that large models can adapt to new tasks from instructions and a handful of examples.
In-context learning lets models “learn” a task inside the context window, using examples as implicit training data.
Chain-of-thought prompting can significantly improve performance on complex reasoning tasks by encouraging the model to show its work.
Reasoning models and related methods seek to formalize and enhance the reasoning abilities of LLMs through decomposition, tool use, and verification.
Soft prompting provides a parameter-efficient way to specialize a frozen LLM using learned prompt embeddings, enabling flexible adaptation and personalization without full fine-tuning.

Reading: Li & Liang (2021), “Prefix-Tuning: Optimizing Continuous Prompts for Generation”

Prefix-tuning as continuous prompts: Instead of learning all model weights, the paper learns a short continuous prefix of activations that every token can attend to, acting like “virtual tokens” at every layer of the Transformer while the base LM stays frozen. This is more expressive than just learning text prompts.
Parameter-efficiency vs fine-tuning/adapters: On table-to-text generation and summarization, prefix-tuning matches or nearly matches full fine-tuning while only training about 0.1–2% as many parameters, and is more parameter-efficient than adapter-tuning (which usually adds ~2–4%).
Low-data and extrapolation benefits: In very small-data regimes and when evaluating on unseen topics/domains (e.g., unseen WebNLG categories or news-to-sports summarization), prefix-tuning often outperforms full fine-tuning, suggesting that freezing the base LM can actually help generalization.
Design choices for continuous prompts: The paper empirically studies several design choices: optimal prefix length (too long overfits), why tuning all layers of the prefix beats tuning only the embedding layer (closer to discrete prompt tuning), and how initializing prefixes with activations of real words like “summarize” or “table-to-text” works better than random initialization.
Personalization & multi-user setting: Because you only store small prefixes, the authors highlight prefix-tuning as attractive for per-user personalization and for batching many users’ requests with different prefixes on a single shared LM without cross-contaminating data.