LLMs and reinforcement learning

My reflection on the Richard Sutton interview with Dwarkesh Patel was that it was interesting how much the two participants talk past each other, and fail to find common ground. Particularly that they couldn’t agree on the power of reinforcement learning, when it’s such a large part of the LLM workflow.

To be clear, it isn’t the large language model that engages in reinforcement learning, it’s the person who’s applying the LLM to their task. That’s all that prompt engineering is. Here’s the workflow:

Identify a goal.
Hypothesize a prompt that leads the LLM to satisfy the goal.
Try out the prompt and generate an outcome.
Observe the gap between the outcome and the intended goal.
Repeat steps 1-4 (yes, include step 1, as you might refine the goal or the prompt) until the gap becomes acceptably small.

This process—commonly known in its isolated, bureaucratized form as “science”—is fundamental to the learning experience. It’s the same as John Dewey’s description of reflective thought, as Honey and Mumford’s learning cycle, as the Shewhart cycle (Plan-Do-Check-Act). Start at step four, and it’s the same as Boyd’s Observe-Orient-Decide-Act loop. And it’s fundamental to how LLM-assisted work unfolds. But because the LLM is incapable of manifesting the cycle, it’s left to the person to do all the learning.

About Graham

Leave a Reply

OOP the Easy Way

APPropriate Behaviour

APPosite Concerns

FSF