- State of Mind
- Posts
- The Rise of the Reward Shaper
The Rise of the Reward Shaper
From Prompt Engineering to Outcome Engineering
An axiom is quietly circulating in AI circles: the complexity of tasks we demand from AI is doubling every seven months.
Our first answer to this challenge was the Harness Era—building ever-better scaffolding around large language models. Our next, now upon us, is to forge the intelligence itself.
1. The Harness Era (2023–2024)
For the last two years, building an agent meant perfecting a "harness"—an abstraction layer of prompts, memory, and tools to tame a powerful but opaque model.
Harness Pillar | Purpose | Typical Techniques |
Prompts | Provide core instructions & persona | System/role prompts, structured templates |
Memory / Context | Inject relevant history | Vector search, RAG, short-term caches |
Tools | Act on the outside world | Function calling, API orchestration, code execution |
We became modern-day spell-crafters, coaxing consistent behavior from non-deterministic people spirits living on the internet. But this approach had a hard ceiling: true model intelligence was a lab-controlled black box, and progress was gated by the next major release.
2. The Eyebrow-Raise Moment
Two signals marked the turning point:
DeepSeek-V2 proved that reinforcement learning (RL) could deliver step-function gains in a model’s core capabilities—not just add polish.
Weights & Biases’ Fully Connected ‘25 made RL for multi-turn agents a headline topic, pushing it from research curiosity to urgent roadmap item.
Hear a disruptive idea once, it’s a novelty. Hear it twice from operators who ship, and the eyebrow goes up. The conclusion is clear: RL for production agents isn’t the distant future. It is the immediate frontier.
We are graduating from prompt engineers to reward shapers.
3. The New Production Trifecta
Reasoning + Tool Use + Reinforcement Learning
Reasoning: A strong foundation model, open or proprietary.
Tool Use: A sandboxed API layer so the agent can read, write, and transact.
Reinforcement Learning: A continuous feedback loop (e.g., GRPO) that teaches the model your definition of success.
Instead of just dressing the model in a better harness, you are recasting its core policy.
4. The Optimization Flywheel
This is a practical playbook for building self-improving agents today.
Step | What Happens |
Ideate | Define success. “Resolve 95% of refunds autonomously at ≥ 4.8 CSAT.” |
Implement | Ship a v0 agent with initial prompts, memory, and tools. |
Generate | Let the agent produce multiple candidate responses and action sequences. |
Evaluate | Rank trajectories via humans, LLM-as-judge, and hard heuristics (cost, latency). |
Reward-Shape | Fine-tune with an RL algorithm (GRPO, PPO-RLHF) on those preferences. |
Deploy & Observe | Stream real-world interactions back into the preference data store. |
Repeat | Each loop compounds accuracy, speed, and alignment. |
Frameworks like NVIDIA NeMo’s Curate → Customize → Evaluate → Guardrail → Retrieve map directly onto this flywheel.
5. The Payoff
This shift transforms the unit economics of building and running AI.
Dimension | Harness-Tuning (Old Way) | Outcome-Tuning (RL Way) |
Accuracy | Indirect, via prompt tweaks | Direct, measurable, continuous gains |
Latency | Locked to large model size | Smaller, faster specialist models |
Cost / Call | Frontier-model pricing | 30-70% lower after distillation |
Iteration Speed | Gated by external lab releases | Continuous, in-house, and compounding |
The result is smaller, faster, cheaper—and fundamentally better—agents that become true domain specialists, not just generic chatbots.
6. One Face, a Hundred Hands
To the customer, the experience is a single, on-brand AI concierge.
Behind the curtain, a swarm of a hundred task-focused, RL-shaped agents quietly handles fulfillment, returns, subscription changes, and real-time analytics. Every interaction fuels the flywheel, so the entire swarm improves constantly.
7. Welcome to Outcome Engineering
This marks the beginning of the Age of the Reward Shaper.
The teams that master this discipline will pull decisively ahead. They won’t just direct intelligence; they will define and manufacture it.