State of Mind
Posts
The Rise of the Reward Shaper

The Rise of the Reward Shaper

From Prompt Engineering to Outcome Engineering

Dom Steil
June 19, 2025

An axiom is quietly circulating in AI circles: the complexity of tasks we demand from AI is doubling every seven months.

Our first answer to this challenge was the Harness Era—building ever-better scaffolding around large language models. Our next, now upon us, is to forge the intelligence itself.

1. The Harness Era (2023–2024)

For the last two years, building an agent meant perfecting a "harness"—an abstraction layer of prompts, memory, and tools to tame a powerful but opaque model.

Harness Pillar	Purpose	Typical Techniques
Prompts	Provide core instructions & persona	System/role prompts, structured templates
Memory / Context	Inject relevant history	Vector search, RAG, short-term caches
Tools	Act on the outside world	Function calling, API orchestration, code execution

We became modern-day spell-crafters, coaxing consistent behavior from non-deterministic people spirits living on the internet. But this approach had a hard ceiling: true model intelligence was a lab-controlled black box, and progress was gated by the next major release.

2. The Eyebrow-Raise Moment

Two signals marked the turning point:

DeepSeek-V2 proved that reinforcement learning (RL) could deliver step-function gains in a model’s core capabilities—not just add polish.
Weights & Biases’ Fully Connected ‘25 made RL for multi-turn agents a headline topic, pushing it from research curiosity to urgent roadmap item.

Hear a disruptive idea once, it’s a novelty. Hear it twice from operators who ship, and the eyebrow goes up. The conclusion is clear: RL for production agents isn’t the distant future. It is the immediate frontier.

We are graduating from prompt engineers to reward shapers.

3. The New Production Trifecta

Reasoning + Tool Use + Reinforcement Learning

Reasoning: A strong foundation model, open or proprietary.
Tool Use: A sandboxed API layer so the agent can read, write, and transact.
Reinforcement Learning: A continuous feedback loop (e.g., GRPO) that teaches the model your definition of success.

Instead of just dressing the model in a better harness, you are recasting its core policy.

4. The Optimization Flywheel

This is a practical playbook for building self-improving agents today.

Step	What Happens
Ideate	Define success. “Resolve 95% of refunds autonomously at ≥ 4.8 CSAT.”
Implement	Ship a v0 agent with initial prompts, memory, and tools.
Generate	Let the agent produce multiple candidate responses and action sequences.
Evaluate	Rank trajectories via humans, LLM-as-judge, and hard heuristics (cost, latency).
Reward-Shape	Fine-tune with an RL algorithm (GRPO, PPO-RLHF) on those preferences.
Deploy & Observe	Stream real-world interactions back into the preference data store.
Repeat	Each loop compounds accuracy, speed, and alignment.

Frameworks like NVIDIA NeMo’s Curate → Customize → Evaluate → Guardrail → Retrieve map directly onto this flywheel.

5. The Payoff

This shift transforms the unit economics of building and running AI.

Dimension	Harness-Tuning (Old Way)	Outcome-Tuning (RL Way)
Accuracy	Indirect, via prompt tweaks	Direct, measurable, continuous gains
Latency	Locked to large model size	Smaller, faster specialist models
Cost / Call	Frontier-model pricing	30-70% lower after distillation
Iteration Speed	Gated by external lab releases	Continuous, in-house, and compounding

The result is smaller, faster, cheaper—and fundamentally better—agents that become true domain specialists, not just generic chatbots.

6. One Face, a Hundred Hands

To the customer, the experience is a single, on-brand AI concierge.

Behind the curtain, a swarm of a hundred task-focused, RL-shaped agents quietly handles fulfillment, returns, subscription changes, and real-time analytics. Every interaction fuels the flywheel, so the entire swarm improves constantly.

7. Welcome to Outcome Engineering

This marks the beginning of the Age of the Reward Shaper.

The teams that master this discipline will pull decisively ahead. They won’t just direct intelligence; they will define and manufacture it.