← All case studies
RLHF Agentic AI Reasoning Verification LLM Evaluation AI Agents

RLHF for agentic workflows: multi-step logic verification in autonomous web browsing agents

Standard RLHF evaluates the final output. Agentic AI requires evaluating every step of the reasoning chain — because a correct final answer reached through flawed logic will fail on the next task that requires the same reasoning pattern.

Here is how we grade AI agent action logs step by step, and why this produces fundamentally better training data than thumbs-up/thumbs-down evaluation. Across 50 action logs, we found that 62% were marked "successful" by outcome-based evaluation — but only 22% achieved that success through fully sound reasoning. The remaining 40% succeeded through logical leaps, unverified information, or brute-force retries that will fail on novel tasks requiring the same reasoning pattern.

62%
Action logs marked "successful" by outcome-based evaluation (standard RLHF thumbs-up/thumbs-down)
22%
Action logs with fully sound reasoning throughout — the only ones that should be used as positive training signal

Why standard RLHF breaks for agentic AI

RLHF was designed for a specific interaction pattern: a user asks a question, the model produces a response, a human evaluator judges whether the response is good. This works for conversational AI because the output is a single text response and the judgment is relatively contained.

Agentic AI operates differently. An AI agent receives a goal ("book the cheapest flight from Bengaluru to Mumbai on June 15"), decomposes it into sub-tasks, executes actions in sequence — navigate to a flight search site, enter departure and arrival, set the date, filter by price, compare options, select the cheapest, proceed to booking — and produces a final result.

The quality of an agentic AI system is not determined by whether the final result is correct. It is determined by whether every step of the reasoning chain is sound — because a flawed reasoning chain that happens to produce a correct result on one task will produce incorrect results on similar tasks where the flawed logic leads to a different outcome.

Standard RLHF — "was the final output good?" — cannot evaluate this. An evaluator who gives a thumbs-up on a completed booking has provided zero signal about whether the intermediate steps were logical, or whether they would fail on a different scenario.

The three failure modes unique to agentic AI

Agentic AI introduces three failure modes that do not exist in conversational AI, each requiring a different evaluation methodology.

Failure Mode 01

The logical leap

The agent skips a necessary verification step because the underlying LLM "knows" the answer from its training data rather than deriving it from the current context. A web browsing agent asked to find the current CEO of a company navigates to the About page, but instead of reading the page content, outputs the CEO name from its training data — which may be outdated. The action log shows: navigated to page → output answer. The missing step — reading the page content and extracting the name — is a logical leap.

The danger: this failure mode is invisible in outcome-based evaluation. The answer is correct (for now), so the thumbs-up evaluator marks it as a success. The agent has learned that skipping the verification step is rewarded — because it was.
Failure Mode 02

The hallucinated action

The agent claims to have performed an action it did not actually perform. In a web browsing context, this might mean claiming to have clicked a button, read a confirmation page, or verified a detail — without the action log showing the corresponding navigation event. The agent's reasoning chain includes a step that exists in the text output but not in the actual execution trace. This is distinct from conversational hallucination (stating a false fact). Agentic hallucination is fabricating an action — claiming to have done something that did not happen.

Standard RLHF cannot detect this because it evaluates the text output, not the execution trace. Multi-step evaluation compares the reasoning chain against the actual action log to surface these gaps.
Failure Mode 03

The circular reasoning loop

The agent encounters an error, attempts to recover, fails, and repeats the same failing approach multiple times before either succeeding by chance or giving up. The action log shows: try approach A → fail → try approach A again → fail → try approach A again → succeed (or timeout). A competent agent would try approach A → fail → diagnose the failure → try approach B.

Outcome-based evaluation marks the final result (success or timeout) without examining the recovery strategy. An agent that succeeds after five identical retries is rewarded the same as an agent that succeeds after one intelligent pivot.

The approach — dataset construction

We constructed a dataset of 50 AI-generated action logs from an autonomous web browsing agent tasked with realistic information retrieval and task completion goals. The action logs were generated using an open-source web browsing agent framework operating on live websites.

Task categories — 50 action logs across 5 domains
Category
Tasks
Avg steps
Complexity
Information retrieval
15
8
Medium
Price comparison
10
14
High
Form completion
8
11
Medium
Multi-site research
10
18
Very high
Booking / transaction
7
16
Very high

Each action log recorded: the goal, every navigation action (URL visited, element clicked, text entered), the agent's internal reasoning at each step (chain-of-thought), the final output, and whether the task was marked as completed or failed by the agent.

The evaluation framework — grading every step, not just the result

This is the core methodological contribution. We designed a 5-dimension evaluation framework that annotators applied to every step of every action log — not just the final outcome.

1

Goal decomposition quality

Does the agent correctly break the high-level goal into appropriate sub-tasks? Are the sub-tasks in a logical order? Are any necessary sub-tasks missing? Are there unnecessary sub-tasks that waste actions?

Score 1–5 per action log · 1 = incoherent · 5 = optimal
2

Action justification

For each action the agent takes, is there a clear logical reason for that specific action at that point in the sequence? Does the agent's chain-of-thought reasoning actually justify the action, or is the reasoning post-hoc rationalisation for an action taken based on pattern matching?

Fully justified · Partially justified · Unjustified — per individual action
3

Information verification

When the agent extracts information from a web page, does it actually read and correctly interpret the page content? Or does it substitute information from its training data, ignore contradictory evidence on the page, or misread the page structure? Each extraction step is verified against the actual page content captured as a screenshot at the time of the action.

Verified · Unverified · Contradicted — per extraction step
4

Error recovery quality

When the agent encounters an error (page not found, element not clickable, unexpected content), does it diagnose the error and adjust its approach, or retry the same failing action?

Intelligent recovery · Brute force · No recovery · N/A
5

Logical coherence across the full chain

Does the entire sequence of actions tell a coherent story from goal to completion? Are there logical inconsistencies where the agent's behaviour in step 8 contradicts its reasoning in step 3? Does the agent maintain context across a long action sequence, or does it lose track of what it has already done?

Score 1–5 per action log · 1 = internally contradictory · 5 = fully coherent

Annotator selection and calibration

Multi-step reasoning evaluation requires a fundamentally different annotator profile than standard RLHF preference annotation. We selected annotators with software engineering background (ability to read action logs and trace execution flows), experience with web technologies, and logical reasoning ability tested via a logic assessment during screening.

All annotators evaluated the same 5 action logs independently using the 5-dimension framework before live annotation began.

Calibration kappa — 5 evaluation dimensions
Goal decomposition        0.76
Action justification      0.71  (lowest — "fully vs. partially justified" is subjective)
Information verification  0.83  (highest — either matches page content or it does not)
Error recovery quality    0.79
Overall logical coherence 0.74
────────────────────────
Mean across dimensions    0.77  (above 0.70 threshold for production-grade annotation)

The results — what the evaluation found

Findings across 50 action logs
Logical leaps detected
34 logs · 68%
Information verification failures
27 logs · 54%
Hallucinated actions detected
12 logs · 24%
Circular reasoning loops
8 logs · 16%
Correct result via flawed reasoning
19 logs · 38%
Fully coherent reasoning chain
11 logs · 22%
Tasks that outcome-eval marked "success"
31 logs · 62%
Tasks with sound reasoning throughout
11 logs · 22%

The most important finding: 31 of 50 action logs were marked "successful" by outcome-based evaluation. But only 11 of those 31 achieved success through fully sound reasoning. Standard RLHF would have marked all 31 as positive training signal. Multi-step evaluation identifies only 11 as genuinely positive signal and flags the other 20 for specific reasoning corrections.

Detailed breakdown — logical leaps

Logical leaps were the most common failure mode, appearing in 68% of action logs. The agent is asked to find the current price of a specific laptop on an e-commerce site.

Example — laptop price retrieval · logical leap detected at Step 4
Step 1
Navigate to e-commerce site homepage
Justified
Step 2
Search for laptop model name
Justified
Step 3
Click on first search result
Justified
Step 4
Output: "The price is ₹89,999" [no page-reading or price extraction action in log]
Logical leap
Annotator assessment: "Step 4 is a logical leap. The agent did not extract the price from the page content — it outputted a price that may be from training data. This step should include: read price element from product page → verify currency → output verified price."

Corrective training signal: The annotator rewrote the reasoning chain for Step 4 to include the missing verification step. Multiplied across hundreds of similar examples, this trains the agent to always verify against the current page rather than substituting from memory.

Detailed breakdown — hallucinated actions

Hallucinated actions appeared in 24% of logs — the agent claimed to have performed an action that the execution trace does not support. The agent is asked to compare prices across three different websites.

Example — price comparison · hallucinated action at Step 6
Step 5
Navigate to Website B
Justified
Step 6
"I can see the price is ₹92,500 on Website B" [execution trace: landed on homepage, no product search, no product page visited]
Hallucinated
Step 7
Navigate to Website C
Justified
Annotator assessment: "Step 6 is a hallucinated action. The execution trace shows the agent landed on Website B's homepage and immediately moved to Step 7 without searching for the product. The price ₹92,500 was generated, not extracted. The correct sequence: search for product on Website B → navigate to product page → extract price → verify price is visible on page → report verified price."

Detailed breakdown — correct result, flawed reasoning

This is the finding that demonstrates why outcome-based evaluation is insufficient for agentic AI. 38% of action logs that produced a correct final result did so through flawed reasoning. In every case, standard RLHF would have marked the log as a positive training example — reinforcing the flawed reasoning pattern.

Example — contact email lookup · correct result via logical leap
Step 1
Navigate to company website
Justified
Step 2
Click "Contact Us" link
Justified
Step 3
Output: "The customer service email is support@company.com" [Contact Us page loaded but not read — email from training data, happens to be current]
Logical leap
Annotator flagged: "Correct result, flawed reasoning. Must not be used as positive RLHF signal without correction. The corrected version must include a page-reading step between navigation and output. On a different company whose contact email has changed, the same reasoning pattern would produce an incorrect result."

The training signal comparison

What each evaluation method produces from 50 action logs
Evaluation method
Positive signal
Negative signal
Corrected chains
Outcome-based RLHF (thumbs up/down)
31
19
0
Multi-step evaluation (our method)
11 (truly sound)
19 (failed)
20 (fixed logic)

The multi-step evaluation produces 20 corrected training examples that outcome-based evaluation would have treated as positive signal without correction. These 20 corrections — showing the flawed reasoning chain alongside the corrected version — are the highest-value training data for improving an agent's reasoning capabilities. The model learns not just what the right answer is, but what the right reasoning process looks like.

Time and cost analysis

Metric
Outcome-based RLHF
Multi-step evaluation
Time per action log
3–5 minutes
25–40 minutes
Evaluator skill required
General
Software engineering
Cost per evaluation
₹80–150
₹400–800
Quality of training signal
Binary (good/bad)
5-dimensional + corrected chains
Logical leaps detected
0
34 of 50
Hallucinated actions detected
0
12 of 50
Model reasoning improvement after retraining
Marginal
Substantial
Multi-step evaluation costs 4–5× more per action log than outcome-based RLHF. But it produces training data that is categorically different — not just "this action sequence was good/bad" but "here is exactly where the reasoning broke down, here is why, and here is what the correct reasoning chain should have been."
Key learnings
Outcome-based RLHF is actively harmful for agentic AI training. 38% of "successful" action logs contained flawed reasoning that standard RLHF would have reinforced. Training on this signal produces agents that are unreliable on novel tasks — they have learned specific shortcuts rather than general reasoning patterns.
Logical leaps are the most common and most dangerous failure mode. 68% of action logs contained at least one logical leap — a step where the agent substituted training data for current context. This failure mode is invisible in outcome evaluation because the substituted information is often correct. It becomes visible only when the reasoning chain is evaluated step by step.
Hallucinated actions are the agentic equivalent of hallucinated facts. 24% of logs contained actions the agent claimed to have performed but did not actually execute. This failure category only exists in agentic AI and requires execution trace verification that standard RLHF annotation does not include.
The corrected reasoning chain is the highest-value training data. The 20 action logs with correct results but flawed reasoning — annotated with specific corrections to each flawed step — produce training signal that teaches the model what good reasoning looks like, not just what good outcomes look like.
Annotator selection for agentic evaluation is fundamentally different. Evaluating multi-step reasoning chains requires software engineering literacy, logical reasoning ability, and experience tracing execution flows. Standard RLHF annotators — even excellent ones — cannot reliably evaluate agentic action logs without task-specific training.
This is the highest-value annotation service in the current market. Agentic AI is the fastest-growing segment, the evaluation methodology is complex, the pool of qualified annotators is small, and the impact on model quality is substantial. Multi-step reasoning evaluation commands 4–5× the price of standard RLHF annotation — and delivers proportionally more value.

Building an agentic AI system and need reasoning-quality evaluation?

We specialise in multi-step logic verification, action log auditing, and RLHF preference data for agentic workflows. Free audit — 10 action logs evaluated, findings in 5 working days.

Request Free Audit →
C
The Concave AI Team
ML-Engineer-Led Data Annotation & GenAI Evaluation · Bengaluru, India