Agentforce · Testing · Developer Tooling

Your AI Agent Shipped Without a Test Suite. Summer ’26 Fixes That.

Agent preview hit GA, trace files arrived, and Testing Center moved into Agentforce Studio. For the first time, you can put an autonomous agent through real QA before it touches a customer. Here is how a Salesforce architect actually builds that test pipeline.

Reading time: ~11 minutes | Published: June 2026 | Published By: Sandip Patel, Salesforce Architect

RELEASE v67.0 Summer ’26 API version, live in production now

AGENT PREVIEW GA Scripted end-to-end agent tests from the CLI

TOKENS 20T+ Tokens Agentforce customers used in one quarter

ANALYTICS 40+ Metrics in the unified agent analytics view

TL;DR

Agents are non-deterministic, so the unit test you wrote for a trigger does not cover them. Summer ’26 gives you the missing layer: agent preview is GA in the CLI, trace files show exactly how an agent routed, Testing Center lives inside Agentforce Studio with conversation-level tests, and Custom Scorers let you grade sessions against your own definition of “good.” Treat agent testing as its own discipline, build a regression suite before go-live, and wire scorers into source control.

☰

Agents are not like that. Feed the same utterance to an Agentforce agent twice and you can get two different phrasings, two different reasoning paths, sometimes two different action choices. The model has built-in variability. Memory, prior turns, and tool access all shape the next response. That is the whole point of an agent, and it is exactly what makes your old test harness useless against it.

The scale makes it sharper. In one recent quarter, Agentforce customers consumed more than 20 trillion tokens, a 400% jump year over year, adding up to roughly 1.79 billion agentic work units. Every one of those interactions is a decision your agent made on its own. A wrong topic selection, a misfired action, a tone that does not match your brand, multiply that across millions of conversations and the cost of skipping QA is not theoretical.

The Core Shift

Traditional testing asks “did the code return the expected value?” Agent testing asks “did the agent reason correctly, pick the right topic, call the right action, and stay on brand, given that the exact words will vary every time?” Those are different questions, and they need different tools.

Here is the good news. Salesforce spent the last few releases building the answer, and Summer ’26 is where it all clicks into place. You no longer have to hand-roll your own evaluation scripts or edit CSV files to manage a test suite. The platform now treats agent testing as a first-class discipline with its own surface, its own primitives, and its own place in your deployment pipeline.

What Actually Shipped in Summer ’26

Four pieces that turn agent QA from guesswork into a repeatable process

The testing story is spread across a few products, so it helps to see the pieces together before we get into the how. Each one closes a specific gap between “the demo worked” and “I trust this in production.”

▶

Agent Preview (GA)

Script an interactive test session end to end from the Salesforce CLI: agent preview start, send, sessions, and end. No clicking through a UI to validate a conversation.

🔎

Trace Files

Inspect the traces recorded during a preview session to see exactly how your agent routed and acted. This is your stack trace for agent reasoning, the thing that used to be a black box.

🧾

Testing Center in Studio

Testing Center moved out of Setup and into Agentforce Studio as a dedicated tab beside Agent Builder and Observability. Batch tests, conversation-level simulation, and user personas, all in one place.

📊

Custom Scorers (Beta)

Grade sessions against your own KPIs: Sentiment, Tone of Voice, Product Interest, Escalation Trigger, Politeness, alongside Salesforce’s standard quality metrics. You define what “good” means.

There is a fifth piece worth flagging for Apex developers: a new @IntegrationTest annotation (Developer Preview) that lets test classes make real callouts to Agentforce and Data 360 instead of mocking everything. It allows live calls and mid-transaction commits through IntegrationTest.commitTestOnly(), with cleanup in a @TearDown method. It runs in scratch orgs only for now, but it is the bridge between your existing Apex test culture and the agent world.

“The gap was never the model. It was everything between a working prototype and a deployed agent: provisioning, authoring, testing, observability. Summer ’26 fills the testing slot.”

Notice the pattern across all five. None of them are about making the agent smarter. They are about giving you visibility and control over an agent you already built. That is the mark of a platform maturing past the launch hype and into the part where real enterprises actually ship.

Scripting Agent Preview from the CLI

Agent preview going GA is quietly the most useful thing in this release for anyone running CI

Before this, validating a conversation meant opening a builder and typing into a chat box like an end user. Fine for a quick sanity check. Useless for repeatable, automated testing. Agent preview going GA changes that by exposing the whole session as scriptable CLI commands.

The flow is four commands. You start a session, send one or more utterances, list or inspect sessions, and end cleanly. Because it is the CLI, it drops straight into a shell script or a pipeline step.

BASH# Start an interactive preview session against a published agent
sf agent preview start --api-name Order_Support_Agent --output-dir ./traces

# Send an utterance and capture the response
sf agent preview send --session-id $SESSION \
  --text "Where is my order 10042?"

# List active sessions, then close when the script finishes
sf agent preview sessions
sf agent preview end --session-id $SESSION

The real prize is the trace file. When the session runs, the platform records how the agent routed: which topic it matched, which action it invoked, what the planner decided at each step. You read that trace the way you would read a debug log, except instead of governor limits and SOQL counts, you are watching the agent’s decision path.

What the trace tells you that a chat window never could

A failed conversation in a chat window gives you a bad answer and nothing else. A trace gives you the why. Maybe the agent matched the wrong topic because two topics had overlapping descriptions. Maybe it picked the right topic but the action threw and the agent silently recovered with a generic reply. You cannot fix what you cannot see, and traces finally make the reasoning visible.

Step 1

Start Session

CLI spins up a preview against your agent

→

Step 2

Send Utterances

Scripted inputs, one per test case

→

Step 3

Capture Trace

Routing and action path recorded

→

Step 4

Assert & Score

Check topic, action, and quality

Architect’s Tip

Pair agent preview with the new richer evaluations in the CLI, which let you define repeatable evaluation tests in YAML or JSON. Keep those eval files in your repo next to your Apex and LWC. Now your agent behavior is versioned, reviewable in a pull request, and runnable on every commit, the same as any other code asset.

Custom Scorers: Defining “Good”

A passing test that ignores tone, compliance, and brand voice is not a passing test

Here is the question that separates a toy test suite from a real one. Your agent answered the customer’s shipping question correctly. Was it a good response? Correct is not the same as good. It might have been curt. It might have leaked a detail your compliance team would flag. It might have missed an obvious upsell. Standard pass/fail metrics will not catch any of that.

Custom Scorers (Beta) solve this by letting you grade a session against criteria you define in plain language. You write something like “rate the politeness of the agent response on a scale of 0 to 5,” describe what each score level means, and give example responses. The platform applies it across your test sessions and live conversations.

Sentiment

Was the customer left satisfied?

Catches conversations that technically resolved but left the customer frustrated. The number that correlates most directly with churn and CSAT.

Tone of Voice

Does it sound like your brand?

A bank and a sneaker startup should not sound identical. Score against your brand guidelines so the agent speaks in a voice your marketing team would approve.

Escalation Trigger

Did it hand off when it should?

Grades whether the agent recognized its own limits and escalated to a human at the right moment, instead of bluffing through a question it could not handle.

Product Interest

Did it spot the opportunity?

For sales-adjacent agents, scores whether the conversation surfaced genuine buying signals rather than treating every chat as a pure support ticket.

The workflow is what makes this stick for developers. You build scorers with Next Gen Testing in Agentforce Studio, or you deploy them through the Metadata API using aiAgentScorerDefinitions so they live in source control. Then you activate them from the Scorer Hub to run against live sessions. Define once, version it, run it everywhere.

This is also where Testing Center earns its move into Studio. It now supports conversation-level testing that simulates full multi-turn conversations with user personas, not just single-utterance checks. You seed prior context, mock tool outputs so you are not hitting real APIs, and assert on intent rather than exact text, because the LLM will reword a valid answer and a brittle string-match test would fail it for no reason.

Governance Note

Scorers are not just a quality tool. They are a governance artifact. A documented, version-controlled Escalation Trigger scorer is evidence that your agent has guardrails, which is exactly what your risk and compliance teams will ask for before they sign off on a customer-facing deployment. Build them early and treat them as part of your audit trail.

A Test Pipeline That Survives Production

How the pieces fit into a lifecycle from scratch org to CI/CD

Individual tools are nice. A pipeline is what keeps an agent reliable after the launch excitement fades. Here is how I sequence the work when an org is moving an agent toward production.

BUILD TEST OPERATE

Scaffold in a scratch org

Use the CLI agent template to spin up a runnable sample with Apex, Prompt Template, and Flow subagents. Enable the ApexIntegrationTests feature in your scratch org definition so you can write @IntegrationTest classes against real Agentforce and Data 360 callouts.

Reproducible from day one

Write evals next to your code

Define YAML or JSON evaluation tests and commit them to the repo alongside your metadata. Each eval is a scripted utterance plus the topic and action you expect. This is your agent’s regression suite, and it belongs in version control.

Versioned behavior

Run agent preview in CI

Wire the four preview commands into your pipeline. On every pull request, start a session, replay your eval utterances, capture traces, and fail the build if a critical conversation routes to the wrong topic or action. Treat a routing regression like a failing unit test.

Catch drift before merge

Layer in Custom Scorers

Deploy scorer definitions through the Metadata API so they ship with your release. Run them in Testing Center against your conversation-level tests, so quality dimensions like tone and escalation are graded automatically, not eyeballed by whoever has time.

Quality as code

Activate scorers on live sessions

Once in production, turn scorers on from the Scorer Hub to grade real conversations. Watch the unified Refined Agent Analytics view, which now folds Service Agent and Employee Agent metrics into one place with 40+ measures across both.

Observe continuously

Feed production back into tests

When a live conversation scores poorly, turn it into a new eval case. Your regression suite grows from real failures, not imagined ones. This loop is what keeps the agent honest months after launch, when nobody remembers the original design decisions.

Close the loop

Eval suite + CI + scorers

Regressions caught pre-merge

Manual chat-window checks only

Catches the obvious, misses drift

No structured testing

Production is your test suite

What Still Trips Teams Up

Honest limits even well-prepared teams will run into

The toolkit is strong, but it is new, and a few edges are still sharp. Knowing them ahead of time saves a frustrating afternoon.

⚠

Integration Tests Are Scratch-Org Only

Developer Preview boundaries

The @IntegrationTest annotation runs in scratch orgs only for now, not sandboxes or production
Tests run asynchronously, one at a time, so a large suite takes real wall-clock time
You must add ApexIntegrationTests to the scratch org definition’s features array first
Plan for it: keep integration tests focused on critical paths, not exhaustive coverage

🧶

Scorers Are LLM-Graded

A judge that also has variance

Custom Scorers use a model to evaluate, so the grade itself carries some variability
Vague scoring descriptions produce inconsistent grades across runs
Custom Scorers require the Agentforce Scorer Beta permission set to use
Fix: write tight rubrics with concrete examples for each score level

The thing nobody warns you about: test for intent, not text

The most common early mistake is writing assertions that check for exact wording. An agent that answered correctly last week will reword the same answer this week, and your string-match assertion fails for a response that was perfectly fine. Always assert on the topic matched and the action invoked, the deterministic parts, and let a scorer judge the prose. The trace file gives you the deterministic signals. Use them as your hard assertions and reserve scorers for the fuzzy quality call.

One more piece of context worth holding. A lot of agent setup is shifting toward the new Agentforce Builder, which becomes the default the week of July 13, 2026. Tests you build against the new builder and Agent Script will age better than anything anchored to the legacy Setup experience. If you are starting fresh, start there.

Real Talk

None of this removes the need for human review on a customer-facing agent before launch. The tooling makes your humans far more efficient and gives you a safety net for regressions, but a person who knows the business should still read a sample of conversations before you flip it on for real customers. Automation catches drift. Judgment catches the things you did not think to test for.

Frequently Asked Questions

Common questions on Testing Center, scorers, and getting started

Do I need to write code to test an Agentforce agent?

No. Testing Center inside Agentforce Studio supports batch tests and conversation-level simulation through a UI, and you can generate test cases with AI or upload them. The code-first path (agent preview in the CLI, YAML/JSON evals, Metadata API scorers) is there when you want testing wired into CI/CD, but admins can run meaningful tests without writing a line.

What is the difference between agent preview and Testing Center?

Agent preview is a CLI capability for scripting individual sessions and capturing trace files, ideal for automated pipelines and debugging routing. Testing Center is the Studio surface for managing batch tests and conversation-level suites with personas. They complement each other: preview for scripted CI checks, Testing Center for managed suites and team visibility.

How are Custom Scorers different from standard metrics?

Standard metrics measure things Salesforce defines, like resolution and accuracy. Custom Scorers let you define your own criteria in natural language, such as brand tone or escalation behavior, with a rubric and examples. You build them in Next Gen Testing or deploy them via the Metadata API, then activate them from the Scorer Hub to grade both test and live sessions.

Can I run real callouts in an Apex test against Agentforce?

Yes, through the new @IntegrationTest annotation in Developer Preview. It allows live callouts to Agentforce and Data 360 and supports mid-transaction commits via IntegrationTest.commitTestOnly(), with cleanup in a @TearDown method. It is limited to scratch orgs for now and you must enable the ApexIntegrationTests feature in your scratch org definition.

Why shouldn’t my tests check for exact response text?

Agents are non-deterministic and will reword valid answers between runs, so an exact-text assertion produces false failures. Assert on the deterministic signals instead: the topic the agent matched and the action it invoked, both visible in the trace file. Use a scorer to judge the quality of the wording rather than matching it literally.

What’s the single highest-impact thing I can do this week?

Pick your most important agent conversation and turn it into a scripted agent preview test. Run it, read the trace, and confirm the agent routes to the topic and action you expect. That one exercise teaches you more about your agent’s real behavior than any amount of manual chatting, and it becomes the first case in your regression suite.

Your AI Agent Shipped Without a Test Suite. Summer ’26 Fixes That

Your AI Agent Shipped Without a Test Suite. Summer ’26 Fixes That.

In This Article

What the trace tells you that a chat window never could

The thing nobody warns you about: test for intent, not text

Leave a reply Cancel reply

Footer widgets