Your AI Agent Shipped Without a Test Suite. Summer ’26 Fixes That.
Agent preview hit GA, trace files arrived, and Testing Center moved into Agentforce Studio. For the first time, you can put an autonomous agent through real QA before it touches a customer. Here is how a Salesforce architect actually builds that test pipeline.
Agents are non-deterministic, so the unit test you wrote for a trigger does not cover them. Summer ’26 gives you the missing layer: agent preview is GA in the CLI, trace files show exactly how an agent routed, Testing Center lives inside Agentforce Studio with conversation-level tests, and Custom Scorers let you grade sessions against your own definition of “good.” Treat agent testing as its own discipline, build a regression suite before go-live, and wire scorers into source control.
In This Article
You know how to test Apex. Mock the callout, assert the result, roll back the data, watch the coverage number climb. That model has worked for fifteen years because Apex is deterministic. Same input, same output, every single run.
Agents are not like that. Feed the same utterance to an Agentforce agent twice and you can get two different phrasings, two different reasoning paths, sometimes two different action choices. The model has built-in variability. Memory, prior turns, and tool access all shape the next response. That is the whole point of an agent, and it is exactly what makes your old test harness useless against it.
The scale makes it sharper. In one recent quarter, Agentforce customers consumed more than 20 trillion tokens, a 400% jump year over year, adding up to roughly 1.79 billion agentic work units. Every one of those interactions is a decision your agent made on its own. A wrong topic selection, a misfired action, a tone that does not match your brand, multiply that across millions of conversations and the cost of skipping QA is not theoretical.
Traditional testing asks “did the code return the expected value?” Agent testing asks “did the agent reason correctly, pick the right topic, call the right action, and stay on brand, given that the exact words will vary every time?” Those are different questions, and they need different tools.
Here is the good news. Salesforce spent the last few releases building the answer, and Summer ’26 is where it all clicks into place. You no longer have to hand-roll your own evaluation scripts or edit CSV files to manage a test suite. The platform now treats agent testing as a first-class discipline with its own surface, its own primitives, and its own place in your deployment pipeline.
The testing story is spread across a few products, so it helps to see the pieces together before we get into the how. Each one closes a specific gap between “the demo worked” and “I trust this in production.”
Script an interactive test session end to end from the Salesforce CLI: agent preview start, send, sessions, and end. No clicking through a UI to validate a conversation.
Inspect the traces recorded during a preview session to see exactly how your agent routed and acted. This is your stack trace for agent reasoning, the thing that used to be a black box.
Testing Center moved out of Setup and into Agentforce Studio as a dedicated tab beside Agent Builder and Observability. Batch tests, conversation-level simulation, and user personas, all in one place.
Grade sessions against your own KPIs: Sentiment, Tone of Voice, Product Interest, Escalation Trigger, Politeness, alongside Salesforce’s standard quality metrics. You define what “good” means.
There is a fifth piece worth flagging for Apex developers: a new @IntegrationTest annotation (Developer Preview) that lets test classes make real callouts to Agentforce and Data 360 instead of mocking everything. It allows live calls and mid-transaction commits through IntegrationTest.commitTestOnly(), with cleanup in a @TearDown method. It runs in scratch orgs only for now, but it is the bridge between your existing Apex test culture and the agent world.
Notice the pattern across all five. None of them are about making the agent smarter. They are about giving you visibility and control over an agent you already built. That is the mark of a platform maturing past the launch hype and into the part where real enterprises actually ship.
Before this, validating a conversation meant opening a builder and typing into a chat box like an end user. Fine for a quick sanity check. Useless for repeatable, automated testing. Agent preview going GA changes that by exposing the whole session as scriptable CLI commands.
The flow is four commands. You start a session, send one or more utterances, list or inspect sessions, and end cleanly. Because it is the CLI, it drops straight into a shell script or a pipeline step.
BASH# Start an interactive preview session against a published agent sf agent preview start --api-name Order_Support_Agent --output-dir ./traces # Send an utterance and capture the response sf agent preview send --session-id $SESSION \ --text "Where is my order 10042?" # List active sessions, then close when the script finishes sf agent preview sessions sf agent preview end --session-id $SESSION
The real prize is the trace file. When the session runs, the platform records how the agent routed: which topic it matched, which action it invoked, what the planner decided at each step. You read that trace the way you would read a debug log, except instead of governor limits and SOQL counts, you are watching the agent’s decision path.
What the trace tells you that a chat window never could
A failed conversation in a chat window gives you a bad answer and nothing else. A trace gives you the why. Maybe the agent matched the wrong topic because two topics had overlapping descriptions. Maybe it picked the right topic but the action threw and the agent silently recovered with a generic reply. You cannot fix what you cannot see, and traces finally make the reasoning visible.
Pair agent preview with the new richer evaluations in the CLI, which let you define repeatable evaluation tests in YAML or JSON. Keep those eval files in your repo next to your Apex and LWC. Now your agent behavior is versioned, reviewable in a pull request, and runnable on every commit, the same as any other code asset.
Here is the question that separates a toy test suite from a real one. Your agent answered the customer’s shipping question correctly. Was it a good response? Correct is not the same as good. It might have been curt. It might have leaked a detail your compliance team would flag. It might have missed an obvious upsell. Standard pass/fail metrics will not catch any of that.
Custom Scorers (Beta) solve this by letting you grade a session against criteria you define in plain language. You write something like “rate the politeness of the agent response on a scale of 0 to 5,” describe what each score level means, and give example responses. The platform applies it across your test sessions and live conversations.
Catches conversations that technically resolved but left the customer frustrated. The number that correlates most directly with churn and CSAT.
A bank and a sneaker startup should not sound identical. Score against your brand guidelines so the agent speaks in a voice your marketing team would approve.
Grades whether the agent recognized its own limits and escalated to a human at the right moment, instead of bluffing through a question it could not handle.
For sales-adjacent agents, scores whether the conversation surfaced genuine buying signals rather than treating every chat as a pure support ticket.
The workflow is what makes this stick for developers. You build scorers with Next Gen Testing in Agentforce Studio, or you deploy them through the Metadata API using aiAgentScorerDefinitions so they live in source control. Then you activate them from the Scorer Hub to run against live sessions. Define once, version it, run it everywhere.
This is also where Testing Center earns its move into Studio. It now supports conversation-level testing that simulates full multi-turn conversations with user personas, not just single-utterance checks. You seed prior context, mock tool outputs so you are not hitting real APIs, and assert on intent rather than exact text, because the LLM will reword a valid answer and a brittle string-match test would fail it for no reason.
Scorers are not just a quality tool. They are a governance artifact. A documented, version-controlled Escalation Trigger scorer is evidence that your agent has guardrails, which is exactly what your risk and compliance teams will ask for before they sign off on a customer-facing deployment. Build them early and treat them as part of your audit trail.
Individual tools are nice. A pipeline is what keeps an agent reliable after the launch excitement fades. Here is how I sequence the work when an org is moving an agent toward production.
@IntegrationTest classes against real Agentforce and Data 360 callouts.The toolkit is strong, but it is new, and a few edges are still sharp. Knowing them ahead of time saves a frustrating afternoon.
- The
@IntegrationTestannotation runs in scratch orgs only for now, not sandboxes or production - Tests run asynchronously, one at a time, so a large suite takes real wall-clock time
- You must add ApexIntegrationTests to the scratch org definition’s features array first
- Plan for it: keep integration tests focused on critical paths, not exhaustive coverage
- Custom Scorers use a model to evaluate, so the grade itself carries some variability
- Vague scoring descriptions produce inconsistent grades across runs
- Custom Scorers require the Agentforce Scorer Beta permission set to use
- Fix: write tight rubrics with concrete examples for each score level
The thing nobody warns you about: test for intent, not text
The most common early mistake is writing assertions that check for exact wording. An agent that answered correctly last week will reword the same answer this week, and your string-match assertion fails for a response that was perfectly fine. Always assert on the topic matched and the action invoked, the deterministic parts, and let a scorer judge the prose. The trace file gives you the deterministic signals. Use them as your hard assertions and reserve scorers for the fuzzy quality call.
One more piece of context worth holding. A lot of agent setup is shifting toward the new Agentforce Builder, which becomes the default the week of July 13, 2026. Tests you build against the new builder and Agent Script will age better than anything anchored to the legacy Setup experience. If you are starting fresh, start there.
None of this removes the need for human review on a customer-facing agent before launch. The tooling makes your humans far more efficient and gives you a safety net for regressions, but a person who knows the business should still read a sample of conversations before you flip it on for real customers. Automation catches drift. Judgment catches the things you did not think to test for.
@IntegrationTest annotation in Developer Preview. It allows live callouts to Agentforce and Data 360 and supports mid-transaction commits via IntegrationTest.commitTestOnly(), with cleanup in a @TearDown method. It is limited to scratch orgs for now and you must enable the ApexIntegrationTests feature in your scratch org definition.