AI Testing Has Entered a New Era: Adam Sandman on TestGuild

March 16th, 2026 by Adam Sandman

Artificial intelligence is rapidly changing the way software is built, tested, and trusted. On a recent episode of the TestGuild Automation Podcast, Adam Sandman, co-founder of Inflectra, joined Joe Colantonio to discuss one of the most important quality challenges facing modern software teams: how to test non-deterministic systems. The conversation explored why traditional pass/fail approaches are no longer sufficient for AI-enabled applications, how quality teams need to think differently about risk, and how Inflectra’s upcoming SureWire product is being designed to help organizations move from AI experimentation to enterprise-grade reliability.

For decades, software testing has largely depended on a stable assumption: if the same input is given under the same conditions, the system should return the same output every time. That assumption still holds for deterministic software, and it remains essential for testing traditional application layers such as user interfaces, APIs, business logic, and databases. But AI systems are different. As discussed on the podcast, non-deterministic systems can produce different responses to the same prompt or input, even when the environment appears unchanged.

That means QA teams can no longer rely exclusively on binary notions of “pass” and “fail” when evaluating AI-powered workflows.

What Makes Non-Deterministic Systems So Hard to Validate

One of the core points is that AI is not just another feature layer. It changes the validation model itself. In deterministic testing, you can define a single expected result and compare actual versus expected. In non-deterministic testing, the same prompt may generate multiple acceptable responses, different quality levels, or even responses that stay within policy but vary in tone, specificity, or usefulness.

That changes how teams define success. The right question is often no longer, “Did it match exactly?” but instead, “Did it behave within acceptable boundaries?” Those boundaries may include safety, relevance, accuracy, harmful content controls, policy compliance, brand appropriateness, or task completion quality. The more critical the application, the more carefully those thresholds need to be defined.

Why Pass/Fail Testing Is No Longer Enough for AI QA

A major theme is that QA teams need to move beyond a purely binary concept of success when evaluating AI systems: because 100% testability is impossible in non-deterministic environments, quality teams need to shift from traditional pass/fail metrics toward identifying an acceptable level of risk for the use case. The episode also notes that the right threshold depends heavily on context, whether the application is an e-commerce chatbot or something safety-critical like air traffic control.

This is one of the most important mindset shifts for the industry:

AI quality is not just about correctness in the narrowest sense.
It is about bounded risk.
It is about confidence.
It is about being able to say, with evidence, that a system behaves well enough for the scenario in which it will be deployed.

For software leaders, that is a much more realistic and much more valuable framing. It aligns testing to real-world consequences instead of forcing a probabilistic system into a deterministic rubric that does not reflect how the system actually behaves.

The Growing Gap Between AI Development Speed and QA Capacity

Another critical issue discussed in the episode is the pressure that AI is putting on quality teams. We have known for a while that AI-generated code is raising the stakes for QA while budgets stay flat. That means teams are being asked to validate more software, more quickly, while also developing entirely new methods for testing AI-driven features such as chatbots, copilots, and agentic workflows.

This is where many organizations are feeling strain. AI can help teams produce more functionality faster, but quality does not happen automatically. In fact, the faster the pace of development, the greater the risk that fragile assumptions, incomplete evaluation criteria, or unmeasured AI behaviors will make their way into production.

That is why QA leaders need better tooling and better frameworks, not just more optimism about AI.

If development accelerates but testing remains stuck in yesterday’s model, the gap between shipping and trust will only widen.

SureWire and the Need for a New Category of AI Quality Assurance

That gap is exactly why we created SureWire, built specifically for agentic AI and designed to help teams know whether their AI agents are safe, reliable, and effective. SureWire acts like an experienced test manager, designing and implementing dynamic tests using collections of bespoke testing agents. The workflow is organized around three steps:

Teams describe their AI agent and concerns
SureWire assesses the system dynamically
Then the platform helps them improve through a report highlighting strengths, weaknesses, risks, and quality concerns.

That structure matters because it translates a difficult technical problem into an operational one. Instead of asking teams to manually invent everything from scratch, it gives them a path to define concerns, execute assessment, and interpret results in a more scalable way.

Many AI systems look promising in demos, prototypes, and controlled environments. But enterprise teams need something more than promising. They need repeatable evidence that a system performs acceptably under pressure, across variation, and within the constraints of the real business environment.

The challenge is not whether AI can produce impressive results. We already know it can. The challenge is whether those results can be trusted consistently enough for production use in environments where quality, compliance, and customer confidence matter.

It is not about replacing all existing QA practices. It is about filling the specific gap that appears when deterministic test models meet non-deterministic behavior.

What Still Remains Deterministic in an AI Application

Another useful point from the TestGuild episode is that teams should decompose their applications instead of treating the whole stack as non-deterministic. When testing AI-enabled systems, teams should still apply traditional deterministic testing to areas like the web UI and data layers, while using new statistical or agent-based testing approaches for the non-deterministic AI components.

This is important because it gives organizations a practical adoption path. You do not have to throw out everything you know about quality engineering. You still need regression suites, integration tests, API validation, requirements traceability, access control verification, and release discipline. What changes is that you now need an additional quality layer for the parts of the system that can vary meaningfully in behavior from one interaction to the next.

That decomposition is one of the most practical ways to think about AI QA. It keeps the discipline grounded while making room for new methods where they are actually needed.

AI Testing Is Really About Risk Thresholds

Perhaps the most important takeaway from the episode is that testing non-deterministic systems is fundamentally about risk management. In the episode Adam and Joe discuss how the acceptable risk thresholds vary depending on the context and consequences of failure. A consumer support bot and a mission-critical control system should not be evaluated with the same tolerance model.

That idea should shape how organizations build their AI quality strategies. Every team needs to answer questions like:

What kinds of failure are unacceptable?
How much variability is tolerable?
What dimensions matter most — accuracy, safety, consistency, policy compliance, or something else?
How should those dimensions be weighted?
What evidence is required before release?
Those are governance questions as much as technical ones. And that is why AI testing is becoming such a strategic function inside engineering organizations. It is no longer just about defect detection. It is about defining operational trust.

Watch the TestGuild Episode

If your team is working through how to test AI-enabled applications, this TestGuild conversation is a useful place to start. It covers why the old rules are breaking down, how non-deterministic systems require a different QA mindset, and why tools like SureWire are emerging as part of a new discipline for enterprise AI reliability. TestGuild’s episode summary also highlights related themes including the difference between deterministic and non-deterministic systems, the pressure AI places on QA budgets, and why testers who embrace AI as a tool will be better positioned to lead going forward.

As AI changes how software is built, it will also change how software is trusted. The organizations that adapt their quality practices early will be the ones best positioned to scale AI with confidence.

Interested in learning more about SureWire?

SureWire is Inflectra’s upcoming AI quality assurance platform built to help teams assess non-deterministic and agentic AI systems for safety, reliability, and effectiveness.
Join the waitlist to get early access and follow the latest product updates from Inflectra.

Adam Sandman is a visionary entrepreneur and a respected thought leader in the enterprise software industry, currently serving as the CEO of Inflectra. He spearheads Inflectra’s suite of ALM and software testing solutions, from test automation (Rapise) to enterprise program management (SpiraPlan). Adam has dedicated his career to revolutionizing how businesses approach software development, testing, and lifecycle management.