Large Language Models (LLMs) offer transformative potential, promising to redefine efficiency and innovation across industries. For business leaders, however, this potential is often overshadowed by their perceived unpredictability — a major barrier to enterprise adoption. In environments where reliability, repeatability, and compliance are paramount, the concern that an LLM might produce a different answer to the same question each time can halt adoption and frustrate teams.
The common belief that LLMs are fundamentally non-deterministic is, in fact, an oversimplification. What we observe as “randomness” is better understood as variance — often a byproduct of engineering choices rather than an intrinsic property of the model. As shown in recent research by Horace He and colleagues at Thinking Machines (2025), much of what appears non-deterministic arises from how inference systems handle concurrency, batching, and floating-point arithmetic, not from any inherent unpredictability of the model itself.
This article deconstructs the myth of “randomness” in LLMs to offer a strategic and technically grounded view. We begin by exploring why LLMs appear unpredictable, then outline the deterministic nature of their computational process, identify the true sources of observed variance, and summarize empirical work demonstrating engineered determinism. Finally, we propose a practical path toward building repeatable, enterprise-grade AI systems — with humility about what we can control and awareness of what remains open research.
The Common Misconception: Why LLMs Seem Unpredictable
To architect a deterministic system, we must first understand why LLMs appear to be inherently variable. This perception is rooted in user experience — prompting the same model multiple times can yield slightly different results. Yet this variability is not a symptom of chaos or “random thought,” but a consequence of the probabilistic sampling step at the end of text generation.
At its final stage, an LLM doesn’t “decide” on a single next word — it computes a vector of probabilities for every possible next token. Instead of declaring “love,” it produces a ranked distribution — perhaps “love” at 40%, “lights” at 35%, and so forth. Several plausible candidates coexist with varying likelihoods.
It is not standard practice to always choose the top-probability token. Instead, developers apply sampling strategies (often guided by parameters such as temperature) to modulate the degree of variation — effectively trading strict repeatability for creative diversity. A higher temperature increases the likelihood of selecting less probable tokens. Importantly, this sampling-induced variance occurs after the model’s deterministic computation — not within it. The core forward pass remains mathematically fixed for identical inputs.
The Core Truth: An LLM’s Forward Pass is Deterministic
Understanding the concept of the “forward pass” is key to building reliable and repeatable LLM applications. While the model’s final output may appear probabilistic, the computation beneath — the forward pass — is fully deterministic.
The forward pass is the core mathematical journey of data through the network. It takes an input (a prompt) and, through a series of fixed linear and nonlinear operations, computes a probability distribution over possible next tokens. Given identical inputs, parameters, and precision settings, the forward pass produces the exact same probability vector every time. This is not speculation but a property of the underlying numerical computation — a fixed, reproducible function.
This distinction is critical. The model is not “imagining” new probabilities each time, it performs a consistent calculation. The apparent variation arises only after this fixed computation, through downstream processes like sampling or hardware-level execution variance. This invites a deeper question: if the core is deterministic, why do our results still vary in practice?
The Real Culprits of Non-Determinism
The paradox of a deterministic process producing variable outcomes dissolves once we look beyond the model itself and into the complex execution environment surrounding it. The apparent “randomness” is not a fundamental property of the LLM, but an artifact of the computational and infrastructural context. As the Thinking Machines study (He, 2025) demonstrated, two primary system-level factors dominate this effect: floating-point arithmetic and batching dynamics.:contentReference[oaicite:0]{index=0}
- Floating-Point Arithmetic: Modern AI models perform trillions of floating-point operations. Because floating-point math is non-associative, the order of additions or multiplications affects the result: (a + b) + c is not always perfectly equal to a + (b + c). These minute rounding differences, insignificant in isolation, can propagate through layers and alter the final output probabilities slightly — a phenomenon extensively discussed in He (2025) as “the original sin of nondeterminism.”
- Batch Processing: To maximize GPU utilization, inference servers group multiple requests into batches. However, batch composition changes continuously as new user calls arrive. Because certain GPU kernels are not batch-invariant, even identical inputs can yield subtly different outputs when run in different batch contexts. The Thinking Machines team pinpointed this as the primary source of inference variance — not randomness, but structural sensitivity to batch size and execution order.
This understanding — that observed variance arises from the environment, not the model — is reinforced by empirical results showing that controlling these conditions yields full repeatability. In other words, determinism is not hypothetical; it’s achievable through careful engineering.
Proof of Concept: The Thinking Machines Study
Moving from theory to practice requires evidence. A recent study by Horace He and colleagues at Thinking Machines (2025) provides exactly that — a rigorous demonstration that LLM inference can, in fact, be made deterministic when system variables are controlled. The researchers hypothesized that observed variance was “most likely related to the server side,” then systematically eliminated such sources.
The results were striking:
- Baseline: Without special controls, 1,000 identical queries yielded between 80 and 100 distinct completions — clear variance even at temperature 0.
- Controlled: After implementing batch-invariant kernels and environmental locking, 10,000 identical runs produced one unique completion.
This empirical result — all completions bitwise identical — confirms that the variability we often attribute to “randomness” is instead an artifact of infrastructure. It reframes determinism not as a philosophical impossibility but as an engineering challenge.
The significance of this finding cannot be overstated — but it also invites humility. It shows that determinism can be achieved in principle, yet sustaining it at scale requires disciplined system design. The path to reliability is therefore not about altering model weights, but about mastering the surrounding architecture. With this shift, determinism moves from theory to practice — and from mystery to method.
A Blueprint for Repeatability: How to Engineer Determinism
Achieving determinism requires more than a configuration checklist — it’s a mindset shift from prompt engineering to systems architecture. The apparent simplicity of a single API call conceals a complex web of dependencies and environmental variables. The following blueprint, adapted from both practice and research (He, 2025), is less a recipe than a discipline: a way of thinking about reliability as a designed property, not an incidental one.
- Lock the Environment. This is foundational. Use the exact same model, and crucially, the same version. API providers frequently deploy silent updates that can subtly change output distributions. Version-locking the model and inference framework ensures that the ground you stand on stays fixed.
- Control Generation Parameters. Leverage underused parameters that govern generation behavior. A fixed random seed ensures the same sampling choices. Tools such as logit_bias can further constrain vocabulary use — shaping a controlled “semantic space” that aligns with a company’s voice or domain. This kind of control transforms the model from a probabilistic text generator into a repeatable reasoning system.
- Adopt a “Divide and Conquer” Strategy: This embodies systems thinking: build systems, not prompts. In one real-world example, a single, monolithic prompt for KPI extraction failed catastrophically in production. Success came only after decomposing the task into nearly twenty smaller steps — each explicit, testable, and partially deterministic. Determinism emerges not from rigidity, but from modularity.
- Treat Information Retrieval as an Engineering Task. The aim is not sophistication for its own sake, but reliability. In one project, a graph-based retrieval system was replaced by a simple keyword search — and accuracy shot to nearly 100%. This is not regression but maturity: determinism begins with using the simplest tool that works consistently, not the most impressive acronym.
- Leverage Specialist Models: For targeted subtasks, smaller open models offer greater transparency, repeatability, and cost control. A model fine-tuned for data extraction or classification may outperform — and out-consist — a general-purpose one. Determinism often correlates with simplicity: fewer moving parts, fewer surprises.
- Enforce Constrained Outputs: Determinism thrives on structure. Forcing the model to produce predictable formats (e.g., JSON or XML schemas) is essential not only for integration but also for compliance. Structured outputs make both systems and audits reproducible. In practice, there are two routes: (1) re-prompting strategies (templates, self-check prompts, post-parse/repair) — simple, but brittle at scale; and (2) constrained decoding, which enforces schemas during token generation (via finite-state machines / JSON-Schema / regex / CFG), guaranteeing well-formed outputs rather than fixing them after the fact. For enterprise workflows, prefer constrained decoding and consider .txt’s Outlines: an open-source library that enforces structure at generation time, supports JSON Schema/regex/CFG/type constraints, and works across OpenAI, vLLM, Transformers, llama.cpp, and more (vLLM even exposes it as a structured-outputs backend). This shifts you from “parse what you got” to “only generate what’s valid.
- Prioritize Data Preparation. The old principle of “garbage in, garbage out” is amplified in generative systems. Ambiguous, contradictory, or shifting input contexts produce corresponding variance in results. Clean, standardized data is not a luxury — it’s the foundation of predictable reasoning.
This blueprint offers a technical path toward determinism — but it also surfaces a deeper strategic and philosophical question: should we always want perfect repeatability?
The Strategic Question: Should We Always Aim for Determinism?
While we now understand that determinism is technically achievable, its strategic desirability is more nuanced. The context — business, creative, or ethical — dictates whether predictability is a feature or a constraint. Framed more humanly: would we want our colleagues to be deterministic?
The answer, of course, depends on the task — and on our philosophy of intelligence itself.
In some contexts, determinism is non-negotiable: extracting KPIs, enforcing compliance, classifying support tickets. In these cases, reliability is performance — precision and repeatability define value.
In others, however, variance is not noise but texture. Creativity, exploration, and ideation depend on it. Forcing determinism in these contexts would sterilize the process. A creative model, like a creative person, must sometimes surprise us.
Ultimately, effective AI design is about calibrating variance. Determinism and creativity are not opposites — they are coordinates on a spectrum. Engineering the right balance between them is less a technical act than a philosophical one: a reflection of what kind of intelligence we want to build, and what kind of world we wish to inhabit.
Conclusion
The notion that Large Language Models are inherently “random” has long hindered their enterprise adoption. The reality is more balanced: determinism in LLMs is an engineering discipline, not a contradiction. The model’s core computation is stable; the variability we observe stems from its operating environment and sampling design — both within our capacity to control.
Moving beyond this misconception allows leaders to see LLMs not as volatile black boxes, but as systems whose behavior reflects design choices. The challenge ahead is not theoretical — it’s architectural. It demands patience, rigor, and humility in the face of complexity. Yet the reward is profound: the ability to design AI systems that are not only repeatable, but responsibly so — deterministic when necessary, variable when meaningful, and always aligned with human intent.
—
Written by Adrian Sanchez de la Sierra, Head of AI Consulting at Zartis.
References
- He, H., et al. (2025). Defeating Nondeterminism in LLM Inference. Thinking Machines.
- dottxt-ai. (2025). Outlines: Structured Outputs for LLMs [software library]. GitHub. GitHub+2dottxt-ai.github.io+2
- Lee, Y., Ka, S., Son, B., Kang, P., & Kang, J.-W. (2024). Navigating the Path of Writing: Outline-guided Text Generation with Large Language Models. arXiv. arXiv
- Kim, Z. M., Ramachandran, A., Tavazoee, F., Kim, J.-K., Rokhlenko, O., & Kang, D. Y. (2025). Align to Structure: Aligning Large Language Models with Structural Information. arXiv. arXiv