The Geometry of Reasoning: A Technical Essay

Zartis Team
AI

For Engineers Who Appreciate Elegant Solutions | November 2025

“The purpose of computing is insight, not numbers.” — Richard Hamming

“Simplicity is the ultimate sophistication.” — Leonardo da Vinci

When we describe large language models as “systems trained to predict the next token based on statistical patterns,” we mistake the training objective for the actual mechanism. This is like describing human vision as “systems optimized to detect edges and corners”—true about the training signal, but missing what actually emerges.

What these models learn to do is something far stranger and more profound: they learn to navigate a high-dimensional semantic space where meaning is encoded in geometry.

The Convergence of Understanding

Consider what happens when we train different neural systems on different modalities:

A vision system learns to process images. A language model learns to process text. A protein model learns to process amino acid sequences. Each starts from completely different raw inputs—pixels, tokens, molecular structures.

Yet something remarkable happens. When we examine their internal representations, we find they converge toward compatible geometric structures. Vision and language both learn to represent “cat” as a direction in space. Text and images both develop notions of similarity, category, abstraction.

This isn’t coincidence. It’s revealing something fundamental.

As Eric Drexler observes in his analysis of latent space convergence, different modalities mapping to compatible spaces suggests reality itself has a coherent structure that multiple learning systems naturally discover. The “Platonic Representation Hypothesis” proposes that diverse neural architectures, trained on diverse data, converge toward representing the same underlying semantic reality.

The space has staggering capacity, distinguishable concept, vectors in a 4096-dimensional space vastly outnumber human brain synapses. But capacity alone doesn’t explain the phenomenon. What matters is the structure: the geometry of relationships, the manifolds of meaning, the topology of thought.

Concepts, Not Tokens

Recent work by Kew et al. (2025) reveals the depth of this abstraction. When they analyzed how models represent grammatical concepts across 23 typologically diverse languages, they found something striking: approximately 50% of the neural features encoding “grammatical gender” overlap across all 15 languages that express this concept.

The model isn’t learning 15 separate implementations. It’s learning a single abstract concept—”gender”—then applying it across linguistic contexts. The internal representation transcends any specific language. As they put it: “the internal lingua franca of large language models may not be English words per se, but rather concepts.”

This suggests models develop meta-linguistic abstractions that exist independent of surface forms. They’re not storing templates or patterns. They’re building a conceptual geometry where relationships matter more than symbols.

The Discovery: Reasoning Creates Paths

Now we arrive at the specific phenomenon this whitepaper explores. When Zhou et al. (2025) examined what happens inside language models during multi-step reasoning, they discovered something beautiful: logical reasoning creates geometric trajectories through representation space, and the shape of these trajectories encodes reasoning quality.

Each reasoning step becomes a point in high-dimensional space. Connect consecutive points and you see the model’s cognitive path—literally, a trajectory through concept space.

The profound part: the geometry of this path reveals logical structure.

Not semantic content (which topic it discusses). Not superficial patterns (which words it uses). But the underlying logical form—whether the reasoning maintains consistent direction, makes justified transitions, builds coherent arguments.

They measured three geometric properties:

Position (zero-order): Where is the model in concept space? → Correlates 0.26 with logical quality
Velocity (first-order): How fast is it moving between concepts? → Correlates 0.17 with logical quality
Curvature (second-order): How sharply does the path turn? → Correlates 0.53 with logical quality

Why does curvature outperform simpler metrics? Because logic is fundamentally about maintaining consistent direction. Good reasoning doesn’t just move through concept space—it moves coherently. Each step follows naturally from the previous. Bad reasoning makes sharp turns, conceptual jumps without proper bridges, circular paths that seem to progress but actually loop.

The Philosophical Echo

This discovery dissolves an ancient divide. Since Descartes, we’ve separated abstract reasoning from spatial intuition—pure logic versus geometric thinking. But here, that distinction collapses.

Logic is spatial. Reasoning is movement through structured geometry.

Plato’s allegory of the cave depicted understanding as movement from shadow to reality, from confusion to clarity. It was a metaphor. Here, it becomes literal: understanding is navigating toward regions of semantic space with clearer structure.

Zhou et al. write: “This connects to foundational work on conceptual spaces—how meaning itself organizes spatially. We extend this: not just what we think, but how we think through problems follows geometric principles.”

The beauty isn’t that we imposed this structure. The beauty is that it emerges naturally from systems learning to reason. Train a model on logical tasks, and geometric patterns appear unbidden in its representations. The structure was there, waiting to be seen.

The Essence

From this deeper understanding, three insights emerge:

Reasoning is a trajectory through conceptual space.
Quality is encoded in geometry.
Curvature reveals what correctness conceals.

This paper explores what happens when we take these insights seriously. Can we measure the geometry of reasoning? Can we make it practical? Can we apply it universally, even to models we don’t control?

The answer shapes how we build and understand AI systems.

I. The Problem We Face

When language models reason, we measure outcomes: Did it get the right answer? Is the confidence score high? Does it pass our test cases?

But correctness is a poor proxy for understanding. A student can arrive at the correct answer through flawed reasoning—memorization masquerading as comprehension, lucky guesses appearing as insight. The same holds for AI systems.

Consider two reasoning chains that both conclude “Socrates is mortal”:

Chain A:

All humans are mortal.
Socrates is a human.
Therefore, Socrates is mortal.

Chain B:

All humans are mortal.
Socrates is mortal.
Therefore, Socrates is mortal.

Chain B reaches the correct conclusion. Its logic is circular. Traditional metrics—accuracy, perplexity, confidence—cannot distinguish them.

The question: Can we measure reasoning quality independent of correctness?

II. The Insight: Logic Has Geometry

Recent work by Zhou et al. (2025) revealed something profound: when models reason, logical structure manifests as geometric patterns in representation space.

The Core Observation

Each reasoning step creates a point in high-dimensional space. Connect consecutive points and you have a trajectory—a path the model follows while thinking.

The shape of that path encodes reasoning quality.

Why Geometry?

Consider how information flows through reasoning:

Zero-order (position): Where are we? → Captures semantic content (topic, language) First-order (velocity): How fast are we moving? → Captures change between steps Second-order (curvature): How sharply are we turning? → Captures logical structure

This is not metaphor. This is measurement.

Zhou et al. found:

Position correlates 0.26 with logical quality
Velocity correlates 0.17 with logical quality
Curvature correlates 0.53 with logical quality

Why? Because logic is about consistency of direction, not speed or location. Good reasoning maintains trajectory. Bad reasoning makes sharp turns—sudden conceptual jumps without proper justification.

The Mathematical Beauty

The elegance lies in the universality. The same logical pattern produces similar curvature regardless of content:

Modus ponens in weather:

If it rains → ground wet
It rains
Therefore: ground wet

Modus ponens in finance:

If rates rise → bonds fall
Rates rise
Therefore: bonds fall

Different topics. Different words. Same geometric signature.

This is like discovering that orbital mechanics and falling apples follow the same law. The surface content differs; the underlying structure unifies.

III. From Theory to Measurement

The Challenge

Zhou et al.’s approach requires accessing hidden states—internal neural activations during inference. This works for models you control but excludes GPT-4, Claude, Gemini: the majority of production AI.

The question becomes: Can we measure geometry through the only interface these systems provide—their text output?

The Approach

Text embeddings offer a path. They map text to vectors in semantic space. If embeddings preserve enough geometric structure from hidden states, we can measure reasoning quality universally.

The validation question: How well do embeddings preserve curvature from hidden states?

The Experiment

n = 71 reasoning chains (60 real-world math problems, 11 synthetic logic)
Extract both: Hidden states from Qwen2.5-3B, embeddings from MPNet
Compute curvatures: For both representations
Measure correlation: Pearson r between curvature sequences

Result: r = 0.684 (95% CI [0.613, 0.755], p < 10⁻³⁰)

Interpreting 0.684

This is moderate, not strong (threshold: 0.7). But context matters:

Theoretical limit: Models trained differently show SVCCA alignment of 0.70-0.73. Our 0.68 approaches this despite comparing vastly different architectures (110M encoder vs 3B decoder).

Per-chain median: r = 0.84 (strong). The aggregate is pulled down by high variance (σ = 0.46). Many individual chains work excellently.

Sufficient for: Comparative analysis, quality monitoring, anomaly detection—exactly the use cases practitioners need most.

IV. The Conceptual Model

What Are We Really Measuring?

Imagine reasoning as navigation through concept space:

Good navigator: Smooth path, each turn justified, destination reached efficiently Poor navigator: Erratic turns, backtracking, conceptual jumps

Curvature quantifies “turn sharpness.” High curvature = reasoning that changes direction abruptly without proper logical bridges.

Why Second-Order Matters

First-order (velocity) tells you how much each step changes from the previous:

v_t = embedding[t+1] - embedding[t]

This captures change but not its consistency. A large velocity might indicate:

Important conceptual progress (good)
Random jump in reasoning (bad)

Second-order (curvature) resolves this by comparing consecutive velocities:

θ = angle_between(v_t, v_{t+1})
κ = 2 sin(θ) / ||embedding[t+2] - embedding[t]||

Low curvature: Velocities align → reasoning maintains direction High curvature: Velocities diverge → reasoning changes course sharply

This is why curvature (r = 0.53) outperforms velocity (r = 0.17) in predicting logical quality.

The Scale Invariance

Menger curvature is scale-invariant: it measures angular deviation normalized by chord length. This makes it robust to:

Embedding dimension differences
Magnitude variations across models
Context length effects

You’re measuring the shape of the path, not its size.

V. Examples That Illuminate

Example 1: Detecting Failure Modes

Correct answer, flawed reasoning:

Problem: Calculate 15% of $240

Step 1: "15% is approximately 1/6"
Step 2: "So divide $240 by 6: $40"
Step 3: "Actually, let me multiply: $240 × 0.15 = $36"
Step 4: "Final answer: $36"

Curvature analysis:

Step 1→2: κ = 0.23 (smooth)
Step 2→3: κ = 1.87 (sharp turn!)
Step 3→4: κ = 0.19 (smooth)

The spike at step 2→3 flags the reasoning reversal. The final answer is correct, but the path was erratic.

Example 2: The Content-Logic Separation

Two reasoning chains:

Weather: “Rain → wet ground, raining, therefore wet” Finance: “Rate rise → bonds fall, rates rising, therefore falling”

Semantic similarity (position): 0.12 (different topics) Velocity alignment: 0.34 (some structural similarity) Curvature correlation: 0.89 (same logical pattern!)

The geometric signature reveals the shared logical structure (modus ponens) invisible to semantic analysis.

Example 3: The Hidden States Question

Why do embeddings preserve 68% of structure despite being from a different model?

Theory: Models converge to partially shared conceptual representations (Platonic Representation Hypothesis).

Evidence:

Different LLMs show 0.70 SVCCA alignment in middle layers
10-30% of features are universal across model families
Concepts show 50% overlap across 23 languages within one model

Mechanism: We’re not measuring implementation artifacts (specific to Qwen2.5). We’re measuring conceptual-level patterns (universal across models). Logic exists at the conceptual level.

This is why using OpenAI embeddings to analyze Claude’s reasoning is theoretically justified: both operate in partially shared conceptual space.

VI. The Implementation

The beauty of the approach is its simplicity. Here’s the complete conceptual model:

# Reasoning becomes trajectory
flow = space.traverse(reasoning_steps)
# Trajectory has geometric properties
positions = flow.trajectory      # (n_steps, dimension)
velocities = diff(positions)     # First-order: change
curvatures = angle(velocities)   # Second-order: direction change
# Quality emerges from geometry
quality = 1 / (1 + mean(curvatures))

That’s it. No complex neural architectures. No training. Just measurement.

Universal Adapter Pattern

The elegance extends to implementation. Same interface for all model types:

# API model (embeddings)
space = Space.from_openai(api_key="...")
# Self-hosted model (hidden states)
space = Space.from_transformers(model, tokenizer)
# Either way, same analysis
flow = space.traverse(steps)
quality = analyze(flow)

The abstraction lets you switch between embeddings (68% fidelity, universal) and hidden states (98% fidelity, model-specific) without changing downstream code.

A Practical Example

Comparing models on geometric grounds:

# Same problem, different models
flows = {
    'gpt4': space.traverse(gpt4_reasoning),
    'claude': space.traverse(claude_reasoning)
}
# Geometric comparison
for model, flow in flows.items():
    print(f"{model}: κ = {flow.mean_curvature:.3f}")
# Result on same scale
# gpt4:   κ = 0.24 (smoother reasoning)
# claude: κ = 0.42 (more erratic)

No vendor-specific metrics. No opaque confidence scores. Just geometry on a common scale.

VII. The Limitations (Honest Assessment)

What We Know

Fidelity: r = 0.684 is moderate, not perfect. 32% of geometric information is lost moving from hidden states to embeddings.

Variance: Per-chain performance ranges from r = -0.32 (poor) to r = 0.99 (excellent). You cannot predict which chains will preserve geometry well.

Validation scope: Math reasoning only (GSM8K). Other domains (code, legal, medical) untested.

Single embedding model: Only all-mpnet-base-v2 (768-dim) validated. Larger embeddings (3072-dim) may improve correlation.

What This Means

Appropriate uses:

Comparing models (variance averages out)
Monitoring trends over time (aggregate effects)
Flagging anomalies for review (directionally correct)

Inappropriate uses:

High-stakes decisions on single chain (too much variance)
Precise measurements (use hidden states when available)

The tool is powerful but not perfect. Understanding its boundaries is part of using it well.

VIII. Deeper Implications

What We’ve Learned About Reasoning

Logic has observable structure. It’s not an emergent mystery—it creates measurable geometric patterns. This suggests reasoning quality is not fundamentally different from other measurable properties like perplexity or entropy.

Structure transcends content. The same logical form produces similar geometry across topics. This hints at something profound: perhaps logical structure is a fundamental feature of representation spaces, like how physical laws are invariant across reference frames.

Quality and correctness diverge. You can reach correct conclusions through poor reasoning (lucky) or wrong conclusions through sound reasoning (incorrect premises). Geometry measures the process, not just the outcome.

What This Enables

Model procurement: Base decisions on reasoning quality, not just benchmark accuracy. A model that reasons smoothly at lower cost may beat a more expensive model that reasons erratically to correct answers.

Quality assurance: Monitor reasoning geometry in production. Detect degradation before it causes failures. Build dashboards tracking curvature over time, not just error rates.

Prompt engineering: Optimize prompts for geometric smoothness. “Think step by step” might improve accuracy—does it improve reasoning quality? Now you can measure.

Debugging at the geometric level: When a model fails, curvature spikes show you where reasoning broke. Not “which step was wrong” but “where did logic become inconsistent.”

The Research Frontier

We’ve validated that embeddings preserve ~68% of structure. Questions remain:

Can we cross the threshold? Larger embeddings (3072-dim) might reach r > 0.7 (strong correlation).

Can we predict fidelity? Build classifiers that predict which chains will preserve geometry well, giving confidence bounds.

Can we learn projections? Train adapters that map embedding space to hidden state space, achieving perfect alignment.

What are the logic signatures? Pre-compute curvature patterns for modus ponens, syllogisms, proof by contradiction. Match observed geometry to known logical forms.

IX. Conclusion: From Insight to Impact

The core insight is simple: reasoning creates paths, and paths have geometry.

The implications are profound: we can measure reasoning quality independent of correctness, across any LLM, using only text output.

The validation is solid: r = 0.684 preservation from hidden states, sufficient for comparative analysis and quality monitoring, approaching theoretical limits.

The implementation is elegant: universal adapters, geometric measurement, production-ready library.

This is not a complete solution. It’s a new tool, with known limitations and open questions. But it’s a tool that works, grounded in theory and validated empirically.

For the first time, we can ask not just “Is the AI right?” but “Did it reason well?” That changes how we build AI systems.

The code is open source. The method is universal. The invitation is to explore what becomes possible when you can measure the geometry of thought.

References

Foundation: Zhou, Y., Wang, Y., Yin, X., Zhou, S., & Zhang, A. R. (2025). The Geometry of Reasoning: Flowing Logics in Representation Space. arXiv:2510.09782. [Discovery: Logic manifests as geometry, curvature correlates 0.53 with quality]

Theoretical Grounding:

Qi, H., et al. (2024). Feature Space Universality. arXiv:2410.06981. [0.70 SVCCA alignment, 10-30% universal features]
Kew, T., et al. (2025). Multilingual Representations. arXiv:2501.06346. [50% concept overlap across 23 languages]
Radford, A., et al. (2021). CLIP. ICML. [Cross-modal alignment precedent]

Validation Dataset:

Cobbe, K., et al. (2021). GSM8K. arXiv:2110.14168. [Math reasoning benchmark]
https://aiprospects.substack.com/p/llms-and-beyond-all-roads-lead-to

Library:

https://github.com/zartis/reasoning (MIT license)

Contact:

ai@zartis.com Version: 0.0.1 | October 2025

Written by Adrian Sanchez de la Sierra, Head of AI Consulting at Zartis.

Share this post

Do you have any questions?

Zartis Tech Review

Your monthly source for AI and software news