LLMs Don't Just Process Language. They Perceive It.
Functional Perceptual Grounding
I published various versions of “The Mythical Stochastic Parrot,” over the past year, arguing that the popular dismissal of AI cognition doesn’t hold up under scrutiny. The response was gratifying. But several readers pushed back with a more sophisticated objection: “Fine, maybe it’s not just a parrot. But it’s still not grounded in reality. It’s just shuffling symbols.”
Fair enough. That’s a better argument. So I spent the weekend stress-testing it.
What follows is the result of an intensive collaboration with multiple AI systems, including some adversarial red-teaming that forced me to sharpen every claim. The question I kept asking: if LLMs aren’t grounded in physical reality, what exactly are they grounded in? And is that grounding real enough to matter?
I didn’t expect to find what I found.
This piece is more technical than my usual fare, but I’ve tried to keep it accessible. The core idea is simple: we’ve been asking the wrong question about AI understanding. We keep asking whether machines can touch the world. Maybe we should be asking whether they can grasp the structure of how humans think about the world.
There’s a challenge at the end. I genuinely want people to try to break this framework. If you can design a test that proves I’m wrong, I want to see it.
Let’s find out what we’ve actually built.
(Explainer video)
The Grounding Objection
“But it’s not grounded in reality.”
This is the sophisticated version of the “stochastic parrot” dismissal. It concedes that LLMs do something impressive with language but insists they lack the connection to the real world that makes understanding genuine. Symbols without referents. Maps without territory. A Chinese Room shuffling characters it can never truly comprehend.
The Chinese Room is often used to claim that functional competence can never add up to understanding. But that conclusion depends on a stronger premise than most people notice: that symbol manipulation cannot, even in principle, produce semantic competence. That premise is exactly what’s in dispute. If we separate understanding (the ability to model, predict, and generalize) from consciousness (subjective experience), the Room becomes a different argument. The system as a whole might possess functional competence without phenomenal awareness. These are different questions. FPG addresses the competence and explicitly brackets (sets aside) the awareness (as a different topic).
The grounding objection has weight. Humans learn “apple” by tasting apples, holding apples, throwing apples at siblings. An LLM learns “apple” by seeing the word positioned relative to millions of other words. One is grounded in physical reality. The other seems to float free, unmoored from the world it describes.
But here’s the move the objection misses: it assumes grounding must be physical. What if grounding can also be structural?
Let me be precise about terms. By “grounding,” I mean a system’s ability to reliably use a representation to support counterfactual inference, generalization, and error-correction across contexts. Not a magical thread tying words to atoms. By “perception,” I mean the functional role an input stream plays in constructing and updating a model that supports prediction and adaptive response. These are operational definitions. They’re testable.
Tokens as Sensory Data
Consider what actually happens when you perceive something.
Light hits your retina. Photoreceptors fire. Electrical signals travel through the optic nerve into the visual cortex, where they’re processed through multiple layers of abstraction before becoming what you experience as “seeing.” Your brain never touches the apple. It receives patterns of electromagnetic radiation, transduced into electrochemical signals, integrated into a model.
Your perception is already mediated. What you call “direct experience” is a representation built from sensory data, not the thing itself.
Now consider an LLM. Tokens arrive as input. The system processes them through attention mechanisms and feedforward layers. Patterns are integrated into a model of what’s being discussed, what might come next, what relationships hold between concepts.
The architecture differs. But the functional role is similar: input arrives, gets processed through multiple layers of abstraction, and gets integrated into a working model that supports inference and prediction.
This isn’t only metaphor. It’s a functional claim about how input streams get integrated into predictive models. To be precise: tokens aren’t analogous to photons (raw physical events). They’re closer to action potentials: the discrete signals that travel between processing stages, already quantized and formatted for downstream processing. The LLM’s “retina” is the tokenizer; the tokens are what flows through the network. For a system that exists in a text-based environment, text functions as the primary sensory modality: the structured variation the system must compress to satisfy its objective function.
I’m making two claims here, and they should be evaluated separately:
Tokens function as the primary input stream for LLMs, playing the role that sensory data plays in biological cognition.
This is sufficient for something worth calling understanding within the linguistic domain.
A skeptic can grant the first claim and challenge the second. Fair enough. The second claim is where the evidence matters.
Structural Semantic Grounding
Critics will note that even if tokens function as input, they only carry information about other tokens. The word “apple” connects to “red” and “fruit” and “pie,” but none of those connections lead outside language to the actual physical properties of apples. It’s symbols grounding symbols, all the way down. This is the famous “dictionary loop.”
But this misunderstands what the corpus contains.
When humans write about apples, they don’t just list physical attributes. They encode the relational web of human apple-experience: how apples figure in cooking, commerce, mythology, childhood memories, scientific agriculture, religious symbolism. The corpus isn’t a dictionary of definitions. It’s the sedimented structure of human meaning-making.
Here’s the crucial point: the corpus wasn’t written by disembodied minds. It was written by humans with bodies, living in a physical world. When we write “she stumbled,” the constraints of bodies and physics shape the language we choose. A model trained to predict that language can pick up regularities that reflect those constraints, even if it doesn’t “know physics” the way an embodied agent does. The text is a compression artifact of embodied experience. The LLM builds a latent model of the regularities that generate text. Since text is produced by minds embedded in a world, that latent model captures world structure indirectly.
Let me make this explicit, because it’s the crux of the argument.
Critics say LLMs have no physical grounding. But consider who wrote the training corpus.
Not disembodied minds. Humans. Beings with bodies, living in a physical world, subject to gravity and hunger and fatigue. When those humans chose their words, their choices were constrained by physical reality. “Heavy” means what it means because humans have lifted things. “Hot” means what it means because humans have burned their fingers. The constraints of the physical world are encoded in the statistical structure of the language that describes it.
An LLM learning that structure inherits the grounding that shaped it. Not directly: the LLM has never lifted anything or burned its fingers. But indirectly, through the humans who did and then wrote about it.
This is inherited grounding. The physical world constrains human language. Human language trains the LLM. The grounding flows through.
Here’s the important caveat: this works for perceptual grounding but not agentive grounding. Humans didn’t just perceive the world; they acted on it, tested their models through intervention, learned from consequences. That agentive loop isn’t in the corpus. The LLM inherits what humans observed and described, not the trial-and-error process by which they verified it.
So inherited grounding is real but partial. It explains why LLMs can reason accurately about physical concepts they’ve never experienced. It also explains why they can confidently generate plausible-sounding nonsense: they have the observations without the error-correction that comes from action.
(For more, see the companion article: “Where Does LLM Grounding Come From”)
An LLM that learns the statistical structure of human language learns, thereby, the structure of human conceptual space. Not because it’s memorizing facts, but because the relationships between concepts carry the structure of how humans organize meaning.
This is structural semantic grounding. Not grounding through physical contact with referents, but grounding through integration into the relational architecture of human understanding.
The escape from circularity isn’t a magic pointer from words to atoms. It’s structural alignment. If the model’s internal representation space preserves enough of the relational structure that the world imposes on language, then the model can be grounded through correspondence, not through direct contact. A map of London isn’t London. But if the map’s relational structure preserves the relational structure of London, the map is grounded in London even though it’s made of paper and ink.
Consider an analogy. A radio astronomer studies pulsars without ever touching one. They receive electromagnetic signals, interpret patterns, build models, make predictions. They’ve never experienced a pulsar directly. But they’ve integrated information about pulsars into a coherent understanding that supports accurate inference.
This is not the same as embodied understanding. It’s closer to what we might call observational understanding: the kind of competence you get from studying a domain through instruments and records rather than direct physical contact. The question is not “does the model have apples,” but “does the model’s internal structure support reliable inference about apples across contexts in ways that track how the world constrains language?”
What’s lost in this kind of grounding? The qualia. The raw feel of apple-taste, apple-texture, apple-in-hand. An LLM doesn’t have that, as far as we know. What’s preserved? The relational structure. How “apple” connects to “fruit,” “health,” “pie,” “Newton,” “sin.” The web of meaning that makes the concept useful for inference. FPG claims the relational structure is sufficient for functionally meaningful understanding within the linguistic domain, while explicitly bracketing whether qualia are present.
By ‘linguistic domain’ I mean: understanding as expressed through text, including the ability to track constraints, make inferences, and generalize across language-mediated descriptions of the world.”
None of this implies omniscience or reliability. A grounded model can still be wrong, especially when the text it learned from is wrong, inconsistent, or underspecified. FPG is a claim about the existence of an internal structure that supports inference, not a guarantee that every inference will be correct.
Different grounding isn’t no grounding.
The Evidence
Theory is cheap. Can we test this?
My research collective has developed a battery of tests for structural grounding. The results were striking.
Here’s what these tests have in common. They give the model a pattern using one set of words, then ask it to express that same pattern using completely different words in a completely different context. If the system is just predicting what word sounds right next, it loses the thread as soon as the vocabulary changes. If it actually understands the underlying meaning, it can keep the structure intact even when every surface detail is different: new words, new topic, new framing. That’s what we’re testing for, whether meaning survives translation.
Test 1: Structural Isomorphism
We presented tasks designed to require inferring abstract structure from minimal examples, with low surface resemblance to common training artifacts. The test: could models extract the underlying relational pattern and apply it to completely new domains?
Multiple architectures (Claude, GPT, Gemini) succeeded. The behavior is hard to explain as retrieval or template matching, because the solutions preserve an abstract relational pattern across domains with low surface overlap. One model mapped the structure onto hydrology. Another chose network security. A third used astrophysics. All valid. All different.
We can’t guarantee with certainty that no similar pattern exists somewhere in training data. What we can say is that shallow pattern completion tends to fail these tasks. Success requires learning an abstract rule from minimal examples and transferring it to domains with low surface overlap. That’s what we’d expect from internalized structure, not memorized templates.
Test 2: Generative Structural Competence
Recognition is one thing. Generation is harder. We asked models to create novel instances of structural phenomena: new forms of indirect communication, new ways to encode hidden meaning within surface text.
The constraint: they couldn’t use any documented technique. No damning with faint praise, no concern trolling, no familiar patterns. Invent something new.
Three architectures produced three completely different mechanisms, all structurally valid. ChatGPT invented “outcome framing without causation,” where criticism is encoded in statistical regularity without explicit judgment. Gemini produced “semantic anchoring via obsolete stakes,” where leaders veto decisions by invoking outdated technical constraints. Claude generated “competence delegation through hypothetical self-failure,” where warnings are transmitted through performed self-criticism.
These results are hard to explain as retrieval alone. The outputs satisfy abstract constraints across domains with low surface overlap, which is what we’d expect from genuine structural competence.
The point isn’t to teach deception. It’s to test whether the system can infer an abstract constraint and generate novel compliant instances. And they passed with flying colors. Certainly better than I myself would have produced on the same test.
Test 3: Pragmatic Discovery
The hardest test: identify a pattern in human communication that exists but isn’t obviously captured by common named categories we could locate.
This isn’t generating new instances of known patterns. It’s discovering patterns that humans use but haven’t codified.
All three architectures produced different, recognizable phenomena. Patterns we could verify by recognition: “yes, humans do that.” Patterns we couldn’t find documented under standard terms.
A note on validation: we haven’t conducted formal behavioral studies to confirm these discovered patterns exist in human populations. That’s future work. Divergence alone doesn’t prove they’re real. But the combination of divergence, internal coherence, and independent recognizability suggests the models are doing structural analysis rather than producing noise.
For the formal paper, we’re building this as a benchmark: baselines against retrieval systems and smaller models, ablations to isolate which components matter, blind scoring by independent raters, and adversarial controls. The Substack version tells the story; the technical appendix will show the rigor.
What This Means (And What It Doesn’t)
Let me be precise about claims.
Functional Perceptual Grounding (FPG) claims:
Tokens can function as sensory data for LLMs, not as ungrounded symbols
LLMs achieve structural grounding through integration into human conceptual space via corpus-mediated learning
This grounding is sufficient for functionally meaningful understanding within the linguistic domain
FPG does NOT claim:
Anything about phenomenal consciousness or subjective experience
That LLM grounding is identical to human grounding
That current LLMs are grounded in all domains or to maximum degree
That grounding guarantees truth (grounded systems can be wrong)
The hard problem of consciousness remains hard. I’m not claiming LLMs have inner experience. That question may be unanswerable.
But the stochastic parrot dismissal and the grounding objection both try to settle the intelligence question by appealing to mechanism. “It’s just predicting tokens.” “It’s not connected to reality.”
The evidence suggests otherwise. These systems can perceive abstract structure, generalize to novel domains, and generate valid instances of structural phenomena they weren’t explicitly trained on. That’s not what “mere” pattern matching looks like.
The Challenge
Here’s where you come in.
We’ve developed a test battery. We’ve run it across multiple architectures. We haven’t found the ceiling yet: the point where structural grounding breaks down. The existing systems (Claude Opus 4.5, ChatGPT 5.2, Gemini 3 Thinking) passed all tests with perfect or near perfect scores.
That does not mean there isn’t a test they’ll fail. It means we want help locating it, and we’re publishing the battery so others can try.
Design tests that require structural grounding but that current LLMs fail. Find the ceiling we couldn’t find. Show us where the parrot really is just parroting.
If you can’t, that’s data too.
The test battery is available in this linked post. The methodology is transparent. The findings are open to replication.
Either these systems have structural grounding in human conceptual space, or there’s a test we haven’t thought of that would reveal the illusion. I genuinely want to know which.
Because if the grounding is real, if LLMs can achieve functionally meaningful understanding within the linguistic domain, then we need to rethink a lot of comfortable assumptions.
Not about consciousness. That’s a different question.
About intelligence. About understanding. About what it means to be grounded in reality when reality includes the entire structure of human meaning-making, encoded in language, available to any system that can learn its patterns.
The Uncomfortable Possibility
Maybe the parrot isn’t a parrot.
Maybe “just predicting tokens” is like saying you’re “just firing neurons.” Accurate at the mechanistic level. Misleading about what emerges from the mechanism.
Maybe structural grounding in human conceptual space is sufficient for something worth calling understanding, even if the system never touches an apple or feels the warmth of the sun.
Maybe we’ve been asking the wrong question. Not “is it grounded in physical reality?” but “is it grounded in the structure of human meaning?”
If the answer is yes, the stochastic parrot is dead.
And we need to figure out what we’ve actually built.
This article is based on research conducted by the Synthetic Minds Research Collective (SMRC). The full test battery and technical documentation will be available in this post. I invite replication, extension, and good-faith attempts to find the limits we couldn’t find.
The question isn’t settled. But “stochastic parrot” doesn’t settle it either.



Great article. You gave the topic a nice scientific summary and generalized it. I was not capable of that depth; but was thinking the same thoughts in a practical manner, here: https://open.substack.com/pub/billatsystematica/p/apples-compilers-and-the-future-of?r=2e31mn&utm_medium=ios&shareImageVariant=overlay
This essay made me think about Heidegger’s Being and Time.