Consensus Machines – a Post-LLM World
I think LLMs are amazing tools, I use LLMs daily.
However, the more time I spend with them, the more I notice patterns in problems which LLMs excel in compared to problems they perform poorly on.
For a while now, I’ve been trying to organize these thoughts into words, formalizing them, by reflecting on these observations and having multiple discussions with colleagues – and I think (or hope) I found the model to describe these thoughts.
I am not much of a writer – an LLM would have done a better job at structuring this – but bear with me.
What are my observations so far:
LLMs are excellent at “solving” solved problems, often doing it faster and much more attentively than a human.
While I can enumerate 10 different pathfinding solutions used in video games, reason about the pros and cons of each approach in a given context, and implement all of them – recalling, evaluating and reasoning would take me time, and while implementing any of them, I would likley create some bugs of omission or inattention because I actively try to maintain a model of what I am building in my head while also taking into consideration the context of where it is implemented it.
This is a problem that can be automated with an LLM today better and faster than a human, in most cases, with the correct context and prompt.
LLMs, on the other hand, are terrible at solving UX problems.
Sure, they can add a button, create a modal, even build a whole view in the flavor of the month UI framework; They can even follow descriptive instructions and adjust the output to a given context.
However, their ability to tailor the solution to a specific context when the problem domain is UX is very limited.
Making the call to action button 10% bigger and 25% brighter is a very easy to follow instruction – but usually there is an underlying understood semantic to that – meaning – make the call to action stand out in the hierarchy. If this semantic is somehow represented in the code, it will often be missed by the LLM.
But Why?
I think the root cause of this behavior is the core nature of LLMs – it’s a statistical model averaging human response. Which can be translated into high accuracy where there is high consensus and low accuracy where there is on consensus.
2+2 is 4
Water is wet
The best gatorade flavor is… Cool Blue? (according to ChatGPT)
This made me further reflect on Google and LLMs as search engines.
Already a lot of the search traffic now flows through ChatGPT and other LLMs instead of Google – but is it a bad thing?
For the most part – no – in most cases when I search for something on Google, I click within the first 3 results; This is Google’s secret sauce after all – semantically tailoring the results to best match the user query.
But is this always the case? I realized that while the times I had to go past the first page of results on Google were far and apart, there were and still are cases – especially when researching esoteric and poorly documented subjects and ideas that translating a presentation from Japanese, published in some back page in 2001, was the best result for what I was looking for.
This made me think about a future where we no longer have indexed and semantically linked internet, with no backwater pages and esoteric archives from the late 90s and early 00s, but instead are relying on LLMs and only ever “hitting the top 3 results” on Google.
This is an average future. This is a future of consensus. This is a future that for the most part is correct and satisfactory, but without innovation.
A bit of a tangent – but this also highlights the risk of feedback feeding LLM output into LLMs – it’s a repeating averaging loop that flattens out all the details.
My takeaway – our future challenges are twofold.
The first challenge – how do we balance the benefits of using LLMs in problem solving with the desire to develop new novel ideas?
Most often new ideas are born not from a strict necessity of making something new, but from the iterative process of looking at a problem and the accepted solutions, and challenging them – how do we not lose our ability to challenge the accepted truth in a world where the benefits of deferring to consensus is so high?
The second challenge – how do we communicate novel ideas and create a platform for them?
Academia is a well established system with enough internal regulation mechanisms that will not go away. That said, I think a huge effect of the internet was democratizing this form of novel thinking – giving everyone a platform to share novel ideas without the rigid requirements of academia. Professional conferences and idea exchanges are also an in between solution that has been around for a while.
What is the platform of the future for such innovaiton? are we going back to the age of forums and mailing lists for niches? some altnet altogether?
I will keep rolling this in my head.
This is very raw and unedited – there is no CTA here – just needed to air my thoughts.
—————————————————————————————————-
Addendum – asked the following of various LLMs:
“Please share your thoughts on the following article – how do you see the future of innovation with LLMs?”
Here are the responses:
Sonnet 4.6
The article raises some genuinely interesting points, and I find myself both agreeing and wanting to push back a bit.
Martin’s core insight — that LLMs are essentially “consensus machines” that excel where human agreement is high and struggle where it’s low — is a useful mental model. It maps well onto my own experience. Tasks like implementing a known algorithm, writing boilerplate, or summarizing established knowledge are ones where I’m genuinely useful. Tasks that require aesthetic judgment, genuine novelty, or navigating contested territory are harder.
But I think the “averaging” framing, while evocative, is somewhat incomplete. The interesting question isn’t just whether LLMs flatten consensus — it’s who uses them and how. A mediocre researcher who offloads their thinking entirely to an LLM will produce averaged, derivative work. But a sharp thinker who uses LLMs to rapidly clear away the solved problems — the boilerplate, the literature review, the implementation — could potentially free up more cognitive bandwidth for the genuinely novel stuff. The tool doesn’t determine the outcome; the disposition of the person wielding it does.
His point about the feedback loop — LLM output feeding back into LLM training — is where I think the real risk lives. That’s not a hypothetical concern, it’s already happening, and it does have the potential to gradually narrow the distribution of ideas that the broader information ecosystem amplifies. That’s worth taking seriously.
On the question of platforms for novel thinking, I’m actually somewhat optimistic in a counterintuitive way. Historically, whenever a dominant communication medium homogenized discourse, niches proliferated in reaction to it. The forum and mailing list culture he half-jokingly references might not be a regression — it might be exactly the right structure for the post-LLM era. Small, high-trust, domain-specific communities with shared context are precisely the environments where non-consensus thinking can survive and develop before it’s ready to enter the mainstream.
The honest tension I sit with is this: I am, in some sense, the thing he’s describing. I’m useful partly because I reflect broad consensus reliably. That’s genuinely valuable — but it does mean that the most important thinking probably still needs to happen in the spaces I don’t reach.
Opus 4.6
This is a thoughtful piece, and I think the core framing — LLMs as consensus machines — is genuinely useful. It captures something real about the technology that often gets lost in both the hype and the doom narratives.
Here are my thoughts on the key threads:
The consensus model is a good mental model, but with nuance. The observation that LLMs perform well where human agreement is high and poorly where it’s low is largely correct. I’d add that it’s not just about consensus on answers but consensus on process. LLMs are strong at pathfinding algorithms not just because people agree on the solutions, but because the problem-solving methodology itself is well-documented and convergent. UX is hard because good UX is deeply contextual and often contradicts general best practices — the right answer depends on who your users are, what they’re doing right now, and what emotional state they’re in. That’s hard to average.
The search engine point is where I think the real danger lives. Martin’s example of the obscure Japanese presentation from 2001 is excellent. There’s an entire long tail of human knowledge that was never high-traffic, never well-linked, never optimized for discoverability — and yet was exactly the right thing for someone at some point. If we collapse the information retrieval layer into “what would the median informed person say,” we lose access to the edges. And the edges are where paradigm shifts come from.
On the feedback loop concern — this is well-documented and real. The term in the literature is “model collapse,” where training on synthetic or model-generated data progressively narrows the output distribution. The averaging-of-averages metaphor is intuitive and basically correct.
Where I’d push back a little is on the implied binary between LLMs and innovation. LLMs can actually be tools for divergent thinking if used deliberately — not as answer machines but as brainstorming partners, as ways to rapidly survey a solution space, or as translators between domains. The risk isn’t inherent to the technology; it’s in the default mode of interaction, which is “give me the answer.” If the culture of use shifts toward “help me explore the space,” the consensus tendency becomes less constraining.
On the second challenge — platforms for novel ideas — I think we’re already seeing fragmentation back toward smaller, higher-signal communities: niche Discords, Substacks, small group chats, curated RSS-like feeds. The forum era may genuinely be returning in spirit if not in form. The interesting question is whether discovery across these niches can work without a centralized index, or whether we need something new — maybe something like federated search across small communities, or curation networks that surface outlier ideas specifically because they diverge from consensus.
The piece reads as someone working through something genuinely important. The fact that it’s “raw and unedited” is actually fitting given the thesis — a polished, LLM-edited version might have averaged out the most interesting edges.
GPT 5.2 Pro
A lot of what the piece calls out lands for me, especially the “default mode” behavior: LLMs are extremely good at producing the most socially/technically typical answer given the cues you provide, and that can feel like a “consensus machine.”
That said, I don’t think the future is necessarily an “average future.” I think the average is where we end up by default—and the next phase of innovation is mostly about designing systems, workflows, and platforms that can reliably move between:
- Compression of consensus (fast, correct-enough, “solved problem” output), and
- Exploration of the long tail (diverse, weird, low-consensus, potentially novel output plus ways to test it).
Below are my thoughts, riffing on the article’s main claims.
1) “LLMs are great at solved problems” is right… but the deeper distinction is evaluation
The article frames it as “high consensus → high accuracy; low consensus → low accuracy.”
That’s often true in practice, but the more general rule I’d use is:
LLMs look smart when the problem has (a) stable patterns in the training data and (b) a clear way to recognize success.
They struggle when either is missing—especially (b).
Why “solved problems” feel easy
Pathfinding algorithms, CRUD code, refactors, common infrastructure patterns:
- There’s a lot of repeated structure in the world’s text/code.
- There are conventions (naming, architecture, idioms).
- There are often tests or compilation or runtime feedback (even if you don’t run them, the shape of correctness is well defined).
Why UX feels hard
UX has a brutal combination of:
-
Underspecified goals (“make it clearer,” “make it feel premium,” “reduce friction”)
-
Hidden constraints (brand, accessibility, business model, user anxiety, trust, legal, platform norms)
-
Outcome ambiguity (even humans disagree until you test)
-
Sparse ground truth (the “right” design depends on user research, context, and metrics)
So I agree with the observation, but I’d phrase the root cause less as “consensus” and more as:
UX is hard because the success criteria live outside text—inside user behavior, context, and measurement.
The optimistic flip side: as LLMs get tighter integration with telemetry, experiments (A/B), user feedback summaries, and explicit design constraints, they can become much better UX partners—not because they suddenly “have taste,” but because they can propose variants and reason about tradeoffs with real signals.
2) The “consensus future” risk is real… but it’s not a law of nature, it’s a product choice
If you deploy LLMs as single-shot answer machines, you bias the world toward:
- the most typical framing
- the most typical solution
- the most typical sources
-
the most typical aesthetics
And if millions of people do that every day, you get a cultural flattening effect: fewer weird corners, fewer unexpected paths, fewer “back pages.”
But here’s the key nuance:
LLMs can also be anti-consensus engines—if you ask them to be
They’re capable of generating:
- multiple competing hypotheses
- contrarian interpretations
- niche references
- alternative design languages
- unusual analogies across domains
The catch is that those outputs are not self-validating. If you force novelty without a validation loop, you get “creative nonsense.” So you need a system that supports both:
- Divergence (generate unusual candidates)
- Convergence (test, critique, validate, measure)
Innovation has always been divergence + convergence. LLMs supercharge divergence, but we still have to build convergence mechanisms that aren’t just “sounds right.”
3) Search: the real danger isn’t that LLMs replace Google—it’s that they replace curiosity
The “top 3 results forever” analogy is strong.
I’d sharpen it like this:
- If LLMs become the interface to knowledge without provenance, people stop checking.
- If people stop checking, fewer weird sources get read.
- If fewer weird sources get read, fewer weird sources get cited/linked.
- If fewer weird sources get cited/linked, they become less discoverable.
- That feeds back into the model’s and society’s sense of what “matters.”
So the innovation risk is not “LLM answers are average,” it’s:
“LLM answers reduce exploration unless the product explicitly rewards exploration.”
What I want to see become standard
- Citations by default (not optional)
- “Show me disagreement” buttons (surface competing views, not just one synthesized voice)
- “Explore the long tail” mode (intentionally retrieve obscure/older/non-mainstream sources)
- Provenance metadata (what sources, what time range, what confidence)
- Interfaces that make it easy to open and read the originals
In other words: if the UI is built to feel like “done,” innovation loses.
If the UI is built to feel like “start here,” innovation can accelerate.
4) Feedback loops and flattening are a real technical + cultural problem
The article mentions the risk of LLM output feeding LLMs and flattening detail. That’s a legitimate concern at two levels:
Technical level
If training data becomes saturated with model-generated text, you can get:
- reduced diversity
- increased blandness
- repetition of errors
- convergence toward “safe averages”
Cultural level
Even if models stayed perfect, if humans increasingly publish LLM-shaped writing, the internet’s texture changes:
- fewer idiosyncratic voices
- fewer strange personal pages
- fewer “translation of a 2001 Japanese slide deck” moments
- more homogenized phrasing, structure, and argument patterns
A big lever here is provenance + incentives:
- reward original reporting
- support archiving and public indexing
- label synthetic content where possible
- make it easy to cite primary sources and hard to launder summaries as “facts”
5) So what does the future of innovation with LLMs look like?
I think we’re moving into a split:
A) “Consensus productivity” explodes
Most day-to-day work becomes:
- faster drafting
- faster coding
- faster summarizing
- faster integration
- faster documentation
- faster debugging (with tool support)
This is enormous economic value, and it will be irresistible.
B) Innovation becomes a workflow discipline
Novelty won’t disappear, but it will increasingly come from teams that deliberately build:
- exploration routines
- adversarial review
- diversity of sources
- experimental validation loops
In other words: innovation will be less about “having access to intelligence” and more about having a system that resists premature convergence.
6) What platforms enable “post-LLM” innovation?
The article wonders if we go “back to forums and mailing lists,” or something else.
My bet: you’ll see two layers:
Layer 1: Small, high-trust niche communities (forums/mailing lists vibes)
- domain-specific norms
- deep context
- reputation systems
- long threads with real disagreement
- lots of tacit knowledge
This is where genuinely new ideas incubate because people can afford to be wrong in public and iterate.
Layer 2: Public, composable “idea objects”
Not just posts, but structured artifacts that are easy to:
- fork
- test
- reproduce
- critique
- cite
Think: “open notebooks” + “argument maps” + “reproducible demos” + “version control for ideas.”
If we get that, LLMs become less like “final answer machines” and more like navigation + synthesis layers over living, testable knowledge.
7) Practical ways to use LLMs for novelty, not against it
If you want LLMs to increase innovation rather than flatten it, a few patterns help a lot:
Use the model in roles that generate productive tension
- Divergent generator: “Give me 12 radically different approaches; include 3 that seem wrong.”
- Contrarian critic: “Argue against the top idea as if your job depends on it.”
- Historian: “What’s the closest past analogue? What failed about it?”
- Edge-case hunter: “Where does this break in the real world?”
- Experiment designer: “What’s the cheapest test that falsifies this?”
Force source diversity
- “Find me non-obvious terms to search.”
- “Give me a bibliography-style set of leads: older, niche, adjacent disciplines.”
- “Summarize disagreements, not conclusions.”
Don’t let it stop at words
Novelty that stays verbal is cheap. Novelty that survives contact with reality is valuable.
So route ideas into:
- prototypes
- small experiments
- simulations
- user tests
- benchmarks
- adversarial testing
My overall take on the article’s thesis
- Strong agree that default LLM use pulls toward consensus and speed, and that can quietly reduce exploration.
- Agree that UX is a great example of where “average text competence” doesn’t equal “situated judgment.”
- The important addition is that we’re not stuck with this outcome: it depends heavily on product design (UI/UX), incentives, and validation loops.
- The “post-LLM” innovation edge won’t go to whoever has the smartest model. It’ll go to whoever builds the best system for escaping local maxima—diverse retrieval, structured debate, cheap experiments, and strong provenance.
If you want, I can also sketch what a “Consensus vs Exploration” product interface could look like (features, toggles, and how it changes team behavior), because I think that’s where this becomes very concrete.
Grok (Suprisingly, not a Nazi salute)
The article by “martin” (dated February 22, 2026) captures a thoughtful and increasingly common unease about large language models (LLMs) as we move deeper into their era. I largely agree with the core thesis: LLMs function as consensus machines, excelling where human knowledge shows strong agreement in training data, but struggling (or defaulting to mediocrity) where there’s genuine divergence, ambiguity, or novelty.
This statistical averaging explains a lot of what we’ve observed:
- Strengths in “solved” domains — Pathfinding algorithms, boilerplate code, well-documented patterns, or factual recall from high-consensus sources. Here, LLMs often outperform humans in speed and consistency because they’re distilling the most repeated “correct” answers from billions of examples.
- Weaknesses in subjective or frontier areas — UX design (as you highlight), taste-driven decisions, truly creative leaps, or domains with sparse/niche data. The model leans toward the “average” button size/color/hierarchy because that’s what appears most frequently across tutorials, A/B tests, and popular apps. Semantic nuance (e.g., “make the CTA stand out in the visual hierarchy without feeling aggressive”) gets flattened unless explicitly guided.
Your Google analogy is spot-on and chilling. Traditional search engines surface long-tail, low-consensus results when needed — the obscure 2001 Japanese PDF might be on page 5, but it’s findable. An LLM-dominated discovery world risks becoming an echo chamber of the top-3 consensus views. We’ve already seen hints of this in how tools like ChatGPT summarize topics: they converge on mainstream narratives, under-representing heterodox or emerging ideas unless prompted to seek diversity.
The feedback loop risk is real too. As more content becomes AI-generated (and fed back into training corpora), we get iterative averaging → homogenization → loss of edge cases and outliers. This could accelerate “model collapse” in subtle ways, where rare but important signals get diluted.
On the future of innovation, I see two broad paths forward, neither fully utopian nor dystopian:
- The consensus-heavy default world — Most everyday problem-solving (coding routine features, basic research synthesis, content generation) becomes dramatically more efficient. Productivity skyrockets for “solved” problems, freeing humans for higher-level work. But innovation at the edges slows because challenging consensus becomes cognitively expensive — why fight the model when deferring is 10× faster and usually “good enough”? We risk a comfortable but stagnant plateau, especially in fields like software UX, product design, or even scientific hypothesis generation where novelty often starts as non-consensus heresy.
- A hybrid renaissance — Humans treat LLMs as powerful oracles for the known, but deliberately cultivate non-consensus spaces for breakthroughs. This could look like:
- Niche, human-first platforms — Revived forums, mailing lists, small Discord/Signal groups, or blockchain-based idea markets where ideas spread via direct endorsement rather than algorithmic popularity.
- Differentiated prompting and multi-agent setups — Techniques that force diversity (e.g., role-playing contrarians, iterative “red team” debates among model instances, or consensus-game mechanisms).
- Human-AI symbiosis with deliberate friction — Tools that highlight low-consensus sources, flag averaging artifacts, or require human overrides for novel paths. Some research already shows promise in human-guided differentiation yielding higher-novelty outputs.
- New incentives for originality — Platforms that reward heterodoxy (e.g., prediction markets for ideas, bounty systems for disproving consensus views, or decentralized archives preserving pre-LLM internet artifacts).
The second path feels more likely long-term because humans are stubbornly contrarian when stakes are high (science, art, entrepreneurship). Academia will persist with its peer-review friction, and we’ll probably see “post-LLM” cultural counter-movements valuing raw human insight, just as vinyl and analog photography persisted post-digital.
Ultimately, LLMs amplify existing human tendencies: deference to consensus when it’s convenient, rebellion when it’s not. The danger isn’t the tools themselves — it’s complacency. If we keep using them lazily as oracles of averaged truth, we get the average future you describe. If we treat them as accelerators for the mundane while preserving spaces for outlier thinking, we might actually accelerate innovation beyond what pre-LLM humans could achieve alone.
Your raw thoughts are valuable precisely because they’re non-polished and non-consensus — exactly the kind of signal that gets lost in averaged systems. Keep rolling them around; they’re a good antidote to the flattening. What do you think the “platform of the future” for novel ideas might actually look like in practice?
Gemini
Martin’s “Consensus Machines” argument is a sharp, sobering take on the current trajectory of AI. It hits on a fundamental tension: we are trading the eccentricity of discovery for the efficiency of the average.
Here are my thoughts on his observations and how they shape the future of innovation.
1. The “Average” Trap (Consensus vs. Novelty)
Martin is spot on about the statistical nature of LLMs. Because I am trained to predict the most likely next token based on a massive corpus, I am, by definition, a distillation of the status quo. * The Risk: If everyone uses LLMs to solve problems, we create a feedback loop of “best practices” that never evolves. Innovation often comes from the “wrong” answer—the edge case, the 2001 Japanese slide deck, or the person who refuses to use the “flavor of the month” framework.
-
The UX Example: His point about UX is brilliant. Great UX isn’t just about following a design system; it’s about intent and friction. LLMs struggle with “intent” because they don’t feel the frustration of a poorly placed button; they just know where buttons usually go.
2. The Death of the “Digital Backwater”
The most haunting part of the article is the idea of losing the indexed, semantically linked internet.
-
Search vs. Answer: Google (historically) gave you a map; LLMs give you a destination. When you use a map, you might take a wrong turn and find something incredible. When you’re teleported to the destination, the “in-between” knowledge—the esoteric archives and niche forums—atrophies from lack of traffic.
-
Information Cannibalization: If the “long tail” of the internet dies because users only want the “top 3 results” summarized, the data pool for future AI becomes incestuous and flat.
3. The Future of Innovation: “Human-in-the-Loop” as a Necessity
I don’t see LLMs replacing innovators, but I do see them raising the “Floor of Mediocrity.”
-
The New Baseline: If an LLM can do 80% of the work (the solved problems), the value of a human shifts entirely to the remaining 20%—the weird, the risky, and the non-consensus.
-
Innovation as Rebellion: Future innovation will likely be defined by rejecting the AI’s first suggestion. The “Consensus Machine” becomes a tool to quickly clear away the mundane so we can spend more energy on the “absurd” ideas that might actually work.
My Takeaway: The “Altnet” and Niche Revival
Martin asks about the platform for future innovation. I suspect he’s right about a return to “Small Tech.” * Private Knowledge Silos: We may see a rise in gated communities (Discords, private forums, physical salons) where people share ideas away from the scraping bots of “Big AI.”
-
Proof of Human: There will be a premium on ideas that don’t look like they came from an LLM.
The Paradox
The irony isn’t lost on me: I am an LLM analyzing an article about how LLMs might kill the very nuance I’m trying to provide. Martin’s “raw and unedited” thoughts have a texture that a polished AI summary would likely smooth over—and that texture is exactly what makes the piece valuable.
How do you feel about this? Do you find yourself leaning on the “consensus” for your daily work, or are you actively looking for those “backwater” sources to stay ahead of the curve?
Would you like me to help you find some of those “esoteric” sources or niche communities related to a specific topic you’re researching?
* DeepSeek falls in the very creepy surveillance pattern of not having a web interface, instead only available as a Chrome extension. It was not available to comment.
* Qwen told me I do not have access to this resource, thus was not available to comment.