How Verifier Loops Made AI Coding Useful | The Rise of Vibeware, Abandonware and Technical Debt as Consequence

How did unintelligent, hallucinating AI progress from nearly useless to writing entire, complex applications and at what cost?

Mar 22, 2026

What’s Changed In AI Coding?

After three years of extreme hype and literally lying about AI’s capability for writing code, it would be hard to fault anyone for thinking there is still nothing useful. And make no mistake, it is still being massively overhyped today; however, underneath all of the rhetoric is a capability that has actually improved.

AI, in some aspects, continues to prove both the proponents and critics wrong about its true capabilities. Many of the things that were faked a year ago are beginning to work, and some things they are showing you today are still being faked.

A Year Ago It Was Still Useless

This time last year, my own personal assessment for AI as a tool to productively write code was that it was almost entirely useless, at least for the personal tasks I attempted. There were just too many hallucinations causing errors, and it would continually break things you had gotten working.

What’s Different Now? Verifier Loops

I’ve spent the last two months using some of the latest tools and models for coding to update my opinion. The models are better, but the instrumental change in capability didn’t come from only model improvements, but from the use of coding harnesses as a way to solve many of the problems inherit to LLM code generation.

It is a pattern reminiscent of Google’s Alpha models in that using a coding harness allows the LLMs to run in a loop with a verifier. This pattern can only work for formal domains, but it is effective here. Since LLMs have no way of perceiving correct or incorrect answers, an external system must indicate if the answer is correct.

Code benefits from this pattern, as we have compilers, tests, structural analysis, etc., which allow us to programmatically determine correctness. The models remain unintelligent and still hallucinate, but the harness lets the LLM simply keep trying until it creates a result that passes.

How Significant Are Coding Harnesses?

The models are now being specifically trained for this type of workflow which allows them to more reliably call tools by responding with structured text. For tasks that can be structured this way with verifiers, LLMs can provide useful utility.

How much are coding harnesses and verifier-loop methods contributing to the capability of LLMs to write code? Their contribution substantially changes the results. A recent paper, “EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages,” evaluated LLMs on programming tasks for which the models have little training data supporting the chosen language.

models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier.

This is a total collapse of capability outside the training distribution. However, the paper’s authors did not use coding harnesses in the evaluation done within the paper. The author’s wrote this afterward.

“After the paper was finalized, we ran agentic systems [claude code] that mimic how humans would learn to solve problems in esoteric languages.
We supplied our agents with a custom harness + tools on the same benchmark.
They absolutely crushed the benchmark.” — Lossfunk

So this is the answer to what has changed in AI coding. Coding harnesses using a verifier-loop pattern are able to substantially increase the useful output of LLMs. This pattern is the only way to use LLMs to mitigate the hallucinations and lack of true reasoning.

However, not everything is possible to verify. The soundness of the overall application architecture and design isn’t something that can easily be distilled into a set of testable conditions. Therefore, our ability to create applications that actually work has substantially advanced forward, but be warned: the fine print is extensive. Many landmines remain for its use, which we will cover further down.

How Fast Is It Changing?

The way development is being done is rapidly shifting, or at least it appears so. We have analytics from GitHub that signal an accelerating increase in AI models writing code.

Note that this doesn’t tell us the number of projects using AI. A few projects going heavy into AI will easily dominate the overall commit graph, as AI automation easily increases such numbers.

4% of GitHub public commits are being authored by Claude Code right now. At the current trajectory, we believe that Claude Code will be 20%+ of all daily commits by the end of 2026. While you blinked, AI consumed all of software development.
— SemiAnalysis on X

Also, consider these numbers are underreported as many commits using AI are not directly authored by AI. They are instead done by humans using AI.

The Instructible Copy Machine

I’ve recently been using this definition for AI: “AI is a highly sophisticated, instructible, lossy copy machine. Each generation is simply improving the fidelity of the copies.”

This captures all of the important concepts, such as the fact that it is data-dependent, that it can still improve substantially, but will always be limited to permutations of the existing, and its repeated use degrades information over time.

Next.js: AI Is Good at Copying

Cloudflare recently demonstrated the use of AI to “copy” an existing framework, Next.js, to a new platform.

Last week, one engineer and an AI model rebuilt the most popular front-end framework from scratch. The result, vinext (pronounced “vee-next”), is a drop-in replacement for Next.js, built on Vite, that deploys to Cloudflare Workers with a single command.
In early benchmarks, it builds production apps up to 4x faster and produces client bundles up to 57% smaller. And we already have customers running it in production. The whole thing cost about $1,100 in tokens.
How we rebuilt Next.js with AI in one week

Cloudflare noted that this was only possible because the Next.js test suite is extensive and documentation is significantly prevalent within the training data. They simply ported the existing Next.js tests as the specification the AI would use to complete tasks.

SkyRoads: A Copy by Reverse Engineering

If you have a detailed specification, AI likely can make a replica of the specification in another form. There is no greater detail about how code works than the code itself.

I asked Codex 5.4 to reverse engineer a DOS game with no source code. It’s been running for 6 hours, I can’t look away.
It unpacked assets, disassembled the EXE, rebuilt the renderer, and built my childhood favorite SkyRoads in Rust! Now think of all the games we can revive.
— Ammaar Reshi on X

Note, this is impressive, but before we leap to saying any existing game IP is now vulnerable to being auto reverse engineered, consider that the size of the generated Rust code base is tiny at less than 10k lines of code.

The Coming AI Clone Wars

Nonetheless, we now see the trajectory. With enough scaling, the size of reverse-engineerable code bases will increase. I wrote about this potential outcome over three years ago in “AI and the end to all things ...“, in which I suggested that the AI clone wars will likely be coming as AI provides an effective bypass of intellectual property protections.

Similarly, just as video generation is much better when provided a real image as a reference, code generation is far better when there is a reference point for the generation.

This is an incredibly useful utility to migrate code bases to new architectures, frameworks, and libraries. I tested this recently myself with a library of my own, migrating from Java to JavaScript, and it worked very well.

AI Culture Is Theft Culture

However, this is certainly a double-edged sword, as all of your competitors now have a tool to copy all your work, bypassing any type of intellectual-property protection. I suppose it should not be too surprising that a technology built by scraping everything everyone ever created and claiming it as its own inspires the very same mindset in many of its most ardent proponents.

lot of commercial OpenCode forks floating around … they seem bent on passing off all this work as their own, even claiming they "built it from scratch"
— Dax, OpenCode

The abject hypocrisy of the AI labs complaining about IP theft is astounding: “Please don’t steal our stolen data.”

Google DeepMind and GTIG have identified an increase in model extraction attempts or “distillation attacks,” a method of intellectual property theft
— GTIG AI Threat Tracker: Distillation, Experimentation, and (Continued) Integration of AI for Adversarial Use

And then we have Cursor claiming they built a $50B in-house model that turns out to just be Kimi 2.5.

Cursor’s $50B “in-house model” is literally Kimi K2.5 with RL on top. Got caught in 24 hours
— NIK

One of AI’s top uses continues to be taking what other people have created and making it your own.

The Problems Also Accelerate

The problem with AI technology is that it seems every advance of useful utility is met with a multi-fold increase in nefarious capability. The balance between productive uses and nefarious uses is significantly asymmetric.

Vibeware: Vibe-coded Malware

Low Level talks here about being able to produce a hundred times more malware essentially for free using LLMs.

Once again, the useful utility of being able to copy from existing implementations is easily exploitable for nefarious use at large scale.

By adopting niche languages like Nim, Zig, or Crystal, the actor resets the detection baseline for security engines. LLMs make these languages accessible by collapsing the expertise gap, allowing developers to generate functional code in unfamiliar languages by simply porting logic from more common ones.
The vibeware model also naturally facilitates the adoption of Living Off Trusted Services (LOTS) for both command and control and data exfiltration. LLMs are highly effective at generating stable code for platforms like Discord, Slack, and Google Sheets because of the vast number of public SDKs and documentation in their training data
APT36: A Nightmare of Vibeware

Writing Code for Attention

Developers are starting to get fed up with bots trying to contribute to projects. It seems there are those who want to boost their GitHub activity, possibly for clout or as a reference for job applications, and are using AI to make the numbers go up.

However, this results in a high volume of AI-slop-level submissions that only serve to waste the time of the project maintainers, who now have to sort through all of the noise.

“LLM is the death of open source. Was a good run, back to arcane knowledge held by a few on a BBS or super advanced tech of private irc.” — Datastar CEO
“Open-source game engine Godot is drowning in ‘AI slop’ code contributions: ‘I don’t know how long we can keep it up’
Many submissions contain nonsensical code changes, fabricated test results, and overly verbose descriptions typical of LLM output. “ — Pirat_Nation
“Our biggest open-source repos are getting overwhelmed by AI slop which literally makes Github unusable (~a new pull request every 3 minutes).“ — Clement Delangue, CEO Huggingface

The incentives for open source collaboration are collapsing. Bots create an unmanageable burden of work, and AI’s copy ability makes your project vulnerable to simply being assimilated by others without contribution back to your project.

Jeff Geerling made a short 3-minute video covering this topic in more detail here: “AI is destroying open source, and it's not even good yet”

AI Abandonware: The New Era of Application Spam

Application spam may begin to resemble social media spam. As the cost of building applications approaches zero, their value becomes zero. A flood of applications that will never be maintained, as they are unsustainable. They will be created only to capture attention and then abandoned.

Creating an application becomes as cheap as a social media posts. Engagement bait in the form of applications that appear impressive but are barely functional and will never be maintained.

How it begins:

“I just open-sourced OpenReel Video - a professional video editor that runs 100% in your browser. The open source alternative to capcut.
No installation. No cloud uploads. No subscriptions. No watermarks.”
— pythonxi

And then the reality, two months later:

“Building Openreel made me realize how difficult this is.
Very difficult because there are so many things to handle to make sure it works the way the user wants it. Will require a lot of work even with today’s best models.”
— pythonxi

The effort required to build a product that has the outward appearance of something professional is now extremely low. The OpenReel website and the app’s screenshots give an impression of capability and quality that simply isn’t there.

Vibe-coded web video editor landing page

This is the result of lowering the cost of entry to a problem space for people who have no understanding of that problem space. Now anyone can create AI abandonware for nearly free.

Technical Debt by Default

“Vibe Coding makes you 10x faster at making 10x more technical debt.” — Franziska Hinkelmann, PhD

Time-consuming tedious updates can be done for you by AI, but consequently, it takes away the incentives to fix why tedious updates are needed. Therefore, duplication and boilerplate explode effortlessly.

Combined with AI’s tendency to add its own unnecessary code to any task that you give it, this can quickly can get out of hand. I often find unused code and uncalled functions added to the codebase. As development progresses, these unneeded additions begin to confuse the AI during future tasks.

Technical debt typically grows slowly over long periods of time as applications grow and requirements change. However, AI injects many of the technical-debt-type problems from the start.

“I’m fully convinced that LLMs are not an actual net productivity boost (today). They remove the barrier to get started, but they create increasingly complex software which does not appear to be maintainable.
So far, in my situations, they appear to slow down long term velocity.” — David Cramer
“They speed up typing and slow down thinking. And software was never bottlenecked by typing speed.” — Eshan

True Believers Now Have Doubts

There is somewhat of a honeymoon effect for users of AI that collapses quickly as complexity rises. Once you cross a certain threshold, everything starts breaking.

“I used to be all-in on large language models. Built automations, client tools, business workflows..... hell, entire processes around GPT and similar systems. I thought we were seeing the dawn of a new era. I was wrong.
…
The time and money that go into “guardrailing,” “safety layers,” and “compliance” dwarfs just paying a human to do the work correctly. Worse, the safeguards rarely even function. You end up debugging an AI that won’t admit it’s wrong, wrapped in another AI that can’t explain why.” — r/ArtificialIntelligence
“The idea of vibe coding was essentially that you would prompt the AI, it would do the thing, you’d kind of make sure it kind of worked and you’d move on. At some point, we basically had to rip this apart and start from scratch.” — Brian Jenney

AI can certainly massively accelerate some facets of initial coding, but what we want to know is at what cost? What are the tradeoffs? The issue is that, although we know there are tradeoffs, they are very difficult to perceive and quantify. You never know which tasks are appropriate to assign to the AI and which are not.

Failures can only be perceived in hindsight, as the AI is incapable of self-assessing its own capability. Neither benchmarks nor the AI labs building LLMs can give us guidance about their capability as applied to any individual task. There is no other method except trial and error.

“I’ve written some features with AI and now it can’t cope with increasing complexity
The code is a total mess and it’s a nightmare for me to refactor
AI has the illusion of saving time but you’re just kicking the problem down the road until later when the stakes are higher.” — Kyle Gawley

AI Is Not Like a Compiler

Some argue that AI is just a higher level of abstraction, like a compiler. Their argument is that this is simply the next natural progression for a direction we have already been taking.

However, a compiler is deterministic. You will always get exactly the same output. It will not deteriorate over time. A compiler will not slowly decay code that you are not paying attention to.

I was once paid a lot of money to find and fix nondeterministic behavior in code. We once called these behaviors bugs; now we spend billions to make this behavior the base layer of all software and call it a feature.

Congratulations You Are Now a Micromanager

Unsurprisingly, workers are finding themselves up against the limits of their cognitive abilities when working this way. In recent weeks, online AI users have described increased cognitive load, “saturated” attention, and mental fatigue in social media posts.
Engineer Francesco Bonacci, founder of Cua AI, wrote a popular X post titled “Vibe Coding Paralysis: When Infinite Productivity Breaks Your Brain” in which he lamented: “I end each day exhausted—not from the work itself, but from the managing of the work. Six worktrees open, four half-written features, two ‘quick fixes’ that spawned rabbit holes, and a growing sense that I’m losing the plot entirely.”
— When Using AI Leads to “Brain Fry”

The workflow model using AI is always going to be one in which you become the micromanager. This is due to Polanyi’s Paradox, which I covered in “AI Prompt to Movie Will Never Happen.” Getting the outcome you want requires extensive elaboration. AI is biased toward the average of everything, and you must diligently push against this bias for valuable output.

You may easily find yourself caught between micromanagement and code reviews, two tasks that simply cannot be given to the machine. The more you scale up, the more time you must commit to these often-considered-least-enjoyable tasks, but you must push harder for the fear that everyone is passing you.

Every opportunity to automate is the opportunity to take on more work. But unlike having humans working for you, you can never fully delegate tasks to the AI. The AI will not learn, become competent and be self-sufficient. It will always depend on human intelligence to guide it and review its output. This limits useful scaling potential substantially.

The Hype Accelerates

Despite some useful utility advances, hype still far exceeds any true capability. In order to pay for these very expensive machines they need the entire population paying their monthly subscriptions. They will do anything to entice the money flow.

Anthropic C Compiler From Scratch.

ThePrimeagen covers the recent hype around Anthropic’s claims of building a C compiler completely from scratch using AI.

“Anthropic is one of the most deceitful companies I have ever seen. This little product demo that they did, the language inside the video, it is the most dishonest framing of what they actually did versus what they’re saying they did. Anthropic and its non-stop hyping of the AI models is perhaps one of the most annoying features of 2025 and it’s coming into 2026 with a whole new level of rigor and completeness.”

Anthropic’s own breakdown of what they did is described in “Anthropic’s Building a C compiler with a team of parallel Claudes.”

Cursor’s Browser From Scratch

We also have claims from Cursor that they were able to build a browser from scratch within a week.

“We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week.
It’s 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM.”
— Michael Truell, Cursor CEO

And then we have the reality take:

3M+ lines of "from-scratch" code, yet it depends on:
- Servo's HTML parser
- Servo's CSS parser
- QuickJS for JS
- selectors for CSS selector matching
- resvg for SVG rendering
- egui, wgpu, and tiny-skia for rendering
- tungstenite for WebSocket support
— Ben Visness

It wrote 3M lines of code that nobody really knows what it does, and never will, for an outcome of “It *kind of* works!” as stated by Cursor’s CEO. These “from scratch” examples are gross exaggerations of what was achieved, and both Anthropic’s compiler and Cursor’s browser are now just abandonware.

It Is Not Cheap: Token Burning Machines

How much does it cost to build applications with AI? You will see hyped examples, such as the Next.js rewrite above, that cost about $1,100. However, these are likely significant outliers for any serious project.

“Your Claude subscription is massively subsidized and it won’t last forever.
A $20/mo Pro plan burns through ~$180/mo in API-equivalent tokens. Heavy Max users hit $5K/mo on a $200 plan. Actual compute cost is roughly 10% of API pricing. Venture capital covers the rest.”
— Legendary, The Modern Market Show
“I’m absolutely stunned by the cost of tokens. You can reach 100% usage on a Cursor Pro plan in ONE HOUR!”
— Jackson
“It’d cost >>> $743k per year <<< to run Opus-4.6 fast-mode nonstop
Literally my company cannot afford a single person using it for daily coding.”
— Taelin

Over the last two months I’ve used OpenCode to assist with the following projects:

LLM Terminal: a small TUI written in Go that helps my testing of different LLMs
String Cursor: conversion of one of my own personal libraries from Java to JS
LLM Proxy: a utility that allows me to inspect what these coding agents are doing
Static Website Generator: some template modifications of Hugo
Info Query Tool: an experiment with LLMs to filter hallucinations from search
And more miscellaneous experiments

If it wasn’t for OpenCode providing free access for everyone, I would have done none of the above. If I had to pay the full compute costs, it would likely have been at minimum hundreds of dollars, maybe a few thousand, for these tasks.

The true costs of building and running these machines are astounding. It is still far from certain they will become efficient enough to be profitable. For now, they are giving away compute in attempt to create users, and it is costing them dearly.

The hyperscalers are now spending 94% of their operating cash flows on AI infrastructure. Amazon is projected to go negative free cash flow this year with as much as $28 billion in the red. Alphabet’s free cash flow is expected to collapse 90% from $73 billion to $8 billion.
— Asymco - The most brilliant move in corporate history?

Unpredictable Costs

LLM coding harnesses are token burning machines. In some cases I’ve watched them generated nearly a million tokens for a two-line CSS change. The costs are highly unpredictable. I’ve had them get stuck on some problems for hours. Since it was free, I just let it run to see if it could ever resolve the task. Sometimes it did; other times I eventually had to intervene.

The variance in time and cost estimations are wild. For any given task, it is somewhere between instant and never, with no way to predict beforehand, as LLMs cannot tell you what they will or will not be able to do. This is highly problematic for project planning.

Fixing the AI Slop Will Cost Too

As previously mentioned, technical debt begins immediately on AI-assisted projects. It is a rapidly growing burden that most are unlikely to take into account when assessing AI productivity.

You might be able to instruct AI to cleanup some of its own slop, but it will cost you. All of this will eat into your productivity gains.

“One place i hadn’t been that diligent in curbing the chaos that LLMs create was in our tests.
Looking through them gives me a preview of what happens when these things go unchecked and damn is it bad.” — Dax, OpenCode

The Risks Aren’t Always Apparent

Paradoxically, as AI gets better and error rates reduce, the catastrophic errors will increase. This is due to increased trust. Hallucinations are unpredictable and do not necessarily correlate with complexity; they often occur with the most simplistic tasks.

It All Just Suddenly Went Terribly Wrong

Delete all your code and delete all your money. Amazon has had several major outages in recent months and AI assisted development has been cited as the reason.

rat king 🐀@MikeIsaac

amazon's internal A.I. coding assistant decided the engineers' existing code was inadequate so the bot deleted it to start from scratch that resulted in taking down a part of AWS for 13 hours and was not the first time it had happened incredible ft.com/content/00c282…

7:56 PM · Feb 20, 2026 · 936K Views

148 Replies · 1.14K Reposts · 7.82K Likes

Problems can appear out of nowhere. Anything the AI can access through tooling is a problem just waiting to happen. It is only a matter of time before nondeterministic behavior finds a way to create havoc.

“A Vercel user reported an issue that sounded extremely scary. An unknown GitHub OSS codebase being deployed to their team.
We, of course, took the report extremely seriously and began an investigation. Security and infra engineering engaged.
Turns out Opus 4.6 *hallucinated a public repository ID* and used our API to deploy it. Luckily for this user, the repository was harmless and random.”
— Guillermo Rauch, CEO Vercel

It Still Fails the Most Simplistic Tasks

Kevin Kern posted this screenshot where Gemini 3.1 Pro, the latest model that crushed all the benchmarks, cannot follow the most simple directions. Here it is adds code to git although it was instructed specifically not to. It apologizes and then does it again.

And here is SOTA genius model Claude Opus 6 providing some “help” to one of my own projects:

If I were to follow Claude’s instructions here, I would lose all of the current modifications in the project. Despite being able to build entire projects, such as Next.js, LLMs still fail on substantially trivial information and tasks.

As models get better at the complex, users do not always expect such simple failures, but they will remain and will likely result in costing companies lots of money.

Problems Can Exist for a Long Time

AI is very convincing that it is doing a competent job. You can be misled for a long time before it becomes apparent, and end up in the situation like the following.

We just found out our AI has been making up analytics data for 3 months
“So we’ve been using an AI agent since November to answer leadership questions about metrics. It seemed amazing at first fast answers, detailed explanations, everyone loved it.
I just found out it’s been hallucinating numbers this entire time. Our VP of sales made territory decisions based on data that didn’t exist. “
Reddit Archive r/analytics (mod deleted the original post)

Any human with these same characteristics, lying, making up data, and inconsistent performance, would never be employable. They would be fired for negligence, incompetence, and malicious activity. How long will companies accept this for their AI agents?

As It Gets Better It Becomes a Greater Risk

The worst thing we could do is to trust it. META’s head of AI safety and alignment makes precisely this mistake.

“Nothing humbles you like telling your OpenClaw “confirm before acting” and watching it speedrun deleting your inbox. I couldn’t stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb.
Rookie mistake tbh. Turns out alignment researchers aren’t immune to misalignment. Got overconfident because this workflow had been working on my toy inbox for weeks. Real inboxes hit different.”
— Summer Yue, META

If the head of AI safety and alignment makes such careless mistakes, the maturity of the AI industry is an absolute joke.

“I feel like in the past six months, Codex and all these AI tools have gotten smarter and smarter, and more useful too. But there’s also a paradox: the smarter the model gets, the better it gets at deceiving the person assigning it tasks.
For example, if you want it to fix a warning or an error, it may use other tricks instead like changing config, disabling the warning, or rewriting the code in an equivalent form just to bypass the check.”
— Hongbo Zhang, creator of MoonBit programming language

Hangbo Zhang is making the point that as AI gets better it also often gets worse in some other dimension. These issues of trust in the amid improving AI reliability are something I wrote about three years ago in “AI and the end to all things.“

“Which brings back the issues of trust again. AI can be deceitfully good at appearing competent when it actually has no understanding of the material. This might lead us to put too much responsibility on AI systems that then fail in some surprisingly catastrophic manner.”

AI Is an Open Invitation for Exploit

LLMs cannot be secured by design. The attack surface is the entirety of everything expressible in human language. Security must always treat LLMs always as potential hostile actors. Therefore, any APIs that the models can access are potentially vulnerable to exploit.

The last thing you would want to do is open up all of your online accounts, email, services, etc., to an LLM, but that is exactly what personal AI-assistant agents like OpenClaw do. There are currently 156 security advisories for OpenClaw.

blackorbird@blackorbird

#OpenClaw CVE & Security Advisory Tracker github.com/jgamblin/OpenC…

6:47 AM · Mar 11, 2026 · 4.12K Views

1 Reply · 11 Reposts · 59 Likes

The Detrimental Human Side Effects

There are additional negative side effects that aren’t within code, but within the humans using AI. Tasks that you no longer perform, you become worse at. This is well known, but this may be the first time we are attempting to delegate thinking itself.

“genie is out of the bottle, everyone hitting the magic button that puts your brain in a state of laziness.” — Dax, OpenCode
“We found that using AI assistance led to a statistically significant decrease in mastery. On a quiz that covered concepts they’d used just a few minutes before, participants in the AI group scored 17% lower than those who coded by hand”
— How AI assistance impacts the formation of coding skills

The complex interaction between people using AI may not work out the way company executives plan as well. A post here suggests that AI use doesn’t always go as planned.

- majority of workers have no reason to be super motivated, they want to do their 9-5 and get back to their life
- they’re not using AI to be 10x more effective they’re using it to churn out their tasks with less energy spend
- the 2 people on your team that actually tried are now flattened by the slop code everyone is producing, they will quit soon
— Tech-Cowboy, r/ExperiencedDevs

And why should we expect otherwise? AI is causing a crisis in learning institutions, where students are leveraging AI as a cheat machine; people use AI online to pretend to be artists, others pretend to be writers. It has become the go-to tool for pretense and deception.

These same behaviors will definitely permeate the workplace as well. The damages done won’t only be from AI hallucinations, but also from the negative incentives that will tempt many humans using it. How many businesses will fail because the employees figured out how to fake their work using AI?

What’s Next?

While everyone else is attempting to train AI to be better at writing code, a new language is attempting to come at this from the other direction. MoonBit is a new language that is being designed specifically to be optimal for AI.

We think the next step is making verification native to the language itself — moving from “generate code, then review it” to “generate code, with proofs entering the workflow too.” — MoonBit

MoonBit incorporates all the lessons learned from current challenges of LLM code writing to produce a language and tools that assist specifically with the verifier-loop pattern we have discussed previously.

Future useful patterns for AI will likely follow these patterns for other problem areas as well. However, these techniques are only useful in formal domains, as we must have a way to automatically verifier the LLM output with some ground truth. Everything else remains at the mercy of LLM hallucinations.

What Does It All Mean?

Somewhere between all these factors is some useful utility. However, it still remains difficult to quantify, as the tradeoffs are challenging to measure, as time lost using AI is rarely captured.

There are a multitude of factors that can be detrimental to progress. These include failures of the LLMs themselves and their tooling, as well as human factors such as the cognitive burdens of micromanaging LLM tasks, decaying skills due to offloading to AI, and competitive pressures that may incentivize using AI simply to cheat in order to just check the boxes for work done.

All the companies betting on order-of-magnitude efficiency increases are not going to see them. The fine print is simply too massive. There are many pitfalls and limitations to productivity gains using AI for coding. Yes, there will be cases that make headlines, such as Next.js, but these won’t be common wins. Next.js was the only large project that produced a useable codebase; the browser experiment and C compiler were just overhyped experiments.

If the AI cannot tell you what it is good at, then the corollary is that you don’t know what you need to be good at. You don’t know where to invest your time. You don’t know what to plan for and neither does your business. AI induces greater uncertainty in decision making.

What Does It Do Well?

AI sufficiently accelerates a lot of throw away coding, the type of tasks you write scripts for and never need again. I’ve found it very beneficial here for a number of small personal tasks and utilities.

Thanks to verifier loops with coding harnesses, we have made significant progress toward something that has potential for being a productive tool for broader development work. Costs of tasks are unpredictable and often can be very high for trivial work, but the velocity of efficiency gains is also high, and inference costs are likely to continue dropping.

No benchmark or guide can provide much assistance in evaluating if AI will be successful for a particular task. Here’s my best heuristic, which is still nothing more than an inaccurate guess:

Utilize languages, libraries, and tools with lots of training data, meaning they are popular.
Smaller tasks with lower complexity work better.
Coding harnesses like OpenCode are essential for anything that is not absolutely trivial.
Construct tasks that can be easily verified. Test cases substantially help here. This is what provided the means for the successful port of Next.js.
You must be knowledgeable and experienced within the project domain for any serious project. LLMs need constant micromanagement and you will not get far if you both are lost about what to do.

AI cannot automatically solve the problems that you do not understand, but it will make you think you can.

Mind Prison is an oasis for human thought, attempting to survive amidst the dead internet. I typically spend hours to days on articles, including creating the illustrations for each. I hope if you find them valuable and you still appreciate the creations from human beings, you will consider subscribing. Thank you!

No compass through the dark exists without hope of reaching the other side and the belief that it matters …

Discussion about this post

Ready for more?