Maybe Don't Bet the Farm on AI Coding
tl;dr: I used some AI coding tools over the last year and got decidedly mixed results
This is an expansion of a thread I posted on Bluesky. The state of affairs on Bluesky is such that some standard disclaimers apply: I am about as virulently anti-GenAI as one can be, often (according to those around me) annoyingly so. The topic makes me cranky in a hurry, and I sometimes fantasize about erecting a sign proclaiming that I Am an AI Hater. I often still use quotes around “AI” because it's not AI, which I fully acknowledge is a lost cause. Throughout this post I will use LLM instead in my references to the products. What falls under the umbrella of GenAI is a collection of rapacious, extractive products aimed at undermining intellectual labor and funneling money and power into the hands of their purveyors. Its hydra-headed push into every sector corresponds with a similar rise in right-wing rhetoric, and its loose connection to the shape of truth makes it a convenient tool for technofascists.
So why use it at all? Call it a combination of cynicism, curiosity, and rational self interest. While I still have something of a choice about whether to use these products at all, even in my daily work, there are places where they've crept in even so, and similar to Michael Taggart's reasoning, there are pragmatic reasons I need to be aware of how they work and how to guard against their many, many pitfalls. For instance, even if I don't use them, others around me are, and if I am going to remain accountable for the outputs of those other people around me from a DevOps and IT security perspective, or from a supervisory perspective, well, I have to know something about them. Additionally, there is a difference in applicability between the two things people are using LLMs for now:
- plausible output that is sometimes, but not necessarily true, and for which establishing truth is costly; and
- plausible output that compiles and can be executed, for which establishing truth is fairly cheap, and which also needn't be “true” in anything but a logical sense. LLMs produce both of these, but one is, theoretically, more useful than the other. For my money, I simply can't trust LLMs for anything in the first case, so that leaves the second, which is making code.
The main promise of these products is that they'll make you more productive as a programmer. Now, productivity here is always some ratio, generally between quantity of software and the cost to produce it, noting that the definition of software here must include some concept of both scope and quality, while cost must additionally be some function of how long it took to make it, etc. Since a rundown of programming productivity isn't the thrust of this post, we'll leave it at this simplistic overview and offer this instead: for LLMs, the measure of programming productivity is the time between “having the idea for a system” and “having working code that satisfies the idea”. Hopefully you can see in this the same basic measures of scope, quality, time, and cost.
So then if we're going to measure gains in productivity, we need to think about how fast something can go from being an idea (or specification, or requirement) to being ready for use by a user (deployed to production, released, merged to main).
Let me take a moment to outline the projects I had in mind and a quick assessment of their status. Some of these are new codebases, and some are not.
The projects
- A DokuWiki quiz plugin that injects a quiz onto a page with simple syntax. It is similar to other plugins that didn't do exactly what I wanted them to do and had some baffling options. Started in April 2026. Iterations: 4? Status: acceptable.
- A refactoring of the dlx-rest search and browse components to improve various interactions. Started in June 2025. Iterations: dozens. Status: largely acceptable after 6 months of remediation.
- A refactoring of the dlx-rest record editor component to improve maintainability and reduce interaction issues. Started in March 2026. Iterations: dozens, plus restarts. Status: provisional.
- An algorithm and web application to render the works of William Blake into a cybertext (in the Espen Aarseth sense) allowing recombinant narrative flow. Started in April 2026. Iterations: dozens. Status: largely failed.
I realize in looking at this list that I have unintentionally arranged them, not in chronological order, but in decreasing order of success.
The Quiz
The most successful of the projects, a DokuWiki quiz syntax plugin, was also the simplest to implement. In looking at the code that GitHub Copilot generated, the structure of a DokuWiki syntax plugin is quite minimal, which suggests to me that a) I could have done it myself in a few hours, most of which would have been spent learning by replicating an existing plugin; and b) not terribly much can go wrong with it. This is a case where I may or may not have bothered to make the plugin otherwise, and since I'm very rusty in PHP and not looking to expand my knowledge of it right now, I'm only invested in it to the extent it's useful for the project it was designed to complement, the Voces del Lunfardo site that my Digital Humanities praxis partner and I are working on.
The Search Interface
Moving down the list, I undertook a feverish refactoring of the main search interface for the Dag Hammarsköld Library's MARC Metadata Editor, mostly to change the screen interactions and simplify things a bit. The basic interface took a few weeks of plugging away with GitHub Copilot before leaving me with months worth of remedial work tracking down wonky and unintended behaviors. In all, that remediation lasted some six additional months. Unlike the PHP case, this is something I spend a nontrivial portion of my time working on, either developing it, fixing it, or simply keeping it running day to day. It pays for me to know what's in the code, so I can't just vibe it and hope for the best. And because I started with an intimate knowledge of the code, I was well positioned to evaluate the output, which I will note is voluminous and therefore it's difficult to conduct a line-by-line review. Success here is ambiguous.
The Record Editor
Next is a refactoring of the much more complicated record editor component in the MARC Metadata Editor. There are two main concerns with this code. First, because my team and I had yet to fully settle on Vue, it contains a labyrinth of nested control structures that perform DOM manipulation, including setting up and (hopefully) tearing down event listeners, etc. The second concern is that it's Vue 2, which is now quite dated (though functional!), creating a long-term maintenance issue. Now, the nature of this codebase is such that if I begin tugging on one thread, I must look in horror as myriad other threads move with it. It's no simple matter of surgically replacing the thing; it's integrated into other things, and most properly calls for a comprehensive review (read: tear it all down and start again from first principles). From that perspective, there can be little hope of outright success. And in fact, the numerous iterations that have gone into it so far speak to the difficulty of finding the right questions* that will allow adequate progress.
This is a good time to pause and talk about churn. If you've used LLMs for coding (or perhaps for other tasks), you may recognize that they sometimes, or maybe frequently, enter some sort of churn state, where the thing they're doing isn't so much incrementing progress toward your goal but is rather doing a lot of busywork in pursuit of some phantom goal the LLM made up based on its training data. “If you want,” it might say, “I can ...” followed by whatever scheme is statistically a plausible next step. Sometimes that next step actually makes sense. Often it looks sensible but will lead to numerous iterations aimed at optimizing something that it may take you a bit to understand is a red herring. This is churn. It's not doing anything specifically productive, but it sure is doing a lot of it, all at your expense, and it's taking up all day to do it.
There's no real way of escaping churn within the context of the “conversation” in which it occurs, and so the best recourse is to put a lid on it and start a new session from an uncontaminated branch. You are branching these changes, right? When people talk about how these products don't make you more productive, per se, but just intensify the process in something of an addictive* way, this is what they mean. The constant wrangling, back and forth with the LLM is intense, especially when you think the end of this development path is right beyond the horizon. So if you can just get there...
During this refactor, I realized I had entered a churn state while refactoring the refactor. You see, the code generated in the first refactor was still quite large and lightly monolithic. True, it was broken down into a handful of sensible components, as the sample component set had been. I thought it might make sense to reduce the size of some of those components as well, since there were some duplicative methods being chained together via communication buses between the components and their subcomponents. Additionally, I wanted (and still want) automated tests that express the functionality of the component set and can say something about whether new code breaks that functionality. You know, regression tests.
Several iterations in, I noticed that the LLM was spawning hundreds of lines of code per iteration, only some of which were new tests. In the process of refactoring, the LLM had invented a service layer to abstract and reuse some of the logic that was duplicated within the components. This seemed sensible at the time, but as this secondary codebase grew, I couldn't help but realize that what it was making was an entire vanilla JavaScript application full of things I had specifically chosen Vue to handle. In other words, it was reshaping the application to use a different architecture entirely, essentially replacing the neatly packaged calculator (Vue 3) I had intended to use with hand-written pages of calculations and tables. This diversion was costly in terms of both time and tokens.*
Presently, I have the skeleton of a working refactor that appears to do most of the things that were envisioned for it. But it encompasses some 10,000 lines of code that I have to review, and it undoubtedly introduces a host of new problems while leaving some very glaring problems in place (because they were out of scope). If I were able to get this into production within six months, I would be very surprised. In all likelihood, I will use the resulting structure as a learning opportunity as I conduct a the aforementioned “comprehensive review”, since there are other problems to address that are outside the scope of the record editor.
The Cybertext Machine
This is the least successful of the projects so far, and the main reason is that it is exploratory, with ill-defined inputs and outcomes. The ostensible goal was something I could see in my head but not express in specific enough terms for a LLM coding product: it is supposed to be an algorithm for chunking up the body of William Blake's complete works into a traversable graph based on line endings and all their possible next line starts, along with some kind of a branching choice web presentation system to afford user traversal. It's possible there is already an algorithm in existence for this (in which case feel free to shout it at me on the social networks I inhabit), but so far nailing one down so far has remained elusive. Meanwhile, the interface itself basically works, and always was the simplest aspect in most respects.
Since the underlying algorithm doesn't yield the correct graph structure, however, the interface working is sort of irrelevant. This is, so to speak, a decently built road that goes nowhere useful. There are scripts that parse the text successfully. There are scripts that load the text into JSON and even interface with a Kuzu store. There are query APIs for getting data back out, and there are data schemas and models and such. It's a full factory of gizmos and gadgets, and they all move in more or less the prescribed way, except that the algorithm they're using is incorrect.
The failure here is one that always would have been likely. I simply don't know enough about the domain in question (extracting the text to build the right graph shape) to proceed. That means I likely don't even know enough to ask the right questions, this LLM coding business has all the hallmarks of being oracular, which means that knowing what to ask and how to ask it (i.e., specificity) makes a difference.* Rather than accept the recommendations of the LLM, whose goal is to shovel whatever it can into this knowledge gap, whether it's appropriate or not, I chose to walk away from it and write this, and an academic essay, instead.
* The big old * in the room
Throughout, I've dropped in asterisks, which all point here. What this mode of working feels like is a slot machine. Dorian Taylor calls it a “slop-machine”, but then likens it to a pachinko game instead. You put (pour) in tokens and pull the lever to see what comes out the other side. The main difference is that sometimes the output looks like it might be useful. But what's coming out of the other end is essentially a worse kind of hot dog. It's made up of the stuff that went in, but you know it contains something unseemly in it. What you're hoping for when you pull the lever is that the output is not just useful, but that it solves the problem you were trying to articulate. When it doesn't, you rephrase, offer corrections, or rethink something, and then you reach for the tokens and the lever again, ad infinitum.
And there's the problem, of course. When I said this business had the hallmarks of being oracular, I meant that these LLMs occupy a place where they promise to help you turn your words into code, and that all you have to do is come to them with the right words, ask the right questions, and you'll get what you're seeking. What I've exhibited above is that this promise only bears out some of the time, enough to keep you coming back for more. On the whole, however, it's a losing proposition unless you're one of these LLM companies, though see below. The dopamine hit you get from scoring anything at all means it's primed to be addictive, which in turn means it entices you to use it even when you haven't got the question sorted out ahead of time.
As for the companies peddling these products, it's unclear what the end game is here. On the one hand, these models are expensive to operate, and the companies do so at a tremendous loss. LLMs are subsidized at a rate of around 90%, which means that customers right now are only paying 10% of the cost the companies would need to break even. While it's clear this is a market share land grab, suppose for a moment that these products can do all they're billed to be able to do. What then? What's left for people to do when all the jobs these corporations believe can be replaced are replaced? Are they planning to pay us all in company scrip? They don't seem to have a plan for that.
For now, the economy is beginning to take what looks like its final form: a massive series of casinos operated by a handful of oligarchs, designed with addiction and limits to individual ownership of anything in it. “AI” is just one more tool in this arsenal.
Conclusion
Where does that leave us? It leaves us basically with a set of tools that, at least for coding purposes, if you're specific enough in your description and are well equipped to evaluate the output, might be of some limited assistance. They're probably good enough to improve your documentation and might be able to help you write more tests than you would have had time for. But you have to know when enough is enough, and be prepared to admit when the LLM is just spinning its wheels for its own purposes, which aren't aligned with yours. Are they worth betting the farm on? I wouldn't.
++++ Like what you just read? You can subscribe to new posts on this blog via any ActivityPub platform (Mastodon, Pleroma, etc.) at @aaron@www.aaronhelton.com or via RSS at https://www.aaronhelton.com/feed
Alternatively, you can follow me on Bluesky: https://bsky.app/profile/aaronhelton.com