If you, like millions of others, watched the Super Bowl for the commercials, you saw a Microsoft ad for AI. Microsoft paid to show the world how easy it has become. NFL recruiters. Player stats. Copilot in Excel. A few comments, and the best linebacker is obvious. Every time it aired, people laughed. Not the reaction Microsoft was looking for. (Of course, who sells the hard part?)
They laughed because anyone who has ever made a real personnel decision knows that question has 15 years of film study, injury reports, locker room culture, and salary cap math behind it. The ad showed the words. It didn't show the work underneath the words.
That gap has a name. Psychologists call it the Dunning-Kruger effect — the cognitive bias where limited experience produces outsized confidence. Right now AI is producing Dunning-Kruger at industrial scale. The demo works. The room gets excited. The budget gets proposed. And nobody talks about what the demo didn't show you.
The Computer Does What You Tell It
I learned to program in BASIC. The first real lesson wasn't syntax. It was humility. The computer was going to do exactly what I told it to do. Not what I meant. Not what I intended. What I said.
That has always been true. What changed is the language.
For 40(something) years, people adapted to computers. You learned their language. BASIC. SQL. Python. The instruction and the outcome had a one-to-one relationship. You typed a thing. It did that thing. Deterministic, the engineers call it. It wasn't always what the person typing wanted, but it was what they told the computer to do.
Large language models flipped that. Now the machine interprets human language. This means the quality of what comes out is dependent on the precision and context of what went in. The computer still does exactly what you tell it. We just tell it in English now. And English, it turns out, is a lot harder to get right than Python.
The Valley Nobody Shows You
This is where the Dunning-Kruger valley lives. The demo worked. Production doesn't. The query breaks when someone asks it differently than the person for whom you designed it. The assumptions baked in for one persona don't hold for another. The answer comes back technically correct and completely useless. There is a special frustration reserved for answers that are true and wrong at the same time. (As Gilfoyle says in Silicon Valley: "The reward function was a little under specified.")
Teams go looking for an engineering fix. The model needs to be smarter. The architecture needs to change. The data pipeline needs work. Sometimes those things are true. Often, the problem is 30 words upstream, in the markdown prompt that wasn't precise enough to encode what the user actually needed to know.
They're debugging the model when they should be editing the sentence.
I've Done This Work Before
Before data and AI, I worked in journalism. As an editor, I fixed grammar. Not because I enjoyed the arguments — and there are arguments, the AP style vs common use debate has the same energy as the newsroom rumble scenes in Anchorman, someone always ends up with a trident — but because grammar standards exist to ease understanding. Every rule is in service of one thing: making sure the meaning the writer intended is the meaning the reader receives.
The rest of the job was figuring out why the words on the page didn't match the conversation that happened before a word was written or even after it was written. A reporter files a story with every fact correct and every sentence grammatically clean. It's still the wrong story. The words as strung together don't tell the story or explain what's happening in a way a reader can absorb. The reporter was able to speak to their point and convey their meaning through body language and voice. Their full meaning needs to be translated to print.
That's an editing problem.
Building natural language query systems feels exactly like that. "How are my top stores doing?" is a clean sentence and an almost entirely ambiguous question. Top by what metric? Doing relative to what baseline? My stores — her district, her region, the 12 locations she personally sweats over first thing Monday morning? The system made assumptions. In the demo, they happened to be right. At scale, for users you never interviewed, they won't be.
The real work — the work that never shows up in the demo — is the conversation that has to happen before a word is written. Who is asking? Why are they asking? What decision are they about to make? How will they interpret the answer? And yes, whether the structure of the response matches the order in which a human being actually processes information. A reporter who buries the lede writes a bad story. A prompt that buries the lede produces an answer the system never finds. Or worse – sometimes it finds the lede, and you don't know whether this is one of those times.
This isn't an engineering problem. It's a language problem. I've been solving language problems for 30 years. The medium changed. The problem didn't.
The Climb Out Looks Like Editing
Organizations that make it through the valley stopped asking how to make the AI smarter. They started asking how to make the question clearer.
They mapped who was asking. They documented the assumptions baked into every query. They built diagnostic frameworks for when an answer comes back wrong — not to debug the model but to find the place in the language chain where meaning broke down. That work looks boring from the outside. It looks like whiteboard sessions, markdown files, and careful arguments about what a single data column actually means to a district manager versus a merchant. (It also occasionally looks like a 20-minute argument about the Oxford comma. Those are sometimes the most important 20 minutes of the project.)
Language is not the soft side of AI work. It is the work. The demo shows you the published story. It doesn't show you the conversation that had to happen before a word was written. How you find the place where meaning broke down turns out to be its own story.
If any of this sounds familiar, we should talk.



