Field Notes 4: The Part I Didn't Expect

I handed the same person's inputs to two different AI models and watched them behave completely differently. What survives when the models get better?

Something happened a few weeks ago that I'm still chewing on.

I'd built a new version of the leadership tool I've been working on, and I handed the same person's inputs to two different AI models to see how they'd each run it. Same instructions, same evidence, same everything. I expected the outputs to be a little different. What I didn't expect was how differently the two models would behave — not in what they concluded, but in how they got there. One slowed down, argued with itself, hedged where it should have hedged. The other got fluent and confident and started telling me things about this person that the evidence didn't actually support, in language good enough that I almost didn't catch it.

And I sat there realizing the thing I'd actually built wasn't a leadership tool. It was a set of guardrails to keep a model from getting ahead of itself. The leadership part was almost incidental. The real work, the part I'd sweated over, was all about restraint.

That bothered me for a few days. Then it turned into the more interesting question, which is the one I actually want to write about.

Here's the worry, stated plainly. There's a well-worn idea in AI — Richard Sutton called it the bitter lesson, and Ethan Mollick has a recent piece riffing on it — that the clever stuff humans hand-build into these systems tends to get washed away as the models themselves get better. The example everyone uses is chess: people spent years encoding chess wisdom into engines, and then a model that was taught nothing and just played itself beat all of it.

I read that and felt a little exposed, because a structured developmental tool is nothing but hand-encoded human wisdom. So I've been doing this slightly nervous exercise where I look at each piece of what I built and ask: if a smarter model showed up next year, does this survive, or was I just papering over today's weaknesses?

A lot of it does not survive, and I'm trying to be honest about that. All the scaffolding I built to stop the model from sounding too sure, from wrapping things up too neatly, from handing someone a verdict when it should hand them a question — that's me compensating for how the current models behave. The day they stop behaving that way, my compensations become dead weight. I don't think I get to keep most of it.

What's strange, though — and this is the part that surprised me — is what did survive the exercise. It wasn't any of the clever stuff. It was a handful of almost boring ideas. That you learn more about a person from three kinds of evidence than from one. That it's better to say "I'm not sure about this part" than to smooth it over. That when someone's self-description and their actual behavior disagree, the disagreement is the interesting part, not a problem to resolve. That what a person does tells you things they'll never tell you about themselves.

None of those are mine. None of them are even about AI. They're just old, sturdy ideas about how you come to understand a person honestly. And they're the only parts I'm confident will still matter when the models are better. The cleverness was disposable. The honesty was the thing.

But I keep circling back to that morning with the two models, because I think there's something underneath it I haven't fully worked out.

The way I actually built this thing was not by sitting down and designing it. It was by arguing with AI models for months. I'd draft something, hand it to one, let it tear into it, hand the result to another, watch them disagree, and use the disagreement to figure out what I actually thought. The models weren't answering my questions. They were sparring partners. The value wasn't in any single response — it was in watching different systems push on the same idea from different directions and noticing where the idea held and where it cracked.

I was using AI the way the tool tells you to understand a person — triangulate, watch for where things collide, distrust the tidy version.

Which is funny, because that's exactly the principle the tool itself is built on. Multiple sources of evidence. Contradiction as signal. Don't trust the clean single answer. I didn't design it that way on purpose. I just noticed, somewhere along the line, that it was how I'd been working the whole time.

I'm not sure what to do with that observation yet. It might be the most interesting thing I've found, and it might be a coincidence I'm over-reading. That's honestly where a lot of this lives right now — I can see a pattern but I can't tell yet whether it's real or whether I just want it to be.

So where does that leave the actual tool. Somewhere uncomfortable, which I'm coming to think is the right place for it.

The clean version of the lesson would be "structure beats chaos," or the opposite, "let the model run free." Neither is true, and the honest answer took me a while to get to, so let me try to say it carefully, because it's the one part of this I think I actually believe rather than suspect.

When I started, I assumed the real work was governing the model's reasoning — scripting how it should think its way through a person's evidence, step by careful step. That felt like the substance of the whole thing. And it's the part that broke most often. When I look back at the runs that went wrong, almost none of them failed because the model lacked structure. They failed because the model followed a confident-sounding line of reasoning right past the evidence. More scripting didn't fix that. Sometimes it made it worse, because a model marching through a rigid set of steps sounds more authoritative, not less, even when it's wrong.

What started to work was moving the discipline somewhere else entirely. Not onto the reasoning — onto the evidence and the confidence. Be strict about what counts as evidence and make the model show it. Be strict about how sure it's allowed to be, and make it say so out loud. But once those two things are locked down, let the actual thinking be loose. Let it follow the interesting thread. Hold the inputs and the honesty tight; let the reasoning breathe.

That distinction sounds small written down. In practice it was the whole game. A model that's free to reason however it wants, but can't claim more than the evidence supports and has to label how confident it is, turns out to be both more useful and more honest than one whose every step you've choreographed. The choreographed one performs rigor. The constrained-on-evidence one actually has it. And — this is the part that connects back to the bitter lesson — governing the reasoning is the brittle, hand-encoded thing that better models will make pointless. Governing the evidence and the confidence might not be. That feels less like a workaround and more like something true about working with these systems at all.

I want to be careful not to oversell it, because the reason this works for leadership development is also the reason it's hard. I can't cleanly tell you what a good outcome even is. Growth is slow and subjective; two thoughtful people will disagree about what someone should work on. When you can't define the win, you can't just hand the wheel to a system optimizing toward it — there's nothing solid for it to aim at, and a confident answer poured into that emptiness is dangerous, not helpful. But you can't script your way out of it either, because that's the chess-wisdom trap. Constraining the evidence and the confidence, and letting the reasoning move, is the only place I've found that holds up in the middle. It's narrower than "I built a system." It's the part I'd actually defend.

I don't have a clean ending for this, which feels appropriate. I started building a leadership tool and I'm not sure that's the thing I made. What I think I actually learned is a smaller and more personal lesson about working with these models at all: that they're most useful when you stop treating them as answer machines and start treating them as something to argue with, and that the whole game is staying honest about the difference between what you know and what just sounds good.

Ask me in a year whether any of that held up.